ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Stavros Kontopoulos	39577a27a0	[SPARK-24902][K8S] Add PV integration tests ## What changes were proposed in this pull request? - Adds persistent volume integration tests - Adds a custom tag to the test to exclude it if it is run against a cloud backend. - Assumes default fs type for the host, AFAIK that is ext4. ## How was this patch tested? Manually run the tests against minikube as usual: ``` [INFO] --- scalatest-maven-plugin:1.0:test (integration-test) spark-kubernetes-integration-tests_2.12 --- Discovery starting. Discovery completed in 192 milliseconds. Run starting. Expected test count is: 16 KubernetesSuite: - Run SparkPi with no resources - Run SparkPi with a very long application name. - Use SparkLauncher.NO_RESOURCE - Run SparkPi with a master URL without a scheme. - Run SparkPi with an argument. - Run SparkPi with custom labels, annotations, and environment variables. - Run extraJVMOptions check on driver - Run SparkRemoteFileTest using a remote data file - Run SparkPi with env and mount secrets. - Run PySpark on simple pi.py example - Run PySpark with Python2 to test a pyfiles example - Run PySpark with Python3 to test a pyfiles example - Run PySpark with memory customization - Run in client mode. - Start pod creation from template - Test PVs with local storage ``` Closes #23514 from skonto/pvctests. Authored-by: Stavros Kontopoulos <stavros.kontopoulos@lightbend.com> Signed-off-by: shane knapp <incomplete@gmail.com>	2019-03-27 13:00:56 -07:00
Ruocheng Jiang	d6ee2f331d	[MINOR][EXAMPLES] Add missing return keyword streaming word count example This is a very low level error. Closes #24153 from jiangruocheng/master. Authored-by: Ruocheng Jiang <jiangruocheng@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-20 17:59:12 -05:00
Yuming Wang	d70b6a39e1	[MINOR][BUILD] Add 2 maven properties(hive.classifier and hive.parquet.group) ## What changes were proposed in this pull request? This pr adds 2 maven properties to help us upgrade the built-in Hive. \| Property Name \| Default \| In future \| \| ------ \| ------ \| ------ \| \| hive.classifier \| (none) \| core \| \| hive.parquet.group \| com.twitter \| org.apache.parquet \| ## How was this patch tested? existing tests Closes #23996 from wangyum/add_2_maven_properties. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-07 16:46:07 -06:00
masa3141	5fa4ba0cfb	[SPARK-26981][MLLIB] Add 'Recall_at_k' metric to RankingMetrics ## What changes were proposed in this pull request? Add 'Recall_at_k' metric to RankingMetrics ## How was this patch tested? Add test to RankingMetricsSuite. Closes #23881 from masa3141/SPARK-26981. Authored-by: masa3141 <masahiro@kazama.tv> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-06 08:28:53 -06:00
liuxian	8290e5eccb	[SPARK-26353][SQL] Add typed aggregate functions(max/min) to the example module. ## What changes were proposed in this pull request? Add typed aggregate functions(max/min) to the example module. ## How was this patch tested? Manual testing: running typed minimum: ``` +-----+----------------------+ \|value\|TypedMin(scala.Tuple2)\| +-----+----------------------+ \| 0\| [0.0]\| \| 2\| [2.0]\| \| 1\| [1.0]\| +-----+----------------------+ ``` running typed maximum: ``` +-----+----------------------+ \|value\|TypedMax(scala.Tuple2)\| +-----+----------------------+ \| 0\| [18]\| \| 2\| [17]\| \| 1\| [19]\| +-----+----------------------+ ``` Closes #23304 from 10110346/typedminmax. Authored-by: liuxian <liu.xian3@zte.com.cn> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2019-02-18 17:20:58 +08:00
Wenchen Fan	8656af98c0	[SPARK-26861][SQL] deprecate typed sum/count/average ## What changes were proposed in this pull request? These builtin typed aggregate functions are not very useful: 1. users can just call the untyped ones and turn the resulting dataframe to a dataset. It has better performance. 2. the typed aggregate functions have subtle different behaviors regarding empty input. I think we should get rid of these builtin typed agg functions and suggest users to use the untyped ones. However, these functions are still useful as a demo of the `Aggregator` API, so I copied them to the example module. ## How was this patch tested? N/A Closes #23763 from cloud-fan/example. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2019-02-14 16:54:39 -08:00
Maxim Gekk	a829234df3	[SPARK-26817][CORE] Use System.nanoTime to measure time intervals ## What changes were proposed in this pull request? In the PR, I propose to use `System.nanoTime()` instead of `System.currentTimeMillis()` in measurements of time intervals. `System.currentTimeMillis()` returns current wallclock time and will follow changes to the system clock. Thus, negative wallclock adjustments can cause timeouts to "hang" for a long time (until wallclock time has caught up to its previous value again). This can happen when ntpd does a "step" after the network has been disconnected for some time. The most canonical example is during system bootup when DHCP takes longer than usual. This can lead to failures that are really hard to understand/reproduce. `System.nanoTime()` is guaranteed to be monotonically increasing irrespective of wallclock changes. ## How was this patch tested? By existing test suites. Closes #23727 from MaxGekk/system-nanotime. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-02-13 13:12:16 -06:00
Sean Owen	8171b156eb	[SPARK-26771][CORE][GRAPHX] Make .unpersist(), .destroy() consistently non-blocking by default ## What changes were proposed in this pull request? Make .unpersist(), .destroy() non-blocking by default and adjust callers to request blocking only where important. This also adds an optional blocking argument to Pyspark's RDD.unpersist(), which never had one. ## How was this patch tested? Existing tests. Closes #23685 from srowen/SPARK-26771. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-02-01 18:29:55 -06:00
Huaxin Gao	f7d87b1685	[SPARK-25997][ML] add Python example code for Power Iteration Clustering in spark.ml ## What changes were proposed in this pull request? Add python example for Power Iteration Clustering in spark.ml ## How was this patch tested? Manually tested Closes #22996 from huaxingao/spark-25997. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-01-31 19:33:44 -06:00
Sean Owen	c2d0d700b5	[SPARK-26640][CORE][ML][SQL][STREAMING][PYSPARK] Code cleanup from lgtm.com analysis ## What changes were proposed in this pull request? Misc code cleanup from lgtm.com analysis. See comments below for details. ## How was this patch tested? Existing tests. Closes #23571 from srowen/SPARK-26640. Lead-authored-by: Sean Owen <sean.owen@databricks.com> Co-authored-by: Hyukjin Kwon <gurwls223@apache.org> Co-authored-by: Sean Owen <srowen@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-01-17 19:40:39 -06:00
Kazuaki Ishizaki	79b05481a2	[SPARK-26508][CORE][SQL] Address warning messages in Java reported at lgtm.com ## What changes were proposed in this pull request? This PR addresses warning messages in Java files reported at [lgtm.com](https://lgtm.com). [lgtm.com](https://lgtm.com) provides automated code review of Java/Python/JavaScript files for OSS projects. [Here](https://lgtm.com/projects/g/apache/spark/alerts/?mode=list&severity=warning) are warning messages regarding Apache Spark project. This PR addresses the following warnings: - Result of multiplication cast to wider type - Implicit narrowing conversion in compound assignment - Boxed variable is never null - Useless null check NOTE: `Potential input resource leak` looks false positive for now. ## How was this patch tested? Existing UTs Closes #23420 from kiszk/SPARK-26508. Authored-by: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-01-01 22:37:28 -06:00
Alessandro Bellina	0a02d5c36f	[SPARK-26285][CORE] accumulator metrics sources for LongAccumulator and Doub… …leAccumulator ## What changes were proposed in this pull request? This PR implements metric sources for LongAccumulator and DoubleAccumulator, such that a user can register these accumulators easily and have their values be reported by the driver's metric namespace. ## How was this patch tested? Unit tests, and manual tests. Please review http://spark.apache.org/contributing.html before opening a pull request. Closes #23242 from abellina/SPARK-26285_accumulator_source. Lead-authored-by: Alessandro Bellina <abellina@yahoo-inc.com> Co-authored-by: Alessandro Bellina <abellina@oath.com> Co-authored-by: Alessandro Bellina <abellina@gmail.com> Signed-off-by: Thomas Graves <tgraves@apache.org>	2018-12-22 09:03:02 -06:00
Keiji Yoshida	e408e05322	[MINOR][DOCS] Fix the "not found: value Row" error on the "programmatic_schema" example ## What changes were proposed in this pull request? Print `import org.apache.spark.sql.Row` of `SparkSQLExample.scala` on the `programmatic_schema` example to fix the `not found: value Row` error on it. ``` scala> val rowRDD = peopleRDD.map(_.split(",")).map(attributes => Row(attributes(0), attributes(1).trim)) <console>:28: error: not found: value Row val rowRDD = peopleRDD.map(_.split(",")).map(attributes => Row(attributes(0), attributes(1).trim)) ``` ## How was this patch tested? NA Closes #23326 from kjmrknsn/fix-sql-getting-started. Authored-by: Keiji Yoshida <kjmrknsn@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2018-12-16 17:11:58 -08:00
Sean Owen	79e36e2c2a	[SPARK-19827][R][FOLLOWUP] spark.ml R API for PIC ## What changes were proposed in this pull request? Follow up style fixes to PIC in R; see #23072 ## How was this patch tested? Existing tests. Closes #23292 from srowen/SPARK-19827.2. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-12-12 09:03:13 -06:00
Huaxin Gao	05cf81e6de	[SPARK-19827][R] spark.ml R API for PIC ## What changes were proposed in this pull request? Add PowerIterationCluster (PIC) in R ## How was this patch tested? Add test case Closes #23072 from huaxingao/spark-19827. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-12-10 18:28:13 -06:00
Huaxin Gao	678e1aca69	[SPARK-24207][R] follow-up PR for SPARK-24207 to fix code style problems ## What changes were proposed in this pull request? follow-up PR for SPARK-24207 to fix code style problems Closes #23256 from huaxingao/spark-24207-cnt. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2018-12-08 22:23:50 +08:00
Liang-Chi Hsieh	8bfea86b1c	[SPARK-26133][ML] Remove deprecated OneHotEncoder and rename OneHotEncoderEstimator to OneHotEncoder ## What changes were proposed in this pull request? We have deprecated `OneHotEncoder` at Spark 2.3.0 and introduced `OneHotEncoderEstimator`. At 3.0.0, we remove deprecated `OneHotEncoder` and rename `OneHotEncoderEstimator` to `OneHotEncoder`. TODO: According to ML migration guide, we need to keep `OneHotEncoderEstimator` as an alias after renaming. This is not done at this patch in order to facilitate review. ## How was this patch tested? Existing tests. Closes #23100 from viirya/remove_one_hot_encoder. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: DB Tsai <d_tsai@apple.com>	2018-11-29 01:54:06 +00:00
DB Tsai	ad853c5678	[SPARK-25956] Make Scala 2.12 as default Scala version in Spark 3.0 ## What changes were proposed in this pull request? This PR makes Spark's default Scala version as 2.12, and Scala 2.11 will be the alternative version. This implies that Scala 2.12 will be used by our CI builds including pull request builds. We'll update the Jenkins to include a new compile-only jobs for Scala 2.11 to ensure the code can be still compiled with Scala 2.11. ## How was this patch tested? existing tests Closes #22967 from dbtsai/scala2.12. Authored-by: DB Tsai <d_tsai@apple.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2018-11-14 16:22:23 -08:00
Sean Owen	722369ee55	[SPARK-24421][BUILD][CORE] Accessing sun.misc.Cleaner in JDK11 …. Other related changes to get JDK 11 working, to test ## What changes were proposed in this pull request? - Access `sun.misc.Cleaner` (Java 8) and `jdk.internal.ref.Cleaner` (JDK 9+) by reflection (note: the latter only works if illegal reflective access is allowed) - Access `sun.misc.Unsafe.invokeCleaner` in Java 9+ instead of `sun.misc.Cleaner` (Java 8) In order to test anything on JDK 11, I also fixed a few small things, which I include here: - Fix minor JDK 11 compile issues - Update scala plugin, Jetty for JDK 11, to facilitate tests too This doesn't mean JDK 11 tests all pass now, but lots do. Note also that the JDK 9+ solution for the Cleaner has a big caveat. ## How was this patch tested? Existing tests. Manually tested JDK 11 build and tests, and tests covering this change appear to pass. All Java 8 tests should still pass, but this change alone does not achieve full JDK 11 compatibility. Closes #22993 from srowen/SPARK-24421. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-11-14 12:52:54 -08:00
Marco Gaido	0b59170001	[SPARK-25764][ML][EXAMPLES] Update BisectingKMeans example to use ClusteringEvaluator ## What changes were proposed in this pull request? Using `computeCost` for evaluating a model is a very poor approach. We should advice the users to a better approach which is available, ie. using the `ClusteringEvaluator` to evaluate their models. The PR updates the examples for `BisectingKMeans` in order to do that. ## How was this patch tested? running examples Closes #22786 from mgaido91/SPARK-25764. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: DB Tsai <d_tsai@apple.com>	2018-11-05 22:42:04 +00:00
Dongjoon Hyun	4506dad8a9	[SPARK-25656][SQL][DOC][EXAMPLE] Add a doc and examples about extra data source options ## What changes were proposed in this pull request? Our current doc does not explain how we are passing the data source specific options to the underlying data source. According to [the review comment](https://github.com/apache/spark/pull/22622#discussion_r222911529), this PR aims to add more detailed information and examples ## How was this patch tested? Manual. Closes #22801 from dongjoon-hyun/SPARK-25656. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2018-10-23 12:41:20 -07:00
Dongjoon Hyun	3b4556745e	[SPARK-25795][R][EXAMPLE] Fix CSV SparkR SQL Example ## What changes were proposed in this pull request? This PR aims to fix the following SparkR example in Spark 2.3.0 ~ 2.4.0. ```r > df <- read.df("examples/src/main/resources/people.csv", "csv") > namesAndAges <- select(df, "name", "age") ... Caused by: org.apache.spark.sql.AnalysisException: cannot resolve '`name`' given input columns: [_c0];; 'Project ['name, 'age] +- AnalysisBarrier +- Relation[_c0#97] csv ``` - https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc3-docs/_site/sql-programming-guide.html#manually-specifying-options - http://spark.apache.org/docs/2.3.2/sql-programming-guide.html#manually-specifying-options - http://spark.apache.org/docs/2.3.1/sql-programming-guide.html#manually-specifying-options - http://spark.apache.org/docs/2.3.0/sql-programming-guide.html#manually-specifying-options ## How was this patch tested? Manual test in SparkR. (Please note that `RSparkSQLExample.R` fails at the last JDBC example) ```r > df <- read.df("examples/src/main/resources/people.csv", "csv", sep=";", inferSchema=T, header=T) > namesAndAges <- select(df, "name", "age") ``` Closes #22791 from dongjoon-hyun/SPARK-25795. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2018-10-22 16:34:33 -07:00
Huaxin Gao	fc64e83f95	[SPARK-24207][R] add R API for PrefixSpan ## What changes were proposed in this pull request? add R API for PrefixSpan ## How was this patch tested? add test in test_mllib_fpm.R Author: Huaxin Gao <huaxing@us.ibm.com> Closes #21710 from huaxingao/spark-24207.	2018-10-21 12:32:43 -07:00
Wenchen Fan	4acbda4a96	Revert "[SPARK-25764][ML][EXAMPLES] Update BisectingKMeans example to use ClusteringEvaluator" This reverts commit `d0ecff2854`.	2018-10-20 09:28:53 +08:00
Marco Gaido	d0ecff2854	[SPARK-25764][ML][EXAMPLES] Update BisectingKMeans example to use ClusteringEvaluator ## What changes were proposed in this pull request? The PR updates the examples for `BisectingKMeans` so that they don't use the deprecated method `computeCost` (see SPARK-25758). ## How was this patch tested? running examples Closes #22763 from mgaido91/SPARK-25764. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-10-19 09:33:46 +08:00
Sean Owen	703e6da1ec	[SPARK-25705][BUILD][STREAMING][TEST-MAVEN] Remove Kafka 0.8 integration ## What changes were proposed in this pull request? Remove Kafka 0.8 integration ## How was this patch tested? Existing tests, build scripts Closes #22703 from srowen/SPARK-25705. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-10-16 09:10:24 -05:00
Ilan Filonenko	6c9c84ffb9	[SPARK-23257][K8S] Kerberos Support for Spark on K8S ## What changes were proposed in this pull request? This is the work on setting up Secure HDFS interaction with Spark-on-K8S. The architecture is discussed in this community-wide google [doc](https://docs.google.com/document/d/1RBnXD9jMDjGonOdKJ2bA1lN4AAV_1RwpU_ewFuCNWKg) This initiative can be broken down into 4 Stages STAGE 1 - [x] Detecting `HADOOP_CONF_DIR` environmental variable and using Config Maps to store all Hadoop config files locally, while also setting `HADOOP_CONF_DIR` locally in the driver / executors STAGE 2 - [x] Grabbing `TGT` from `LTC` or using keytabs+principle and creating a `DT` that will be mounted as a secret or using a pre-populated secret STAGE 3 - [x] Driver STAGE 4 - [x] Executor ## How was this patch tested? Locally tested on a single-noded, pseudo-distributed Kerberized Hadoop Cluster - [x] E2E Integration tests https://github.com/apache/spark/pull/22608 - [ ] Unit tests ## Docs and Error Handling? - [x] Docs - [x] Error Handling ## Contribution Credit kimoonkim skonto Closes #21669 from ifilonenko/secure-hdfs. Lead-authored-by: Ilan Filonenko <if56@cornell.edu> Co-authored-by: Ilan Filonenko <ifilondz@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2018-10-15 15:48:51 -07:00
Sean Owen	a001814189	[SPARK-25598][STREAMING][BUILD][TEST-MAVEN] Remove flume connector in Spark 3 ## What changes were proposed in this pull request? Removes all vestiges of Flume in the build, for Spark 3. I don't think this needs Jenkins config changes. ## How was this patch tested? Existing tests. Closes #22692 from srowen/SPARK-25598. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-10-11 14:28:06 -07:00
gatorsmile	9bf397c0e4	[SPARK-25592] Setting version to 3.0.0-SNAPSHOT ## What changes were proposed in this pull request? This patch is to bump the master branch version to 3.0.0-SNAPSHOT. ## How was this patch tested? N/A Closes #22606 from gatorsmile/bump3.0. Authored-by: gatorsmile <gatorsmile@gmail.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2018-10-02 08:48:24 -07:00
gatorsmile	bb2f069cf2	[SPARK-25436] Bump master branch version to 2.5.0-SNAPSHOT ## What changes were proposed in this pull request? In the dev list, we can still discuss whether the next version is 2.5.0 or 3.0.0. Let us first bump the master branch version to `2.5.0-SNAPSHOT`. ## How was this patch tested? N/A Closes #22426 from gatorsmile/bumpVersionMaster. Authored-by: gatorsmile <gatorsmile@gmail.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2018-09-15 16:24:02 -07:00
Ilan Filonenko	1cfda44825	[SPARK-25021][K8S] Add spark.executor.pyspark.memory limit for K8S ## What changes were proposed in this pull request? Add spark.executor.pyspark.memory limit for K8S ## How was this patch tested? Unit and Integration tests Closes #22298 from ifilonenko/SPARK-25021. Authored-by: Ilan Filonenko <if56@cornell.edu> Signed-off-by: Holden Karau <holden@pigscanfly.ca>	2018-09-08 22:18:06 -07:00
Huangweizhe	6c66ab8b33	[SPARK-24688][EXAMPLES] Modify the comments about LabeledPoint ## What changes were proposed in this pull request? An RDD is created using LabeledPoint, but the comment is like #LabeledPoint(feature, label). Although in the method ChiSquareTest.test, the second parameter is feature and the third parameter is label, it it better to write label in front of feature here because if an RDD is created using LabeldPoint, what we get are actually (label, feature) pairs. Now it is changed as LabeledPoint(label, feature). The comments in Scala and Java example have the same typos. ## How was this patch tested? tested https://issues.apache.org/jira/browse/SPARK-24688 Author: Weizhe Huang 492816239qq.com Please review http://spark.apache.org/contributing.html before opening a pull request. Closes #21665 from uzmijnlm/my_change. Authored-by: Huangweizhe <huangweizhe@bbdservice.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-08-25 09:24:20 -05:00
Kazuhiro Sera	8ec25cd67e	Fix typos detected by github.com/client9/misspell ## What changes were proposed in this pull request? Fixing typos is sometimes very hard. It's not so easy to visually review them. Recently, I discovered a very useful tool for it, [misspell](https://github.com/client9/misspell). This pull request fixes minor typos detected by [misspell](https://github.com/client9/misspell) except for the false positives. If you would like me to work on other files as well, let me know. ## How was this patch tested? ### before ``` $ misspell . \| grep -v '.js' R/pkg/R/SQLContext.R:354:43: "definiton" is a misspelling of "definition" R/pkg/R/SQLContext.R:424:43: "definiton" is a misspelling of "definition" R/pkg/R/SQLContext.R:445:43: "definiton" is a misspelling of "definition" R/pkg/R/SQLContext.R:495:43: "definiton" is a misspelling of "definition" NOTICE-binary:454:16: "containd" is a misspelling of "contained" R/pkg/R/context.R:46:43: "definiton" is a misspelling of "definition" R/pkg/R/context.R:74:43: "definiton" is a misspelling of "definition" R/pkg/R/DataFrame.R:591:48: "persistance" is a misspelling of "persistence" R/pkg/R/streaming.R:166:44: "occured" is a misspelling of "occurred" R/pkg/inst/worker/worker.R:65:22: "ouput" is a misspelling of "output" R/pkg/tests/fulltests/test_utils.R:106:25: "environemnt" is a misspelling of "environment" common/kvstore/src/test/java/org/apache/spark/util/kvstore/InMemoryStoreSuite.java:38:39: "existant" is a misspelling of "existent" common/kvstore/src/test/java/org/apache/spark/util/kvstore/LevelDBSuite.java:83:39: "existant" is a misspelling of "existent" common/network-common/src/main/java/org/apache/spark/network/crypto/TransportCipher.java:243:46: "transfered" is a misspelling of "transferred" common/network-common/src/main/java/org/apache/spark/network/sasl/SaslEncryption.java:234:19: "transfered" is a misspelling of "transferred" common/network-common/src/main/java/org/apache/spark/network/sasl/SaslEncryption.java:238:63: "transfered" is a misspelling of "transferred" common/network-common/src/main/java/org/apache/spark/network/sasl/SaslEncryption.java:244:46: "transfered" is a misspelling of "transferred" common/network-common/src/main/java/org/apache/spark/network/sasl/SaslEncryption.java:276:39: "transfered" is a misspelling of "transferred" common/network-common/src/main/java/org/apache/spark/network/util/AbstractFileRegion.java:27:20: "transfered" is a misspelling of "transferred" common/unsafe/src/test/scala/org/apache/spark/unsafe/types/UTF8StringPropertyCheckSuite.scala:195:15: "orgin" is a misspelling of "origin" core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala:621:39: "gauranteed" is a misspelling of "guaranteed" core/src/main/scala/org/apache/spark/status/storeTypes.scala:113:29: "ect" is a misspelling of "etc" core/src/main/scala/org/apache/spark/storage/DiskStore.scala:282:18: "transfered" is a misspelling of "transferred" core/src/main/scala/org/apache/spark/util/ListenerBus.scala:64:17: "overriden" is a misspelling of "overridden" core/src/test/scala/org/apache/spark/ShuffleSuite.scala:211:7: "substracted" is a misspelling of "subtracted" core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala:1922:49: "agriculteur" is a misspelling of "agriculture" core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala:2468:84: "truely" is a misspelling of "truly" core/src/test/scala/org/apache/spark/storage/FlatmapIteratorSuite.scala:25:18: "persistance" is a misspelling of "persistence" core/src/test/scala/org/apache/spark/storage/FlatmapIteratorSuite.scala:26:69: "persistance" is a misspelling of "persistence" data/streaming/AFINN-111.txt:1219:0: "humerous" is a misspelling of "humorous" dev/run-pip-tests:55:28: "enviroments" is a misspelling of "environments" dev/run-pip-tests:91:37: "virutal" is a misspelling of "virtual" dev/merge_spark_pr.py:377:72: "accross" is a misspelling of "across" dev/merge_spark_pr.py:378:66: "accross" is a misspelling of "across" dev/run-pip-tests:126:25: "enviroments" is a misspelling of "environments" docs/configuration.md:1830:82: "overriden" is a misspelling of "overridden" docs/structured-streaming-programming-guide.md:525:45: "processs" is a misspelling of "processes" docs/structured-streaming-programming-guide.md:1165:61: "BETWEN" is a misspelling of "BETWEEN" docs/sql-programming-guide.md:1891:810: "behaivor" is a misspelling of "behavior" examples/src/main/python/sql/arrow.py:98:8: "substract" is a misspelling of "subtract" examples/src/main/python/sql/arrow.py:103:27: "substract" is a misspelling of "subtract" licenses/LICENSE-heapq.txt:5:63: "Stichting" is a misspelling of "Stitching" licenses/LICENSE-heapq.txt:6:2: "Mathematisch" is a misspelling of "Mathematics" licenses/LICENSE-heapq.txt:262:29: "Stichting" is a misspelling of "Stitching" licenses/LICENSE-heapq.txt:262:39: "Mathematisch" is a misspelling of "Mathematics" licenses/LICENSE-heapq.txt:269:49: "Stichting" is a misspelling of "Stitching" licenses/LICENSE-heapq.txt:269:59: "Mathematisch" is a misspelling of "Mathematics" licenses/LICENSE-heapq.txt:274:2: "STICHTING" is a misspelling of "STITCHING" licenses/LICENSE-heapq.txt:274:12: "MATHEMATISCH" is a misspelling of "MATHEMATICS" licenses/LICENSE-heapq.txt:276:29: "STICHTING" is a misspelling of "STITCHING" licenses/LICENSE-heapq.txt:276:39: "MATHEMATISCH" is a misspelling of "MATHEMATICS" licenses-binary/LICENSE-heapq.txt:5:63: "Stichting" is a misspelling of "Stitching" licenses-binary/LICENSE-heapq.txt:6:2: "Mathematisch" is a misspelling of "Mathematics" licenses-binary/LICENSE-heapq.txt:262:29: "Stichting" is a misspelling of "Stitching" licenses-binary/LICENSE-heapq.txt:262:39: "Mathematisch" is a misspelling of "Mathematics" licenses-binary/LICENSE-heapq.txt:269:49: "Stichting" is a misspelling of "Stitching" licenses-binary/LICENSE-heapq.txt:269:59: "Mathematisch" is a misspelling of "Mathematics" licenses-binary/LICENSE-heapq.txt:274:2: "STICHTING" is a misspelling of "STITCHING" licenses-binary/LICENSE-heapq.txt:274:12: "MATHEMATISCH" is a misspelling of "MATHEMATICS" licenses-binary/LICENSE-heapq.txt:276:29: "STICHTING" is a misspelling of "STITCHING" licenses-binary/LICENSE-heapq.txt:276:39: "MATHEMATISCH" is a misspelling of "MATHEMATICS" mllib/src/main/resources/org/apache/spark/ml/feature/stopwords/hungarian.txt:170:0: "teh" is a misspelling of "the" mllib/src/main/resources/org/apache/spark/ml/feature/stopwords/portuguese.txt:53:0: "eles" is a misspelling of "eels" mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala:99:20: "Euclidian" is a misspelling of "Euclidean" mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala:539:11: "Euclidian" is a misspelling of "Euclidean" mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala:77:36: "Teh" is a misspelling of "The" mllib/src/main/scala/org/apache/spark/mllib/clustering/StreamingKMeans.scala:230:24: "inital" is a misspelling of "initial" mllib/src/main/scala/org/apache/spark/mllib/stat/MultivariateOnlineSummarizer.scala:276:9: "Euclidian" is a misspelling of "Euclidean" mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala:237:26: "descripiton" is a misspelling of "descriptions" python/pyspark/find_spark_home.py:30:13: "enviroment" is a misspelling of "environment" python/pyspark/context.py:937:12: "supress" is a misspelling of "suppress" python/pyspark/context.py:938:12: "supress" is a misspelling of "suppress" python/pyspark/context.py:939:12: "supress" is a misspelling of "suppress" python/pyspark/context.py:940:12: "supress" is a misspelling of "suppress" python/pyspark/heapq3.py:6:63: "Stichting" is a misspelling of "Stitching" python/pyspark/heapq3.py:7:2: "Mathematisch" is a misspelling of "Mathematics" python/pyspark/heapq3.py:263:29: "Stichting" is a misspelling of "Stitching" python/pyspark/heapq3.py:263:39: "Mathematisch" is a misspelling of "Mathematics" python/pyspark/heapq3.py:270:49: "Stichting" is a misspelling of "Stitching" python/pyspark/heapq3.py:270:59: "Mathematisch" is a misspelling of "Mathematics" python/pyspark/heapq3.py:275:2: "STICHTING" is a misspelling of "STITCHING" python/pyspark/heapq3.py:275:12: "MATHEMATISCH" is a misspelling of "MATHEMATICS" python/pyspark/heapq3.py:277:29: "STICHTING" is a misspelling of "STITCHING" python/pyspark/heapq3.py:277:39: "MATHEMATISCH" is a misspelling of "MATHEMATICS" python/pyspark/heapq3.py:713:8: "probabilty" is a misspelling of "probability" python/pyspark/ml/clustering.py:1038:8: "Currenlty" is a misspelling of "Currently" python/pyspark/ml/stat.py:339:23: "Euclidian" is a misspelling of "Euclidean" python/pyspark/ml/regression.py:1378:20: "paramter" is a misspelling of "parameter" python/pyspark/mllib/stat/_statistics.py:262:8: "probabilty" is a misspelling of "probability" python/pyspark/rdd.py:1363:32: "paramter" is a misspelling of "parameter" python/pyspark/streaming/tests.py:825:42: "retuns" is a misspelling of "returns" python/pyspark/sql/tests.py:768:29: "initalization" is a misspelling of "initialization" python/pyspark/sql/tests.py:3616:31: "initalize" is a misspelling of "initialize" resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerBackendUtil.scala:120:39: "arbitary" is a misspelling of "arbitrary" resource-managers/mesos/src/test/scala/org/apache/spark/deploy/mesos/MesosClusterDispatcherArgumentsSuite.scala:26:45: "sucessfully" is a misspelling of "successfully" resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerUtils.scala:358:27: "constaints" is a misspelling of "constraints" resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/YarnClusterSuite.scala:111:24: "senstive" is a misspelling of "sensitive" sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala:1063:5: "overwirte" is a misspelling of "overwrite" sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala:1348:17: "compatability" is a misspelling of "compatibility" sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala:77:36: "paramter" is a misspelling of "parameter" sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:1374:22: "precendence" is a misspelling of "precedence" sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisSuite.scala:238:27: "unnecassary" is a misspelling of "unnecessary" sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ConditionalExpressionSuite.scala:212:17: "whn" is a misspelling of "when" sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinHelper.scala:147:60: "timestmap" is a misspelling of "timestamp" sql/core/src/test/scala/org/apache/spark/sql/TPCDSQuerySuite.scala:150:45: "precentage" is a misspelling of "percentage" sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchemaSuite.scala:135:29: "infered" is a misspelling of "inferred" sql/hive/src/test/resources/golden/udf_instr-1-2e76f819563dbaba4beb51e3a130b922:1:52: "occurance" is a misspelling of "occurrence" sql/hive/src/test/resources/golden/udf_instr-2-32da357fc754badd6e3898dcc8989182:1:52: "occurance" is a misspelling of "occurrence" sql/hive/src/test/resources/golden/udf_locate-1-6e41693c9c6dceea4d7fab4c02884e4e:1:63: "occurance" is a misspelling of "occurrence" sql/hive/src/test/resources/golden/udf_locate-2-d9b5934457931447874d6bb7c13de478:1:63: "occurance" is a misspelling of "occurrence" sql/hive/src/test/resources/golden/udf_translate-2-f7aa38a33ca0df73b7a1e6b6da4b7fe8:9:79: "occurence" is a misspelling of "occurrence" sql/hive/src/test/resources/golden/udf_translate-2-f7aa38a33ca0df73b7a1e6b6da4b7fe8:13:110: "occurence" is a misspelling of "occurrence" sql/hive/src/test/resources/ql/src/test/queries/clientpositive/annotate_stats_join.q:46:105: "distint" is a misspelling of "distinct" sql/hive/src/test/resources/ql/src/test/queries/clientpositive/auto_sortmerge_join_11.q:29:3: "Currenly" is a misspelling of "Currently" sql/hive/src/test/resources/ql/src/test/queries/clientpositive/avro_partitioned.q:72:15: "existant" is a misspelling of "existent" sql/hive/src/test/resources/ql/src/test/queries/clientpositive/decimal_udf.q:25:3: "substraction" is a misspelling of "subtraction" sql/hive/src/test/resources/ql/src/test/queries/clientpositive/groupby2_map_multi_distinct.q:16:51: "funtion" is a misspelling of "function" sql/hive/src/test/resources/ql/src/test/queries/clientpositive/groupby_sort_8.q:15:30: "issueing" is a misspelling of "issuing" sql/hive/src/test/scala/org/apache/spark/sql/sources/HadoopFsRelationTest.scala:669:52: "wiht" is a misspelling of "with" sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/session/HiveSessionImpl.java:474:9: "Refering" is a misspelling of "Referring" ``` ### after ``` $ misspell . \| grep -v '.js' common/network-common/src/main/java/org/apache/spark/network/util/AbstractFileRegion.java:27:20: "transfered" is a misspelling of "transferred" core/src/main/scala/org/apache/spark/status/storeTypes.scala:113:29: "ect" is a misspelling of "etc" core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala:1922:49: "agriculteur" is a misspelling of "agriculture" data/streaming/AFINN-111.txt:1219:0: "humerous" is a misspelling of "humorous" licenses/LICENSE-heapq.txt:5:63: "Stichting" is a misspelling of "Stitching" licenses/LICENSE-heapq.txt:6:2: "Mathematisch" is a misspelling of "Mathematics" licenses/LICENSE-heapq.txt:262:29: "Stichting" is a misspelling of "Stitching" licenses/LICENSE-heapq.txt:262:39: "Mathematisch" is a misspelling of "Mathematics" licenses/LICENSE-heapq.txt:269:49: "Stichting" is a misspelling of "Stitching" licenses/LICENSE-heapq.txt:269:59: "Mathematisch" is a misspelling of "Mathematics" licenses/LICENSE-heapq.txt:274:2: "STICHTING" is a misspelling of "STITCHING" licenses/LICENSE-heapq.txt:274:12: "MATHEMATISCH" is a misspelling of "MATHEMATICS" licenses/LICENSE-heapq.txt:276:29: "STICHTING" is a misspelling of "STITCHING" licenses/LICENSE-heapq.txt:276:39: "MATHEMATISCH" is a misspelling of "MATHEMATICS" licenses-binary/LICENSE-heapq.txt:5:63: "Stichting" is a misspelling of "Stitching" licenses-binary/LICENSE-heapq.txt:6:2: "Mathematisch" is a misspelling of "Mathematics" licenses-binary/LICENSE-heapq.txt:262:29: "Stichting" is a misspelling of "Stitching" licenses-binary/LICENSE-heapq.txt:262:39: "Mathematisch" is a misspelling of "Mathematics" licenses-binary/LICENSE-heapq.txt:269:49: "Stichting" is a misspelling of "Stitching" licenses-binary/LICENSE-heapq.txt:269:59: "Mathematisch" is a misspelling of "Mathematics" licenses-binary/LICENSE-heapq.txt:274:2: "STICHTING" is a misspelling of "STITCHING" licenses-binary/LICENSE-heapq.txt:274:12: "MATHEMATISCH" is a misspelling of "MATHEMATICS" licenses-binary/LICENSE-heapq.txt:276:29: "STICHTING" is a misspelling of "STITCHING" licenses-binary/LICENSE-heapq.txt:276:39: "MATHEMATISCH" is a misspelling of "MATHEMATICS" mllib/src/main/resources/org/apache/spark/ml/feature/stopwords/hungarian.txt:170:0: "teh" is a misspelling of "the" mllib/src/main/resources/org/apache/spark/ml/feature/stopwords/portuguese.txt:53:0: "eles" is a misspelling of "eels" mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala:99:20: "Euclidian" is a misspelling of "Euclidean" mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala:539:11: "Euclidian" is a misspelling of "Euclidean" mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala:77:36: "Teh" is a misspelling of "The" mllib/src/main/scala/org/apache/spark/mllib/stat/MultivariateOnlineSummarizer.scala:276:9: "Euclidian" is a misspelling of "Euclidean" python/pyspark/heapq3.py:6:63: "Stichting" is a misspelling of "Stitching" python/pyspark/heapq3.py:7:2: "Mathematisch" is a misspelling of "Mathematics" python/pyspark/heapq3.py:263:29: "Stichting" is a misspelling of "Stitching" python/pyspark/heapq3.py:263:39: "Mathematisch" is a misspelling of "Mathematics" python/pyspark/heapq3.py:270:49: "Stichting" is a misspelling of "Stitching" python/pyspark/heapq3.py:270:59: "Mathematisch" is a misspelling of "Mathematics" python/pyspark/heapq3.py:275:2: "STICHTING" is a misspelling of "STITCHING" python/pyspark/heapq3.py:275:12: "MATHEMATISCH" is a misspelling of "MATHEMATICS" python/pyspark/heapq3.py:277:29: "STICHTING" is a misspelling of "STITCHING" python/pyspark/heapq3.py:277:39: "MATHEMATISCH" is a misspelling of "MATHEMATICS" python/pyspark/ml/stat.py:339:23: "Euclidian" is a misspelling of "Euclidean" ``` Closes #22070 from seratch/fix-typo. Authored-by: Kazuhiro Sera <seratch@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2018-08-11 21:23:36 -05:00
Li Jin	8141d55926	[SPARK-23633][SQL] Update Pandas UDFs section in sql-programming-guide ## What changes were proposed in this pull request? Update Pandas UDFs section in sql-programming-guide. Add section for grouped aggregate pandas UDF. ## How was this patch tested? Author: Li Jin <ice.xelloss@gmail.com> Closes #21887 from icexelloss/SPARK-23633-sql-programming-guide.	2018-07-31 10:10:38 +08:00
WeichenXu	59c3c233f4	[SPARK-23254][ML] Add user guide entry and example for DataFrame multivariate summary ## What changes were proposed in this pull request? Add user guide and scala/java/python examples for `ml.stat.Summarizer` ## How was this patch tested? Doc generated snapshot: ![image](https://user-images.githubusercontent.com/19235986/38987108-45646044-4401-11e8-9ba8-ae94ba96cbf9.png) ![image](https://user-images.githubusercontent.com/19235986/38987096-36dcc73c-4401-11e8-87f9-5b91e7f9e27b.png) ![image](https://user-images.githubusercontent.com/19235986/38987088-2d1c1eaa-4401-11e8-80b5-8c40d529a120.png) ![image](https://user-images.githubusercontent.com/19235986/38987077-22ce8be0-4401-11e8-8199-c3a4d8d23201.png) Author: WeichenXu <weichen.xu@databricks.com> Closes #20446 from WeichenXu123/summ_guide.	2018-07-11 13:56:09 -05:00
cluo	ac78bcce00	[SPARK-24743][EXAMPLES] Update the JavaDirectKafkaWordCount example to support the new API of kafka ## What changes were proposed in this pull request? Add some required configs for Kafka consumer in JavaDirectKafkaWordCount class. ## How was this patch tested? Manual tests on Local mode. Author: cluo <0512lc@163.com> Closes #21717 from cluo512/SPARK-24743-update-JavaDirectKafkaWordCount.	2018-07-05 09:06:25 -05:00
Ilan Filonenko	1a644afbac	[SPARK-23984][K8S] Initial Python Bindings for PySpark on K8s ## What changes were proposed in this pull request? Introducing Python Bindings for PySpark. - [x] Running PySpark Jobs - [x] Increased Default Memory Overhead value - [ ] Dependency Management for virtualenv/conda ## How was this patch tested? This patch was tested with - [x] Unit Tests - [x] Integration tests with [this addition](https://github.com/apache-spark-on-k8s/spark-integration/pull/46) ``` KubernetesSuite: - Run SparkPi with no resources - Run SparkPi with a very long application name. - Run SparkPi with a master URL without a scheme. - Run SparkPi with an argument. - Run SparkPi with custom labels, annotations, and environment variables. - Run SparkPi with a test secret mounted into the driver and executor pods - Run extraJVMOptions check on driver - Run SparkRemoteFileTest using a remote data file - Run PySpark on simple pi.py example - Run PySpark with Python2 to test a pyfiles example - Run PySpark with Python3 to test a pyfiles example Run completed in 4 minutes, 28 seconds. Total number of tests run: 11 Suites: completed 2, aborted 0 Tests: succeeded 11, failed 0, canceled 0, ignored 0, pending 0 All tests passed. ``` Author: Ilan Filonenko <if56@cornell.edu> Author: Ilan Filonenko <ifilondz@gmail.com> Closes #21092 from ifilonenko/master.	2018-06-08 11:18:34 -07:00
Shahid	a5d775a1f3	[SPARK-24191][ML] Scala Example code for Power Iteration Clustering ## What changes were proposed in this pull request? Added example code for Power Iteration Clustering in Spark ML examples Author: Shahid <shahidki31@gmail.com> Closes #21248 from shahidki31/sparkCommit.	2018-06-08 08:45:56 -05:00
Shahid	2c100209f0	[SPARK-24224][ML-EXAMPLES] Java example code for Power Iteration Clustering in spark.ml ## What changes were proposed in this pull request? Java example code for Power Iteration Clustering in spark.ml ## How was this patch tested? Locally tested Author: Shahid <shahidki31@gmail.com> Closes #21283 from shahidki31/JavaPicExample.	2018-06-08 08:44:59 -05:00
jerryshao	5fccdae189	[SPARK-22968][DSTREAM] Throw an exception on partition revoking issue ## What changes were proposed in this pull request? Kafka partitions can be revoked when new consumers joined in the consumer group to rebalance the partitions. But current Spark Kafka connector code makes sure there's no partition revoking scenarios, so trying to get latest offset from revoked partitions will throw exceptions as JIRA mentioned. Partition revoking happens when new consumer joined the consumer group, which means different streaming apps are trying to use same group id. This is fundamentally not correct, different apps should use different consumer group. So instead of throwing an confused exception from Kafka, improve the exception message by identifying revoked partition and directly throw an meaningful exception when partition is revoked. Besides, this PR also fixes bugs in `DirectKafkaWordCount`, this example simply cannot be worked without the fix. ``` 8/01/05 09:48:27 INFO internals.ConsumerCoordinator: Revoking previously assigned partitions [kssh-7, kssh-4, kssh-3, kssh-6, kssh-5, kssh-0, kssh-2, kssh-1] for group use_a_separate_group_id_for_each_stream 18/01/05 09:48:27 INFO internals.AbstractCoordinator: (Re-)joining group use_a_separate_group_id_for_each_stream 18/01/05 09:48:27 INFO internals.AbstractCoordinator: Successfully joined group use_a_separate_group_id_for_each_stream with generation 4 18/01/05 09:48:27 INFO internals.ConsumerCoordinator: Setting newly assigned partitions [kssh-7, kssh-4, kssh-6, kssh-5] for group use_a_separate_group_id_for_each_stream ``` ## How was this patch tested? This is manually verified in local cluster, unfortunately I'm not sure how to simulate it in UT, so propose the PR without UT added. Author: jerryshao <sshao@hortonworks.com> Closes #21038 from jerryshao/SPARK-22968.	2018-04-17 21:08:42 -05:00
Ilan Filonenko	f15906da15	[SPARK-22839][K8S] Remove the use of init-container for downloading remote dependencies ## What changes were proposed in this pull request? Removal of the init-container for downloading remote dependencies. Built off of the work done by vanzin in an attempt to refactor driver/executor configuration elaborated in [this](https://issues.apache.org/jira/browse/SPARK-22839) ticket. ## How was this patch tested? This patch was tested with unit and integration tests. Author: Ilan Filonenko <if56@cornell.edu> Closes #20669 from ifilonenko/remove-init-container.	2018-03-19 11:29:56 -07:00
DylanGuedes	b6f837c9d3	[PYTHON] Changes input variable to not conflict with built-in function Signed-off-by: DylanGuedes <djmgguedesgmail.com> ## What changes were proposed in this pull request? Changes variable name conflict: [input is a built-in python function](https://stackoverflow.com/questions/20670732/is-input-a-keyword-in-python). ## How was this patch tested? I runned the example and it works fine. Author: DylanGuedes <djmgguedes@gmail.com> Closes #20775 from DylanGuedes/input_variable.	2018-03-10 19:48:29 +09:00
Benjamin Peterson	7013eea11c	[SPARK-23522][PYTHON] always use sys.exit over builtin exit The exit() builtin is only for interactive use. applications should use sys.exit(). ## What changes were proposed in this pull request? All usage of the builtin `exit()` function is replaced by `sys.exit()`. ## How was this patch tested? I ran `python/run-tests`. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Benjamin Peterson <benjamin@python.org> Closes #20682 from benjaminp/sys-exit.	2018-03-08 20:38:34 +09:00
Shashwat Anand	4aaa7d40bf	[MINOR][DOC] Use raw triple double quotes around docstrings where there are occurrences of backslashes. From [PEP 257](https://www.python.org/dev/peps/pep-0257/): > For consistency, always use """triple double quotes""" around docstrings. Use r"""raw triple double quotes""" if you use any backslashes in your docstrings. For Unicode docstrings, use u"""Unicode triple-quoted strings""". For example, this is what help (kafka_wordcount) shows: ``` DESCRIPTION Counts words in UTF8 encoded, ' ' delimited text received from the network every second. Usage: kafka_wordcount.py <zk> <topic> To run this on your local machine, you need to setup Kafka and create a producer first, see http://kafka.apache.org/documentation.html#quickstart and then run the example `$ bin/spark-submit --jars external/kafka-assembly/target/scala-/spark-streaming-kafka-assembly-.jar examples/src/main/python/streaming/kafka_wordcount.py localhost:2181 test` ``` This is what it shows, after the fix: ``` DESCRIPTION Counts words in UTF8 encoded, '\n' delimited text received from the network every second. Usage: kafka_wordcount.py <zk> <topic> To run this on your local machine, you need to setup Kafka and create a producer first, see http://kafka.apache.org/documentation.html#quickstart and then run the example `$ bin/spark-submit --jars \ external/kafka-assembly/target/scala-/spark-streaming-kafka-assembly-.jar \ examples/src/main/python/streaming/kafka_wordcount.py \ localhost:2181 test` ``` The thing worth noticing is no linebreak here in the help. ## What changes were proposed in this pull request? Change triple double quotes to raw triple double quotes when there are occurrences of backslashes in docstrings. ## How was this patch tested? Manually as this is a doc fix. Author: Shashwat Anand <me@shashwat.me> Closes #20497 from ashashwat/docstring-fixes.	2018-02-03 10:31:04 -08:00
gatorsmile	7a2ada223e	[SPARK-23261][PYSPARK] Rename Pandas UDFs ## What changes were proposed in this pull request? Rename the public APIs and names of pandas udfs. - `PANDAS SCALAR UDF` -> `SCALAR PANDAS UDF` - `PANDAS GROUP MAP UDF` -> `GROUPED MAP PANDAS UDF` - `PANDAS GROUP AGG UDF` -> `GROUPED AGG PANDAS UDF` ## How was this patch tested? The existing tests Author: gatorsmile <gatorsmile@gmail.com> Closes #20428 from gatorsmile/renamePandasUDFs.	2018-01-30 21:55:55 +09:00
sethah	5056877e8b	[SPARK-23138][ML][DOC] Multiclass logistic regression summary example and user guide ## What changes were proposed in this pull request? User guide and examples are updated to reflect multiclass logistic regression summary which was added in [SPARK-17139](https://issues.apache.org/jira/browse/SPARK-17139). I did not make a separate summary example, but added the summary code to the multiclass example that already existed. I don't see the need for a separate example for the summary. ## How was this patch tested? Docs and examples only. Ran all examples locally using spark-submit. Author: sethah <shendrickson@cloudera.com> Closes #20332 from sethah/multiclass_summary_example.	2018-01-30 09:02:16 +02:00
Bryan Cutler	0d60b3213f	[SPARK-22221][DOCS] Adding User Documentation for Arrow ## What changes were proposed in this pull request? Adding user facing documentation for working with Arrow in Spark Author: Bryan Cutler <cutlerb@gmail.com> Author: Li Jin <ice.xelloss@gmail.com> Author: hyukjinkwon <gurwls223@gmail.com> Closes #19575 from BryanCutler/arrow-user-docs-SPARK-2221.	2018-01-29 10:25:25 -08:00
hyukjinkwon	b8c32dc573	[SPARK-23248][PYTHON][EXAMPLES] Relocate module docstrings to the top in PySpark examples ## What changes were proposed in this pull request? This PR proposes to relocate the docstrings in modules of examples to the top. Seems these are mistakes. So, for example, the below codes ```python >>> help(aft_survival_regression) ``` shows the module docstrings for examples as below: Before ``` Help on module aft_survival_regression: NAME aft_survival_regression ... DESCRIPTION # Licensed to the Apache Software Foundation (ASF) under one or more # contributor license agreements. See the NOTICE file distributed with # this work for additional information regarding copyright ownership. # The ASF licenses this file to You under the Apache License, Version 2.0 # (the "License"); you may not use this file except in compliance with # the License. You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. # ... (END) ``` After ``` Help on module aft_survival_regression: NAME aft_survival_regression ... DESCRIPTION An example demonstrating aft survival regression. Run with: bin/spark-submit examples/src/main/python/ml/aft_survival_regression.py (END) ``` ## How was this patch tested? Manually checked. Author: hyukjinkwon <gurwls223@gmail.com> Closes #20416 from HyukjinKwon/module-docstring-example.	2018-01-28 10:33:06 +09:00
Bago Amirbekian	05839d1648	[SPARK-22735][ML][DOC] Added VectorSizeHint docs and examples. ## What changes were proposed in this pull request? Added documentation for new transformer. Author: Bago Amirbekian <bago@databricks.com> Closes #20285 from MrBago/sizeHintDocs.	2018-01-23 14:11:23 -08:00
Liang-Chi Hsieh	b74366481c	[SPARK-23048][ML] Add OneHotEncoderEstimator document and examples ## What changes were proposed in this pull request? We have `OneHotEncoderEstimator` now and `OneHotEncoder` will be deprecated since 2.3.0. We should add `OneHotEncoderEstimator` into mllib document. We also need to provide corresponding examples for `OneHotEncoderEstimator` which are used in the document too. ## How was this patch tested? Existing tests. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #20257 from viirya/SPARK-23048.	2018-01-19 12:48:42 +02:00

1 2 3 4 5 ...

972 commits