ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
s71955	619c949019	[SPARK-23425][SQL][FOLLOWUP] Support wildcards in HDFS path for loadtable command. What changes were proposed in this pull request Updated the Migration guide for the behavior changes done in the JIRA issue SPARK-23425. How was this patch tested? Manually verified. Closes #22396 from sujith71955/master_newtest. Authored-by: s71955 <sujithchacko.2010@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-09-17 19:22:27 +08:00
Dongjoon Hyun	bfcf742605	[SPARK-24418][FOLLOWUP][DOC] Update docs to show Scala 2.11.12 ## What changes were proposed in this pull request? SPARK-24418 upgrades Scala to 2.11.12. This PR updates Scala version in docs. - https://spark.apache.org/docs/latest/quick-start.html#self-contained-applications (screenshot) ![screen1](https://user-images.githubusercontent.com/9700541/45590509-9c5f0400-b8ee-11e8-9293-e48d297db894.png) - https://spark.apache.org/docs/latest/rdd-programming-guide.html#working-with-key-value-pairs (Scala, Java) (These are hyperlink updates) - https://spark.apache.org/docs/latest/streaming-flume-integration.html#configuring-flume-1 (screenshot) ![screen2](https://user-images.githubusercontent.com/9700541/45590511-a123b800-b8ee-11e8-97a5-b7f2288229c2.png) ## How was this patch tested? Manual. ```bash $ cd docs $ SKIP_API=1 jekyll build ``` Closes #22431 from dongjoon-hyun/SPARK-24418. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: DB Tsai <d_tsai@apple.com>	2018-09-16 04:14:19 +00:00
gatorsmile	bb2f069cf2	[SPARK-25436] Bump master branch version to 2.5.0-SNAPSHOT ## What changes were proposed in this pull request? In the dev list, we can still discuss whether the next version is 2.5.0 or 3.0.0. Let us first bump the master branch version to `2.5.0-SNAPSHOT`. ## How was this patch tested? N/A Closes #22426 from gatorsmile/bumpVersionMaster. Authored-by: gatorsmile <gatorsmile@gmail.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2018-09-15 16:24:02 -07:00
LucaCanali	45c4ebc817	[SPARK-25170][DOC] Add list and short description of Spark Executor Task Metrics to the documentation. ## What changes were proposed in this pull request? Add description of Executor Task Metrics to the documentation. Closes #22397 from LucaCanali/docMonitoringTaskMetrics. Authored-by: LucaCanali <luca.canali@cern.ch> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-09-13 10:19:21 -05:00
Michael Mior	ab25c96790	[SPARK-23820][CORE] Enable use of long form of callsite in logs This is a rework of #21433 to address some concerns there. Closes #22398 from michaelmior/long-callsite2. Authored-by: Michael Mior <mmior@uwaterloo.ca> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-09-13 09:45:25 +08:00
Mario Molina	c9cb393dc4	[SPARK-17916][SPARK-25241][SQL][FOLLOW-UP] Fix empty string being parsed as null when nullValue is set. ## What changes were proposed in this pull request? In the PR, I propose new CSV option `emptyValue` and an update in the SQL Migration Guide which describes how to revert previous behavior when empty strings were not written at all. Since Spark 2.4, empty strings are saved as `""` to distinguish them from saved `null`s. Closes #22234 Closes #22367 ## How was this patch tested? It was tested by `CSVSuite` and new tests added in the PR #22234 Closes #22389 from MaxGekk/csv-empty-value-master. Lead-authored-by: Mario Molina <mmolimar@gmail.com> Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-09-11 20:47:14 +08:00
Ilan Filonenko	1cfda44825	[SPARK-25021][K8S] Add spark.executor.pyspark.memory limit for K8S ## What changes were proposed in this pull request? Add spark.executor.pyspark.memory limit for K8S ## How was this patch tested? Unit and Integration tests Closes #22298 from ifilonenko/SPARK-25021. Authored-by: Ilan Filonenko <if56@cornell.edu> Signed-off-by: Holden Karau <holden@pigscanfly.ca>	2018-09-08 22:18:06 -07:00
Yuming Wang	b0ada7dce0	[SPARK-25330][BUILD][BRANCH-2.3] Revert Hadoop 2.7 to 2.7.3 ## What changes were proposed in this pull request? How to reproduce permission issue: ```sh # build spark ./dev/make-distribution.sh --name SPARK-25330 --tgz -Phadoop-2.7 -Phive -Phive-thriftserver -Pyarn tar -zxf spark-2.4.0-SNAPSHOT-bin-SPARK-25330.tar && cd spark-2.4.0-SNAPSHOT-bin-SPARK-25330 export HADOOP_PROXY_USER=user_a bin/spark-sql export HADOOP_PROXY_USER=user_b bin/spark-sql ``` ```java Exception in thread "main" java.lang.RuntimeException: org.apache.hadoop.security.AccessControlException: Permission denied: user=user_b, access=EXECUTE, inode="/tmp/hive-$%7Buser.name%7D/user_b/668748f2-f6c5-4325-a797-fd0a7ee7f4d4":user_b:hadoop:drwx------ at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:319) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:259) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:205) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:190) ``` The issue occurred in this commit: `feb886f209`. This pr revert Hadoop 2.7 to 2.7.3 to avoid this issue. ## How was this patch tested? unit tests and manual tests. Closes #22327 from wangyum/SPARK-25330. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-09-06 21:41:13 -07:00
Rob Vesse	da6fa3828b	[SPARK-25262][K8S] Allow SPARK_LOCAL_DIRS to be tmpfs backed on K8S ## What changes were proposed in this pull request? The default behaviour of Spark on K8S currently is to create `emptyDir` volumes to back `SPARK_LOCAL_DIRS`. In some environments e.g. diskless compute nodes this may actually hurt performance because these are backed by the Kubelet's node storage which on a diskless node will typically be some remote network storage. Even if this is enterprise grade storage connected via a high speed interconnect the way Spark uses these directories as scratch space (lots of relatively small short lived files) has been observed to cause serious performance degradation. Therefore we would like to provide the option to use K8S's ability to instead back these `emptyDir` volumes with `tmpfs`. Therefore this PR adds a configuration option that enables `SPARK_LOCAL_DIRS` to be backed by Memory backed `emptyDir` volumes rather than the default. Documentation is added to describe both the default behaviour plus this new option and its implications. One of which is that scratch space then counts towards your pods memory limits and therefore users will need to adjust their memory requests accordingly. NB - This is an alternative version of PR #22256 reduced to just the `tmpfs` piece ## How was this patch tested? Ran with this option in our diskless compute environments to verify functionality Author: Rob Vesse <rvesse@dotnetrdf.org> Closes #22323 from rvesse/SPARK-25262-tmpfs.	2018-09-06 16:18:59 -07:00
Yuming Wang	3e033035a3	[SPARK-25258][SPARK-23131][SPARK-25176][BUILD] Upgrade Kryo to 4.0.2 ## What changes were proposed in this pull request? Upgrade chill to 0.9.3, Kryo to 4.0.2, to get bug fixes and improvements. The resolved tickets includes: - SPARK-25258 Upgrade kryo package to version 4.0.2 - SPARK-23131 Kryo raises StackOverflow during serializing GLR model - SPARK-25176 Kryo fails to serialize a parametrised type hierarchy More details: https://github.com/twitter/chill/releases/tag/v0.9.3 `cc3910d501` ## How was this patch tested? Existing tests. Closes #22179 from wangyum/SPARK-23131. Lead-authored-by: Yuming Wang <yumwang@ebay.com> Co-authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-09-05 15:48:41 -07:00
Darcy Shen	546683c21a	[SPARK-25298][BUILD] Improve build definition for Scala 2.12 ## What changes were proposed in this pull request? Improve build for Scala 2.12. Current build for sbt fails on the subproject `repl`: ``` [info] Compiling 6 Scala sources to /Users/rendong/wdi/spark/repl/target/scala-2.12/classes... [error] /Users/rendong/wdi/spark/repl/scala-2.11/src/main/scala/org/apache/spark/repl/SparkILoopInterpreter.scala:80: overriding lazy value importableSymbolsWithRenames in class ImportHandler of type List[(this.intp.global.Symbol, this.intp.global.Name)]; [error] lazy value importableSymbolsWithRenames needs `override' modifier [error] lazy val importableSymbolsWithRenames: List[(Symbol, Name)] = { [error] ^ [warn] /Users/rendong/wdi/spark/repl/src/main/scala/org/apache/spark/repl/SparkILoop.scala:53: variable addedClasspath in class ILoop is deprecated (since 2.11.0): use reset, replay or require to update class path [warn] if (addedClasspath != "") { [warn] ^ [warn] /Users/rendong/wdi/spark/repl/src/main/scala/org/apache/spark/repl/SparkILoop.scala:54: variable addedClasspath in class ILoop is deprecated (since 2.11.0): use reset, replay or require to update class path [warn] settings.classpath append addedClasspath [warn] ^ [warn] two warnings found [error] one error found [error] (repl/compile:compileIncremental) Compilation failed [error] Total time: 93 s, completed 2018-9-3 10:07:26 ``` ## How was this patch tested? ``` ./dev/change-scala-version.sh 2.12 ## For Maven ./build/mvn -Pscala-2.12 [mvn commands] ## For SBT sbt -Dscala.version=2.12.6 ``` Closes #22310 from sadhen/SPARK-25298. Authored-by: Darcy Shen <sadhen@zoho.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-09-03 07:36:04 -05:00
huangtengfei02	339859c4e4	[SPARK-25261][MINOR][DOC] update the description for spark.executor\|driver.memory in configuration.md ## What changes were proposed in this pull request? As described in [SPARK-25261](https://issues.apache.org/jira/projects/SPARK/issues/SPARK-25261)，the unit of spark.executor.memory and spark.driver.memory is parsed as bytes in some cases if no unit specified, while in https://spark.apache.org/docs/latest/configuration.html#application-properties, they are descibed as MiB, which may lead to some misunderstandings. ## How was this patch tested? N/A Closes #22252 from ivoson/branch-correct-configuration. Lead-authored-by: huangtengfei02 <huangtengfei02@baidu.com> Co-authored-by: Huang Tengfei <tengfei.h@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-08-31 09:06:38 -05:00
Reza Safi	135ff16a35	[SPARK-25233][STREAMING] Give the user the option of specifying a minimum message per partition per batch when using kafka direct API with backpressure After SPARK-18371, it is guaranteed that there would be at least one message per partition per batch using direct kafka API when new messages exist in the topics. This change will give the user the option of setting the minimum instead of just a hard coded 1 limit The related unit test is updated and some internal tests verified that the topic partitions with new messages will be progressed by the specified minimum. Author: Reza Safi <rezasafi@cloudera.com> Closes #22223 from rezasafi/streaminglag.	2018-08-30 13:26:03 -05:00
Maxim Gekk	3c67cb0b52	[SPARK-25273][DOC] How to install testthat 1.0.2 ## What changes were proposed in this pull request? R tests require `testthat` v1.0.2. In the PR, I described how to install the version in the section http://spark.apache.org/docs/latest/building-spark.html#running-r-tests. Closes #22272 from MaxGekk/r-testthat-doc. Authored-by: Maxim Gekk <maxim.gekk@databricks.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-08-30 20:25:26 +08:00
Ryan Blue	7ad18ee9f2	[SPARK-25004][CORE] Add spark.executor.pyspark.memory limit. ## What changes were proposed in this pull request? This adds `spark.executor.pyspark.memory` to configure Python's address space limit, [`resource.RLIMIT_AS`](https://docs.python.org/3/library/resource.html#resource.RLIMIT_AS). Limiting Python's address space allows Python to participate in memory management. In practice, we see fewer cases of Python taking too much memory because it doesn't know to run garbage collection. This results in YARN killing fewer containers. This also improves error messages so users know that Python is consuming too much memory: ``` File "build/bdist.linux-x86_64/egg/package/library.py", line 265, in fe_engineer fe_eval_rec.update(f(src_rec_prep, mat_rec_prep)) File "build/bdist.linux-x86_64/egg/package/library.py", line 163, in fe_comp comparisons = EvaluationUtils.leven_list_compare(src_rec_prep.get(item, []), mat_rec_prep.get(item, [])) File "build/bdist.linux-x86_64/egg/package/evaluationutils.py", line 25, in leven_list_compare permutations = sorted(permutations, reverse=True) MemoryError ``` The new pyspark memory setting is used to increase requested YARN container memory, instead of sharing overhead memory between python and off-heap JVM activity. ## How was this patch tested? Tested memory limits in our YARN cluster and verified that MemoryError is thrown. Author: Ryan Blue <blue@apache.org> Closes #21977 from rdblue/SPARK-25004-add-python-memory-limit.	2018-08-28 12:31:33 -07:00
Yinan Li	dac099d082	[SPARK-24090][K8S] Update running-on-kubernetes.md ## What changes were proposed in this pull request? Updated documentation for Spark on Kubernetes for the upcoming 2.4.0. Please review http://spark.apache.org/contributing.html before opening a pull request. mccheah erikerlandson Closes #22224 from liyinan926/master. Authored-by: Yinan Li <ynli@google.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-08-27 15:55:34 -05:00
jaroslav chládek	8e6427871a	Correct missing punctuation in the documentation ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. Closes #22189 from movrsprbp/patch-1. Authored-by: jaroslav chládek <mastermism@gmail.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-08-25 12:49:48 +08:00
Gengliang Wang	05974f9431	[SPARK-25133][SQL][DOC] Avro data source guide ## What changes were proposed in this pull request? Create documentation for AVRO data source. The new page will be linked in https://spark.apache.org/docs/latest/sql-programming-guide.html For preview please unzip the following file: [AvroDoc.zip](https://github.com/apache/spark/files/2313011/AvroDoc.zip) Closes #22121 from gengliangwang/avroDoc. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-08-23 13:45:49 +08:00
Sean Owen	35f7f5ce83	[DOCS][MINOR] Fix a few broken links and typos, and, nit, use HTTPS more consistently ## What changes were proposed in this pull request? Fix a few broken links and typos, and, nit, use HTTPS more consistently esp. on scripts and Apache links ## How was this patch tested? Doc build Closes #22172 from srowen/DocTypo. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-08-22 01:02:17 +08:00
Zhang Le	219ed7b487	[DOCS] Fixed NDCG formula issues When j is 0, log(j+1) will be 0, and this leads to division by 0 issue. ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. Closes #22090 from yueguoguo/patch-1. Authored-by: Zhang Le <yueguoguo@users.noreply.github.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-08-20 14:59:03 -05:00
Liang-Chi Hsieh	8b0e94d896	[SPARK-23042][ML] Use OneHotEncoderModel to encode labels in MultilayerPerceptronClassifier ## What changes were proposed in this pull request? In MultilayerPerceptronClassifier, we use RDD operation to encode labels for now. I think we should use ML's OneHotEncoderEstimator/Model to do the encoding. ## How was this patch tested? Existing tests. Closes #20232 from viirya/SPARK-23042. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: DB Tsai <d_tsai@apple.com>	2018-08-17 18:40:29 +00:00
Joey Krabacher	30be71e912	[DOCS] Fix cloud-integration.md Typo Corrected typo; changed spark-default.conf to spark-defaults.conf Closes #22125 from KraFusion/patch-2. Authored-by: Joey Krabacher <jkrabacher@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-08-16 16:48:51 -07:00
Joey Krabacher	709f541dd0	[DOCS] Update configuration.md changed $SPARK_HOME/conf/spark-default.conf to $SPARK_HOME/conf/spark-defaults.conf no testing necessary as this was a change to documentation. Closes #22116 from KraFusion/patch-1. Authored-by: Joey Krabacher <jkrabacher@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-08-16 16:47:52 -07:00
Ilan Filonenko	a791c29bd8	[SPARK-23984][K8S] Changed Python Version config to be camelCase ## What changes were proposed in this pull request? Small formatting change to have Python Version be camelCase as per request during PR review. ## How was this patch tested? Tested with unit and integration tests Author: Ilan Filonenko <if56@cornell.edu> Closes #22095 from ifilonenko/spark-py-edits.	2018-08-15 17:52:12 -07:00
Imran Rashid	1024875843	[SPARK-25088][CORE][MESOS][DOCS] Update Rest Server docs & defaults. ## What changes were proposed in this pull request? (a) disabled rest submission server by default in standalone mode (b) fails the standalone master if rest server enabled & authentication secret set (c) fails the mesos cluster dispatcher if authentication secret set (d) doc updates (e) when submitting a standalone app, only try the rest submission first if spark.master.rest.enabled=true otherwise you'd see a 10 second pause like 18/08/09 08:13:22 INFO RestSubmissionClient: Submitting a request to launch an application in spark://... 18/08/09 08:13:33 WARN RestSubmissionClient: Unable to connect to server spark://... I also made sure the mesos cluster dispatcher failed with the secret enabled, though I had to do that on slightly different code as I don't have mesos native libs around. ## How was this patch tested? I ran the tests in the mesos module & in core for org.apache.spark.deploy.* I ran a test on a cluster with standalone master to make sure I could still start with the right configs, and would fail the right way too. Closes #22071 from squito/rest_doc_updates. Authored-by: Imran Rashid <irashid@cloudera.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-08-14 13:02:33 -05:00
Kazuhiro Sera	8ec25cd67e	Fix typos detected by github.com/client9/misspell ## What changes were proposed in this pull request? Fixing typos is sometimes very hard. It's not so easy to visually review them. Recently, I discovered a very useful tool for it, [misspell](https://github.com/client9/misspell). This pull request fixes minor typos detected by [misspell](https://github.com/client9/misspell) except for the false positives. If you would like me to work on other files as well, let me know. ## How was this patch tested? ### before ``` $ misspell . \| grep -v '.js' R/pkg/R/SQLContext.R:354:43: "definiton" is a misspelling of "definition" R/pkg/R/SQLContext.R:424:43: "definiton" is a misspelling of "definition" R/pkg/R/SQLContext.R:445:43: "definiton" is a misspelling of "definition" R/pkg/R/SQLContext.R:495:43: "definiton" is a misspelling of "definition" NOTICE-binary:454:16: "containd" is a misspelling of "contained" R/pkg/R/context.R:46:43: "definiton" is a misspelling of "definition" R/pkg/R/context.R:74:43: "definiton" is a misspelling of "definition" R/pkg/R/DataFrame.R:591:48: "persistance" is a misspelling of "persistence" R/pkg/R/streaming.R:166:44: "occured" is a misspelling of "occurred" R/pkg/inst/worker/worker.R:65:22: "ouput" is a misspelling of "output" R/pkg/tests/fulltests/test_utils.R:106:25: "environemnt" is a misspelling of "environment" common/kvstore/src/test/java/org/apache/spark/util/kvstore/InMemoryStoreSuite.java:38:39: "existant" is a misspelling of "existent" common/kvstore/src/test/java/org/apache/spark/util/kvstore/LevelDBSuite.java:83:39: "existant" is a misspelling of "existent" common/network-common/src/main/java/org/apache/spark/network/crypto/TransportCipher.java:243:46: "transfered" is a misspelling of "transferred" common/network-common/src/main/java/org/apache/spark/network/sasl/SaslEncryption.java:234:19: "transfered" is a misspelling of "transferred" common/network-common/src/main/java/org/apache/spark/network/sasl/SaslEncryption.java:238:63: "transfered" is a misspelling of "transferred" common/network-common/src/main/java/org/apache/spark/network/sasl/SaslEncryption.java:244:46: "transfered" is a misspelling of "transferred" common/network-common/src/main/java/org/apache/spark/network/sasl/SaslEncryption.java:276:39: "transfered" is a misspelling of "transferred" common/network-common/src/main/java/org/apache/spark/network/util/AbstractFileRegion.java:27:20: "transfered" is a misspelling of "transferred" common/unsafe/src/test/scala/org/apache/spark/unsafe/types/UTF8StringPropertyCheckSuite.scala:195:15: "orgin" is a misspelling of "origin" core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala:621:39: "gauranteed" is a misspelling of "guaranteed" core/src/main/scala/org/apache/spark/status/storeTypes.scala:113:29: "ect" is a misspelling of "etc" core/src/main/scala/org/apache/spark/storage/DiskStore.scala:282:18: "transfered" is a misspelling of "transferred" core/src/main/scala/org/apache/spark/util/ListenerBus.scala:64:17: "overriden" is a misspelling of "overridden" core/src/test/scala/org/apache/spark/ShuffleSuite.scala:211:7: "substracted" is a misspelling of "subtracted" core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala:1922:49: "agriculteur" is a misspelling of "agriculture" core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala:2468:84: "truely" is a misspelling of "truly" core/src/test/scala/org/apache/spark/storage/FlatmapIteratorSuite.scala:25:18: "persistance" is a misspelling of "persistence" core/src/test/scala/org/apache/spark/storage/FlatmapIteratorSuite.scala:26:69: "persistance" is a misspelling of "persistence" data/streaming/AFINN-111.txt:1219:0: "humerous" is a misspelling of "humorous" dev/run-pip-tests:55:28: "enviroments" is a misspelling of "environments" dev/run-pip-tests:91:37: "virutal" is a misspelling of "virtual" dev/merge_spark_pr.py:377:72: "accross" is a misspelling of "across" dev/merge_spark_pr.py:378:66: "accross" is a misspelling of "across" dev/run-pip-tests:126:25: "enviroments" is a misspelling of "environments" docs/configuration.md:1830:82: "overriden" is a misspelling of "overridden" docs/structured-streaming-programming-guide.md:525:45: "processs" is a misspelling of "processes" docs/structured-streaming-programming-guide.md:1165:61: "BETWEN" is a misspelling of "BETWEEN" docs/sql-programming-guide.md:1891:810: "behaivor" is a misspelling of "behavior" examples/src/main/python/sql/arrow.py:98:8: "substract" is a misspelling of "subtract" examples/src/main/python/sql/arrow.py:103:27: "substract" is a misspelling of "subtract" licenses/LICENSE-heapq.txt:5:63: "Stichting" is a misspelling of "Stitching" licenses/LICENSE-heapq.txt:6:2: "Mathematisch" is a misspelling of "Mathematics" licenses/LICENSE-heapq.txt:262:29: "Stichting" is a misspelling of "Stitching" licenses/LICENSE-heapq.txt:262:39: "Mathematisch" is a misspelling of "Mathematics" licenses/LICENSE-heapq.txt:269:49: "Stichting" is a misspelling of "Stitching" licenses/LICENSE-heapq.txt:269:59: "Mathematisch" is a misspelling of "Mathematics" licenses/LICENSE-heapq.txt:274:2: "STICHTING" is a misspelling of "STITCHING" licenses/LICENSE-heapq.txt:274:12: "MATHEMATISCH" is a misspelling of "MATHEMATICS" licenses/LICENSE-heapq.txt:276:29: "STICHTING" is a misspelling of "STITCHING" licenses/LICENSE-heapq.txt:276:39: "MATHEMATISCH" is a misspelling of "MATHEMATICS" licenses-binary/LICENSE-heapq.txt:5:63: "Stichting" is a misspelling of "Stitching" licenses-binary/LICENSE-heapq.txt:6:2: "Mathematisch" is a misspelling of "Mathematics" licenses-binary/LICENSE-heapq.txt:262:29: "Stichting" is a misspelling of "Stitching" licenses-binary/LICENSE-heapq.txt:262:39: "Mathematisch" is a misspelling of "Mathematics" licenses-binary/LICENSE-heapq.txt:269:49: "Stichting" is a misspelling of "Stitching" licenses-binary/LICENSE-heapq.txt:269:59: "Mathematisch" is a misspelling of "Mathematics" licenses-binary/LICENSE-heapq.txt:274:2: "STICHTING" is a misspelling of "STITCHING" licenses-binary/LICENSE-heapq.txt:274:12: "MATHEMATISCH" is a misspelling of "MATHEMATICS" licenses-binary/LICENSE-heapq.txt:276:29: "STICHTING" is a misspelling of "STITCHING" licenses-binary/LICENSE-heapq.txt:276:39: "MATHEMATISCH" is a misspelling of "MATHEMATICS" mllib/src/main/resources/org/apache/spark/ml/feature/stopwords/hungarian.txt:170:0: "teh" is a misspelling of "the" mllib/src/main/resources/org/apache/spark/ml/feature/stopwords/portuguese.txt:53:0: "eles" is a misspelling of "eels" mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala:99:20: "Euclidian" is a misspelling of "Euclidean" mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala:539:11: "Euclidian" is a misspelling of "Euclidean" mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala:77:36: "Teh" is a misspelling of "The" mllib/src/main/scala/org/apache/spark/mllib/clustering/StreamingKMeans.scala:230:24: "inital" is a misspelling of "initial" mllib/src/main/scala/org/apache/spark/mllib/stat/MultivariateOnlineSummarizer.scala:276:9: "Euclidian" is a misspelling of "Euclidean" mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala:237:26: "descripiton" is a misspelling of "descriptions" python/pyspark/find_spark_home.py:30:13: "enviroment" is a misspelling of "environment" python/pyspark/context.py:937:12: "supress" is a misspelling of "suppress" python/pyspark/context.py:938:12: "supress" is a misspelling of "suppress" python/pyspark/context.py:939:12: "supress" is a misspelling of "suppress" python/pyspark/context.py:940:12: "supress" is a misspelling of "suppress" python/pyspark/heapq3.py:6:63: "Stichting" is a misspelling of "Stitching" python/pyspark/heapq3.py:7:2: "Mathematisch" is a misspelling of "Mathematics" python/pyspark/heapq3.py:263:29: "Stichting" is a misspelling of "Stitching" python/pyspark/heapq3.py:263:39: "Mathematisch" is a misspelling of "Mathematics" python/pyspark/heapq3.py:270:49: "Stichting" is a misspelling of "Stitching" python/pyspark/heapq3.py:270:59: "Mathematisch" is a misspelling of "Mathematics" python/pyspark/heapq3.py:275:2: "STICHTING" is a misspelling of "STITCHING" python/pyspark/heapq3.py:275:12: "MATHEMATISCH" is a misspelling of "MATHEMATICS" python/pyspark/heapq3.py:277:29: "STICHTING" is a misspelling of "STITCHING" python/pyspark/heapq3.py:277:39: "MATHEMATISCH" is a misspelling of "MATHEMATICS" python/pyspark/heapq3.py:713:8: "probabilty" is a misspelling of "probability" python/pyspark/ml/clustering.py:1038:8: "Currenlty" is a misspelling of "Currently" python/pyspark/ml/stat.py:339:23: "Euclidian" is a misspelling of "Euclidean" python/pyspark/ml/regression.py:1378:20: "paramter" is a misspelling of "parameter" python/pyspark/mllib/stat/_statistics.py:262:8: "probabilty" is a misspelling of "probability" python/pyspark/rdd.py:1363:32: "paramter" is a misspelling of "parameter" python/pyspark/streaming/tests.py:825:42: "retuns" is a misspelling of "returns" python/pyspark/sql/tests.py:768:29: "initalization" is a misspelling of "initialization" python/pyspark/sql/tests.py:3616:31: "initalize" is a misspelling of "initialize" resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerBackendUtil.scala:120:39: "arbitary" is a misspelling of "arbitrary" resource-managers/mesos/src/test/scala/org/apache/spark/deploy/mesos/MesosClusterDispatcherArgumentsSuite.scala:26:45: "sucessfully" is a misspelling of "successfully" resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerUtils.scala:358:27: "constaints" is a misspelling of "constraints" resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/YarnClusterSuite.scala:111:24: "senstive" is a misspelling of "sensitive" sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala:1063:5: "overwirte" is a misspelling of "overwrite" sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala:1348:17: "compatability" is a misspelling of "compatibility" sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala:77:36: "paramter" is a misspelling of "parameter" sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:1374:22: "precendence" is a misspelling of "precedence" sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisSuite.scala:238:27: "unnecassary" is a misspelling of "unnecessary" sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ConditionalExpressionSuite.scala:212:17: "whn" is a misspelling of "when" sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinHelper.scala:147:60: "timestmap" is a misspelling of "timestamp" sql/core/src/test/scala/org/apache/spark/sql/TPCDSQuerySuite.scala:150:45: "precentage" is a misspelling of "percentage" sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchemaSuite.scala:135:29: "infered" is a misspelling of "inferred" sql/hive/src/test/resources/golden/udf_instr-1-2e76f819563dbaba4beb51e3a130b922:1:52: "occurance" is a misspelling of "occurrence" sql/hive/src/test/resources/golden/udf_instr-2-32da357fc754badd6e3898dcc8989182:1:52: "occurance" is a misspelling of "occurrence" sql/hive/src/test/resources/golden/udf_locate-1-6e41693c9c6dceea4d7fab4c02884e4e:1:63: "occurance" is a misspelling of "occurrence" sql/hive/src/test/resources/golden/udf_locate-2-d9b5934457931447874d6bb7c13de478:1:63: "occurance" is a misspelling of "occurrence" sql/hive/src/test/resources/golden/udf_translate-2-f7aa38a33ca0df73b7a1e6b6da4b7fe8:9:79: "occurence" is a misspelling of "occurrence" sql/hive/src/test/resources/golden/udf_translate-2-f7aa38a33ca0df73b7a1e6b6da4b7fe8:13:110: "occurence" is a misspelling of "occurrence" sql/hive/src/test/resources/ql/src/test/queries/clientpositive/annotate_stats_join.q:46:105: "distint" is a misspelling of "distinct" sql/hive/src/test/resources/ql/src/test/queries/clientpositive/auto_sortmerge_join_11.q:29:3: "Currenly" is a misspelling of "Currently" sql/hive/src/test/resources/ql/src/test/queries/clientpositive/avro_partitioned.q:72:15: "existant" is a misspelling of "existent" sql/hive/src/test/resources/ql/src/test/queries/clientpositive/decimal_udf.q:25:3: "substraction" is a misspelling of "subtraction" sql/hive/src/test/resources/ql/src/test/queries/clientpositive/groupby2_map_multi_distinct.q:16:51: "funtion" is a misspelling of "function" sql/hive/src/test/resources/ql/src/test/queries/clientpositive/groupby_sort_8.q:15:30: "issueing" is a misspelling of "issuing" sql/hive/src/test/scala/org/apache/spark/sql/sources/HadoopFsRelationTest.scala:669:52: "wiht" is a misspelling of "with" sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/session/HiveSessionImpl.java:474:9: "Refering" is a misspelling of "Referring" ``` ### after ``` $ misspell . \| grep -v '.js' common/network-common/src/main/java/org/apache/spark/network/util/AbstractFileRegion.java:27:20: "transfered" is a misspelling of "transferred" core/src/main/scala/org/apache/spark/status/storeTypes.scala:113:29: "ect" is a misspelling of "etc" core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala:1922:49: "agriculteur" is a misspelling of "agriculture" data/streaming/AFINN-111.txt:1219:0: "humerous" is a misspelling of "humorous" licenses/LICENSE-heapq.txt:5:63: "Stichting" is a misspelling of "Stitching" licenses/LICENSE-heapq.txt:6:2: "Mathematisch" is a misspelling of "Mathematics" licenses/LICENSE-heapq.txt:262:29: "Stichting" is a misspelling of "Stitching" licenses/LICENSE-heapq.txt:262:39: "Mathematisch" is a misspelling of "Mathematics" licenses/LICENSE-heapq.txt:269:49: "Stichting" is a misspelling of "Stitching" licenses/LICENSE-heapq.txt:269:59: "Mathematisch" is a misspelling of "Mathematics" licenses/LICENSE-heapq.txt:274:2: "STICHTING" is a misspelling of "STITCHING" licenses/LICENSE-heapq.txt:274:12: "MATHEMATISCH" is a misspelling of "MATHEMATICS" licenses/LICENSE-heapq.txt:276:29: "STICHTING" is a misspelling of "STITCHING" licenses/LICENSE-heapq.txt:276:39: "MATHEMATISCH" is a misspelling of "MATHEMATICS" licenses-binary/LICENSE-heapq.txt:5:63: "Stichting" is a misspelling of "Stitching" licenses-binary/LICENSE-heapq.txt:6:2: "Mathematisch" is a misspelling of "Mathematics" licenses-binary/LICENSE-heapq.txt:262:29: "Stichting" is a misspelling of "Stitching" licenses-binary/LICENSE-heapq.txt:262:39: "Mathematisch" is a misspelling of "Mathematics" licenses-binary/LICENSE-heapq.txt:269:49: "Stichting" is a misspelling of "Stitching" licenses-binary/LICENSE-heapq.txt:269:59: "Mathematisch" is a misspelling of "Mathematics" licenses-binary/LICENSE-heapq.txt:274:2: "STICHTING" is a misspelling of "STITCHING" licenses-binary/LICENSE-heapq.txt:274:12: "MATHEMATISCH" is a misspelling of "MATHEMATICS" licenses-binary/LICENSE-heapq.txt:276:29: "STICHTING" is a misspelling of "STITCHING" licenses-binary/LICENSE-heapq.txt:276:39: "MATHEMATISCH" is a misspelling of "MATHEMATICS" mllib/src/main/resources/org/apache/spark/ml/feature/stopwords/hungarian.txt:170:0: "teh" is a misspelling of "the" mllib/src/main/resources/org/apache/spark/ml/feature/stopwords/portuguese.txt:53:0: "eles" is a misspelling of "eels" mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala:99:20: "Euclidian" is a misspelling of "Euclidean" mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala:539:11: "Euclidian" is a misspelling of "Euclidean" mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala:77:36: "Teh" is a misspelling of "The" mllib/src/main/scala/org/apache/spark/mllib/stat/MultivariateOnlineSummarizer.scala:276:9: "Euclidian" is a misspelling of "Euclidean" python/pyspark/heapq3.py:6:63: "Stichting" is a misspelling of "Stitching" python/pyspark/heapq3.py:7:2: "Mathematisch" is a misspelling of "Mathematics" python/pyspark/heapq3.py:263:29: "Stichting" is a misspelling of "Stitching" python/pyspark/heapq3.py:263:39: "Mathematisch" is a misspelling of "Mathematics" python/pyspark/heapq3.py:270:49: "Stichting" is a misspelling of "Stitching" python/pyspark/heapq3.py:270:59: "Mathematisch" is a misspelling of "Mathematics" python/pyspark/heapq3.py:275:2: "STICHTING" is a misspelling of "STITCHING" python/pyspark/heapq3.py:275:12: "MATHEMATISCH" is a misspelling of "MATHEMATICS" python/pyspark/heapq3.py:277:29: "STICHTING" is a misspelling of "STITCHING" python/pyspark/heapq3.py:277:39: "MATHEMATISCH" is a misspelling of "MATHEMATICS" python/pyspark/ml/stat.py:339:23: "Euclidian" is a misspelling of "Euclidean" ``` Closes #22070 from seratch/fix-typo. Authored-by: Kazuhiro Sera <seratch@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2018-08-11 21:23:36 -05:00
Achuth17	d36539741f	[SPARK-24626][SQL] Improve location size calculation in Analyze Table command ## What changes were proposed in this pull request? Currently, Analyze table calculates table size sequentially for each partition. We can parallelize size calculations over partitions. Results : Tested on a table with 100 partitions and data stored in S3. With changes : - 10.429s - 10.557s - 10.439s - 9.893s  Without changes : - 110.034s - 99.510s - 100.743s - 99.106s ## How was this patch tested? Simple unit test. Closes #21608 from Achuth17/improveAnalyze. Lead-authored-by: Achuth17 <Achuth.narayan@gmail.com> Co-authored-by: arajagopal17 <arajagopal@qubole.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2018-08-09 08:29:24 -07:00
Sean Owen	5f9633dc97	[SPARK-25015][BUILD] Update Hadoop 2.7 to 2.7.7 ## What changes were proposed in this pull request? Update Hadoop 2.7 to 2.7.7 to pull in bug and security fixes. ## How was this patch tested? Existing tests. Author: Sean Owen <srowen@gmail.com> Closes #21987 from srowen/SPARK-25015.	2018-08-04 14:59:13 -05:00
Dilip Biswal	73dd6cf9b5	[SPARK-24966][SQL] Implement precedence rules for set operations. ## What changes were proposed in this pull request? Currently the set operations INTERSECT, UNION and EXCEPT are assigned the same precedence. This PR fixes the problem by giving INTERSECT higher precedence than UNION and EXCEPT. UNION and EXCEPT operators are evaluated in the order in which they appear in the query from left to right. This results in change in behavior because of the change in order of evaluations of set operators in a query. The old behavior is still preserved under a newly added config parameter. Query `:` ``` SELECT * FROM t1 UNION SELECT * FROM t2 EXCEPT SELECT * FROM t3 INTERSECT SELECT * FROM t4 ``` Parsed plan before the change `:` ``` == Parsed Logical Plan == 'Intersect false :- 'Except false : :- 'Distinct : : +- 'Union : : :- 'Project [] : : : +- 'UnresolvedRelation `t1` : : +- 'Project [] : : +- 'UnresolvedRelation `t2` : +- 'Project [] : +- 'UnresolvedRelation `t3` +- 'Project [] +- 'UnresolvedRelation `t4` ``` Parsed plan after the change `:` ``` == Parsed Logical Plan == 'Except false :- 'Distinct : +- 'Union : :- 'Project [] : : +- 'UnresolvedRelation `t1` : +- 'Project [] : +- 'UnresolvedRelation `t2` +- 'Intersect false :- 'Project [] : +- 'UnresolvedRelation `t3` +- 'Project [] +- 'UnresolvedRelation `t4` ``` ## How was this patch tested? Added tests in PlanParserSuite, SQLQueryTestSuite. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Dilip Biswal <dbiswal@us.ibm.com> Closes #21941 from dilipbiswal/SPARK-24966.	2018-08-02 22:04:17 -07:00
Marco Gaido	ad2e636628	[SPARK-24598][DOCS] State in the documentation the behavior when arithmetic operations cause overflow ## What changes were proposed in this pull request? According to the discussion in https://github.com/apache/spark/pull/21599, changing the behavior of arithmetic operations so that they can check for overflow is not nice in a minor release. What we can do for 2.4 is warn users about the current behavior in the documentation, so that they are aware of the issue and can take proper actions. ## How was this patch tested? NA Author: Marco Gaido <marcogaido91@gmail.com> Closes #21967 from mgaido91/SPARK-24598_doc.	2018-08-02 09:28:13 -07:00
Sean Owen	c9914cf049	[MINOR][DOCS] Add note about Spark network security ## What changes were proposed in this pull request? In response to a recent question, this reiterates that network access to a Spark cluster should be disabled by default, and that access to its hosts and services from outside a private network should be added back explicitly. Also, some minor touch-ups while I was at it. ## How was this patch tested? N/A Author: Sean Owen <srowen@gmail.com> Closes #21947 from srowen/SecurityNote.	2018-08-02 10:22:52 +08:00
liuxian	c5fe412928	[SPARK-18188][DOC][FOLLOW-UP] Add `spark.broadcast.checksum` to configuration ## What changes were proposed in this pull request? This pr add `spark.broadcast.checksum` to configuration. ## How was this patch tested? manually tested Author: liuxian <liu.xian3@zte.com.cn> Closes #21825 from 10110346/checksum_config.	2018-08-01 21:19:24 -05:00
Li Jin	8141d55926	[SPARK-23633][SQL] Update Pandas UDFs section in sql-programming-guide ## What changes were proposed in this pull request? Update Pandas UDFs section in sql-programming-guide. Add section for grouped aggregate pandas UDF. ## How was this patch tested? Author: Li Jin <ice.xelloss@gmail.com> Closes #21887 from icexelloss/SPARK-23633-sql-programming-guide.	2018-07-31 10:10:38 +08:00
Takeshi Yamamuro	47d84e4d0e	[SPARK-22814][SQL] Support Date/Timestamp in a JDBC partition column ## What changes were proposed in this pull request? This pr supported Date/Timestamp in a JDBC partition column (a numeric column is only supported in the master). This pr also modified code to verify a partition column type; ``` val jdbcTable = spark.read .option("partitionColumn", "text") .option("lowerBound", "aaa") .option("upperBound", "zzz") .option("numPartitions", 2) .jdbc("jdbc:postgresql:postgres", "t", options) // with this pr org.apache.spark.sql.AnalysisException: Partition column type should be numeric, date, or timestamp, but string found.; at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.verifyAndGetNormalizedPartitionColumn(JDBCRelation.scala:165) at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.columnPartition(JDBCRelation.scala:85) at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:36) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:317) // without this pr java.lang.NumberFormatException: For input string: "aaa" at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.lang.Long.parseLong(Long.java:589) at java.lang.Long.parseLong(Long.java:631) at scala.collection.immutable.StringLike$class.toLong(StringLike.scala:277) ``` Closes #19999 ## How was this patch tested? Added tests in `JDBCSuite`. Author: Takeshi Yamamuro <yamamuro@apache.org> Closes #21834 from maropu/SPARK-22814.	2018-07-30 07:42:00 -07:00
maryannxue	21fcac1645	[SPARK-24288][SQL] Add a JDBC Option to enable preventing predicate pushdown ## What changes were proposed in this pull request? Add a JDBC Option "pushDownPredicate" (default `true`) to allow/disallow predicate push-down in JDBC data source. ## How was this patch tested? Add a test in `JDBCSuite` Author: maryannxue <maryannxue@apache.org> Closes #21875 from maryannxue/spark-24288.	2018-07-26 23:47:32 -07:00
mcheah	571a6f0574	[SPARK-23146][K8S] Support client mode. ## What changes were proposed in this pull request? Support client mode for the Kubernetes scheduler. Client mode works more or less identically to cluster mode. However, in client mode, the Spark Context needs to be manually bootstrapped with certain properties which would have otherwise been set up by spark-submit in cluster mode. Specifically: - If the user doesn't provide a driver pod name, we don't add an owner reference. This is for usage when the driver is not running in a pod in the cluster. In such a case, the driver can only provide a best effort to clean up the executors when the driver exits, but cleaning up the resources is not guaranteed. The executor JVMs should exit if the driver JVM exits, but the pods will still remain in the cluster in a COMPLETED or FAILED state. - The user must provide a host (spark.driver.host) and port (spark.driver.port) that the executors can connect to. When using spark-submit in cluster mode, spark-submit generates the headless service automatically; in client mode, the user is responsible for setting up their own connectivity. We also change the authentication configuration prefixes for client mode. ## How was this patch tested? Adding an integration test to exercise client mode support. Author: mcheah <mcheah@palantir.com> Closes #21748 from mccheah/k8s-client-mode.	2018-07-25 11:08:41 -07:00
Imran Rashid	15fff79032	[SPARK-24297][CORE] Fetch-to-disk by default for > 2gb Fetch-to-mem is guaranteed to fail if the message is bigger than 2 GB, so we might as well use fetch-to-disk in that case. The message includes some metadata in addition to the block data itself (in particular UploadBlock has a lot of metadata), so we leave a little room. Author: Imran Rashid <irashid@cloudera.com> Closes #21474 from squito/SPARK-24297.	2018-07-25 09:08:42 +08:00
s71955	d4a277f0ce	[SPARK-24812][SQL] Last Access Time in the table description is not valid ## What changes were proposed in this pull request? Last Access Time will always displayed wrong date Thu Jan 01 05:30:00 IST 1970 when user run DESC FORMATTED table command In hive its displayed as "UNKNOWN" which makes more sense than displaying wrong date. seems to be a limitation as of now even from hive, better we can follow the hive behavior unless the limitation has been resolved from hive. spark client output ![spark_desc table](https://user-images.githubusercontent.com/12999161/42753448-ddeea66a-88a5-11e8-94aa-ef8d017f94c5.png) Hive client output ![hive_behaviour](https://user-images.githubusercontent.com/12999161/42753489-f4fd366e-88a5-11e8-83b0-0f3a53ce83dd.png) ## How was this patch tested? UT has been added which makes sure that the wrong date "Thu Jan 01 05:30:00 IST 1970 " shall not be added as value for the Last Access property Author: s71955 <sujithchacko.2010@gmail.com> Closes #21775 from sujith71955/master_hive.	2018-07-24 11:31:27 -07:00
“attilapiros”	d2436a8529	[SPARK-24594][YARN] Introducing metrics for YARN ## What changes were proposed in this pull request? In this PR metrics are introduced for YARN. As up to now there was no metrics in the YARN module a new metric system is created with the name "applicationMaster". To support both client and cluster mode the metric system lifecycle is bound to the AM. ## How was this patch tested? Both client and cluster mode was tested manually. Before the test on one of the YARN node spark-core was removed to cause the allocation failure. Spark was started as (in case of client mode): ``` spark2-submit \ --class org.apache.spark.examples.SparkPi \ --conf "spark.yarn.blacklist.executor.launch.blacklisting.enabled=true" --conf "spark.blacklist.application.maxFailedExecutorsPerNode=2" --conf "spark.dynamicAllocation.enabled=true" --conf "spark.metrics.conf.*.sink.console.class=org.apache.spark.metrics.sink.ConsoleSink" \ --master yarn \ --deploy-mode client \ original-spark-examples_2.11-2.4.0-SNAPSHOT.jar \ 1000 ``` In both cases the YARN logs contained the new metrics as: ``` $ yarn logs --applicationId application_1529926424933_0015 ... -- Gauges ---------------------------------------------------------------------- application_1531751594108_0046.applicationMaster.numContainersPendingAllocate value = 0 application_1531751594108_0046.applicationMaster.numExecutorsFailed value = 3 application_1531751594108_0046.applicationMaster.numExecutorsRunning value = 9 application_1531751594108_0046.applicationMaster.numLocalityAwareTasks value = 0 application_1531751594108_0046.applicationMaster.numReleasedContainers value = 0 ... ``` Author: “attilapiros” <piros.attila.zsolt@gmail.com> Author: Attila Zsolt Piros <2017933+attilapiros@users.noreply.github.com> Closes #21635 from attilapiros/SPARK-24594.	2018-07-24 09:33:10 +08:00
zhengruifeng3	81af88687f	[SPARK-23231][ML][DOC] Add doc for string indexer ordering to user guide (also to RFormula guide) ## What changes were proposed in this pull request? add doc for string indexer ordering ## How was this patch tested? existing tests Author: zhengruifeng3 <zhengruifeng@jd.com> Author: zhengruifeng <ruifengz@foxmail.com> Closes #21792 from zhengruifeng/doc_string_indexer_ordering.	2018-07-21 08:26:45 -05:00
Daniel van der Ende	2333a34d39	[SPARK-22880][SQL] Add cascadeTruncate option to JDBC datasource This commit adds the `cascadeTruncate` option to the JDBC datasource API, for databases that support this functionality (PostgreSQL and Oracle at the moment). This allows for applying a cascading truncate that affects tables that have foreign key constraints on the table being truncated. ## What changes were proposed in this pull request? Add `cascadeTruncate` option to JDBC datasource API. Allow this to affect the `TRUNCATE` query for databases that support this option. ## How was this patch tested? Existing tests for `truncateQuery` were updated. Also, an additional test was added to ensure that the correct syntax was applied, and that enabling the config for databases that do not support this option does not result in invalid queries. Author: Daniel van der Ende <daniel.vanderende@gmail.com> Closes #20057 from danielvdende/SPARK-22880.	2018-07-20 13:03:57 -07:00
Huangweizhe	ebe9e28488	[SPARK-24628][DOC] Typos of the example code in docs/mllib-data-types.md ## What changes were proposed in this pull request? The example wants to create a dense matrix ((1.0, 2.0), (3.0, 4.0), (5.0, 6.0)), but the list is given as [1, 2, 3, 4, 5, 6]. Now it is changed as [1, 3, 5, 2, 4, 6]. And the example wants to create an RDD of coordinate entries like: entries = sc.parallelize([(0, 0, 1.2), (1, 0, 2.1), (2, 1, 3.7)]). However, it is done with the MatrixEntry class like: entries = sc.parallelize([MatrixEntry(0, 0, 1.2), MatrixEntry(1, 0, 2.1), MatrixEntry(6, 1, 3.7)]), where the third MatrixEntry has a different row index. Now it is changed as MatrixEntry(2, 1, 3.7). ## How was this patch tested? This is trivial enough that it should not affect tests. Author: Weizhe Huang <huangweizhebbdservice.com> Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Huangweizhe <huangweizhe@bbdservice.com> Closes #21612 from huangweizhe123/my_change.	2018-07-18 09:45:56 -05:00
Miklos C	f876d3fa80	[SPARK-20220][DOCS] Documentation Add thrift scheduling pool config to scheduling docs ## What changes were proposed in this pull request? The thrift scheduling pool configuration was removed from a previous release. Adding this back to the job scheduling configuration docs. This PR takes over #17536 and handle some comments here. ## How was this patch tested? Manually. Closes #17536 Author: hyukjinkwon <gurwls223@apache.org> Closes #21778 from HyukjinKwon/SPARK-20220.	2018-07-17 09:22:16 +08:00
Zoltan C. Toth	5d62a985dc	Doc fix: The Imputer is an Estimator Fixing the doc as the imputer is not a `Transformer` but an `Estimator`. https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala#L96-L97 ## What changes were proposed in this pull request? Simple documentation fix ## How was this patch tested? manual testing Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Zoltan C. Toth <zoltanctoth@gmail.com> Closes #21755 from zoltanctoth/doc-imputer-is-estimator.	2018-07-15 17:08:26 -05:00
WeichenXu	59c3c233f4	[SPARK-23254][ML] Add user guide entry and example for DataFrame multivariate summary ## What changes were proposed in this pull request? Add user guide and scala/java/python examples for `ml.stat.Summarizer` ## How was this patch tested? Doc generated snapshot: ![image](https://user-images.githubusercontent.com/19235986/38987108-45646044-4401-11e8-9ba8-ae94ba96cbf9.png) ![image](https://user-images.githubusercontent.com/19235986/38987096-36dcc73c-4401-11e8-87f9-5b91e7f9e27b.png) ![image](https://user-images.githubusercontent.com/19235986/38987088-2d1c1eaa-4401-11e8-80b5-8c40d529a120.png) ![image](https://user-images.githubusercontent.com/19235986/38987077-22ce8be0-4401-11e8-8199-c3a4d8d23201.png) Author: WeichenXu <weichen.xu@databricks.com> Closes #20446 from WeichenXu123/summ_guide.	2018-07-11 13:56:09 -05:00
Andrew Korzhuev	5ff1b9ba19	[SPARK-23529][K8S] Support mounting volumes This PR continues #21095 and intersects with #21238. I've added volume mounts as a separate step and added PersistantVolumeClaim support. There is a fundamental problem with how we pass the options through spark conf to fabric8. For each volume type and all possible volume options we would have to implement some custom code to map config values to fabric8 calls. This will result in big body of code we would have to support and means that Spark will always be somehow out of sync with k8s. I think there needs to be a discussion on how to proceed correctly (eg use PodPreset instead) ---- Due to the complications of provisioning and managing actual resources this PR addresses only volume mounting of already present resources. ---- - [x] emptyDir support - [x] Testing - [x] Documentation - [x] KubernetesVolumeUtils tests Author: Andrew Korzhuev <andrew.korzhuev@klarna.com> Author: madanadit <adit@alluxio.com> Closes #21260 from andrusha/k8s-vol.	2018-07-10 22:53:44 -07:00
jerryshao	e2c7e09f74	[SPARK-24646][CORE] Minor change to spark.yarn.dist.forceDownloadSchemes to support wildcard '' ## What changes were proposed in this pull request? In the case of getting tokens via customized `ServiceCredentialProvider`, it is required that `ServiceCredentialProvider` be available in local spark-submit process classpath. In this case, all the configured remote sources should be forced to download to local. For the ease of using this configuration, here propose to add wildcard '' support to `spark.yarn.dist.forceDownloadSchemes`, also clarify the usage of this configuration. ## How was this patch tested? New UT added. Author: jerryshao <sshao@hortonworks.com> Closes #21633 from jerryshao/SPARK-21917-followup.	2018-07-09 10:21:40 +08:00
Gengliang Wang	33952cfa81	[SPARK-24675][SQL] Rename table: validate existence of new location ## What changes were proposed in this pull request? If table is renamed to a existing new location, data won't show up. ``` scala> Seq("hello").toDF("a").write.format("parquet").saveAsTable("t") scala> sql("select * from t").show() +-----+ \| a\| +-----+ \|hello\| +-----+ scala> sql("alter table t rename to test") res2: org.apache.spark.sql.DataFrame = [] scala> sql("select * from test").show() +---+ \| a\| +---+ +---+ ``` The file layout is like ``` $ tree test test ├── gabage └── t ├── _SUCCESS └── part-00000-856b0f10-08f1-42d6-9eb3-7719261f3d5e-c000.snappy.parquet ``` In Hive, if the new location exists, the renaming will fail even the location is empty. We should have the same validation in Catalog, in case of unexpected bugs. ## How was this patch tested? New unit test. Author: Gengliang Wang <gengliang.wang@databricks.com> Closes #21655 from gengliangwang/validate_rename_table.	2018-07-05 09:25:19 -07:00
Rekha Joshi	f599cde695	[SPARK-24507][DOCUMENTATION] Update streaming guide ## What changes were proposed in this pull request? Updated streaming guide for direct stream and link to integration guide. ## How was this patch tested? jekyll build Author: Rekha Joshi <rekhajoshm@gmail.com> Closes #21683 from rekhajoshm/SPARK-24507.	2018-07-02 22:39:00 +08:00
Yuanjian Li	6a0b77a55d	[SPARK-24215][PYSPARK][FOLLOW UP] Implement eager evaluation for DataFrame APIs in PySpark ## What changes were proposed in this pull request? Address comments in #21370 and add more test. ## How was this patch tested? Enhance test in pyspark/sql/test.py and DataFrameSuite Author: Yuanjian Li <xyliyuanjian@gmail.com> Closes #21553 from xuanyuanking/SPARK-24215-follow.	2018-06-27 10:43:06 -07:00

1 2 3 4 5 ...

2162 commits