ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
gengjiaan	dba4375359	[MINOR][CORE][DOCS] Fix inconsistent description of showConsoleProgress ## What changes were proposed in this pull request? The latest docs http://spark.apache.org/docs/latest/configuration.html contains some description as below: spark.ui.showConsoleProgress \| true \| Show the progress bar in the console. The progress bar shows the progress of stages that run for longer than 500ms. If multiple stages run at the same time, multiple progress bars will be displayed on the same line. -- \| -- \| -- But the class `org.apache.spark.internal.config.UI` define the config `spark.ui.showConsoleProgress` as below: ``` val UI_SHOW_CONSOLE_PROGRESS = ConfigBuilder("spark.ui.showConsoleProgress") .doc("When true, show the progress bar in the console.") .booleanConf .createWithDefault(false) ``` So I think there are exists some little mistake and lead to confuse reader. ## How was this patch tested? No need UT. Closes #25297 from beliefer/inconsistent-desc-showConsoleProgress. Authored-by: gengjiaan <gengjiaan@360.cn> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-07-31 12:17:44 +09:00
Marcelo Vanzin	2ddeff97d7	[SPARK-27963][CORE] Allow dynamic allocation without a shuffle service. This change adds a new option that enables dynamic allocation without the need for a shuffle service. This mode works by tracking which stages generate shuffle files, and keeping executors that generate data for those shuffles alive while the jobs that use them are active. A separate timeout is also added for shuffle data; so that executors that hold shuffle data can use a separate timeout before being removed because of being idle. This allows the shuffle data to be kept around in case it is needed by some new job, or allow users to be more aggressive in timing out executors that don't have shuffle data in active use. The code also hooks up to the context cleaner so that shuffles that are garbage collected are detected, and the respective executors not held unnecessarily. Testing done with added unit tests, and also with TPC-DS workloads on YARN without a shuffle service. Closes #24817 from vanzin/SPARK-27963. Authored-by: Marcelo Vanzin <vanzin@cloudera.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-07-16 16:37:38 -07:00
Gabor Somogyi	f83000597f	[SPARK-23472][CORE] Add defaultJavaOptions for driver and executor. ## What changes were proposed in this pull request? This PR adds two new config properties: `spark.driver.defaultJavaOptions` and `spark.executor.defaultJavaOptions`. These are intended to be set by administrators in a file of defaults for options like JVM garbage collection algorithm. Users will still set `extraJavaOptions` properties, and both sets of JVM options will be added to start a JVM (default options are prepended to extra options). ## How was this patch tested? Existing + additional unit tests. ``` cd docs/ SKIP_API=1 jekyll build ``` Manual webpage check. Closes #24804 from gaborgsomogyi/SPARK-23472. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-07-11 09:37:26 -07:00
Dongjoon Hyun	47f54b1ec7	[SPARK-28118][CORE] Add `spark.eventLog.compression.codec` configuration ## What changes were proposed in this pull request? Event logs are different from the other data in terms of the lifetime. It would be great to have a new configuration for Spark event log compression like `spark.eventLog.compression.codec` . This PR adds this new configuration as an optional configuration. So, if `spark.eventLog.compression.codec` is not given, `spark.io.compression.codec` will be used. ## How was this patch tested? Pass the Jenkins with the newly added test case. Closes #24921 from dongjoon-hyun/SPARK-28118. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: DB Tsai <d_tsai@apple.com>	2019-06-21 00:43:38 +00:00
Mellacheruvu Sandeep	b7b4452553	[SPARK-24898][DOC] Adding spark.checkpoint.compress to the docs ## What changes were proposed in this pull request? Adding spark.checkpoint.compress configuration parameter to the documentation ![](https://user-images.githubusercontent.com/3538013/59580409-a7013080-90ee-11e9-9b2c-3d29015f597e.png) ## How was this patch tested? Checked locally for jeykyll html docs. Also validated the html for any issues. Closes #24883 from sandeepvja/SPARK-24898. Authored-by: Mellacheruvu Sandeep <mellacheruvu.sandeep@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-06-16 22:54:08 -07:00
Thomas Graves	d30284b5a5	[SPARK-27760][CORE] Spark resources - change user resource config from .count to .amount ## What changes were proposed in this pull request? Change the resource config spark.{executor/driver}.resource.{resourceName}.count to .amount to allow future usage of containing both a count and a unit. Right now we only support counts - # of gpus for instance, but in the future we may want to support units for things like memory - 25G. I think making the user only have to specify a single config .amount is better then making them specify 2 separate configs of a .count and then a .unit. Change it now since its a user facing config. Amount also matches how the spark on yarn configs are setup. ## How was this patch tested? Unit tests and manually verified on yarn and local cluster mode Closes #24810 from tgravescs/SPARK-27760-amount. Authored-by: Thomas Graves <tgraves@nvidia.com> Signed-off-by: Thomas Graves <tgraves@apache.org>	2019-06-06 14:16:05 -05:00
gengjiaan	8feb80ad86	[SPARK-27811][CORE][DOCS] Improve docs about spark.driver.memoryOverhead and spark.executor.memoryOverhead. ## What changes were proposed in this pull request? I found the docs of `spark.driver.memoryOverhead` and `spark.executor.memoryOverhead` exists a little ambiguity. For example, the origin docs of `spark.driver.memoryOverhead` start with `The amount of off-heap memory to be allocated per driver in cluster mode`. But `MemoryManager` also managed a memory area named off-heap used to allocate memory in tungsten mode. So I think the description of `spark.driver.memoryOverhead` always make confused. `spark.executor.memoryOverhead` has the same confused with `spark.driver.memoryOverhead`. ## How was this patch tested? Exists UT. Closes #24671 from beliefer/improve-docs-of-overhead. Authored-by: gengjiaan <gengjiaan@360.cn> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-06-01 08:19:50 -05:00
Thomas Graves	1277f8fa92	[SPARK-27362][K8S] Resource Scheduling support for k8s ## What changes were proposed in this pull request? Add ability to map the spark resource configs spark.{executor/driver}.resource.{resourceName} to kubernetes Container builder so that we request resources (gpu,s/fpgas/etc) from kubernetes. Note that the spark configs will overwrite any resource configs users put into a pod template. I added a generic vendor config which is only used by kubernetes right now. I intentionally didn't put it into the kubernetes config namespace just to avoid adding more config prefixes. I will add more documentation for this under jira SPARK-27492. I think it will be easier to do all at once to get cohesive story. ## How was this patch tested? Unit tests and manually testing on k8s cluster. Closes #24703 from tgravescs/SPARK-27362. Authored-by: Thomas Graves <tgraves@nvidia.com> Signed-off-by: Thomas Graves <tgraves@apache.org>	2019-05-31 15:26:14 -05:00
Marcelo Vanzin	09ed64d795	[SPARK-27868][CORE] Better default value and documentation for socket server backlog. First, there is currently no public documentation for this setting. So it's hard to even know that it could be a problem if your application starts failing with weird shuffle errors. Second, the javadoc attached to the code was incorrect; the default value just uses the default value from the JRE, which is 50, instead of having an unbounded queue as the comment implies. So use a default that is a "rounded" version of the JRE default, and provide documentation explaining that this value may need to be adjusted. Also added a log message that was very helpful in debugging an issue caused by this problem. Closes #24732 from vanzin/SPARK-27868. Authored-by: Marcelo Vanzin <vanzin@cloudera.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-05-29 14:56:36 -07:00
Thomas Graves	74e5e41eeb	[SPARK-27488][CORE] Driver interface to support GPU resources ## What changes were proposed in this pull request? Added the driver functionality to get the resources. The user interface is: SparkContext.resources - I called it this to match the TaskContext.resources api proposed in the other PR. Originally it was going to be called SparkContext.getResources but changed to be consistent, if people have strong feelings I can change it. There are 2 ways the driver can discover what resources it has. 1) user specifies a discoveryScript, this is similar to the executors and is meant for yarn and k8s where they don't tell you what you were allocated but you are running in isolated environment. 2) read the config spark.driver.resource.resourceName.addresses. The config is meant to be used with standalone mode where the Worker will have to assign what GPU addresses the Driver is allowed to use by setting that config. When the user runs a spark application, if they want the driver to have GPU's they would specify the conf spark.driver.resource.gpu.count=X where x is the number they want. If they are running on yarn or k8s they will also have to specify the discoveryScript as specified above, if they are on standalone mode and cluster is setup properly they wouldn't have to specify anything else. We could potentially get rid of the spark.driver.resources.gpu.addresses config which is really meant to be an internal config for worker to set if the standalone mode Worker wanted to write a discoveryScript out and set that for the user. I'll wait for the jira that implements that to decide if we can remove. - This PR also has changes to be consistent about using resourceName everywhere. - change the config names from POSTFIX to SUFFIX to be more consistent with other areas in Spark - Moved the config checks around a bit since now used by both executor and driver. Note those might overlap a bit with https://github.com/apache/spark/pull/24374 so we will have to figure out which one should go in first. ## How was this patch tested? Unit tests and manually test the interface. Closes #24615 from tgravescs/SPARK-27488. Authored-by: Thomas Graves <tgraves@nvidia.com> Signed-off-by: Xiangrui Meng <meng@databricks.com>	2019-05-23 11:46:13 -07:00
Thomas Graves	db2e3c4341	[SPARK-27024] Executor interface for cluster managers to support GPU and other resources ## What changes were proposed in this pull request? Add in GPU and generic resource type allocation to the executors. Note this is part of a bigger feature for gpu-aware scheduling and is just how the executor find the resources. The general flow : - users ask for a certain set of resources, for instance number of gpus - each cluster manager has a specific way to do this. - cluster manager allocates a container or set of resources (standalone mode) - When spark launches the executor in that container, the executor either has to be told what resources it has or it has to auto discover them. - Executor has to register with Driver and tell the driver the set of resources it has so the scheduler can use that to schedule tasks that requires a certain amount of each of those resources In this pr I added configs and arguments to the executor to be able discover resources. The argument to the executor is intended to be used by standalone mode or other cluster managers that don't have isolation so that it can assign specific resources to specific executors in case there are multiple executors on a node. The argument is a file contains JSON Array of ResourceInformation objects. The discovery script is meant to be used in an isolated environment where the executor only sees the resources it should use. Note that there will be follow on PRs to add other parts like the scheduler part. See the epic high level jira: https://issues.apache.org/jira/browse/SPARK-24615 ## How was this patch tested? Added unit tests and manually tested. Please review http://spark.apache.org/contributing.html before opening a pull request. Closes #24406 from tgravescs/gpu-sched-executor-clean. Authored-by: Thomas Graves <tgraves@nvidia.com> Signed-off-by: Thomas Graves <tgraves@apache.org>	2019-05-14 08:41:41 -05:00
jiafu.zhang@intel.com	fa5dc0a45a	[SPARK-26632][CORE] Separate Thread Configurations of Driver and Executor ## What changes were proposed in this pull request? For the below three thread configuration items applied to both driver and executor, spark.rpc.io.serverThreads spark.rpc.io.clientThreads spark.rpc.netty.dispatcher.numThreads, we separate them to driver specifics and executor specifics. spark.driver.rpc.io.serverThreads < - > spark.executor.rpc.io.serverThreads spark.driver.rpc.io.clientThreads < - > spark.executor.rpc.io.clientThreads spark.driver.rpc.netty.dispatcher.numThreads < - > spark.executor.rpc.netty.dispatcher.numThreads Spark reads these specifics first and fall back to the common configurations. ## How was this patch tested? We ran the SimpleMap app without shuffle for benchmark purpose to test Spark's scalability in HPC with omini-path NIC which has higher bandwidth than normal ethernet NIC. Spark's base version is 2.4.0. Spark ran in the Standalone mode. Driver was in a standalone node. After the separation, the performance is improved a lot in 256 nodes and 512 nodes. see below test results of SimpleMapTask before and after the enhancement. You can view the tables in the [JIRA](https://issues.apache.org/jira/browse/SPARK-26632) too. ds: spark.driver.rpc.io.serverThreads dc: spark.driver.rpc.io.clientThreads dd: spark.driver.rpc.netty.dispatcher.numThreads ed: spark.executor.rpc.netty.dispatcher.numThreads time: Overall Time (s) old time: Overall Time without Separation (s) Before: nodes \| ds \| dc \| dd \| ed \| time -- \|-- \| -- \| -- \| -- \| -- 128 nodes \| 8 \| 8 \| 8 \| 8 \| 108 256 nodes \| 8 \| 8 \| 8 \| 8 \| 196 512 nodes \| 8 \| 8 \| 8 \| 8 \| 377 After: nodes \| ds \| dc \| dd \| ed \| time \| improvement -- \| -- \| -- \| -- \| -- \| -- \| -- 128 nodes \| 15 \| 15 \| 10 \| 30 \| 107 \| 0.9% 256 nodes \| 12 \| 15 \| 10 \| 30 \| 159 \| 18.8% 512 nodes \| 12 \| 15 \| 10 \| 30 \| 283 \| 24.9% Closes #23560 from zjf2012/thread_conf_separation. Authored-by: jiafu.zhang@intel.com <jiafu.zhang@intel.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-05-10 10:42:43 -07:00
shivusondur	4cb1cd6ab7	[SPARK-27532][DOC] Correct the default value in the Documentation for "spark.redaction.regex" ## What changes were proposed in this pull request? Corrected the default value in the Documentation for "spark.redaction.regex" ## How was this patch tested? NA Closes #24428 from shivusondur/doc2. Authored-by: shivusondur <shivusondur@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-04-21 16:56:12 +09:00
Shixiong Zhu	5ff39cd5ee	[SPARK-27394][WEBUI] Flush LiveEntity if necessary when receiving SparkListenerExecutorMetricsUpdate ## What changes were proposed in this pull request? This PR updates `AppStatusListener` to flush `LiveEntity` if necessary when receiving `SparkListenerExecutorMetricsUpdate`. This will ensure the staleness of Spark UI doesn't last more than the executor heartbeat interval. ## How was this patch tested? The new unit test. Closes #24303 from zsxwing/SPARK-27394. Authored-by: Shixiong Zhu <zsxwing@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-04-09 08:26:00 -07:00
Jungtaek Lim (HeartSaVioR)	a840b99daf	[MINOR][DOC] Fix html tag broken in configuration.md ## What changes were proposed in this pull request? This patch fixes wrong HTML tag in configuration.md which breaks the table tag. This is originally reported in dev mailing list: https://lists.apache.org/thread.html/744bdc83b3935776c8d91bf48fdf80d9a3fed3858391e60e343206f9%3Cdev.spark.apache.org%3E ## How was this patch tested? This change is one-liner and pretty obvious so I guess we may be able to skip testing. Closes #24304 from HeartSaVioR/MINOR-configuration-doc-html-tag-error. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-04-05 08:41:19 -07:00
Sean Owen	754f820035	[SPARK-26918][DOCS] All .md should have ASF license header ## What changes were proposed in this pull request? Add AL2 license to metadata of all .md files. This seemed to be the tidiest way as it will get ignored by .md renderers and other tools. Attempts to write them as markdown comments revealed that there is no such standard thing. ## How was this patch tested? Doc build Closes #24243 from srowen/SPARK-26918. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-30 19:49:45 -05:00
“attilapiros”	2fbed378bf	[MINOR][DOC] Add missing space after comma Adding missing spaces after commas. Closes #24205 from attilapiros/minor-doc-changes. Authored-by: “attilapiros” <piros.attila.zsolt@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-25 15:22:07 -05:00
Jungtaek Lim (HeartSaVioR)	d5bda2c9e8	[SPARK-26792][CORE] Apply custom log URL to Spark UI ## What changes were proposed in this pull request? [SPARK-23155](https://issues.apache.org/jira/browse/SPARK-23155) enables SHS to set up custom executor log URLs. This patch proposes to extend this feature to to Spark UI as well. Unlike the approach we did for SHS (replace executor log URLs when executor information is requested so it's like a change of view), here this patch replaces executor log URLs while registering executor, which also affects event log as well. In point of SHS's view, it will be treated as original log url when custom log url is applied to Spark UI. ## How was this patch tested? Added UT. Closes #23790 from HeartSaVioR/SPARK-26792. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-03-04 10:36:04 -08:00
Hyukjin Kwon	0d77d575e1	[MINOR][DOCS] Add a note that 'spark.executor.pyspark.memory' is dependent on 'resource' ## What changes were proposed in this pull request? This PR adds a note that explicitly `spark.executor.pyspark.memory` is dependent on resource module's behaviours at Python memory usage. For instance, I at least see some difference at https://github.com/apache/spark/pull/21977#discussion_r251220966 ## How was this patch tested? Manually built the doc. Closes #23664 from HyukjinKwon/note-resource-dependent. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2019-01-31 15:51:40 +08:00
Wenchen Fan	ed71a825c5	[SPARK-26700][CORE] enable fetch-big-block-to-disk by default ## What changes were proposed in this pull request? This is a followup of #16989 The fetch-big-block-to-disk feature is disabled by default, because it's not compatible with external shuffle service prior to Spark 2.2. The client sends stream request to fetch block chunks, and old shuffle service can't support it. After 2 years, Spark 2.2 has EOL, and now it's safe to turn on this feature by default ## How was this patch tested? existing tests Closes #23625 from cloud-fan/minor. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-01-28 23:41:55 +08:00
Devaraj K	1b75f3bcff	[SPARK-17928][MESOS] No driver.memoryOverhead setting for mesos cluster mode ## What changes were proposed in this pull request? Added a new configuration 'spark.mesos.driver.memoryOverhead' for providing the driver memory overhead in mesos cluster mode. ## How was this patch tested? Verified it manually, Resource Scheduler allocates (drivermemory+ driver memoryOverhead) for driver in mesos cluster mode. Closes #17726 from devaraj-kavali/SPARK-17928. Authored-by: Devaraj K <devaraj@apache.org> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-01-15 15:45:20 -06:00
Sean Owen	2f8a938805	[SPARK-26539][CORE] Remove spark.memory.useLegacyMode and StaticMemoryManager ## What changes were proposed in this pull request? Remove spark.memory.useLegacyMode and StaticMemoryManager. Update tests that used the StaticMemoryManager to equivalent use of UnifiedMemoryManager. ## How was this patch tested? Existing tests, with modifications to make them work with a different mem manager. Closes #23457 from srowen/SPARK-26539. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-01-10 08:57:44 -06:00
“attilapiros”	e103c4a5e7	[SPARK-24920][CORE] Allow sharing Netty's memory pool allocators ## What changes were proposed in this pull request? Introducing shared polled ByteBuf allocators. This feature can be enabled via the "spark.network.sharedByteBufAllocators.enabled" configuration. When it is on then only two pooled ByteBuf allocators are created: - one for transport servers where caching is allowed and - one for transport clients where caching is disabled This way the cache allowance remains as before. Both shareable pools are created with numCores parameter set to 0 (which defaults to the available processors) as conf.serverThreads() and conf.clientThreads() are module dependant and the lazy creation of this allocators would lead to unpredicted behaviour. When "spark.network.sharedByteBufAllocators.enabled" is false then a new allocator is created for every transport client and server separately as was before this PR. ## How was this patch tested? Existing unit tests. Closes #23278 from attilapiros/SPARK-24920. Authored-by: “attilapiros” <piros.attila.zsolt@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-01-08 13:11:11 -06:00
韩田田00222924	82c1ac48a3	[SPARK-25696] The storage memory displayed on spark Application UI is… … incorrect. ## What changes were proposed in this pull request? In the reported heartbeat information, the unit of the memory data is bytes, which is converted by the formatBytes() function in the utils.js file before being displayed in the interface. The cardinality of the unit conversion in the formatBytes function is 1000, which should be 1024. Change the cardinality of the unit conversion in the formatBytes function to 1024. ## How was this patch tested? manual tests Please review http://spark.apache.org/contributing.html before opening a pull request. Closes #22683 from httfighter/SPARK-25696. Lead-authored-by: 韩田田00222924 <han.tiantian@zte.com.cn> Co-authored-by: han.tiantian@zte.com.cn <han.tiantian@zte.com.cn> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-12-10 18:27:01 -06:00
hyukjinkwon	9cda9a892d	[SPARK-26080][PYTHON] Skips Python resource limit on Windows in Python worker ## What changes were proposed in this pull request? `resource` package is a Unix specific package. See https://docs.python.org/2/library/resource.html and https://docs.python.org/3/library/resource.html. Note that we document Windows support: > Spark runs on both Windows and UNIX-like systems (e.g. Linux, Mac OS). This should be backported into branch-2.4 to restore Windows support in Spark 2.4.1. ## How was this patch tested? Manually mocking the changed logics. Closes #23055 from HyukjinKwon/SPARK-26080. Lead-authored-by: hyukjinkwon <gurwls223@apache.org> Co-authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2018-12-02 17:41:08 +08:00
Keiji Yoshida	c3f27b2437	[MINOR][DOCS] Fix typos ## What changes were proposed in this pull request? Fix Typos. This PR is the complete version of https://github.com/apache/spark/pull/23145. ## How was this patch tested? NA Closes #23185 from kjmrknsn/docUpdate. Authored-by: Keiji Yoshida <kjmrknsn@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-11-30 09:03:46 -06:00
“attilapiros”	ab61ddb34d	[SPARK-26118][WEB UI] Introducing spark.ui.requestHeaderSize for setting HTTP requestHeaderSize ## What changes were proposed in this pull request? Introducing spark.ui.requestHeaderSize for configuring Jetty's HTTP requestHeaderSize. This way long authorization field does not lead to HTTP 413. ## How was this patch tested? Manually with curl (which version must be at least 7.55). With the original default value (8k limit): ```bash # Starting history server with default requestHeaderSize $ ./sbin/start-history-server.sh starting org.apache.spark.deploy.history.HistoryServer, logging to /Users/attilapiros/github/spark/logs/spark-attilapiros-org.apache.spark.deploy.history.HistoryServer-1-apiros-MBP.lan.out # Creating huge header $ echo -n "X-Custom-Header: " > cookie $ printf 'A%.0s' {1..9500} >> cookie # HTTP GET with huge header fails with 431 $ curl -H cookie http://458apiros-MBP.lan:18080/ <h1>Bad Message 431</h1><pre>reason: Request Header Fields Too Large</pre> # The log contains the error $ tail -1 /Users/attilapiros/github/spark/logs/spark-attilapiros-org.apache.spark.deploy.history.HistoryServer-1-apiros-MBP.lan.out 18/11/19 21:24:28 WARN HttpParser: Header is too large 8193>8192 ``` After: ```bash # Creating the history properties file with the increased requestHeaderSize $ echo spark.ui.requestHeaderSize=10000 > history.properties # Starting Spark History Server with the settings $ ./sbin/start-history-server.sh --properties-file history.properties starting org.apache.spark.deploy.history.HistoryServer, logging to /Users/attilapiros/github/spark/logs/spark-attilapiros-org.apache.spark.deploy.history.HistoryServer-1-apiros-MBP.lan.out # HTTP GET with huge header gives back HTML5 (I have added here only just a part of the response) $ curl -H cookie http://458apiros-MBP.lan:18080/ <!DOCTYPE html><html> <head>... <link rel="shortcut icon" href="/static/spark-logo-77x50px-hd.png"></link> <title>History Server</title> </head> <body> ... ``` Closes #23090 from attilapiros/JettyHeaderSize. Authored-by: “attilapiros” <piros.attila.zsolt@gmail.com> Signed-off-by: Imran Rashid <irashid@cloudera.com>	2018-11-20 08:56:22 -06:00
ankurgupta	5f11e8c4cb	[SPARK-25118][CORE] Persist Driver Logs in Client mode to Hdfs Currently, we do not have a mechanism to collect driver logs if a user chooses to run their application in client mode. This is a big issue as admin teams need to create their own mechanisms to capture driver logs. This commit adds a logger which, if enabled, adds a local log appender to the root logger and asynchronously syncs it an application specific log file on hdfs (Spark Driver Log Dir). Additionally, this collects spark-shell driver logs at INFO level by default. The change is that instead of setting root logger level to WARN, we will set the consoleAppender threshold to WARN, in case of spark-shell. This ensures that only WARN logs are printed on CONSOLE but other log appenders still capture INFO (or the default log level logs). 1. Verified that logs are written to local and remote dir 2. Added a unit test case 3. Verified this for spark-shell, client mode and pyspark. 4. Verified in both non-kerberos and kerberos environment 5. Verified with following unexpected termination conditions: Ctrl + C, Driver OOM, Large Log Files 6. Ran an application in spark-shell and ensured that driver logs were captured at INFO level 7. Started the application at WARN level, programmatically changed the level to INFO and ensured that logs on console were printed at INFO level Closes #22504 from ankuriitg/ankurgupta/SPARK-25118. Authored-by: ankurgupta <ankur.gupta@cloudera.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2018-11-14 08:23:34 -08:00
Dhruve Ashar	fdd3bace1d	[SPARK-22148][SPARK-15815][SCHEDULER] Acquire new executors to avoid hang because of blacklisting ## What changes were proposed in this pull request? Every time a task is unschedulable because of the condition where no. of task failures < no. of executors available, we currently abort the taskSet - failing the job. This change tries to acquire new executors so that we can complete the job successfully. We try to acquire a new executor only when we can kill an existing idle executor. We fallback to the older implementation where we abort the job if we cannot find an idle executor. ## How was this patch tested? I performed some manual tests to check and validate the behavior. ```scala val rdd = sc.parallelize(Seq(1 to 10), 3) import org.apache.spark.TaskContext val mapped = rdd.mapPartitionsWithIndex ( (index, iterator) => { if (index == 2) { Thread.sleep(30 * 1000); val attemptNum = TaskContext.get.attemptNumber; if (attemptNum < 3) throw new Exception("Fail for blacklisting")}; iterator.toList.map (x => x + " -> " + index).iterator } ) mapped.collect ``` Closes #22288 from dhruve/bug/SPARK-22148. Lead-authored-by: Dhruve Ashar <dhruveashar@gmail.com> Co-authored-by: Dhruve Ashar <dhruve@users.noreply.github.com> Co-authored-by: Tom Graves <tgraves@apache.org> Signed-off-by: Thomas Graves <tgraves@apache.org>	2018-11-06 08:25:32 -06:00
Alex Hagerman	1a7abf3f45	[SPARK-25933][DOCUMENTATION] Fix pstats.Stats() reference in configuration.md ## What changes were proposed in this pull request? Change ptats.Stats() to pstats.Stats() for `spark.python.profile.dump` in configuration.md. ## How was this patch tested? Doc test Closes #22933 from AlexHagerman/doc_fix. Authored-by: Alex Hagerman <alex@unexpectedeof.net> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-11-03 12:56:59 -05:00
Imran Rashid	af3b816070	[SPARK-25855][CORE] Don't use erasure coding for event logs by default ## What changes were proposed in this pull request? This turns off hdfs erasure coding by default for event logs, regardless of filesystem defaults. Because this requires apis only available in hadoop 3, this uses reflection. EC isn't a very good choice for event logs, as hflush() is a no-op, and so updates to the file are not visible for a long time. This can still be configured by setting "spark.eventLog.allowErasureCoding=true", which will use filesystem defaults. ## How was this patch tested? deployed a cluster with the changes with HDFS EC on. By default, event logs didn't use EC, but configuration still would allow EC. Also tried writing to the local fs (which doesn't support EC at all) and things worked fine. Closes #22881 from squito/SPARK-25855. Authored-by: Imran Rashid <irashid@cloudera.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2018-10-31 10:52:51 -07:00
Sean Owen	703e6da1ec	[SPARK-25705][BUILD][STREAMING][TEST-MAVEN] Remove Kafka 0.8 integration ## What changes were proposed in this pull request? Remove Kafka 0.8 integration ## How was this patch tested? Existing tests, build scripts Closes #22703 from srowen/SPARK-25705. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-10-16 09:10:24 -05:00
shivusondur	1a6815cd9f	[SPARK-25677][DOC] spark.io.compression.codec = org.apache.spark.io.ZstdCompressionCodec throwing IllegalArgumentException Exception ## What changes were proposed in this pull request? Documentation is updated with proper classname org.apache.spark.io.ZStdCompressionCodec ## How was this patch tested? we used the spark.io.compression.codec = org.apache.spark.io.ZStdCompressionCodec and verified the logs. Closes #22669 from shivusondur/CompressionIssue. Authored-by: shivusondur <shivusondur@gmail.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-10-08 15:43:08 +08:00
Shahid	71876633f3	[SPARK-25583][DOC] Add history-server related configuration in the documentation. ## What changes were proposed in this pull request? Add history-server related configuration in the documentation. Some of the history server related configurations were missing in the documentation.Like, 'spark.history.store.maxDiskUsage', 'spark.ui.liveUpdate.period' etc. ## How was this patch tested? ![screenshot from 2018-10-01 20-58-26](https://user-images.githubusercontent.com/23054875/46298568-04833a80-c5bd-11e8-95b8-54c9d6582fd2.png) ![screenshot from 2018-10-01 20-59-31](https://user-images.githubusercontent.com/23054875/46298591-11a02980-c5bd-11e8-93d0-892afdfd4f9a.png) ![screenshot from 2018-10-01 20-59-45](https://user-images.githubusercontent.com/23054875/46298601-1533b080-c5bd-11e8-9689-e9b39882a7b5.png) Closes #22601 from shahidki31/historyConf. Authored-by: Shahid <shahidki31@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2018-10-02 08:05:09 -07:00
Michael Mior	ab25c96790	[SPARK-23820][CORE] Enable use of long form of callsite in logs This is a rework of #21433 to address some concerns there. Closes #22398 from michaelmior/long-callsite2. Authored-by: Michael Mior <mmior@uwaterloo.ca> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-09-13 09:45:25 +08:00
Ilan Filonenko	1cfda44825	[SPARK-25021][K8S] Add spark.executor.pyspark.memory limit for K8S ## What changes were proposed in this pull request? Add spark.executor.pyspark.memory limit for K8S ## How was this patch tested? Unit and Integration tests Closes #22298 from ifilonenko/SPARK-25021. Authored-by: Ilan Filonenko <if56@cornell.edu> Signed-off-by: Holden Karau <holden@pigscanfly.ca>	2018-09-08 22:18:06 -07:00
huangtengfei02	339859c4e4	[SPARK-25261][MINOR][DOC] update the description for spark.executor\|driver.memory in configuration.md ## What changes were proposed in this pull request? As described in [SPARK-25261](https://issues.apache.org/jira/projects/SPARK/issues/SPARK-25261)，the unit of spark.executor.memory and spark.driver.memory is parsed as bytes in some cases if no unit specified, while in https://spark.apache.org/docs/latest/configuration.html#application-properties, they are descibed as MiB, which may lead to some misunderstandings. ## How was this patch tested? N/A Closes #22252 from ivoson/branch-correct-configuration. Lead-authored-by: huangtengfei02 <huangtengfei02@baidu.com> Co-authored-by: Huang Tengfei <tengfei.h@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-08-31 09:06:38 -05:00
Reza Safi	135ff16a35	[SPARK-25233][STREAMING] Give the user the option of specifying a minimum message per partition per batch when using kafka direct API with backpressure After SPARK-18371, it is guaranteed that there would be at least one message per partition per batch using direct kafka API when new messages exist in the topics. This change will give the user the option of setting the minimum instead of just a hard coded 1 limit The related unit test is updated and some internal tests verified that the topic partitions with new messages will be progressed by the specified minimum. Author: Reza Safi <rezasafi@cloudera.com> Closes #22223 from rezasafi/streaminglag.	2018-08-30 13:26:03 -05:00
Ryan Blue	7ad18ee9f2	[SPARK-25004][CORE] Add spark.executor.pyspark.memory limit. ## What changes were proposed in this pull request? This adds `spark.executor.pyspark.memory` to configure Python's address space limit, [`resource.RLIMIT_AS`](https://docs.python.org/3/library/resource.html#resource.RLIMIT_AS). Limiting Python's address space allows Python to participate in memory management. In practice, we see fewer cases of Python taking too much memory because it doesn't know to run garbage collection. This results in YARN killing fewer containers. This also improves error messages so users know that Python is consuming too much memory: ``` File "build/bdist.linux-x86_64/egg/package/library.py", line 265, in fe_engineer fe_eval_rec.update(f(src_rec_prep, mat_rec_prep)) File "build/bdist.linux-x86_64/egg/package/library.py", line 163, in fe_comp comparisons = EvaluationUtils.leven_list_compare(src_rec_prep.get(item, []), mat_rec_prep.get(item, [])) File "build/bdist.linux-x86_64/egg/package/evaluationutils.py", line 25, in leven_list_compare permutations = sorted(permutations, reverse=True) MemoryError ``` The new pyspark memory setting is used to increase requested YARN container memory, instead of sharing overhead memory between python and off-heap JVM activity. ## How was this patch tested? Tested memory limits in our YARN cluster and verified that MemoryError is thrown. Author: Ryan Blue <blue@apache.org> Closes #21977 from rdblue/SPARK-25004-add-python-memory-limit.	2018-08-28 12:31:33 -07:00
Joey Krabacher	709f541dd0	[DOCS] Update configuration.md changed $SPARK_HOME/conf/spark-default.conf to $SPARK_HOME/conf/spark-defaults.conf no testing necessary as this was a change to documentation. Closes #22116 from KraFusion/patch-1. Authored-by: Joey Krabacher <jkrabacher@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-08-16 16:47:52 -07:00
Kazuhiro Sera	8ec25cd67e	Fix typos detected by github.com/client9/misspell ## What changes were proposed in this pull request? Fixing typos is sometimes very hard. It's not so easy to visually review them. Recently, I discovered a very useful tool for it, [misspell](https://github.com/client9/misspell). This pull request fixes minor typos detected by [misspell](https://github.com/client9/misspell) except for the false positives. If you would like me to work on other files as well, let me know. ## How was this patch tested? ### before ``` $ misspell . \| grep -v '.js' R/pkg/R/SQLContext.R:354:43: "definiton" is a misspelling of "definition" R/pkg/R/SQLContext.R:424:43: "definiton" is a misspelling of "definition" R/pkg/R/SQLContext.R:445:43: "definiton" is a misspelling of "definition" R/pkg/R/SQLContext.R:495:43: "definiton" is a misspelling of "definition" NOTICE-binary:454:16: "containd" is a misspelling of "contained" R/pkg/R/context.R:46:43: "definiton" is a misspelling of "definition" R/pkg/R/context.R:74:43: "definiton" is a misspelling of "definition" R/pkg/R/DataFrame.R:591:48: "persistance" is a misspelling of "persistence" R/pkg/R/streaming.R:166:44: "occured" is a misspelling of "occurred" R/pkg/inst/worker/worker.R:65:22: "ouput" is a misspelling of "output" R/pkg/tests/fulltests/test_utils.R:106:25: "environemnt" is a misspelling of "environment" common/kvstore/src/test/java/org/apache/spark/util/kvstore/InMemoryStoreSuite.java:38:39: "existant" is a misspelling of "existent" common/kvstore/src/test/java/org/apache/spark/util/kvstore/LevelDBSuite.java:83:39: "existant" is a misspelling of "existent" common/network-common/src/main/java/org/apache/spark/network/crypto/TransportCipher.java:243:46: "transfered" is a misspelling of "transferred" common/network-common/src/main/java/org/apache/spark/network/sasl/SaslEncryption.java:234:19: "transfered" is a misspelling of "transferred" common/network-common/src/main/java/org/apache/spark/network/sasl/SaslEncryption.java:238:63: "transfered" is a misspelling of "transferred" common/network-common/src/main/java/org/apache/spark/network/sasl/SaslEncryption.java:244:46: "transfered" is a misspelling of "transferred" common/network-common/src/main/java/org/apache/spark/network/sasl/SaslEncryption.java:276:39: "transfered" is a misspelling of "transferred" common/network-common/src/main/java/org/apache/spark/network/util/AbstractFileRegion.java:27:20: "transfered" is a misspelling of "transferred" common/unsafe/src/test/scala/org/apache/spark/unsafe/types/UTF8StringPropertyCheckSuite.scala:195:15: "orgin" is a misspelling of "origin" core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala:621:39: "gauranteed" is a misspelling of "guaranteed" core/src/main/scala/org/apache/spark/status/storeTypes.scala:113:29: "ect" is a misspelling of "etc" core/src/main/scala/org/apache/spark/storage/DiskStore.scala:282:18: "transfered" is a misspelling of "transferred" core/src/main/scala/org/apache/spark/util/ListenerBus.scala:64:17: "overriden" is a misspelling of "overridden" core/src/test/scala/org/apache/spark/ShuffleSuite.scala:211:7: "substracted" is a misspelling of "subtracted" core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala:1922:49: "agriculteur" is a misspelling of "agriculture" core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala:2468:84: "truely" is a misspelling of "truly" core/src/test/scala/org/apache/spark/storage/FlatmapIteratorSuite.scala:25:18: "persistance" is a misspelling of "persistence" core/src/test/scala/org/apache/spark/storage/FlatmapIteratorSuite.scala:26:69: "persistance" is a misspelling of "persistence" data/streaming/AFINN-111.txt:1219:0: "humerous" is a misspelling of "humorous" dev/run-pip-tests:55:28: "enviroments" is a misspelling of "environments" dev/run-pip-tests:91:37: "virutal" is a misspelling of "virtual" dev/merge_spark_pr.py:377:72: "accross" is a misspelling of "across" dev/merge_spark_pr.py:378:66: "accross" is a misspelling of "across" dev/run-pip-tests:126:25: "enviroments" is a misspelling of "environments" docs/configuration.md:1830:82: "overriden" is a misspelling of "overridden" docs/structured-streaming-programming-guide.md:525:45: "processs" is a misspelling of "processes" docs/structured-streaming-programming-guide.md:1165:61: "BETWEN" is a misspelling of "BETWEEN" docs/sql-programming-guide.md:1891:810: "behaivor" is a misspelling of "behavior" examples/src/main/python/sql/arrow.py:98:8: "substract" is a misspelling of "subtract" examples/src/main/python/sql/arrow.py:103:27: "substract" is a misspelling of "subtract" licenses/LICENSE-heapq.txt:5:63: "Stichting" is a misspelling of "Stitching" licenses/LICENSE-heapq.txt:6:2: "Mathematisch" is a misspelling of "Mathematics" licenses/LICENSE-heapq.txt:262:29: "Stichting" is a misspelling of "Stitching" licenses/LICENSE-heapq.txt:262:39: "Mathematisch" is a misspelling of "Mathematics" licenses/LICENSE-heapq.txt:269:49: "Stichting" is a misspelling of "Stitching" licenses/LICENSE-heapq.txt:269:59: "Mathematisch" is a misspelling of "Mathematics" licenses/LICENSE-heapq.txt:274:2: "STICHTING" is a misspelling of "STITCHING" licenses/LICENSE-heapq.txt:274:12: "MATHEMATISCH" is a misspelling of "MATHEMATICS" licenses/LICENSE-heapq.txt:276:29: "STICHTING" is a misspelling of "STITCHING" licenses/LICENSE-heapq.txt:276:39: "MATHEMATISCH" is a misspelling of "MATHEMATICS" licenses-binary/LICENSE-heapq.txt:5:63: "Stichting" is a misspelling of "Stitching" licenses-binary/LICENSE-heapq.txt:6:2: "Mathematisch" is a misspelling of "Mathematics" licenses-binary/LICENSE-heapq.txt:262:29: "Stichting" is a misspelling of "Stitching" licenses-binary/LICENSE-heapq.txt:262:39: "Mathematisch" is a misspelling of "Mathematics" licenses-binary/LICENSE-heapq.txt:269:49: "Stichting" is a misspelling of "Stitching" licenses-binary/LICENSE-heapq.txt:269:59: "Mathematisch" is a misspelling of "Mathematics" licenses-binary/LICENSE-heapq.txt:274:2: "STICHTING" is a misspelling of "STITCHING" licenses-binary/LICENSE-heapq.txt:274:12: "MATHEMATISCH" is a misspelling of "MATHEMATICS" licenses-binary/LICENSE-heapq.txt:276:29: "STICHTING" is a misspelling of "STITCHING" licenses-binary/LICENSE-heapq.txt:276:39: "MATHEMATISCH" is a misspelling of "MATHEMATICS" mllib/src/main/resources/org/apache/spark/ml/feature/stopwords/hungarian.txt:170:0: "teh" is a misspelling of "the" mllib/src/main/resources/org/apache/spark/ml/feature/stopwords/portuguese.txt:53:0: "eles" is a misspelling of "eels" mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala:99:20: "Euclidian" is a misspelling of "Euclidean" mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala:539:11: "Euclidian" is a misspelling of "Euclidean" mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala:77:36: "Teh" is a misspelling of "The" mllib/src/main/scala/org/apache/spark/mllib/clustering/StreamingKMeans.scala:230:24: "inital" is a misspelling of "initial" mllib/src/main/scala/org/apache/spark/mllib/stat/MultivariateOnlineSummarizer.scala:276:9: "Euclidian" is a misspelling of "Euclidean" mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala:237:26: "descripiton" is a misspelling of "descriptions" python/pyspark/find_spark_home.py:30:13: "enviroment" is a misspelling of "environment" python/pyspark/context.py:937:12: "supress" is a misspelling of "suppress" python/pyspark/context.py:938:12: "supress" is a misspelling of "suppress" python/pyspark/context.py:939:12: "supress" is a misspelling of "suppress" python/pyspark/context.py:940:12: "supress" is a misspelling of "suppress" python/pyspark/heapq3.py:6:63: "Stichting" is a misspelling of "Stitching" python/pyspark/heapq3.py:7:2: "Mathematisch" is a misspelling of "Mathematics" python/pyspark/heapq3.py:263:29: "Stichting" is a misspelling of "Stitching" python/pyspark/heapq3.py:263:39: "Mathematisch" is a misspelling of "Mathematics" python/pyspark/heapq3.py:270:49: "Stichting" is a misspelling of "Stitching" python/pyspark/heapq3.py:270:59: "Mathematisch" is a misspelling of "Mathematics" python/pyspark/heapq3.py:275:2: "STICHTING" is a misspelling of "STITCHING" python/pyspark/heapq3.py:275:12: "MATHEMATISCH" is a misspelling of "MATHEMATICS" python/pyspark/heapq3.py:277:29: "STICHTING" is a misspelling of "STITCHING" python/pyspark/heapq3.py:277:39: "MATHEMATISCH" is a misspelling of "MATHEMATICS" python/pyspark/heapq3.py:713:8: "probabilty" is a misspelling of "probability" python/pyspark/ml/clustering.py:1038:8: "Currenlty" is a misspelling of "Currently" python/pyspark/ml/stat.py:339:23: "Euclidian" is a misspelling of "Euclidean" python/pyspark/ml/regression.py:1378:20: "paramter" is a misspelling of "parameter" python/pyspark/mllib/stat/_statistics.py:262:8: "probabilty" is a misspelling of "probability" python/pyspark/rdd.py:1363:32: "paramter" is a misspelling of "parameter" python/pyspark/streaming/tests.py:825:42: "retuns" is a misspelling of "returns" python/pyspark/sql/tests.py:768:29: "initalization" is a misspelling of "initialization" python/pyspark/sql/tests.py:3616:31: "initalize" is a misspelling of "initialize" resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerBackendUtil.scala:120:39: "arbitary" is a misspelling of "arbitrary" resource-managers/mesos/src/test/scala/org/apache/spark/deploy/mesos/MesosClusterDispatcherArgumentsSuite.scala:26:45: "sucessfully" is a misspelling of "successfully" resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerUtils.scala:358:27: "constaints" is a misspelling of "constraints" resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/YarnClusterSuite.scala:111:24: "senstive" is a misspelling of "sensitive" sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala:1063:5: "overwirte" is a misspelling of "overwrite" sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala:1348:17: "compatability" is a misspelling of "compatibility" sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala:77:36: "paramter" is a misspelling of "parameter" sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:1374:22: "precendence" is a misspelling of "precedence" sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisSuite.scala:238:27: "unnecassary" is a misspelling of "unnecessary" sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ConditionalExpressionSuite.scala:212:17: "whn" is a misspelling of "when" sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinHelper.scala:147:60: "timestmap" is a misspelling of "timestamp" sql/core/src/test/scala/org/apache/spark/sql/TPCDSQuerySuite.scala:150:45: "precentage" is a misspelling of "percentage" sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchemaSuite.scala:135:29: "infered" is a misspelling of "inferred" sql/hive/src/test/resources/golden/udf_instr-1-2e76f819563dbaba4beb51e3a130b922:1:52: "occurance" is a misspelling of "occurrence" sql/hive/src/test/resources/golden/udf_instr-2-32da357fc754badd6e3898dcc8989182:1:52: "occurance" is a misspelling of "occurrence" sql/hive/src/test/resources/golden/udf_locate-1-6e41693c9c6dceea4d7fab4c02884e4e:1:63: "occurance" is a misspelling of "occurrence" sql/hive/src/test/resources/golden/udf_locate-2-d9b5934457931447874d6bb7c13de478:1:63: "occurance" is a misspelling of "occurrence" sql/hive/src/test/resources/golden/udf_translate-2-f7aa38a33ca0df73b7a1e6b6da4b7fe8:9:79: "occurence" is a misspelling of "occurrence" sql/hive/src/test/resources/golden/udf_translate-2-f7aa38a33ca0df73b7a1e6b6da4b7fe8:13:110: "occurence" is a misspelling of "occurrence" sql/hive/src/test/resources/ql/src/test/queries/clientpositive/annotate_stats_join.q:46:105: "distint" is a misspelling of "distinct" sql/hive/src/test/resources/ql/src/test/queries/clientpositive/auto_sortmerge_join_11.q:29:3: "Currenly" is a misspelling of "Currently" sql/hive/src/test/resources/ql/src/test/queries/clientpositive/avro_partitioned.q:72:15: "existant" is a misspelling of "existent" sql/hive/src/test/resources/ql/src/test/queries/clientpositive/decimal_udf.q:25:3: "substraction" is a misspelling of "subtraction" sql/hive/src/test/resources/ql/src/test/queries/clientpositive/groupby2_map_multi_distinct.q:16:51: "funtion" is a misspelling of "function" sql/hive/src/test/resources/ql/src/test/queries/clientpositive/groupby_sort_8.q:15:30: "issueing" is a misspelling of "issuing" sql/hive/src/test/scala/org/apache/spark/sql/sources/HadoopFsRelationTest.scala:669:52: "wiht" is a misspelling of "with" sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/session/HiveSessionImpl.java:474:9: "Refering" is a misspelling of "Referring" ``` ### after ``` $ misspell . \| grep -v '.js' common/network-common/src/main/java/org/apache/spark/network/util/AbstractFileRegion.java:27:20: "transfered" is a misspelling of "transferred" core/src/main/scala/org/apache/spark/status/storeTypes.scala:113:29: "ect" is a misspelling of "etc" core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala:1922:49: "agriculteur" is a misspelling of "agriculture" data/streaming/AFINN-111.txt:1219:0: "humerous" is a misspelling of "humorous" licenses/LICENSE-heapq.txt:5:63: "Stichting" is a misspelling of "Stitching" licenses/LICENSE-heapq.txt:6:2: "Mathematisch" is a misspelling of "Mathematics" licenses/LICENSE-heapq.txt:262:29: "Stichting" is a misspelling of "Stitching" licenses/LICENSE-heapq.txt:262:39: "Mathematisch" is a misspelling of "Mathematics" licenses/LICENSE-heapq.txt:269:49: "Stichting" is a misspelling of "Stitching" licenses/LICENSE-heapq.txt:269:59: "Mathematisch" is a misspelling of "Mathematics" licenses/LICENSE-heapq.txt:274:2: "STICHTING" is a misspelling of "STITCHING" licenses/LICENSE-heapq.txt:274:12: "MATHEMATISCH" is a misspelling of "MATHEMATICS" licenses/LICENSE-heapq.txt:276:29: "STICHTING" is a misspelling of "STITCHING" licenses/LICENSE-heapq.txt:276:39: "MATHEMATISCH" is a misspelling of "MATHEMATICS" licenses-binary/LICENSE-heapq.txt:5:63: "Stichting" is a misspelling of "Stitching" licenses-binary/LICENSE-heapq.txt:6:2: "Mathematisch" is a misspelling of "Mathematics" licenses-binary/LICENSE-heapq.txt:262:29: "Stichting" is a misspelling of "Stitching" licenses-binary/LICENSE-heapq.txt:262:39: "Mathematisch" is a misspelling of "Mathematics" licenses-binary/LICENSE-heapq.txt:269:49: "Stichting" is a misspelling of "Stitching" licenses-binary/LICENSE-heapq.txt:269:59: "Mathematisch" is a misspelling of "Mathematics" licenses-binary/LICENSE-heapq.txt:274:2: "STICHTING" is a misspelling of "STITCHING" licenses-binary/LICENSE-heapq.txt:274:12: "MATHEMATISCH" is a misspelling of "MATHEMATICS" licenses-binary/LICENSE-heapq.txt:276:29: "STICHTING" is a misspelling of "STITCHING" licenses-binary/LICENSE-heapq.txt:276:39: "MATHEMATISCH" is a misspelling of "MATHEMATICS" mllib/src/main/resources/org/apache/spark/ml/feature/stopwords/hungarian.txt:170:0: "teh" is a misspelling of "the" mllib/src/main/resources/org/apache/spark/ml/feature/stopwords/portuguese.txt:53:0: "eles" is a misspelling of "eels" mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala:99:20: "Euclidian" is a misspelling of "Euclidean" mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala:539:11: "Euclidian" is a misspelling of "Euclidean" mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala:77:36: "Teh" is a misspelling of "The" mllib/src/main/scala/org/apache/spark/mllib/stat/MultivariateOnlineSummarizer.scala:276:9: "Euclidian" is a misspelling of "Euclidean" python/pyspark/heapq3.py:6:63: "Stichting" is a misspelling of "Stitching" python/pyspark/heapq3.py:7:2: "Mathematisch" is a misspelling of "Mathematics" python/pyspark/heapq3.py:263:29: "Stichting" is a misspelling of "Stitching" python/pyspark/heapq3.py:263:39: "Mathematisch" is a misspelling of "Mathematics" python/pyspark/heapq3.py:270:49: "Stichting" is a misspelling of "Stitching" python/pyspark/heapq3.py:270:59: "Mathematisch" is a misspelling of "Mathematics" python/pyspark/heapq3.py:275:2: "STICHTING" is a misspelling of "STITCHING" python/pyspark/heapq3.py:275:12: "MATHEMATISCH" is a misspelling of "MATHEMATICS" python/pyspark/heapq3.py:277:29: "STICHTING" is a misspelling of "STITCHING" python/pyspark/heapq3.py:277:39: "MATHEMATISCH" is a misspelling of "MATHEMATICS" python/pyspark/ml/stat.py:339:23: "Euclidian" is a misspelling of "Euclidean" ``` Closes #22070 from seratch/fix-typo. Authored-by: Kazuhiro Sera <seratch@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2018-08-11 21:23:36 -05:00
liuxian	c5fe412928	[SPARK-18188][DOC][FOLLOW-UP] Add `spark.broadcast.checksum` to configuration ## What changes were proposed in this pull request? This pr add `spark.broadcast.checksum` to configuration. ## How was this patch tested? manually tested Author: liuxian <liu.xian3@zte.com.cn> Closes #21825 from 10110346/checksum_config.	2018-08-01 21:19:24 -05:00
Imran Rashid	15fff79032	[SPARK-24297][CORE] Fetch-to-disk by default for > 2gb Fetch-to-mem is guaranteed to fail if the message is bigger than 2 GB, so we might as well use fetch-to-disk in that case. The message includes some metadata in addition to the block data itself (in particular UploadBlock has a lot of metadata), so we leave a little room. Author: Imran Rashid <irashid@cloudera.com> Closes #21474 from squito/SPARK-24297.	2018-07-25 09:08:42 +08:00
Yuanjian Li	6a0b77a55d	[SPARK-24215][PYSPARK][FOLLOW UP] Implement eager evaluation for DataFrame APIs in PySpark ## What changes were proposed in this pull request? Address comments in #21370 and add more test. ## How was this patch tested? Enhance test in pyspark/sql/test.py and DataFrameSuite Author: Yuanjian Li <xyliyuanjian@gmail.com> Closes #21553 from xuanyuanking/SPARK-24215-follow.	2018-06-27 10:43:06 -07:00
Sanket Chintapalli	3af1d3e6d9	[SPARK-24416] Fix configuration specification for killBlacklisted executors ## What changes were proposed in this pull request? spark.blacklist.killBlacklistedExecutors is defined as (Experimental) If set to "true", allow Spark to automatically kill, and attempt to re-create, executors when they are blacklisted. Note that, when an entire node is added to the blacklist, all of the executors on that node will be killed. I presume the killing of blacklisted executors only happens after the stage completes successfully and all tasks have completed or on fetch failures (updateBlacklistForFetchFailure/updateBlacklistForSuccessfulTaskSet). It is confusing because the definition states that the executor will be attempted to be recreated as soon as it is blacklisted. This is not true while the stage is in progress and an executor is blacklisted, it will not attempt to cleanup until the stage finishes. Author: Sanket Chintapalli <schintap@yahoo-inc.com> Closes #21475 from redsanket/SPARK-24416.	2018-06-12 13:55:08 -05:00
Yuanjian Li	dbb4d83829	[SPARK-24215][PYSPARK] Implement _repr_html_ for dataframes in PySpark ## What changes were proposed in this pull request? Implement `_repr_html_` for PySpark while in notebook and add config named "spark.sql.repl.eagerEval.enabled" to control this. The dev list thread for context: http://apache-spark-developers-list.1001551.n3.nabble.com/eager-execution-and-debuggability-td23928.html ## How was this patch tested? New ut in DataFrameSuite and manual test in jupyter. Some screenshot below. After: ![image](https://user-images.githubusercontent.com/4833765/40268422-8db5bef0-5b9f-11e8-80f1-04bc654a4f2c.png) Before: ![image](https://user-images.githubusercontent.com/4833765/40268431-9f92c1b8-5b9f-11e8-9db9-0611f0940b26.png) Author: Yuanjian Li <xyliyuanjian@gmail.com> Closes #21370 from xuanyuanking/SPARK-24215.	2018-06-05 08:23:08 +07:00
Yuming Wang	ed1a65448f	[SPARK-19112][CORE][FOLLOW-UP] Add missing shortCompressionCodecNames to configuration. ## What changes were proposed in this pull request? Spark provides four codecs: `lz4`, `lzf`, `snappy`, and `zstd`. This pr add missing shortCompressionCodecNames to configuration. ## How was this patch tested? manually tested Author: Yuming Wang <yumwang@ebay.com> Closes #21431 from wangyum/SPARK-19112.	2018-05-26 20:26:00 +08:00
Jake Charland	a4470bc78c	[SPARK-21673] Use the correct sandbox environment variable set by Mesos ## What changes were proposed in this pull request? This change changes spark behavior to use the correct environment variable set by Mesos in the container on startup. Author: Jake Charland <jakec@uber.com> Closes #18894 from jakecharland/MesosSandbox.	2018-05-22 08:06:15 -05:00
Devaraj K	007ae6878f	[SPARK-24003][CORE] Add support to provide spark.executor.extraJavaOptions in terms of App Id and/or Executor Id's ## What changes were proposed in this pull request? Added support to specify the 'spark.executor.extraJavaOptions' value in terms of the `{{APP_ID}}` and/or `{{EXECUTOR_ID}}`, `{{APP_ID}}` will be replaced by Application Id and `{{EXECUTOR_ID}}` will be replaced by Executor Id while starting the executor. ## How was this patch tested? I have verified this by checking the executor process command and gc logs. I verified the same in different deployment modes(Standalone, YARN, Mesos) client and cluster modes. Author: Devaraj K <devaraj@apache.org> Closes #21088 from devaraj-kavali/SPARK-24003.	2018-04-30 13:40:03 -07:00
Julien Cuquemelle	55c4ca88a3	[SPARK-22683][CORE] Add a executorAllocationRatio parameter to throttle the parallelism of the dynamic allocation ## What changes were proposed in this pull request? By default, the dynamic allocation will request enough executors to maximize the parallelism according to the number of tasks to process. While this minimizes the latency of the job, with small tasks this setting can waste a lot of resources due to executor allocation overhead, as some executor might not even do any work. This setting allows to set a ratio that will be used to reduce the number of target executors w.r.t. full parallelism. The number of executors computed with this setting is still fenced by `spark.dynamicAllocation.maxExecutors` and `spark.dynamicAllocation.minExecutors` ## How was this patch tested? Units tests and runs on various actual workloads on a Yarn Cluster Author: Julien Cuquemelle <j.cuquemelle@criteo.com> Closes #19881 from jcuquemelle/AddTaskPerExecutorSlot.	2018-04-24 10:56:55 -05:00

1 2 3 4 5 ...

399 commits