ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Wenchen Fan	05b85eb8cb	[SPARK-27474][CORE] avoid retrying a task failed with CommitDeniedException many times ## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-25250 reports a bug that, a task which is failed with `CommitDeniedException` gets retried many times. This can happen when a stage has 2 task set managers, one is zombie, one is active. A task from the zombie TSM completes, and commits to a central coordinator(assuming it's a file writing task). Then the corresponding task from the active TSM will fail with `CommitDeniedException`. `CommitDeniedException.countTowardsTaskFailures` is false, so the active TSM will keep retrying this task, until the job finishes. This wastes resource a lot. #21131 firstly implements that a previous successful completed task from zombie `TaskSetManager` could mark the task of the same partition completed in the active `TaskSetManager`. Later #23871 improves the implementation to cover a corner case that, an active `TaskSetManager` hasn't been created when a previous task succeed. However, #23871 has a bug and was reverted in #24359. With hindsight, #23781 is fragile because we need to sync the states between `DAGScheduler` and `TaskScheduler`, about which partitions are completed. This PR proposes a new fix: 1. When `DAGScheduler` gets a task success event from an earlier attempt, notify the `TaskSchedulerImpl` about it 2. When `TaskSchedulerImpl` knows a partition is already completed, ask the active `TaskSetManager` to mark the corresponding task as finished, if the task is not finished yet. This fix covers the corner case, because: 1. If `DAGScheduler` gets the task completion event from zombie TSM before submitting the new stage attempt, then `DAGScheduler` knows that this partition is completed, and it will exclude this partition when creating task set for the new stage attempt. See `DAGScheduler.submitMissingTasks` 2. If `DAGScheduler` gets the task completion event from zombie TSM after submitting the new stage attempt, then the active TSM is already created. Compared to the previous fix, the message loop becomes longer, so it's likely that, the active task set manager has already retried the task multiple times. But this failure window won't be too big, and we want to avoid the worse case that retries the task many times until the job finishes. So this solution is acceptable. ## How was this patch tested? a new test case. Closes #24375 from cloud-fan/fix2. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-29 14:20:58 +08:00
gatorsmile	cd4a284030	[SPARK-27460][FOLLOW-UP][TESTS] Fix flaky tests ## What changes were proposed in this pull request? This patch makes several test flakiness fixes. ## How was this patch tested? N/A Closes #24434 from gatorsmile/fixFlakyTest. Lead-authored-by: gatorsmile <gatorsmile@gmail.com> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-24 17:36:29 +08:00
Sean Owen	596a5ff273	[MINOR][BUILD] Update genjavadoc to 0.13 ## What changes were proposed in this pull request? Kind of related to https://github.com/gatorsmile/spark/pull/5 - let's update genjavadoc to see if it generates fewer spurious javadoc errors to begin with. ## How was this patch tested? Existing docs build Closes #24443 from srowen/genjavadoc013. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-04-24 13:44:48 +09:00
uncleGen	ecfdffcb35	[SPARK-27503][DSTREAM] JobGenerator thread exit for some fatal errors but application keeps running ## What changes were proposed in this pull request? In some corner cases, `JobGenerator` thread (including some other EventLoop threads) may exit for some fatal error, like OOM, but Spark Streaming job keep running with no batch job generating. Currently, we only report any non-fatal error. ``` override def run(): Unit = { try { while (!stopped.get) { val event = eventQueue.take() try { onReceive(event) } catch { case NonFatal(e) => try { onError(e) } catch { case NonFatal(e) => logError("Unexpected error in " + name, e) } } } } catch { case ie: InterruptedException => // exit even if eventQueue is not empty case NonFatal(e) => logError("Unexpected error in " + name, e) } } ``` In this PR, we double check if event thread alive when post Event ## How was this patch tested? existing unit tests Closes #24400 from uncleGen/SPARK-27503. Authored-by: uncleGen <hustyugm@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-04-23 07:11:58 -07:00
Shixiong Zhu	009059e3c2	[SPARK-27496][CORE] Fatal errors should also be sent back to the sender ## What changes were proposed in this pull request? When a fatal error (such as StackOverflowError) throws from "receiveAndReply", we should try our best to notify the sender. Otherwise, the sender will hang until timeout. In addition, when a MessageLoop is dying unexpectedly, it should resubmit a new one so that Dispatcher is still working. ## How was this patch tested? New unit tests. Closes #24396 from zsxwing/SPARK-27496. Authored-by: Shixiong Zhu <zsxwing@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-04-21 17:00:07 -07:00
Shahid	16bbe0f798	[SPARK-27486][CORE][TEST] Enable History server storage information test in the HistoryServerSuite ## What changes were proposed in this pull request? We have disabled a test related to storage in the History server suite after SPARK-13845. But, after SPARK-22050, we can store the information about block updated events to eventLog, if we enable "spark.eventLog.logBlockUpdates.enabled=true". So, we can enable the test, by adding an eventlog corresponding to the application, which has enabled the configuration, "spark.eventLog.logBlockUpdates.enabled=true" ## How was this patch tested? Existing UTs Closes #24390 from shahidki31/enableRddStorageTest. Authored-by: Shahid <shahidki31@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-04-19 08:12:20 -07:00
pengbo	54b0d1e0ef	[SPARK-27416][SQL] UnsafeMapData & UnsafeArrayData Kryo serialization … ## What changes were proposed in this pull request? Finish the rest work of https://github.com/apache/spark/pull/24317, https://github.com/apache/spark/pull/9030 a. Implement Kryo serialization for UnsafeArrayData b. fix UnsafeMapData Java/Kryo Serialization issue when two machines have different Oops size c. Move the duplicate code "getBytes()" to Utils. ## How was this patch tested? According Units has been added & tested Closes #24357 from pengbo/SPARK-27416_new. Authored-by: pengbo <bo.peng1019@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-17 13:03:00 +08:00
shivusondur	88d9de26dd	[SPARK-27464][CORE] Added Constant instead of referring string literal used from many places ## What changes were proposed in this pull request? Added Constant instead of referring the same String literal "spark.buffer.pageSize" from many places ## How was this patch tested? Run the corresponding Unit Test Cases manually. Closes #24368 from shivusondur/Constant. Authored-by: shivusondur <shivusondur@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-04-16 09:30:46 -05:00
Kazuaki Ishizaki	257d01a6b8	[SPARK-27397][CORE] Take care of OpenJ9 JVM in Spark ## What changes were proposed in this pull request? This PR supports `OpenJ9` in addition to `IBM JDK` and `OpenJDK` in Spark by handling `System.getProperty("java.vendor") = "Eclipse OpenJ9"`. In `inferDefaultMemory()` and `getKrb5LoginModuleName()`, this PR uses non `IBM` way. ``` $ ~/jdk-11.0.2+9_openj9-0.12.1/bin/jshell \| Welcome to JShell -- Version 11.0.2 \| For an introduction type: /help intro jshell> System.out.println(System.getProperty("java.vendor")) Eclipse OpenJ9 jshell> System.out.println(System.getProperty("java.vm.info")) JRE 11 Linux amd64-64-Bit Compressed References 20190204_127 (JIT enabled, AOT enabled) OpenJ9 - 90dd8cb40 OMR - d2f4534b JCL - 289c70b6844 based on jdk-11.0.2+9 jshell> System.out.println(Class.forName("com.ibm.lang.management.OperatingSystemMXBean").getDeclaredMethod("getTotalPhysicalMemory")) public abstract long com.ibm.lang.management.OperatingSystemMXBean.getTotalPhysicalMemory() jshell> System.out.println(Class.forName("com.sun.management.OperatingSystemMXBean").getDeclaredMethod("getTotalPhysicalMemorySize")) public abstract long com.sun.management.OperatingSystemMXBean.getTotalPhysicalMemorySize() jshell> System.out.println(Class.forName("com.ibm.security.auth.module.Krb5LoginModule")) \| Exception java.lang.ClassNotFoundException: com.ibm.security.auth.module.Krb5LoginModule \| at Class.forNameImpl (Native Method) \| at Class.forName (Class.java:339) \| at (#1:1) jshell> System.out.println(Class.forName("com.sun.security.auth.module.Krb5LoginModule")) class com.sun.security.auth.module.Krb5LoginModule ``` ## How was this patch tested? Existing test suites Manual testing with OpenJ9. Closes #24308 from kiszk/SPARK-27397. Authored-by: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-04-16 09:11:47 -05:00
Sean Owen	8718367e2e	[SPARK-27470][PYSPARK] Update pyrolite to 4.23 ## What changes were proposed in this pull request? Update pyrolite to 4.23 to pick up bug and security fixes. ## How was this patch tested? Existing tests. Closes #24381 from srowen/SPARK-27470. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-04-16 19:41:40 +09:00
Wenchen Fan	0bb716bac3	Revert [SPARK-23433][SPARK-25250][CORE] Later created TaskSet should learn about the finished partitions ## What changes were proposed in this pull request? Our customer has a very complicated job. Sometimes it successes and sometimes it fails with ``` Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: ShuffleMapStage 4 has failed the maximum allowable number of times: 4. Most recent failure reason: org.apache.spark.shuffle.FetchFailedException ``` However, with the patch https://github.com/apache/spark/pull/23871 , the job hangs forever. When I investigated it, I found that `DAGScheduler` and `TaskSchedulerImpl` define stage completion differently. `DAGScheduler` thinks a stage is completed if all its partitions are marked as completed ([result stage](https://github.com/apache/spark/blob/v2.4.1/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1362-L1368) and [shuffle stage](https://github.com/apache/spark/blob/v2.4.1/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1400)). `TaskSchedulerImpl` thinks a stage's task set is completed when all tasks finish (see the [code](https://github.com/apache/spark/blob/v2.4.1/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L779-L784)). Ideally this two definition should be consistent, but #23871 breaks it. In our customer's Spark log, I found that, a stage's task set completes, but the stage never completes. More specifically, `DAGScheduler` submits a task set for stage 4.1 with 1000 tasks, but the `TaskSetManager` skips to run the first 100 tasks. Later on, `TaskSetManager` finishes 900 tasks and marks the task set as completed. However, `DAGScheduler` doesn't agree with it and hangs forever, waiting for more task completion events of stage 4.1. With hindsight, I think `TaskSchedulerIImpl.stageIdToFinishedPartitions` is fragile. We need to pay more effort to make sure this is consistent with `DAGScheduler`'s knowledge. When `DAGScheduler` marks some partitions from finished to unfinished, `TaskSchedulerIImpl.stageIdToFinishedPartitions` should be updated as well. This PR reverts #23871, let's think of a more robust idea later. ## How was this patch tested? N/A Closes #24359 from cloud-fan/revert. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-14 16:57:41 +08:00
Bago Amirbekian	eea3f55a31	[SPARK-27446][R] Use existing spark conf if available. ## What changes were proposed in this pull request? The RBackend and RBackendHandler create new conf objects that don't pick up conf values from the existing SparkSession and therefore always use the default conf values instead of values specified by the user. In this fix we check to see if the spark env already exists, and get the conf from there. We fall back to creating a new conf. This follows the pattern used in other places including this: `3725b1324f/core/src/main/scala/org/apache/spark/api/r/BaseRRunner.scala (L261)` ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. Closes #24353 from MrBago/r-backend-use-existing-conf. Authored-by: Bago Amirbekian <bago@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-04-14 17:09:12 +09:00
Sean Owen	4ec7f631aa	[SPARK-27404][CORE][SQL][STREAMING][YARN] Fix build warnings for 3.0: postfixOps edition ## What changes were proposed in this pull request? Fix build warnings -- see some details below. But mostly, remove use of postfix syntax where it causes warnings without the `scala.language.postfixOps` import. This is mostly in expressions like "120000 milliseconds". Which, I'd like to simplify to things like "2.minutes" anyway. ## How was this patch tested? Existing tests. Closes #24314 from srowen/SPARK-27404. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-04-11 13:43:44 -05:00
Maxim Gekk	d33ae2e9ed	[SPARK-26953][CORE][TEST] Disable result checking in the test: java.lang.ArrayIndexOutOfBoundsException in TimSort ## What changes were proposed in this pull request? I propose to disable (comment) result checking in `SorterSuite`.`java.lang.ArrayIndexOutOfBoundsException in TimSort` because: 1. The check is optional, and correctness of TimSort is checked by another tests. Purpose of the test is to check that TimSort doesn't fail with `ArrayIndexOutOfBoundsException`. 2. Significantly drops execution time of the test. Here are timing of running the test locally: ``` Sort: 1.4 seconds Result checking: 15.6 seconds ``` ## How was this patch tested? By `SorterSuite`. Closes #24343 from MaxGekk/timsort-test-speedup. Authored-by: Maxim Gekk <maxim.gekk@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-04-11 07:58:57 -07:00
Shixiong Zhu	5ff39cd5ee	[SPARK-27394][WEBUI] Flush LiveEntity if necessary when receiving SparkListenerExecutorMetricsUpdate ## What changes were proposed in this pull request? This PR updates `AppStatusListener` to flush `LiveEntity` if necessary when receiving `SparkListenerExecutorMetricsUpdate`. This will ensure the staleness of Spark UI doesn't last more than the executor heartbeat interval. ## How was this patch tested? The new unit test. Closes #24303 from zsxwing/SPARK-27394. Authored-by: Shixiong Zhu <zsxwing@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-04-09 08:26:00 -07:00
LantaoJin	52838e74af	[SPARK-13704][CORE][YARN] Reduce rack resolution time ## What changes were proposed in this pull request? When you submit a stage on a large cluster, rack resolving takes a long time when initializing TaskSetManager because a script is invoked to resolve the rack of each host, one by one. Based on current implementation, it takes 30~40 seconds to resolve the racks in our 5000 nodes' cluster. After applied the patch, it decreased to less than 15 seconds. YARN-9332 has added an interface to handle multiple hosts in one invocation to save time. But before upgrading to the newest Hadoop, we could construct the same tool in Spark to resolve this issue. ## How was this patch tested? UT and manually testing on a 5000 node cluster. Closes #24245 from squito/SPARK-13704_update. Lead-authored-by: LantaoJin <jinlantao@gmail.com> Co-authored-by: Imran Rashid <irashid@cloudera.com> Signed-off-by: Imran Rashid <irashid@cloudera.com>	2019-04-08 10:47:06 -05:00
liulijia	39f75b4588	[SPARK-27192][CORE] spark.task.cpus should be less or equal than spark.executor.cores ## What changes were proposed in this pull request? check spark.task.cpus before creating TaskScheduler in SparkContext ## How was this patch tested? UT Please review http://spark.apache.org/contributing.html before opening a pull request. Closes #24261 from liutang123/SPARK-27192. Authored-by: liulijia <liutang123@yeah.net> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-04-05 13:55:57 -05:00
Dongjoon Hyun	982c4c8e3c	[SPARK-27390][CORE][SQL][TEST] Fix package name mismatch ## What changes were proposed in this pull request? This PR aims to clean up package name mismatches. ## How was this patch tested? Pass the Jenkins. Closes #24300 from dongjoon-hyun/SPARK-27390. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-04-05 11:50:37 -07:00
Sean Owen	23bde44797	[SPARK-27358][UI] Update jquery to 1.12.x to pick up security fixes ## What changes were proposed in this pull request? Update jquery -> 1.12.4, datatables -> 1.10.18, mustache -> 2.3.12. Add missing mustache license ## How was this patch tested? I manually tested the UI locally with the javascript console open and didn't observe any problems or JS errors. The only 'risky' change seems to be mustache, but on reading its release notes, don't think the changes from 0.8.1 to 2.x would affect Spark's simple usage. Closes #24288 from srowen/SPARK-27358. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-04-05 12:54:01 -05:00
Wenchen Fan	b56e433b54	[SPARK-27338][CORE][FOLLOWUP] remove trailing space ## What changes were proposed in this pull request? https://github.com/apache/spark/pull/24265 breaks the lint check, because it has trailing space. (not sure why it passed jenkins). This PR fixes it. ## How was this patch tested? N/A Closes #24289 from cloud-fan/fix. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-04 11:43:20 +08:00
Venkata krishnan Sowrirajan	6c4552c650	[SPARK-27338][CORE] Fix deadlock in UnsafeExternalSorter.SpillableIterator when locking both UnsafeExternalSorter.SpillableIterator and TaskMemoryManager ## What changes were proposed in this pull request? In `UnsafeExternalSorter.SpillableIterator#loadNext()` takes lock on the `UnsafeExternalSorter` and calls `freePage` once the `lastPage` is consumed which needs to take a lock on `TaskMemoryManager`. At the same time, there can be another MemoryConsumer using `UnsafeExternalSorter` as part of sorting can try to `allocatePage` needs to get lock on `TaskMemoryManager` which can cause spill to happen which requires lock on `UnsafeExternalSorter` again causing deadlock. This is a classic deadlock situation happening similar to the SPARK-26265. To fix this, we can move the `freePage` call in `loadNext` outside of `Synchronized` block similar to the fix in SPARK-26265 ## How was this patch tested? Manual tests were being done and will also try to add a test. Closes #24265 from venkata91/deadlock-sorter. Authored-by: Venkata krishnan Sowrirajan <vsowrirajan@qubole.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-04 09:58:05 +08:00
LantaoJin	69dd44af19	[SPARK-27216][CORE] Upgrade RoaringBitmap to 0.7.45 to fix Kryo unsafe ser/dser issue ## What changes were proposed in this pull request? HighlyCompressedMapStatus uses RoaringBitmap to record the empty blocks. But RoaringBitmap couldn't be ser/deser with unsafe KryoSerializer. It's a bug of RoaringBitmap-0.5.11 and fixed in latest version. This is an update of #24157 ## How was this patch tested? Add a UT Closes #24264 from LantaoJin/SPARK-27216. Lead-authored-by: LantaoJin <jinlantao@gmail.com> Co-authored-by: Lantao Jin <jinlantao@gmail.com> Signed-off-by: Imran Rashid <irashid@cloudera.com>	2019-04-03 20:09:50 -05:00
Gabor Somogyi	57aff93886	[SPARK-26998][CORE] Remove SSL configuration from executors ## What changes were proposed in this pull request? Different SSL passwords shown up as command line argument on executor side in standalone mode: * keyStorePassword * keyPassword * trustStorePassword In this PR I've removed SSL configurations from executors. ## How was this patch tested? Existing + additional unit tests. Additionally tested with standalone mode and checked the command line arguments: ``` [gaborsomogyi:~/spark] SPARK-26998(+4/-0,3)+ ± jps 94803 CoarseGrainedExecutorBackend 94818 Jps 90149 RemoteMavenServer 91925 Nailgun 94793 SparkSubmit 94680 Worker 94556 Master 398 [gaborsomogyi:~/spark] SPARK-26998(+4/-1,3)+ ± ps -ef \| egrep "94556\|94680\|94793\|94803" 502 94556 1 0 2:02PM ttys007 0:07.39 /Library/Java/JavaVirtualMachines/jdk1.8.0_152.jdk/Contents/Home/bin/java -cp /Users/gaborsomogyi/spark/conf/:/Users/gaborsomogyi/spark/assembly/target/scala-2.12/jars/* -Xmx1g org.apache.spark.deploy.master.Master --host gsomogyi-MBP.local --port 7077 --webui-port 8080 --properties-file conf/spark-defaults.conf 502 94680 1 0 2:02PM ttys007 0:07.27 /Library/Java/JavaVirtualMachines/jdk1.8.0_152.jdk/Contents/Home/bin/java -cp /Users/gaborsomogyi/spark/conf/:/Users/gaborsomogyi/spark/assembly/target/scala-2.12/jars/* -Xmx1g org.apache.spark.deploy.worker.Worker --webui-port 8081 --properties-file conf/spark-defaults.conf spark://gsomogyi-MBP.local:7077 502 94793 94782 0 2:02PM ttys007 0:35.52 /Library/Java/JavaVirtualMachines/jdk1.8.0_152.jdk/Contents/Home/bin/java -cp /Users/gaborsomogyi/spark/conf/:/Users/gaborsomogyi/spark/assembly/target/scala-2.12/jars/* -Dscala.usejavacp=true -Xmx1g org.apache.spark.deploy.SparkSubmit --master spark://gsomogyi-MBP.local:7077 --class org.apache.spark.repl.Main --name Spark shell spark-shell 502 94803 94680 0 2:03PM ttys007 0:05.20 /Library/Java/JavaVirtualMachines/jdk1.8.0_152.jdk/Contents/Home/bin/java -cp /Users/gaborsomogyi/spark/conf/:/Users/gaborsomogyi/spark/assembly/target/scala-2.12/jars/* -Xmx1024M -Dspark.ssl.ui.port=0 -Dspark.driver.port=60902 org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler172.30.65.186:60902 --executor-id 0 --hostname 172.30.65.186 --cores 8 --app-id app-20190326140311-0000 --worker-url spark://Worker172.30.65.186:60899 502 94910 57352 0 2:05PM ttys008 0:00.00 egrep 94556\|94680\|94793\|94803 ``` Closes #24170 from gaborgsomogyi/SPARK-26998. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-04-02 09:18:43 -07:00
Sean Owen	d4420b455a	[SPARK-27323][CORE][SQL][STREAMING] Use Single-Abstract-Method support in Scala 2.12 to simplify code ## What changes were proposed in this pull request? Use Single Abstract Method syntax where possible (and minor related cleanup). Comments below. No logic should change here. ## How was this patch tested? Existing tests. Closes #24241 from srowen/SPARK-27323. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-04-02 07:37:05 -07:00
“attilapiros”	9eb896cc3b	[SPARK-27333][TEST] Update thread audit whitelist to skip broadcast-exchange-., process reaper and StatisticsDataReferenceCleaner threads ## What changes were proposed in this pull request? Update thread audit whitelist to skip threads of the global broadcast exchange thread pool, process reaper and Hadoop FS statistics data reference cleaner thread. ## How was this patch tested? Via existing UT using broadcast exchange via `sbt` i.e: ``` > project sql > testOnly .SessionStateSuite -- -z "fork new sessions and run query on inherited table" ``` Before (wrapped long line for manually to save horizontal scrolling for reviewers): ``` ===== POSSIBLE THREAD LEAK IN SUITE o.a.s.sql.SessionStateSuite, thread names: broadcast-exchange-6, broadcast-exchange-0, broadcast-exchange-2, broadcast-exchange-5, broadcast-exchange-7, broadcast-exchange-4, broadcast-exchange-1, process reaper, broadcast-exchange-3, org.apache.hadoop.fs.FileSystem$Statistics$StatisticsDataReferenceCleaner ===== ``` After this change no possible thread leak detected. Closes #24244 from attilapiros/thread-audit-minor. Authored-by: “attilapiros” <piros.attila.zsolt@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-03-31 17:33:31 -07:00
gatorsmile	92b6f86f6d	[SPARK-27244][CORE][TEST][FOLLOWUP] toDebugString redacts sensitive information ## What changes were proposed in this pull request? This PR is a FollowUp of https://github.com/apache/spark/pull/24196. It improves the test case by using the parameters that are being used in the actual scenarios. ## How was this patch tested? N/A Closes #24257 from gatorsmile/followupSPARK-27244. Authored-by: gatorsmile <gatorsmile@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-03-30 22:58:28 -07:00
Sean Owen	754f820035	[SPARK-26918][DOCS] All .md should have ASF license header ## What changes were proposed in this pull request? Add AL2 license to metadata of all .md files. This seemed to be the tidiest way as it will get ignored by .md renderers and other tools. Attempts to write them as markdown comments revealed that there is no such standard thing. ## How was this patch tested? Doc build Closes #24243 from srowen/SPARK-26918. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-30 19:49:45 -05:00
Dongjoon Hyun	88ea319871	Revert "[SPARK-27192][CORE] spark.task.cpus should be less or equal than spark.executor.cores" This reverts commit `f8fa564dec`.	2019-03-30 16:35:34 -07:00
liulijia	f8fa564dec	[SPARK-27192][CORE] spark.task.cpus should be less or equal than spark.executor.cores ## What changes were proposed in this pull request? spark.task.cpus should be less or equal than spark.executor.cores when use static executor allocation ## How was this patch tested? manual Closes #24131 from liutang123/SPARK-27192. Authored-by: liulijia <liutang123@yeah.net> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-30 12:38:05 -05:00
Ninad Ingole	dbc7ce18b9	[SPARK-27244][CORE] Redact Passwords While Using Option logConf=true ## What changes were proposed in this pull request? When logConf is set to true, config keys that contain password were printed in cleartext in driver log. This change uses the already present redact method in Utils, to redact all the passwords based on redact pattern in SparkConf and then print the conf to driver log thus ensuring that sensitive information like passwords is not printed in clear text. ## How was this patch tested? This patch was tested through `SparkConfSuite` & then entire unit test through sbt Please review http://spark.apache.org/contributing.html before opening a pull request. Closes #24196 from ninadingole/SPARK-27244. Authored-by: Ninad Ingole <robert.wallis@example.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-29 14:16:53 -05:00
Wenchen Fan	e4a968d829	[MINOR][CORE] Remove import scala.collection.Set in TaskSchedulerImpl ## What changes were proposed in this pull request? I was playing with the scheduler and found this weird thing. In `TaskSchedulerImpl` we import `scala.collection.Set` without any reason. This is bad in practice, as it silently changes the actual class when we simply type `Set`, which by default should point to the immutable set. This change only affects one method: `getExecutorsAliveOnHost`. I checked all the caller side and none of them need a general `Set` type. ## How was this patch tested? N/A Closes #24231 from cloud-fan/minor. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2019-03-28 21:12:18 +09:00
Sean Owen	3a8398df5c	[SPARK-26660][FOLLOWUP] Raise task serialized size warning threshold to 1000 KiB ## What changes were proposed in this pull request? Raise the threshold size for serialized task size at which a warning is generated from 100KiB to 1000KiB. As several people have noted, the original change for this JIRA highlighted that this threshold is low. Test output regularly shows: ``` - sorting on StringType with nullable=false, sortOrder=List('a DESC NULLS LAST) 22:47:53.320 WARN org.apache.spark.scheduler.TaskSetManager: Stage 80 contains a task of very large size (755 KiB). The maximum recommended task size is 100 KiB. 22:47:53.348 WARN org.apache.spark.scheduler.TaskSetManager: Stage 81 contains a task of very large size (755 KiB). The maximum recommended task size is 100 KiB. 22:47:53.417 WARN org.apache.spark.scheduler.TaskSetManager: Stage 83 contains a task of very large size (755 KiB). The maximum recommended task size is 100 KiB. 22:47:53.444 WARN org.apache.spark.scheduler.TaskSetManager: Stage 84 contains a task of very large size (755 KiB). The maximum recommended task size is 100 KiB. ... - SPARK-20688: correctly check analysis for scalar sub-queries 22:49:10.314 WARN org.apache.spark.scheduler.DAGScheduler: Broadcasting large task binary with size 150.8 KiB - SPARK-21835: Join in correlated subquery should be duplicateResolved: case 1 22:49:10.595 WARN org.apache.spark.scheduler.DAGScheduler: Broadcasting large task binary with size 150.7 KiB 22:49:10.744 WARN org.apache.spark.scheduler.DAGScheduler: Broadcasting large task binary with size 150.7 KiB 22:49:10.894 WARN org.apache.spark.scheduler.DAGScheduler: Broadcasting large task binary with size 150.7 KiB - SPARK-21835: Join in correlated subquery should be duplicateResolved: case 2 - SPARK-21835: Join in correlated subquery should be duplicateResolved: case 3 - SPARK-23316: AnalysisException after max iteration reached for IN query 22:49:11.559 WARN org.apache.spark.scheduler.DAGScheduler: Broadcasting large task binary with size 154.2 KiB ``` It seems that a larger threshold of about 1MB is more suitable. ## How was this patch tested? Existing tests. Closes #24226 from srowen/SPARK-26660.2. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2019-03-27 10:42:26 +09:00
Stavros Kontopoulos	05168e725d	[SPARK-24793][K8S] Enhance spark-submit for app management - supports `--kill` & `--status` flags. - supports globs which is useful in general check this long standing [issue](https://github.com/kubernetes/kubernetes/issues/17144#issuecomment-272052461) for kubectl. Manually against running apps. Example output: Submission Id reported at launch time: ``` 2019-01-20 23:47:56 INFO Client:58 - Waiting for application spark-pi with submissionId spark:spark-pi-1548020873671-driver to finish... ``` Killing the app: ``` ./bin/spark-submit --kill spark:spark-pi-1548020873671-driver --master k8s://https://192.168.2.8:8443 2019-01-20 23:48:07 WARN Utils:70 - Your hostname, universe resolves to a loopback address: 127.0.0.1; using 192.168.2.8 instead (on interface wlp2s0) 2019-01-20 23:48:07 WARN Utils:70 - Set SPARK_LOCAL_IP if you need to bind to another address ``` App terminates with 143 (SIGTERM, since we have tiny this should lead to [graceful shutdown](https://cloud.google.com/solutions/best-practices-for-building-containers)): ``` 2019-01-20 23:48:08 INFO LoggingPodStatusWatcherImpl:58 - State changed, new state: pod name: spark-pi-1548020873671-driver namespace: spark labels: spark-app-selector -> spark-e4730c80e1014b72aa77915a2203ae05, spark-role -> driver pod uid: 0ba9a794-1cfd-11e9-8215-a434d9270a65 creation time: 2019-01-20T21:47:55Z service account name: spark-sa volumes: spark-local-dir-1, spark-conf-volume, spark-sa-token-b7wcm node name: minikube start time: 2019-01-20T21:47:55Z phase: Running container status: container name: spark-kubernetes-driver container image: skonto/spark:k8s-3.0.0 container state: running container started at: 2019-01-20T21:48:00Z 2019-01-20 23:48:09 INFO LoggingPodStatusWatcherImpl:58 - State changed, new state: pod name: spark-pi-1548020873671-driver namespace: spark labels: spark-app-selector -> spark-e4730c80e1014b72aa77915a2203ae05, spark-role -> driver pod uid: 0ba9a794-1cfd-11e9-8215-a434d9270a65 creation time: 2019-01-20T21:47:55Z service account name: spark-sa volumes: spark-local-dir-1, spark-conf-volume, spark-sa-token-b7wcm node name: minikube start time: 2019-01-20T21:47:55Z phase: Failed container status: container name: spark-kubernetes-driver container image: skonto/spark:k8s-3.0.0 container state: terminated container started at: 2019-01-20T21:48:00Z container finished at: 2019-01-20T21:48:08Z exit code: 143 termination reason: Error 2019-01-20 23:48:09 INFO LoggingPodStatusWatcherImpl:58 - Container final statuses: container name: spark-kubernetes-driver container image: skonto/spark:k8s-3.0.0 container state: terminated container started at: 2019-01-20T21:48:00Z container finished at: 2019-01-20T21:48:08Z exit code: 143 termination reason: Error 2019-01-20 23:48:09 INFO Client:58 - Application spark-pi with submissionId spark:spark-pi-1548020873671-driver finished. 2019-01-20 23:48:09 INFO ShutdownHookManager:58 - Shutdown hook called 2019-01-20 23:48:09 INFO ShutdownHookManager:58 - Deleting directory /tmp/spark-f114b2e0-5605-4083-9203-a4b1c1f6059e ``` Glob scenario: ``` ./bin/spark-submit --status spark:spark-pi* --master k8s://https://192.168.2.8:8443 2019-01-20 22:27:44 WARN Utils:70 - Your hostname, universe resolves to a loopback address: 127.0.0.1; using 192.168.2.8 instead (on interface wlp2s0) 2019-01-20 22:27:44 WARN Utils:70 - Set SPARK_LOCAL_IP if you need to bind to another address Application status (driver): pod name: spark-pi-1547948600328-driver namespace: spark labels: spark-app-selector -> spark-f13f01702f0b4503975ce98252d59b94, spark-role -> driver pod uid: c576e1c6-1c54-11e9-8215-a434d9270a65 creation time: 2019-01-20T01:43:22Z service account name: spark-sa volumes: spark-local-dir-1, spark-conf-volume, spark-sa-token-b7wcm node name: minikube start time: 2019-01-20T01:43:22Z phase: Running container status: container name: spark-kubernetes-driver container image: skonto/spark:k8s-3.0.0 container state: running container started at: 2019-01-20T01:43:27Z Application status (driver): pod name: spark-pi-1547948792539-driver namespace: spark labels: spark-app-selector -> spark-006d252db9b24f25b5069df357c30264, spark-role -> driver pod uid: 38375b4b-1c55-11e9-8215-a434d9270a65 creation time: 2019-01-20T01:46:35Z service account name: spark-sa volumes: spark-local-dir-1, spark-conf-volume, spark-sa-token-b7wcm node name: minikube start time: 2019-01-20T01:46:35Z phase: Succeeded container status: container name: spark-kubernetes-driver container image: skonto/spark:k8s-3.0.0 container state: terminated container started at: 2019-01-20T01:46:39Z container finished at: 2019-01-20T01:46:56Z exit code: 0 termination reason: Completed ``` Closes #23599 from skonto/submit_ops_extension. Authored-by: Stavros Kontopoulos <stavros.kontopoulos@lightbend.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-03-26 11:55:03 -07:00
Ajith	b61dce23d2	[SPARK-26961][CORE] Enable parallel classloading capability ## What changes were proposed in this pull request? As per https://docs.oracle.com/javase/8/docs/api/java/lang/ClassLoader.html ``Class loaders that support concurrent loading of classes are known as parallel capable class loaders and are required to register themselves at their class initialization time by invoking the ClassLoader.registerAsParallelCapable method. Note that the ClassLoader class is registered as parallel capable by default. However, its subclasses still need to register themselves if they are parallel capable. `` i.e we can have finer class loading locks by registering classloaders as parallel capable. (Refer to deadlock due to macro lock https://issues.apache.org/jira/browse/SPARK-26961). All the classloaders we have are wrapper of URLClassLoader which by itself is parallel capable. But this cannot be achieved by scala code due to static registration Refer https://github.com/scala/bug/issues/11429 ## How was this patch tested? All Existing UT must pass Closes #24126 from ajithme/driverlock. Authored-by: Ajith <ajith2489@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-25 19:07:30 -05:00
liuxian	e4b36df2c0	[SPARK-27256][CORE][SQL] If the configuration is used to set the number of bytes, we'd better use `bytesConf`'. ## What changes were proposed in this pull request? Currently, if we want to configure `spark.sql.files.maxPartitionBytes` to 256 megabytes, we must set `spark.sql.files.maxPartitionBytes=268435456`, which is very unfriendly to users. And if we set it like this:`spark.sql.files.maxPartitionBytes=256M`, we will encounter this exception: ``` Exception in thread "main" java.lang.IllegalArgumentException: spark.sql.files.maxPartitionBytes should be long, but was 256M at org.apache.spark.internal.config.ConfigHelpers$.toNumber(ConfigBuilder.scala) ``` This PR use `bytesConf` to replace `longConf` or `intConf`, if the configuration is used to set the number of bytes. Configuration change list: `spark.files.maxPartitionBytes` `spark.files.openCostInBytes` `spark.shuffle.sort.initialBufferSize` `spark.shuffle.spill.initialMemoryThreshold` `spark.sql.autoBroadcastJoinThreshold` `spark.sql.files.maxPartitionBytes` `spark.sql.files.openCostInBytes` `spark.sql.defaultSizeInBytes` ## How was this patch tested? 1.Existing unit tests 2.Manual testing Closes #24187 from 10110346/bytesConf. Authored-by: liuxian <liu.xian3@zte.com.cn> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-03-25 14:47:40 -07:00
Luca Canali	4b2b3da766	[SPARK-26928][CORE][FOLLOWUP] Fix JVMCPUSource file name and minor updates to doc ## What changes were proposed in this pull request? This applies some minor updates/cleaning following up SPARK-26928, notably renaming JVMCPU.scala to JVMCPUSource.scala. ## How was this patch tested? Manually tested Closes #24201 from LucaCanali/fixupSPARK-26928. Authored-by: Luca Canali <luca.canali@cern.ch> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-25 15:35:24 -05:00
Sean Owen	8bc304f97e	[SPARK-26132][BUILD][CORE] Remove support for Scala 2.11 in Spark 3.0.0 ## What changes were proposed in this pull request? Remove Scala 2.11 support in build files and docs, and in various parts of code that accommodated 2.11. See some targeted comments below. ## How was this patch tested? Existing tests. Closes #23098 from srowen/SPARK-26132. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-25 10:46:42 -05:00
Maxim Gekk	027ed2d11b	[SPARK-23643][CORE][SQL][ML] Shrinking the buffer in hashSeed up to size of the seed parameter ## What changes were proposed in this pull request? The hashSeed method allocates 64 bytes instead of 8. Other bytes are always zeros (thanks to default behavior of ByteBuffer). And they could be excluded from hash calculation because they don't differentiate inputs. ## How was this patch tested? By running the existing tests - XORShiftRandomSuite Closes #20793 from MaxGekk/hash-buff-size. Lead-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Co-authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-23 11:26:09 -05:00
hehuiyuan	68abf77b1a	[SPARK-27184][CORE] Avoid hardcoded 'spark.jars', 'spark.files', 'spark.submit.pyFiles' and 'spark.submit.deployMode' ## What changes were proposed in this pull request? For [SPARK-27184](https://issues.apache.org/jira/browse/SPARK-27184) In the `org.apache.spark.internal.config`, we define the variables of `FILES` and `JARS`, we can use them instead of "spark.jars" and "spark.files". ```scala private[spark] val JARS = ConfigBuilder("spark.jars") .stringConf .toSequence .createWithDefault(Nil) ``` ```scala private[spark] val FILES = ConfigBuilder("spark.files") .stringConf .toSequence .createWithDefault(Nil) ``` Other : In the `org.apache.spark.internal.config`, we define the variables of `SUBMIT_PYTHON_FILES ` and `SUBMIT_DEPLOY_MODE `, we can use them instead of "spark.submit.pyFiles" and "spark.submit.deployMode". ```scala private[spark] val SUBMIT_PYTHON_FILES = ConfigBuilder("spark.submit.pyFiles") .stringConf .toSequence .createWithDefault(Nil) ``` ```scala private[spark] val SUBMIT_DEPLOY_MODE = ConfigBuilder("spark.submit.deployMode") .stringConf .createWithDefault("client") ``` Closes #24123 from hehuiyuan/hehuiyuan-patch-6. Authored-by: hehuiyuan <hehuiyuan@ZBMAC-C02WD3K5H.local> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2019-03-23 09:43:00 +09:00
Jungtaek Lim (HeartSaVioR)	8a9eb05137	[SPARK-26606][CORE] Handle driver options properly when submitting to standalone cluster mode via legacy Client ## What changes were proposed in this pull request? This patch fixes the issue that ClientEndpoint in standalone cluster doesn't recognize about driver options which are passed to SparkConf instead of system properties. When `Client` is executed via cli they should be provided as system properties, but with `spark-submit` they can be provided as SparkConf. (SpartSubmit will call `ClientApp.start` with SparkConf which would contain these options.) ## How was this patch tested? Manually tested via following steps: 1) setup standalone cluster (launch master and worker via `./sbin/start-all.sh`) 2) submit one of example app with standalone cluster mode ``` ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master "spark://localhost:7077" --conf "spark.driver.extraJavaOptions=-Dfoo=BAR" --deploy-mode "cluster" --num-executors 1 --driver-memory 512m --executor-memory 512m --executor-cores 1 examples/jars/spark-examples*.jar 10 ``` 3) check whether `foo=BAR` is provided in system properties in Spark UI <img width="877" alt="Screen Shot 2019-03-21 at 8 18 04 AM" src="https://user-images.githubusercontent.com/1317309/54728501-97db1700-4bc1-11e9-89da-078445c71e9b.png"> Closes #24163 from HeartSaVioR/SPARK-26606. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-03-22 15:07:49 -07:00
Jungtaek Lim (HeartSaVioR)	174531c183	[MINOR][CORE] Leverage modified Utils.classForName to reduce scalastyle off for Class.forName ## What changes were proposed in this pull request? This patch modifies Utils.classForName to have optional parameters - initialize, noSparkClassLoader - to let callers of Class.forName with thread context classloader to use it instead. This helps to reduce scalastyle off for Class.forName. ## How was this patch tested? Existing UTs. Closes #24148 from HeartSaVioR/MINOR-reduce-scalastyle-off-for-class-forname. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-22 05:28:46 -05:00
maryannxue	9f58d3b436	[SPARK-27236][TEST] Refactor log-appender pattern in tests ## What changes were proposed in this pull request? Refactored code in tests regarding the "withLogAppender()" pattern by creating a general helper method in SparkFunSuite. ## How was this patch tested? Passed existing tests. Closes #24172 from maryannxue/log-appender. Authored-by: maryannxue <maryannxue@apache.org> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2019-03-21 19:18:30 -07:00
Jungtaek Lim (HeartSaVioR)	a8d9531edc	[SPARK-27205][CORE] Remove complicated logic for just leaving warning log when main class is scala.App ## What changes were proposed in this pull request? [SPARK-26977](https://issues.apache.org/jira/browse/SPARK-26977) introduced very strange bug which spark-shell is no longer able to load classes which are provided via `--packages`. TBH I don't know about the details why it is broken, but looks like initializing `object class` brings the weirdness (maybe due to static initialization done twice?). This patch removes the logic to leave warning log when main class is scala.App, to not deal with such complexity for just leaving warning message. ## How was this patch tested? Manual test: suppose we run spark-shell with `--packages` option like below: ``` ./bin/spark-shell --verbose --master "local[]" --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.0 ``` Before this patch, importing class in transitive dependency fails: ``` Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Spark context Web UI available at http://localhost:4040 Spark context available as 'sc' (master = local[], app id = local-1553005771597). Spark session available as 'spark'. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.0.0-SNAPSHOT /_/ Using Scala version 2.12.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_191) Type in expressions to have them evaluated. Type :help for more information. scala> import org.apache.kafka <console>:23: error: object kafka is not a member of package org.apache import org.apache.kafka ``` After this patch, importing class in transitive dependency succeeds: ``` Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Spark context Web UI available at http://localhost:4040 Spark context available as 'sc' (master = local[*], app id = local-1553004095542). Spark session available as 'spark'. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.0.0-SNAPSHOT /_/ Using Scala version 2.12.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_191) Type in expressions to have them evaluated. Type :help for more information. scala> import org.apache.kafka import org.apache.kafka ``` Closes #24147 from HeartSaVioR/SPARK-27205. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-20 17:55:48 -05:00
Lantao Jin	93c6d2a198	[SPARK-27215][CORE] Correct the kryo configurations ## What changes were proposed in this pull request? ```scala val KRYO_USE_UNSAFE = ConfigBuilder("spark.kyro.unsafe") .booleanConf .createWithDefault(false) val KRYO_USE_POOL = ConfigBuilder("spark.kyro.pool") .booleanConf .createWithDefault(true) ``` kyro should be kryo ## How was this patch tested? no need Closes #24156 from LantaoJin/SPARK-27215. Authored-by: Lantao Jin <jinlantao@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-03-20 14:27:05 -07:00
Gengliang Wang	ef2d63bfb1	[SPARK-27201][WEBUI] Toggle full job description on click ## What changes were proposed in this pull request? Previously, in https://github.com/apache/spark/pull/6646 there was an improvement to show full job description after double clicks. I think this is a bit hard to be noticed by some users. I suggest changing the event to one click. Also, after the full description is shown, another click should be able to hide the overflow text again. Before click: ![short](https://user-images.githubusercontent.com/1097932/54608784-79bfca80-4a8c-11e9-912b-30799be0d6cb.png) After click: ![full](https://user-images.githubusercontent.com/1097932/54608790-7b898e00-4a8c-11e9-9251-86061158db68.png) Click again: ![short](https://user-images.githubusercontent.com/1097932/54608784-79bfca80-4a8c-11e9-912b-30799be0d6cb.png) ## How was this patch tested? Manually check. Closes #24145 from gengliangwang/showDescriptionDetail. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2019-03-20 21:29:13 +09:00
Ajith	1f692e522c	[SPARK-27200][WEBUI][HISTORYSERVER] History Environment tab must sort Configurations/Properties by default Environment Page in SparkUI have all the configuration sorted by key. But this is not the case in History server case, to keep UX same, we can have it sorted in history server too ## What changes were proposed in this pull request? On render of Env page the properties are sorted before creating page ## How was this patch tested? Manually tested in UI Closes #24143 from ajithme/historyenv. Authored-by: Ajith <ajith2489@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2019-03-20 20:16:17 +09:00
weixiuli	8b0aa59218	[SPARK-26288][CORE] add initRegisteredExecutorsDB ## What changes were proposed in this pull request? As we all know that spark on Yarn uses DB https://github.com/apache/spark/pull/7943 to record RegisteredExecutors information which can be reloaded and used again when the ExternalShuffleService is restarted . The RegisteredExecutors information can't be recorded both in the mode of spark's standalone and spark on k8s , which will cause the RegisteredExecutors information to be lost ,when the ExternalShuffleService is restarted. To solve the problem above, a method is proposed and is committed . ## How was this patch tested? new unit tests Closes #23393 from weixiuli/SPARK-26288. Authored-by: weixiuli <weixiuli@jd.com> Signed-off-by: Imran Rashid <irashid@cloudera.com>	2019-03-19 16:16:43 -05:00
pgandhi	7043aee1ba	[SPARK-27112][CORE] : Create a resource ordering between threads to resolve the deadlocks encountered … …when trying to kill executors either due to dynamic allocation or blacklisting ## What changes were proposed in this pull request? There are two deadlocks as a result of the interplay between three different threads: task-result-getter thread spark-dynamic-executor-allocation thread dispatcher-event-loop thread(makeOffers()) The fix ensures ordering synchronization constraint by acquiring lock on `TaskSchedulerImpl` before acquiring lock on `CoarseGrainedSchedulerBackend` in `makeOffers()` as well as killExecutors() method. This ensures resource ordering between the threads and thus, fixes the deadlocks. ## How was this patch tested? Manual Tests Closes #24072 from pgandhi999/SPARK-27112-2. Authored-by: pgandhi <pgandhi@verizonmedia.com> Signed-off-by: Imran Rashid <irashid@cloudera.com>	2019-03-18 10:33:51 -05:00
Ajith	fc88d3df5c	[SPARK-27164][CORE] RDD.countApprox on empty RDDs schedules jobs which never complete ## What changes were proposed in this pull request? When Result stage has zero tasks, the Job End event is never fired, hence the Job is always running in UI. Example: sc.emptyRDD[Int].countApprox(1000) never finishes even it has no tasks to launch ## How was this patch tested? Added UT Closes #24100 from ajithme/emptyRDD. Authored-by: Ajith <ajith2489@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-17 12:56:41 -05:00
fitermay	1bc481b779	[SPARK-27070] Improve performance of DefaultPartitionCoalescer This time tested against Scala 2.11 as well Closes #24116 from fitermay/master. Authored-by: fitermay <fiterman@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-17 11:47:14 -05:00
Ajith	c324e1da9d	[SPARK-27122][CORE] Jetty classes must not be return via getters in org.apache.spark.ui.WebUI ## What changes were proposed in this pull request? When we run YarnSchedulerBackendSuite, the class path seems to be made from the classes folder(resource-managers/yarn/target/scala-2.12/classes) instead of jar (resource-managers/yarn/target/spark-yarn_2.12-3.0.0-SNAPSHOT.jar) . ui.getHandlers is in spark-core and its loaded from spark-core.jar which is shaded and hence refers to org.spark_project.jetty.servlet.ServletContextHandler Here in org.apache.spark.scheduler.cluster.YarnSchedulerBackend, as its not shaded, it expects org.eclipse.jetty.servlet.ServletContextHandler Refer discussion https://issues.apache.org/jira/browse/SPARK-27122?focusedCommentId=16792318&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16792318 Hence as a fix, org.apache.spark.ui.WebUI must only return a wrapper class instance or references so that Jetty classes can be avoided in getters which are accessed outside spark-core ## How was this patch tested? Existing UT can pass Closes #24088 from ajithme/shadebug. Authored-by: Ajith <ajith2489@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-17 06:44:02 -05:00
lichaoqun	4132c989db	[MINOR][CORE] spark.diskStore.subDirectories <= 0 should throw Exception ## What changes were proposed in this pull request? this pr add check this spark.diskStore.subDirectories > 0.This value need to be checked before it can be used. ## How was this patch tested? N/A Please review http://spark.apache.org/contributing.html before opening a pull request. Closes #24024 from lcqzte10192193/wid-lcq-190308. Authored-by: lichaoqun <li.chaoqun@zte.com.cn> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-17 06:43:14 -05:00
Liupengcheng	cad475dcc9	[SPARK-26941][YARN] Fix incorrect computation of maxNumExecutorFailures in ApplicationMaster for streaming ## What changes were proposed in this pull request? Currently, when enabled streaming dynamic allocation for streaming applications, the maxNumExecutorFailures in ApplicationMaster is still computed with `spark.dynamicAllocation.maxExecutors`. Actually, we should consider `spark.streaming.dynamicAllocation.maxExecutors` instead. Related codes: `f87153a3ac/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala (L101)` ## How was this patch tested? NA Please review http://spark.apache.org/contributing.html before opening a pull request. Closes #23845 from liupc/Fix-incorrect-maxNumExecutorFailures-for-streaming. Lead-authored-by: Liupengcheng <liupengcheng@xiaomi.com> Co-authored-by: liupengcheng <liupengcheng@xiaomi.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-16 19:45:05 -05:00
SongYadong	ec11790580	[CORE][MINOR] Correct the comment to show heartbeat interval is configurable ## What changes were proposed in this pull request? Executor heartbeat interval is configurable by `"spark.executor.heartbeatInterval"`. But in a comment, heartbeat interval is presented as a constant `10s`. This pr tries to correct the description. ## How was this patch tested? Existing unit tests. Closes #24101 from SongYadong/heartbeat_interval_comment. Authored-by: SongYadong <song.yadong1@zte.com.cn> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-15 20:30:36 -05:00
Dongjoon Hyun	4bab69b22a	Revert "[SPARK-27070] Fix performance bug in DefaultPartitionCoalescer" This reverts commit `21db4336b0`.	2019-03-15 14:56:08 -07:00
fitermay	21db4336b0	[SPARK-27070] Fix performance bug in DefaultPartitionCoalescer When trying to coalesce a UnionRDD of two large FileScanRDDs (each with a few million partitions) into around 8k partitions the driver can stall for over an hour. Profiler shows that over 90% of the time is spent in TimSort which is invoked by `pickBin`. This patch replaces sorting with a more efficient `min` for the purpose of finding the least occupied PartitionGroup Closes #23986 from fitermay/SPARK-27070. Authored-by: fitermay <fiterman@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-14 20:13:18 -05:00
Ajith	2a04de52dd	[SPARK-26152] Synchronize Worker Cleanup with Worker Shutdown ## What changes were proposed in this pull request? The race between org.apache.spark.deploy.DeployMessages.WorkDirCleanup event and org.apache.spark.deploy.worker.Worker#onStop. Here its possible that while the WorkDirCleanup event is being processed, org.apache.spark.deploy.worker.Worker#cleanupThreadExecutor was shutdown. hence any submission after ThreadPoolExecutor will result in java.util.concurrent.RejectedExecutionException ## How was this patch tested? Manually Closes #24056 from ajithme/workercleanup. Authored-by: Ajith <ajith2489@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-14 09:16:29 -05:00
Jungtaek Lim (HeartSaVioR)	f57af2286f	[MINOR][CORE] Use https for bintray spark-packages repository ## What changes were proposed in this pull request? This patch changes the schema of url from http to https for bintray spark-packages repository. Looks like we already changed the schema of repository url for pom.xml but missed inside the code. ## How was this patch tested? Manually ran the `--package` via `./bin/spark-shell --verbose --packages "RedisLabs:spark-redis:0.3.2"` ``` ... Ivy Default Cache set to: /Users/jlim/.ivy2/cache The jars for the packages stored in: /Users/jlim/.ivy2/jars :: loading settings :: url = jar:file:/Users/jlim/WorkArea/ScalaProjects/spark/dist/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml RedisLabs#spark-redis added as a dependency :: resolving dependencies :: org.apache.spark#spark-submit-parent-2fee2e18-7832-4a4d-9e97-7b3d0fef766d;1.0 confs: [default] found RedisLabs#spark-redis;0.3.2 in spark-packages found redis.clients#jedis;2.7.2 in central found org.apache.commons#commons-pool2;2.3 in central downloading https://dl.bintray.com/spark-packages/maven/RedisLabs/spark-redis/0.3.2/spark-redis-0.3.2.jar ... [SUCCESSFUL ] RedisLabs#spark-redis;0.3.2!spark-redis.jar (824ms) downloading https://repo1.maven.org/maven2/redis/clients/jedis/2.7.2/jedis-2.7.2.jar ... [SUCCESSFUL ] redis.clients#jedis;2.7.2!jedis.jar (576ms) downloading https://repo1.maven.org/maven2/org/apache/commons/commons-pool2/2.3/commons-pool2-2.3.jar ... [SUCCESSFUL ] org.apache.commons#commons-pool2;2.3!commons-pool2.jar (150ms) :: resolution report :: resolve 4586ms :: artifacts dl 1555ms :: modules in use: RedisLabs#spark-redis;0.3.2 from spark-packages in [default] org.apache.commons#commons-pool2;2.3 from central in [default] redis.clients#jedis;2.7.2 from central in [default] --------------------------------------------------------------------- \| \| modules \|\| artifacts \| \| conf \| number\| search\|dwnlded\|evicted\|\| number\|dwnlded\| --------------------------------------------------------------------- \| default \| 3 \| 3 \| 3 \| 0 \|\| 3 \| 3 \| --------------------------------------------------------------------- ``` Closes #24061 from HeartSaVioR/MINOR-use-https-to-bintray-repository. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-12 18:01:16 -05:00
Liupengcheng	d5cfe08fdc	[SPARK-26927][CORE] Ensure executor is active when processing events in dynamic allocation manager. ## What changes were proposed in this pull request? There is a race condition in the `ExecutorAllocationManager` that the `SparkListenerExecutorRemoved` event is posted before the `SparkListenerTaskStart` event, which will cause the incorrect result of `executorIds`. Then, when some executor idles, the real executors will be removed even actual executor number is equal to `minNumExecutors` due to the incorrect computation of `newExecutorTotal`(may greater than the `minNumExecutors`), thus may finally causing zero available executors but a wrong positive number of executorIds was kept in memory. What's more, even the `SparkListenerTaskEnd` event can not make the fake `executorIds` released, because later idle event for the fake executors can not cause the real removal of these executors, as they are already removed and they are not exist in the `executorDataMap` of `CoaseGrainedSchedulerBackend`, so that the `onExecutorRemoved` method will never be called again. For details see https://issues.apache.org/jira/browse/SPARK-26927 This PR is to fix this problem. ## How was this patch tested? existUT and added UT Closes #23842 from liupc/Fix-race-condition-that-casues-dyanmic-allocation-not-working. Lead-authored-by: Liupengcheng <liupengcheng@xiaomi.com> Co-authored-by: liupengcheng <liupengcheng@xiaomi.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-03-12 13:53:42 -07:00
ankurgupta	688b0c01fa	[SPARK-26089][CORE] Handle corruption in large shuffle blocks ## What changes were proposed in this pull request? SPARK-4105 added corruption detection in shuffle blocks but that was limited to blocks which are smaller than maxBytesInFlight/3. This commit adds upon that by adding corruption check for large blocks. There are two changes/improvements that are made in this commit: 1. Large blocks are checked upto maxBytesInFlight/3 size in a similar way as smaller blocks, so if a large block is corrupt in the starting, that block will be re-fetched and if that also fails, FetchFailureException will be thrown. 2. If large blocks are corrupt after size maxBytesInFlight/3, then any IOException thrown while reading the stream will be converted to FetchFailureException. This is slightly more aggressive than was originally intended but since the consumer of the stream may have already read some records and processed them, we can't just re-fetch the block, we need to fail the whole task. Additionally, we also thought about maybe adding a new type of TaskEndReason, which would re-try the task couple of times before failing the previous stage, but given the complexity involved in that solution we decided to not proceed in that direction. Thanks to squito for direction and support. ## How was this patch tested? Changed the junit test for big blocks to check for corruption. Closes #23453 from ankuriitg/ankurgupta/SPARK-26089. Authored-by: ankurgupta <ankur.gupta@cloudera.com> Signed-off-by: Imran Rashid <irashid@cloudera.com>	2019-03-12 14:27:44 -05:00
shivusondur	4b6d39d85d	[SPARK-27090][CORE] Removing old LEGACY_DRIVER_IDENTIFIER ("<driver>") ## What changes were proposed in this pull request? LEGACY_DRIVER_IDENTIFIER and its reference are removed. corresponding references test are updated. ## How was this patch tested? tested UT test cases Closes #24026 from shivusondur/newjira2. Authored-by: shivusondur <shivusondur@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-12 13:29:39 -05:00
hongdongdong	1029bf9c35	Use variable instead of function to keep the format uniform ## What changes were proposed in this pull request? The change just use variable(_taskScheduler) instead of function(taskScheduler) to keep the format uniform in different situation. ## How was this patch tested? Please review http://spark.apache.org/contributing.html before opening a pull request. Closes #24048 from hddong/Use-variable-instead-of-function. Authored-by: hongdongdong <hongdongdong@cmss.chinamobile.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-11 19:00:26 -05:00
Hyukjin Kwon	3725b1324f	[SPARK-26923][SQL][R] Refactor ArrowRRunner and RRunner to share one BaseRRunner ## What changes were proposed in this pull request? This PR proposes to have one base R runner. In the high level, Previously, it had `ArrowRRunner` and it inherited `RRunner`: ``` └── RRunner └── ArrowRRunner ``` After this PR, now it has a `BaseRRunner`, and `ArrowRRunner` and `RRunner` inherit `BaseRRunner`: ``` └── BaseRRunner ├── ArrowRRunner └── RRunner ``` This way is consistent with Python's. In more details, see below: ```scala class BaseRRunner[IN, OUT] { def compute: Iterator[OUT] = { ... newWriterThread(...).start() ... newReaderIterator(...) ... } // Make a thread that writes data from JVM to R process abstract protected def newWriterThread(..., iter: Iterator[IN], ...): WriterThread // Make an iterator that reads data from the R process to JVM abstract protected def newReaderIterator(...): ReaderIterator abstract class WriterThread(..., iter: Iterator[IN], ...) extends Thread { override def run(): Unit { ... writeIteratorToStream(...) ... } // Actually writing logic to the socket stream. abstract protected def writeIteratorToStream(dataOut: DataOutputStream): Unit } abstract class ReaderIterator extends Iterator[OUT] { override def hasNext(): Boolean = { ... read(...) ... } override def next(): OUT = { ... hasNext() ... } // Actually reading logic from the socket stream. abstract protected def read(...): OUT } } ``` ```scala case [Arrow]RRunner extends BaseRRunner { override def newWriterThread(...) { new WriterThread(...) { override def writeIteratorToStream(...) { ... } } } override def newReaderIterator(...) { new ReaderIterator(...) { override def read(...) { ... } } } } ``` ## How was this patch tested? Manually tested and existing tests should cover. Closes #23977 from HyukjinKwon/SPARK-26923. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2019-03-12 08:45:29 +09:00
Marcelo Vanzin	f1e223bfa3	[SPARK-27004][CORE] Remove stale HTTP auth code. This code is from the era when Spark used an HTTP server to distribute dependencies, which is long gone. Nowadays it only causes problems when someone is using dependencies from an HTTP server with Spark auth on. Closes #24033 from vanzin/SPARK-27004. Authored-by: Marcelo Vanzin <vanzin@cloudera.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-03-11 12:27:25 -07:00
Ajith	b98922abf2	[SPARK-27116] Environment tab must sort Hadoop Configuration by default ## What changes were proposed in this pull request? Environment tab in SparkUI do not have Hadoop Configuration sorted. All other tables in the same page like Spark Configrations, System Configuration etc are sorted by keys by default ## How was this patch tested? Manually tested on SparkUI Closes #24038 from ajithme/sqluisort. Authored-by: Ajith <ajith2489@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-11 08:43:49 -05:00
Gabor Somogyi	29d9021245	[SPARK-24621][WEBUI] Show secure URLs on web pages ## What changes were proposed in this pull request? Web UI URLs are pointing to `http://` targets even if SSL is enabled. In this PR I've changed the code to point to `https://` URLs. ## How was this patch tested? Existing unit tests + manually by starting standalone master/worker/spark-shell. Please see jira. Closes #23991 from gaborgsomogyi/SPARK-24621. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-10 19:28:35 -05:00
Hyukjin Kwon	28d003097b	[SPARK-27102][R][PYTHON][CORE] Remove the references to Python's Scala codes in R's Scala codes ## What changes were proposed in this pull request? Currently, R's Scala codes happened to refer Python's Scala codes for code deduplications. It's a bit odd. For instance, when we face an exception from R, it shows python related code path, which makes confusing to debug. It should rather have one code base and R's and Python's should share. This PR proposes: 1. Make a `SocketAuthServer` and move `PythonServer` so that `PythonRDD` and `RRDD` can share it. 2. Move `readRDDFromFile` and `readRDDFromInputStream` into `JavaRDD`. 3. Reuse `RAuthHelper` and remove `RSocketAuthHelper` in `RRDD`. 4. Rename `getEncryptionEnabled` to `isEncryptionEnabled` while I am here. So, now, the places below: - `sql/core/src/main/scala/org/apache/spark/sql/api/r` - `core/src/main/scala/org/apache/spark/api/r` - `mllib/src/main/scala/org/apache/spark/ml/r` don't refer Python's Scala codes. ## How was this patch tested? Existing tests should cover this. Closes #24023 from HyukjinKwon/SPARK-27102. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2019-03-10 15:08:23 +09:00
Kris Mok	57ae251f75	[SPARK-27097] Avoid embedding platform-dependent offsets literally in whole-stage generated code ## What changes were proposed in this pull request? Spark SQL performs whole-stage code generation to speed up query execution. There are two steps to it: - Java source code is generated from the physical query plan on the driver. A single version of the source code is generated from a query plan, and sent to all executors. - It's compiled to bytecode on the driver to catch compilation errors before sending to executors, but currently only the generated source code gets sent to the executors. The bytecode compilation is for fail-fast only. - Executors receive the generated source code and compile to bytecode, then the query runs like a hand-written Java program. In this model, there's an implicit assumption about the driver and executors being run on similar platforms. Some code paths accidentally embedded platform-dependent object layout information into the generated code, such as: ```java Platform.putLong(buffer, /* offset / 24, / value */ 1); ``` This code expects a field to be at offset +24 of the `buffer` object, and sets a value to that field. But whole-stage code generation generally uses platform-dependent information from the driver. If the object layout is significantly different on the driver and executors, the generated code can be reading/writing to wrong offsets on the executors, causing all kinds of data corruption. One code pattern that leads to such problem is the use of `Platform.XXX` constants in generated code, e.g. `Platform.BYTE_ARRAY_OFFSET`. Bad: ```scala val baseOffset = Platform.BYTE_ARRAY_OFFSET // codegen template: s"Platform.putLong($buffer, $baseOffset, $value);" ``` This will embed the value of `Platform.BYTE_ARRAY_OFFSET` on the driver into the generated code. Good: ```scala val baseOffset = "Platform.BYTE_ARRAY_OFFSET" // codegen template: s"Platform.putLong($buffer, $baseOffset, $value);" ``` This will generate the offset symbolically -- `Platform.putLong(buffer, Platform.BYTE_ARRAY_OFFSET, value)`, which will be able to pick up the correct value on the executors. Caveat: these offset constants are declared as runtime-initialized `static final` in Java, so they're not compile-time constants from the Java language's perspective. It does lead to a slightly increased size of the generated code, but this is necessary for correctness. NOTE: there can be other patterns that generate platform-dependent code on the driver which is invalid on the executors. e.g. if the endianness is different between the driver and the executors, and if some generated code makes strong assumption about endianness, it would also be problematic. ## How was this patch tested? Added a new test suite `WholeStageCodegenSparkSubmitSuite`. This test suite needs to set the driver's extraJavaOptions to force the driver and executor use different Java object layouts, so it's run as an actual SparkSubmit job. Authored-by: Kris Mok <kris.mokdatabricks.com> Closes #24031 from gatorsmile/cherrypickSPARK-27097. Lead-authored-by: Kris Mok <kris.mok@databricks.com> Co-authored-by: gatorsmile <gatorsmile@gmail.com> Signed-off-by: DB Tsai <d_tsai@apple.com>	2019-03-09 01:20:32 +00:00
Wenchen Fan	cb20fbc43e	[SPARK-27065][CORE] avoid more than one active task set managers for a stage ## What changes were proposed in this pull request? This is another attempt to fix the more-than-one-active-task-set-managers bug. https://github.com/apache/spark/pull/17208 is the first attempt. It marks the TSM as zombie before sending a task completion event to DAGScheduler. This is necessary, because when the DAGScheduler gets the task completion event, and it's for the last partition, then the stage is finished. However, if it's a shuffle stage and it has missing map outputs, DAGScheduler will resubmit it(see the [code](https://github.com/apache/spark/blob/v2.4.0/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1416-L1422)) and create a new TSM for this stage. This leads to more than one active TSM of a stage and fail. This fix has a hole: Let's say a stage has 10 partitions and 2 task set managers: TSM1(zombie) and TSM2(active). TSM1 has a running task for partition 10 and it completes. TSM2 finishes tasks for partitions 1-9, and thinks he is still active because he hasn't finished partition 10 yet. However, DAGScheduler gets task completion events for all the 10 partitions and thinks the stage is finished. Then the same problem occurs: DAGScheduler may resubmit the stage and cause more than one actice TSM error. https://github.com/apache/spark/pull/21131 fixed this hole by notifying all the task set managers when a task finishes. For the above case, TSM2 will know that partition 10 is already completed, so he can mark himself as zombie after partitions 1-9 are completed. However, #21131 still has a hole: TSM2 may be created after the task from TSM1 is completed. Then TSM2 can't get notified about the task completion, and leads to the more than one active TSM error. #22806 and #23871 are created to fix this hole. However the fix is complicated and there are still ongoing discussions. This PR proposes a simple fix, which can be easy to backport: mark all existing task set managers as zombie when trying to create a new task set manager. After this PR, #21131 is still necessary, to avoid launching unnecessary tasks and fix [SPARK-25250](https://issues.apache.org/jira/browse/SPARK-25250 ). #22806 and #23871 are its followups to fix the hole. ## How was this patch tested? existing tests. Closes #23927 from cloud-fan/scheduler. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Imran Rashid <irashid@cloudera.com>	2019-03-06 12:00:33 -06:00
wuyi	e5c61436a5	[SPARK-23433][SPARK-25250][CORE] Later created TaskSet should learn about the finished partitions ## What changes were proposed in this pull request? This is an optional solution for #22806 . #21131 firstly implement that a previous successful completed task from zombie TaskSetManager could also succeed the active TaskSetManager, which based on an assumption that an active TaskSetManager always exists for that stage when this happen. But that's not always true as an active TaskSetManager may haven't been created when a previous task succeed, and this is the reason why #22806 hit the issue. This pr extends #21131 's behavior by adding `stageIdToFinishedPartitions` into TaskSchedulerImpl, which recording the finished partition whenever a task(from zombie or active) succeed. Thus, a later created active TaskSetManager could also learn about the finished partition by looking into `stageIdToFinishedPartitions ` and won't launch any duplicate tasks. ## How was this patch tested? Add. Closes #23871 from Ngone51/dev-23433-25250. Lead-authored-by: wuyi <ngone_5451@163.com> Co-authored-by: Ngone51 <ngone_5451@163.com> Signed-off-by: Imran Rashid <irashid@cloudera.com>	2019-03-06 11:53:07 -06:00
moqimoqidea	3fcbc7fb9f	[MINOR] Spelling mistake: forword -> forward ## What changes were proposed in this pull request? Spelling mistake: forword -> forward ## How was this patch tested? This is a private function, there is no place to call this function outside of this file. Closes #23978 from moqimoqidea/master. Authored-by: moqimoqidea <39821951+moqimoqidea@users.noreply.github.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2019-03-06 16:29:07 +09:00
“attilapiros”	5668c42edf	[SPARK-27021][CORE] Cleanup of Netty event loop group for shuffle chunk fetch requests ## What changes were proposed in this pull request? Creating an Netty `EventLoopGroup` leads to creating a new Thread pool for handling the events. For stopping the threads of the pool the event loop group should be shut down which is properly done for transport servers and clients by calling for example the `shutdownGracefully()` method (for details see the `close()` method of `TransportClientFactory` and `TransportServer`). But there is a separate event loop group for shuffle chunk fetch requests which is in pipeline for handling fetch request (shared between the client and server) and owned by the `TransportContext` and this was never shut down. ## How was this patch tested? With existing unittest. This leak is in the production system too but its effect is spiking in the unittest. Checking the core unittest logs before the PR: ``` $ grep "LEAK IN SUITE" unit-tests.log \| grep -o shuffle-chunk-fetch-handler \| wc -l 381 ``` And after the PR without whitelisting in thread audit and with an extra `await` after the ` chunkFetchWorkers.shutdownGracefully()`: ``` $ grep "LEAK IN SUITE" unit-tests.log \| grep -o shuffle-chunk-fetch-handler \| wc -l 0 ``` Closes #23930 from attilapiros/SPARK-27021. Authored-by: “attilapiros” <piros.attila.zsolt@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-03-05 12:31:06 -08:00
Luca Canali	25d2850665	[SPARK-26928][CORE] Add driver CPU Time to the metrics system ## What changes were proposed in this pull request? This proposes to add instrumentation for the driver's JVM CPU time via the Spark Dropwizard/Codahale metrics system. It follows directly from previous work SPARK-25228 and shares similar motivations: it is intended as an improvement to be used for Spark performance dashboards and monitoring tools/instrumentation. Implementation details: this PR takes the code introduced in SPARK-25228 and moves it to a new separate Source JVMCPUSource, which is then used to register the jvmCpuTime gauge metric for both executor and driver. The registration of the jvmCpuTime metric for the driver is conditional, a new configuration parameter `spark.metrics.cpu.time.driver.enabled` (proposed default: false) is introduced for this purpose. ## How was this patch tested? Manually tested, using local mode and using YARN. Closes #23838 from LucaCanali/addCPUTimeMetricDriver. Authored-by: Luca Canali <luca.canali@cern.ch> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-03-05 10:47:39 -08:00
Ajith	6207360b00	[SPARK-27012][CORE] Storage tab shows rdd details even after executor ended ## What changes were proposed in this pull request? After we cache a table, we can see its details in Storage Tab of spark UI. If the executor has shutdown ( graceful shutdown/ Dynamic executor scenario) UI still shows the rdd as cached and when we click the link it throws error. This is because on executor remove event, we fail to adjust rdd partition details org.apache.spark.status.AppStatusListener#onExecutorRemoved ## How was this patch tested? Have tested this fix in UI manually Edit: Added UT Closes #23920 from ajithme/cachestorage. Authored-by: Ajith <ajith2489@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-03-05 10:40:38 -08:00
Yanbo Liang	7857c6d633	[SPARK-27051][CORE] Bump Jackson version to 2.9.8 ## What changes were proposed in this pull request? Fasterxml Jackson version before 2.9.8 is affected by multiple [CVEs](https://github.com/FasterXML/jackson-databind/issues/2186), we need to fix bump the dependent Jackson to 2.9.8. ## How was this patch tested? Existing tests and offline benchmark. I have run ```SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.datasources.json.JSONBenchmark"``` to check there is no performance degradation for this upgrade. Closes #23965 from yanboliang/SPARK-27051. Authored-by: Yanbo Liang <ybliang8@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2019-03-05 11:46:51 +09:00
Sean Owen	0deebd3820	[SPARK-26016][DOCS] Clarify that text DataSource read/write, and RDD methods that read text, always use UTF-8 ## What changes were proposed in this pull request? Clarify that text DataSource read/write, and RDD methods that read text, always use UTF-8 as they use Hadoop's implementation underneath. I think these are all the places that this needs a mention in the user-facing docs. ## How was this patch tested? Doc tests. Closes #23962 from srowen/SPARK-26016. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2019-03-05 08:03:39 +09:00
Yuming Wang	827d371877	[SPARK-25689][FOLLOW-UP][CORE] Get proxy user's delegation tokens ## What changes were proposed in this pull request? This pr makes it get proxy user's delegation token, otherwise throws `AccessControlException`([full log](https://issues.apache.org/jira/browse/SPARK-25689?focusedCommentId=16780609&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16780609)): ```java org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS] ... at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBackend.scala:95) at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:62) at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:185) ``` How to reproduce this issue: ```shell $ ssh user_admspark-getaway-host1 $ export HADOOP_PROXY_USER=user_a $ spark-sql --master yarn ``` ## How was this patch tested? Test on our production environment. Closes #23922 from wangyum/SPARK-25689. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-03-04 13:21:24 -08:00
LantaoJin	e5c502c596	[SPARK-25865][CORE] Add GC information to ExecutorMetrics ## What changes were proposed in this pull request? Only memory usage without GC information could not help us to determinate the proper settings of memory. We need the GC metrics about frequency of major & minor GC. For example, two cases, their configured memory for executor are all 10GB and their usages are all near 10GB. So should we increase or decrease the configured memory for them? This metrics may be helpful. We can increase configured memory for the first one if it has very frequency major GC and decrease the second one if only some minor GC and none major GC. GC metrics are only useful in entire lifetime of executors instead of separated stages. ## How was this patch tested? Adding UT. Closes #22874 from LantaoJin/SPARK-25865. Authored-by: LantaoJin <jinlantao@gmail.com> Signed-off-by: Imran Rashid <irashid@cloudera.com>	2019-03-04 14:26:02 -06:00
Jungtaek Lim (HeartSaVioR)	d5bda2c9e8	[SPARK-26792][CORE] Apply custom log URL to Spark UI ## What changes were proposed in this pull request? [SPARK-23155](https://issues.apache.org/jira/browse/SPARK-23155) enables SHS to set up custom executor log URLs. This patch proposes to extend this feature to to Spark UI as well. Unlike the approach we did for SHS (replace executor log URLs when executor information is requested so it's like a change of view), here this patch replaces executor log URLs while registering executor, which also affects event log as well. In point of SHS's view, it will be treated as original log url when custom log url is applied to Spark UI. ## How was this patch tested? Added UT. Closes #23790 from HeartSaVioR/SPARK-26792. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-03-04 10:36:04 -08:00
manuzhang	81dd21fda9	[SPARK-26977][CORE] Fix warn against subclassing scala.App ## What changes were proposed in this pull request? Fix warn against subclassing scala.App ## How was this patch tested? Manual test Closes #23903 from manuzhang/fix_submit_warning. Authored-by: manuzhang <owenzhang1990@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-01 17:37:58 -06:00
SongYadong	86b25c4350	[SPARK-26967][CORE] Put MetricsSystem instance names together for clearer management ## What changes were proposed in this pull request? `MetricsSystem` instance creations have a scattered distribution in the project code. So do their names. It may cause some inconvenience for browsing and management. This PR tries to put them together. In this way, we can have a uniform location for adding or removing them, and have a overall view of `MetircsSystem `instances in current project. It's also helpful for maintaining user documents by avoiding missing something. ## How was this patch tested? Existing unit tests. Closes #23869 from SongYadong/metrics_system_inst_manage. Lead-authored-by: SongYadong <song.yadong1@zte.com.cn> Co-authored-by: walter2001 <ydsong2007@163.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-01 11:49:43 -06:00
liuxian	02bbe977ab	[MINOR] Remove unnecessary gets when getting a value from map. ## What changes were proposed in this pull request? Redundant `get` when getting a value from `Map` given a key. ## How was this patch tested? N/A Closes #23901 from 10110346/removegetfrommap. Authored-by: liuxian <liu.xian3@zte.com.cn> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-01 11:48:07 -06:00
Yifei Huang	bc7592ba11	[SPARK-27009][TEST] Add Standard Deviation to benchmark results ## What changes were proposed in this pull request? Add standard deviation to the stats taken during benchmark testing. ## How was this patch tested? Manually ran a few benchmark tests locally and visually inspected the output Closes #23914 from yifeih/spark-27009-stdev. Authored-by: Yifei Huang <yifeih@palantir.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-02-28 20:55:55 -08:00
Imran Rashid	c8e7eb1fa7	[SPARK-26774][CORE] Update some docs on TaskSchedulerImpl. A couple of places in TaskSchedulerImpl could use a minor doc update on threading concerns. There is one bug fix here, but only in sc.killTaskAttempt() which is probably not used much. Closes #23874 from squito/SPARK-26774. Authored-by: Imran Rashid <irashid@cloudera.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-02-28 11:30:20 -08:00
Hyukjin Kwon	6e31ccf2a1	[SPARK-26895][CORE][FOLLOW-UP] Uninitializing log after `prepareSubmitEnvironment` in SparkSubmit ## What changes were proposed in this pull request? Currently, if I run `spark-shell` in my local, it started to show the logs as below: ``` $ ./bin/spark-shell ... 19/02/28 04:42:43 INFO SecurityManager: Changing view acls to: hkwon 19/02/28 04:42:43 INFO SecurityManager: Changing modify acls to: hkwon 19/02/28 04:42:43 INFO SecurityManager: Changing view acls groups to: 19/02/28 04:42:43 INFO SecurityManager: Changing modify acls groups to: 19/02/28 04:42:43 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hkwon); groups with view permissions: Set(); users with modify permissions: Set(hkwon); groups with modify permissions: Set() 19/02/28 04:42:43 INFO SignalUtils: Registered signal handler for INT 19/02/28 04:42:48 INFO SparkContext: Running Spark version 3.0.0-SNAPSHOT 19/02/28 04:42:48 INFO SparkContext: Submitted application: Spark shell 19/02/28 04:42:48 INFO SecurityManager: Changing view acls to: hkwon ``` Seems to be the cause is https://github.com/apache/spark/pull/23806 and `prepareSubmitEnvironment` looks actually reinitializing the logging again. This PR proposes to uninitializing log later after `prepareSubmitEnvironment`. ## How was this patch tested? Manually tested. Closes #23911 from HyukjinKwon/SPARK-26895. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-02-27 17:01:30 -08:00
Marcelo Vanzin	a6ddc9d083	[SPARK-24736][K8S] Let spark-submit handle dependency resolution. Before this change, there was some code in the k8s backend to deal with how to resolve dependencies and make them available to the Spark application. It turns out that none of that code is necessary, since spark-submit already handles all that for applications started in client mode - like the k8s driver that is run inside a Spark-created pod. For that reason, specifically for pyspark, there's no need for the k8s backend to deal with PYTHONPATH; or, in general, to change the URIs provided by the user at all. spark-submit takes care of that. For testing, I created a pyspark script that depends on another module that is shipped with --py-files. Then I used: - --py-files http://.../dep.py http://.../test.py - --py-files http://.../dep.zip http://.../test.py - --py-files local:/.../dep.py local:/.../test.py - --py-files local:/.../dep.zip local:/.../test.py Without this change, all of the above commands fail. With the change, the driver is able to see the dependencies in all the above cases; but executors don't see the dependencies in the last two. That's a bug in shared Spark code that deals with local: dependencies in pyspark (SPARK-26934). I also tested a Scala app using the main jar from an http server. Closes #23793 from vanzin/SPARK-24736. Authored-by: Marcelo Vanzin <vanzin@cloudera.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-02-27 09:49:31 -08:00
liuxian	7912dbb88f	[MINOR] Simplify boolean expression ## What changes were proposed in this pull request? Comparing whether Boolean expression is equal to true is redundant For example: The datatype of `a` is boolean. Before: if (a == true) After: if (a) ## How was this patch tested? N/A Closes #23884 from 10110346/simplifyboolean. Authored-by: liuxian <liu.xian3@zte.com.cn> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-02-27 08:38:00 -06:00
Jungtaek Lim (HeartSaVioR)	c17150a5f5	[SPARK-22860][CORE][YARN] Redact command line arguments for running Driver and Executor before logging (standalone and YARN) ## What changes were proposed in this pull request? This patch applies redaction to command line arguments before logging them. This applies to two resource managers: standalone cluster and YARN. This patch only concerns about arguments starting with `-D` since Spark is likely passing the Spark configuration to command line arguments as `-Dspark.blabla=blabla`. More change is necessary if we also want to handle the case of `--conf spark.blabla=blabla`. ## How was this patch tested? Added UT for redact logic. This patch only touches how to log so not easy to add UT regarding it. Closes #23820 from HeartSaVioR/MINOR-redact-command-line-args-for-running-driver-executor. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-02-26 14:49:46 -08:00
Maxim Gekk	a2a41b7bf2	[SPARK-26978][CORE][SQL] Avoid magic time constants ## What changes were proposed in this pull request? In the PR, I propose to refactor existing code related to date/time conversions, and replace constants like `1000` and `1000000` by `DateTimeUtils` constants and transformation functions from `java.util.concurrent.TimeUnit._`. ## How was this patch tested? The changes are tested by existing test suites. Closes #23878 from MaxGekk/magic-time-constants. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-02-26 09:08:12 -06:00
Marcelo Vanzin	4808393449	[SPARK-26788][YARN] Remove SchedulerExtensionService. Since the yarn module is actually private to Spark, this interface was never actually "public". Since it has no use inside of Spark, let's avoid adding a yarn-specific extension that isn't public, and point any potential users are more general solutions (like using a SparkListener). Closes #23839 from vanzin/SPARK-26788. Authored-by: Marcelo Vanzin <vanzin@cloudera.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-02-25 13:57:37 -06:00
“attilapiros”	0ac516bebd	[SPARK-25035][CORE] Avoiding memory mapping at disk-stored blocks replication Before this PR the method `BlockManager#putBlockDataAsStream()` (which is used during block replication where the block data is received as a stream) was reading the whole block content into the memory even at DISK_ONLY storage level. With this change the received block data (which was temporary stored in a file) is just simply moved into the right location backing the target block. This way a possible OOM error is avoided. In this implementation to save code duplications the method `doPutBytes` is refactored into a template method called `BlockStoreUpdater` which has a separate implementation to handle byte buffer based and temporary file based block store updates. With existing unit tests of `DistributedSuite` (the ones dealing with replications): - caching on disk, replicated (encryption = off) (with replication as stream) - caching on disk, replicated (encryption = on) (with replication as stream) - caching in memory, serialized, replicated (encryption = on) (with replication as stream) - caching in memory, serialized, replicated (encryption = off) (with replication as stream) - etc. And with new unit tests testing `putBlockDataAsStream` method directly: - test putBlockDataAsStream with caching (encryption = off) - test putBlockDataAsStream with caching (encryption = on) - test putBlockDataAsStream with caching on disk (encryption = off) - test putBlockDataAsStream with caching on disk (encryption = on) Closes #23688 from attilapiros/SPARK-25035. Authored-by: “attilapiros” <piros.attila.zsolt@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-02-25 11:43:39 -08:00
Maxim Gekk	2d2fb34b93	[SPARK-26953][CORE][TEST] Test TimSort for ArrayIndexOutOfBoundsException ## What changes were proposed in this pull request? In the PR, I propose to test the input showed at the end of the article: https://arxiv.org/pdf/1805.08612.pdf . The difference of the test and paper's test is type of array. This test allocates arrays of bytes instead of array of ints. ## How was this patch tested? New test is added to `SorterSuite`. Closes #23856 from MaxGekk/timsort-bug-fix. Authored-by: Maxim Gekk <maxim.gekk@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-02-24 17:37:32 -06:00
seancxmao	a07b07fd85	[MINOR][DOCS] Remove references to Shark ## What changes were proposed in this pull request? This PR aims to remove references to "Shark", which is a precursor to Spark SQL. I searched the whole project for the text "Shark" (ignore case) and just found a single match. Note that occurrences like nickname or test data are irrelevant. ## How was this patch tested? N/A. Change comments only. Closes #23876 from seancxmao/remove-Shark. Authored-by: seancxmao <seancxmao@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-02-23 11:03:05 -06:00
Sean Owen	ab4e83aca7	[SPARK-26963][MLLIB] SizeEstimator can't make some JDK fields accessible in Java 9+ ## What changes were proposed in this pull request? Don't use inaccessible fields in SizeEstimator, which comes up in Java 9+ ## How was this patch tested? Manually ran tests with Java 11; it causes these tests that failed before to pass. This ought to pass on Java 8 as there's effectively no change for Java 8. Closes #23866 from srowen/SPARK-26963. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-02-23 11:01:47 -06:00
seancxmao	ce3a157f00	[SPARK-26939][CORE][DOC] Fix some outdated comments about task schedulers ## What changes were proposed in this pull request? This PR aims to fix some outdated comments about task schedulers. 1. Change "ClusterScheduler" to "YarnScheduler" in comments of `YarnClusterScheduler` According to [SPARK-1140 Remove references to ClusterScheduler](https://issues.apache.org/jira/browse/SPARK-1140), ClusterScheduler is not used anymore. I also searched "ClusterScheduler" within the whole project, no other occurrences are found in comments or test cases. Note classes like `YarnClusterSchedulerBackend` or `MesosClusterScheduler` are not relevant. 2. Update comments about `statusUpdate` from `TaskSetManager` `statusUpdate` has been moved to `TaskSchedulerImpl`. StatusUpdate event handling is delegated to `handleSuccessfulTask`/`handleFailedTask`. ## How was this patch tested? N/A. Fix comments only. Closes #23844 from seancxmao/taskscheduler-comments. Authored-by: seancxmao <seancxmao@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-02-23 10:52:53 -06:00
Alessandro Bellina	79a650494f	[SPARK-26895][CORE] prepareSubmitEnvironment should be called within doAs for proxy users ## What changes were proposed in this pull request? `prepareSubmitEnvironment` performs globbing that will fail in the case where a proxy user (`--proxy-user`) doesn't have permission to the file. This is a bug also with 2.3, so we should backport, as currently you can't launch an application that for instance is passing a file under `--archives`, and that file is owned by the target user. The solution is to call `prepareSubmitEnvironment` within a doAs context if proxying. ## How was this patch tested? Manual tests running with `--proxy-user` and `--archives`, before and after, showing that the globbing is successful when the resource is owned by the target user. I've looked at writing unit tests, but I am not sure I can do that cleanly (perhaps with a custom FileSystem). Open to ideas. Please review http://spark.apache.org/contributing.html before opening a pull request. Closes #23806 from abellina/SPARK-26895_prepareSubmitEnvironment_from_doAs. Lead-authored-by: Alessandro Bellina <abellina@gmail.com> Co-authored-by: Alessandro Bellina <abellina@yahoo-inc.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-02-22 11:15:20 -08:00
Maxim Gekk	1304974539	[SPARK-26955][CORE] Align Spark's TimSort to jdk11 implementation ## What changes were proposed in this pull request? Spark's TimSort deviates from JDK 11 TimSort in a couple places: - `stackLen` was increased in jdk - additional cases for break in `mergeCollapse`: `n < 0` In the PR, I propose to align Spark TimSort to jdk implementation. ## How was this patch tested? By existing test suites, in particular, `SorterSuite`. Closes #23858 from MaxGekk/timsort-java-alignment. Authored-by: Maxim Gekk <maxim.gekk@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-02-21 22:18:23 -06:00
liupengcheng	2153b316bd	[SPARK-26892][CORE] Fix saveAsTextFile throws NullPointerException when null row present ## What changes were proposed in this pull request? Currently, RDD.saveAsTextFile may throw NullPointerException then null row is present. ``` scala> sc.parallelize(Seq(1,null),1).saveAsTextFile("/tmp/foobar.dat") 19/02/15 21:39:17 ERROR Utils: Aborting task java.lang.NullPointerException at org.apache.spark.rdd.RDD.$anonfun$saveAsTextFile$3(RDD.scala:1510) at scala.collection.Iterator$$anon$10.next(Iterator.scala:459) at org.apache.spark.internal.io.SparkHadoopWriter$.$anonfun$executeTask$1(SparkHadoopWriter.scala:129) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1352) at org.apache.spark.internal.io.SparkHadoopWriter$.executeTask(SparkHadoopWriter.scala:127) at org.apache.spark.internal.io.SparkHadoopWriter$.$anonfun$write$1(SparkHadoopWriter.scala:83) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:121) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:425) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1318) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:428) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) ``` This PR write "Null" for null row to avoid NPE and fix it. ## How was this patch tested? NA Please review http://spark.apache.org/contributing.html before opening a pull request. Closes #23799 from liupc/Fix-saveAsTextFile-throws-NullPointerException-when-null-row-present. Lead-authored-by: liupengcheng <liupengcheng@xiaomi.com> Co-authored-by: Liupengcheng <liupengcheng@xiaomi.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-02-20 16:42:55 -06:00
Hyukjin Kwon	3c15d8b71c	[SPARK-26762][SQL][R] Arrow optimization for conversion from Spark DataFrame to R DataFrame ## What changes were proposed in this pull request? This PR targets to support Arrow optimization for conversion from Spark DataFrame to R DataFrame. Like PySpark side, it falls back to non-optimization code path when it's unable to use Arrow optimization. This can be tested as below: ```bash $ ./bin/sparkR --conf spark.sql.execution.arrow.enabled=true ``` ```r collect(createDataFrame(mtcars)) ``` ### Requirements - R 3.5.x - Arrow package 0.12+ ```bash Rscript -e 'remotes::install_github("apache/arrowapache-arrow-0.12.0", subdir = "r")' ``` Note: currently, Arrow R package is not in CRAN. Please take a look at ARROW-3204. Note: currently, Arrow R package seems not supporting Windows. Please take a look at ARROW-3204. ### Benchmarks Shall ```bash sync && sudo purge ./bin/sparkR --conf spark.sql.execution.arrow.enabled=false --driver-memory 4g ``` ```bash sync && sudo purge ./bin/sparkR --conf spark.sql.execution.arrow.enabled=true --driver-memory 4g ``` R code ```r df <- cache(createDataFrame(read.csv("500000.csv"))) count(df) test <- function() { options(digits.secs = 6) # milliseconds start.time <- Sys.time() collect(df) end.time <- Sys.time() time.taken <- end.time - start.time print(time.taken) } test() ``` Data (350 MB): ```r object.size(read.csv("500000.csv")) 350379504 bytes ``` "500000 Records" http://eforexcel.com/wp/downloads-16-sample-csv-files-data-sets-for-testing/ Results ``` Time difference of 221.32014 secs ``` ``` Time difference of 15.51145 secs ``` The performance improvement was around 1426%. ### Limitations: - For now, Arrow optimization with R does not support when the data is `raw`, and when user explicitly gives float type in the schema. They produce corrupt values. In this case, we decide to fall back to non-optimization code path. - Due to ARROW-4512, it cannot send and receive batch by batch. It has to send all batches in Arrow stream format at once. It needs improvement later. ## How was this patch tested? Existing tests related with Arrow optimization cover this change. Also, manually tested. Closes #23760 from HyukjinKwon/SPARK-26762. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2019-02-20 11:35:17 +08:00
Gabor Somogyi	28ced387b9	[SPARK-26772][YARN] Delete ServiceCredentialProvider and make HadoopDelegationTokenProvider a developer API ## What changes were proposed in this pull request? `HadoopDelegationTokenProvider` has basically the same functionality just like `ServiceCredentialProvider` so the interfaces can be merged. `YARNHadoopDelegationTokenManager` now loads `ServiceCredentialProvider`s in one step. The drawback of this if one provider fails all others are not loaded. `HadoopDelegationTokenManager` loads `HadoopDelegationTokenProvider`s independently so it provides more robust behaviour. In this PR I've I've made the following changes: * Deleted `YARNHadoopDelegationTokenManager` and `ServiceCredentialProvider` * Made `HadoopDelegationTokenProvider` a `DeveloperApi` ## How was this patch tested? Existing unit tests. Closes #23686 from gaborgsomogyi/SPARK-26772. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-02-15 14:43:13 -08:00

1 2 3 4 5 ...

7129 commits