ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Gengliang Wang	2348cce37e	Preparing development version 3.2.1-SNAPSHOT	2021-09-26 12:28:46 +00:00
Gengliang Wang	2ed8c08c5b	Preparing Spark release v3.2.0-rc5	2021-09-26 12:28:40 +00:00
Gengliang Wang	da722d43cb	Preparing development version 3.2.1-SNAPSHOT	2021-09-24 10:03:23 +00:00
Gengliang Wang	9e35703211	Preparing Spark release v3.2.0-rc5	2021-09-24 10:03:16 +00:00
Gengliang Wang	0fb7127f85	Preparing development version 3.2.1-SNAPSHOT	2021-09-23 08:46:28 +00:00
Gengliang Wang	b609f2fe0c	Preparing Spark release v3.2.0-rc4	2021-09-23 08:46:22 +00:00
Gengliang Wang	b0249851f6	Preparing development version 3.2.1-SNAPSHOT	2021-09-18 11:30:12 +00:00
Gengliang Wang	96044e9735	Preparing Spark release v3.2.0-rc3	2021-09-18 11:30:06 +00:00
Lukas Rytz	2e7583799e	[SPARK-36712][BUILD] Make scala-parallel-collections in 2.13 POM a direct dependency (not in maven profile) As [reported on `devspark.apache.org`](https://lists.apache.org/thread.html/r84cff66217de438f1389899e6d6891b573780159cd45463acf3657aa%40%3Cdev.spark.apache.org%3E), the published POMs when building with Scala 2.13 have the `scala-parallel-collections` dependency only in the `scala-2.13` profile of the pom. ### What changes were proposed in this pull request? This PR suggests to work around this by un-commenting the `scala-parallel-collections` dependency when switching to 2.13 using the the `change-scala-version.sh` script. I included an upgrade to scala-parallel-collections version 1.0.3, the changes compared to 0.2.0 are minor. - removed OSGi metadata - renamed some internal inner classes - added `Automatic-Module-Name` ### Why are the changes needed? According to the posts, this solves issues for developers that write unit tests for their applications. Stephen Coy suggested to use the https://www.mojohaus.org/flatten-maven-plugin. While this sounds like a more principled solution, it is possibly too risky to do at this specific point in time? ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Locally Closes #33948 from lrytz/parCollDep. Authored-by: Lukas Rytz <lukas.rytz@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit `1a62e6a2c1`) Signed-off-by: Sean Owen <srowen@gmail.com>	2021-09-13 11:06:58 -05:00
Gengliang Wang	1bad04d028	Preparing development version 3.2.1-SNAPSHOT	2021-08-31 17:04:14 +00:00
Gengliang Wang	03f5d23e96	Preparing Spark release v3.2.0-rc2	2021-08-31 17:04:08 +00:00
Gengliang Wang	69be513c5e	Preparing development version 3.2.1-SNAPSHOT	2021-08-20 12:40:47 +00:00
Gengliang Wang	6bb3523d8e	Preparing Spark release v3.2.0-rc1	2021-08-20 12:40:40 +00:00
Gengliang Wang	fafdc1482b	Revert "Preparing Spark release v3.2.0-rc1" This reverts commit `8e58fafb05`.	2021-08-20 20:07:02 +08:00
Gengliang Wang	c829ed53ff	Revert "Preparing development version 3.2.1-SNAPSHOT" This reverts commit `4f1d21571d`.	2021-08-20 20:07:01 +08:00
Gengliang Wang	4f1d21571d	Preparing development version 3.2.1-SNAPSHOT	2021-08-19 14:08:32 +00:00
Gengliang Wang	8e58fafb05	Preparing Spark release v3.2.0-rc1	2021-08-19 14:08:26 +00:00
HyukjinKwon	8a1e172b51	[SPARK-34520][CORE] Remove unused SecurityManager references ### What changes were proposed in this pull request? This is kind of a followup of https://github.com/apache/spark/pull/24033 and https://github.com/apache/spark/pull/30945. Many of references in `SecurityManager` were introduced from SPARK-1189, and related usages were removed later from https://github.com/apache/spark/pull/24033 and https://github.com/apache/spark/pull/30945. This PR proposes to remove them out. ### Why are the changes needed? For better readability of codes. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? Manually complied. GitHub Actions and Jenkins build should test it out as well. Closes #31636 from HyukjinKwon/SPARK-34520. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-02-24 20:38:03 -08:00
yangjie01	777d51e7e3	[SPARK-34374][SQL][DSTREAM] Use standard methods to extract keys or values from a Map ### What changes were proposed in this pull request? Use standard methods to extract `keys` or `values` from a `Map`, it's semantically consistent and use the `DefaultKeySet` and `DefaultValuesIterable` instead of a manual loop. Before ``` map.map(_._1) map.map(_._2) ``` After ``` map.keys map.values ``` ### Why are the changes needed? Code Simpilefications. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the Jenkins or GitHub Action Closes #31484 from LuciferYang/keys-and-values. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-02-08 15:42:55 -06:00
yangjie01	2e192b6f45	[SPARK-34284][CORE][TESTS] Fix deprecated API usage of Apache commons-io ### What changes were proposed in this pull request? There are some deprecated API usage compilation warning related to Apache commons-io as follows: ``` [WARNING] [Warn] /spark/core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala:1109: [deprecation org.apache.spark.deploy.SparkSubmitSuite.checkDownloadedFile.$org_scalatest_assert_macro_expr.$org_scalatest_assert_macro_left \| origin=org.apache.commons.io.FileUtils.readFileToString \| version=] method readFileToString in class FileUtils is deprecated [WARNING] [Warn] /spark/core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala:1110: [deprecation org.apache.spark.deploy.SparkSubmitSuite.checkDownloadedFile.$org_scalatest_assert_macro_expr.$org_scalatest_assert_macro_right \| origin=org.apache.commons.io.FileUtils.readFileToString \| version=] method readFileToString in class FileUtils is deprecated [WARNING] [Warn] /spark/core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala:1152: [deprecation org.apache.spark.deploy.SparkSubmitSuite \| origin=org.apache.commons.io.FileUtils.write \| version=] method write in class FileUtils is deprecated [WARNING] [Warn] /spark/core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala:1167: [deprecation org.apache.spark.deploy.SparkSubmitSuite \| origin=org.apache.commons.io.FileUtils.write \| version=] method write in class FileUtils is deprecated [WARNING] [Warn] /spark/core/src/test/scala/org/apache/spark/deploy/history/HistoryServerSuite.scala:201: [deprecation org.apache.spark.deploy.history.HistoryServerSuite.<local HistoryServerSuite>.$anonfun.exp \| origin=org.apache.commons.io.IOUtils.toString \| version=] method toString in class IOUtils is deprecated [WARNING] [Warn] /spark/core/src/test/scala/org/apache/spark/deploy/history/HistoryServerSuite.scala:716: [deprecation org.apache.spark.deploy.history.HistoryServerSuite.getContentAndCode.inString.$anonfun \| origin=org.apache.commons.io.IOUtils.toString \| version=] method toString in class IOUtils is deprecated [WARNING] [Warn] /spark/core/src/test/scala/org/apache/spark/deploy/history/HistoryServerSuite.scala:732: [deprecation org.apache.spark.deploy.history.HistoryServerSuite.connectAndGetInputStream.errString.$anonfun \| origin=org.apache.commons.io.IOUtils.toString \| version=] method toString in class IOUtils is deprecated [WARNING] [Warn] /spark/streaming/src/test/scala/org/apache/spark/streaming/InputStreamsSuite.scala:267: [deprecation org.apache.spark.streaming.InputStreamsSuite.<local InputStreamsSuite>.$anonfun.$anonfun.write \| origin=org.apache.commons.io.IOUtils.write \| version=] method write in class IOUtils is deprecated [WARNING] [Warn] /spark/streaming/src/test/scala/org/apache/spark/streaming/StreamingContextSuite.scala:912: [deprecation org.apache.spark.streaming.StreamingContextSuite.createCorruptedCheckpoint \| origin=org.apache.commons.io.FileUtils.write \| version=] method write in class FileUtils is deprecated ``` The main API change is to need to add a `java.nio.charset.Charset` parameter when the corresponding method is called, so the main change of is pr is add a `StandardCharsets.UTF_8` parameter to the these method. ### Why are the changes needed? Fix deprecated API usage of Apache commons-io. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the Jenkins or GitHub Action Closes #31389 from LuciferYang/SPARK-34284. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-29 17:50:14 +09:00
yangjie01	8999e8805d	[SPARK-34224][CORE][SQL][SS][DSTREAM][YARN][TEST][EXAMPLES] Ensure all resource opened by `Source.fromXXX` are closed ### What changes were proposed in this pull request? Using a function like `.mkString` or `.getLines` directly on a `scala.io.Source` opened by `fromFile`, `fromURL`, `fromURI ` will leak the underlying file handle, this pr use the `Utils.tryWithResource` method wrap the `BufferedSource` to ensure these `BufferedSource` closed. ### Why are the changes needed? Avoid file handle leak. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the Jenkins or GitHub Action Closes #31323 from LuciferYang/source-not-closed. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-26 19:06:37 +09:00
yangjie01	9e33d49b5b	[SPARK-33346][CORE][SQL][MLLIB][DSTREAM][K8S] Change the never changed 'var' to 'val' ### What changes were proposed in this pull request? Some local variables are declared as `var`, but they are never reassigned and should be declared as `val`, so this pr turn these from `var` to `val` except for `mockito` related cases. ### Why are the changes needed? Use `val` instead of `var` when possible. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the Jenkins or GitHub Action Closes #31142 from LuciferYang/SPARK-33346. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-01-15 08:47:02 -06:00
HyukjinKwon	3d0323401f	[SPARK-33810][TESTS] Reenable test cases disabled in SPARK-31732 ### What changes were proposed in this pull request? The test failures were due to machine being slow in Jenkins. We switched to Ubuntu 20 if I am not wrong. Looks like all machines are functioning properly unlike the past, and the tests pass without a problem anymore. This PR proposes to enable them back. ### Why are the changes needed? To restore test coverage. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? Jenkins jobs in this PR show the flakiness. Closes #30798 from HyukjinKwon/do-not-merge-test. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-16 08:34:22 -08:00
Dongjoon Hyun	de9818f043	[SPARK-33662][BUILD] Setting version to 3.2.0-SNAPSHOT ### What changes were proposed in this pull request? This PR aims to update `master` branch version to 3.2.0-SNAPSHOT. ### Why are the changes needed? Start to prepare Apache Spark 3.2.0. ### Does this PR introduce _any_ user-facing change? N/A. ### How was this patch tested? Pass the CIs. Closes #30606 from dongjoon-hyun/SPARK-3.2. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-04 14:10:42 -08:00
Josh Soref	13fd272cd3	Spelling r common dev mlib external project streaming resource managers python ### What changes were proposed in this pull request? This PR intends to fix typos in the sub-modules: * `R` * `common` * `dev` * `mlib` * `external` * `project` * `streaming` * `resource-managers` * `python` Split per srowen https://github.com/apache/spark/pull/30323#issuecomment-728981618 NOTE: The misspellings have been reported at `706a726f87 (commitcomment-44064356)` ### Why are the changes needed? Misspelled words make it harder to read / understand content. ### Does this PR introduce _any_ user-facing change? There are various fixes to documentation, etc... ### How was this patch tested? No testing was performed Closes #30402 from jsoref/spelling-R_common_dev_mlib_external_project_streaming_resource-managers_python. Authored-by: Josh Soref <jsoref@users.noreply.github.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-11-27 10:22:45 -06:00
Dongjoon Hyun	3ce4ab545b	[SPARK-33513][BUILD] Upgrade to Scala 2.13.4 to improve exhaustivity ### What changes were proposed in this pull request? This PR aims the followings. 1. Upgrade from Scala 2.13.3 to 2.13.4 for Apache Spark 3.1 2. Fix exhaustivity issues in both Scala 2.12/2.13 (Scala 2.13.4 requires this for compilation.) 3. Enforce the improved exhaustive check by using the existing Scala 2.13 GitHub Action compilation job. ### Why are the changes needed? Scala 2.13.4 is a maintenance release for 2.13 line and improves JDK 15 support. - https://github.com/scala/scala/releases/tag/v2.13.4 Also, it improves exhaustivity check. - https://github.com/scala/scala/pull/9140 (Check exhaustivity of pattern matches with "if" guards and custom extractors) - https://github.com/scala/scala/pull/9147 (Check all bindings exhaustively, e.g. tuples components) ### Does this PR introduce _any_ user-facing change? Yep. Although it's a maintenance version change, it's a Scala version change. ### How was this patch tested? Pass the CIs and do the manual testing. - Scala 2.12 CI jobs(GitHub Action/Jenkins UT/Jenkins K8s IT) to check the validity of code change. - Scala 2.13 Compilation job to check the compilation Closes #30455 from dongjoon-hyun/SCALA_3.13. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-23 16:28:43 -08:00
yangjie01	e3058ba17c	[SPARK-33441][BUILD] Add unused-imports compilation check and remove all unused-imports ### What changes were proposed in this pull request? This pr add a new Scala compile arg to `pom.xml` to defense against new unused imports: - `-Ywarn-unused-import` for Scala 2.12 - `-Wconf:cat=unused-imports:e` for Scala 2.13 The other fIles change are remove all unused imports in Spark code ### Why are the changes needed? Cleanup code and add guarantee to defense against new unused imports ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the Jenkins or GitHub Action Closes #30351 from LuciferYang/remove-imports-core-module. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-19 14:20:39 +09:00
yi.wu	edeecada66	[SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission ### What changes were proposed in this pull request? This PR cleans up the RPC message flow among the multiple decommission use cases, it includes changes: * Keep `Worker`'s decommission status be consistent between the case where decommission starts from `Worker` and the case where decommission starts from the `MasterWebUI`: sending `DecommissionWorker` from `Master` to `Worker` in the latter case. * Change from two-way communication to one-way communication when notifying decommission between driver and executor: it's obviously unnecessary for the executor to acknowledge the decommission status to the driver since the decommission request is from the driver. And it's same in reverse. * Only send one message instead of two(`DecommissionSelf`/`DecommissionBlockManager`) when decommission the executor: executor and `BlockManager` are in the same JVM. * Clean up codes around here. ### Why are the changes needed? Before: <img width="1948" alt="WeChat56c00cc34d9785a67a544dca036d49da" src="https://user-images.githubusercontent.com/16397174/92850308-dc461c80-f41e-11ea-8ac0-287825f4e0c4.png"> After: <img width="1968" alt="WeChat05f7afb017e3f0132394c5e54245e49e" src="https://user-images.githubusercontent.com/16397174/93189571-de88dd80-f774-11ea-9300-1943920aa27d.png"> (Note the diagrams only counts those RPC calls that needed to go through the network. Local RPC calls are not counted here.) After this change, We reduced 6 original RPC calls and added one more RPC call for keeping the consistent decommission status for the Worker. And the RPC flow becomes more clear. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Updated existing tests. Closes #29817 from Ngone51/simplify-decommission-rpc. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-23 13:58:44 +09:00
yangjie01	dd80845735	[SPARK-32964][DSTREAMS] Pass all `streaming` module UTs in Scala 2.13 ### What changes were proposed in this pull request? There is only one failed case of `streaming` module in Scala 2.13: `start with non-serializable DStream checkpoint ` in `StreamingContextSuite`. `StackOverflowError` is thrown here when `SerializationDebugger#visit` method is called. I found that `inputStreams` and `outputStreams` in `DStreamGraph` can not be matched in `SerializationDebugger#visit` method because `ArrayBuffer` in not `Array` in Scala 2.13. The main change of this pr is use `mutable.ArraySeq` instead of `ArrayBuffer` to store `inputStreams` and `outputStreams` in `DStreamGraph`, then it can be matched in `SerializationDebugger#visit` method. ### Why are the changes needed? We need to support a Scala 2.13 build. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Scala 2.12: Pass the Jenkins or GitHub Action - Scala 2.13: Pass GitHub 2.13 Build Action Do the following: ``` dev/change-scala-version.sh 2.13 mvn clean install -DskipTests -pl streaming -Pscala-2.13 -am mvn test -pl streaming -Pscala-2.13 mvn test -pl core -Pscala-2.13 ``` streaming module: ``` Tests: succeeded 339, failed 0, canceled 0, ignored 2, pending 0 All tests passed. ``` core module: ``` Tests: succeeded 2648, failed 0, canceled 4, ignored 7, pending 0 All tests passed. ``` Closes #29836 from LuciferYang/fix-streaming-213. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-09-22 11:01:44 -07:00
Wenchen Fan	0c66813ad9	Revert "[SPARK-32850][CORE] Simplify the RPC message flow of decommission" This reverts commit `56ae95053d`.	2020-09-21 13:28:31 +08:00
William Hyun	7892887981	[SPARK-32930][CORE] Replace deprecated isFile/isDirectory methods ### What changes were proposed in this pull request? This PR aims to replace deprecated `isFile` and `isDirectory` methods. ```diff - fs.isDirectory(hadoopPath) + fs.getFileStatus(hadoopPath).isDirectory ``` ```diff - fs.isFile(new Path(inProgressLog)) + fs.getFileStatus(new Path(inProgressLog)).isFile ``` ### Why are the changes needed? It shows deprecation warnings. - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-3.2-hive-2.3/1244/consoleFull ``` [warn] /home/jenkins/workspace/spark-master-test-sbt-hadoop-3.2-hive-2.3/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala:815: method isFile in class FileSystem is deprecated: see corresponding Javadoc for more information. [warn] if (!fs.isFile(new Path(inProgressLog))) { ``` ``` [warn] /home/jenkins/workspace/spark-master-test-sbt-hadoop-3.2-hive-2.3/core/src/main/scala/org/apache/spark/SparkContext.scala:1884: method isDirectory in class FileSystem is deprecated: see corresponding Javadoc for more information. [warn] if (fs.isDirectory(hadoopPath)) { ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the Jenkins. Closes #29796 from williamhyun/filesystem. Authored-by: William Hyun <williamhyun3@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-09-18 18:13:11 +09:00
yi.wu	56ae95053d	[SPARK-32850][CORE] Simplify the RPC message flow of decommission ### What changes were proposed in this pull request? This PR cleans up the RPC message flow among the multiple decommission use cases, it includes changes: * Keep `Worker`'s decommission status be consistent between the case where decommission starts from `Worker` and the case where decommission starts from the `MasterWebUI`: sending `DecommissionWorker` from `Master` to `Worker` in the latter case. * Change from two-way communication to one-way communication when notifying decommission between driver and executor: it's obviously unnecessary for the executor to acknowledge the decommission status to the driver since the decommission request is from the driver. And it's same in reverse. * Only send one message instead of two(`DecommissionSelf`/`DecommissionBlockManager`) when decommission the executor: executor and `BlockManager` are in the same JVM. * Clean up codes around here. ### Why are the changes needed? Before: <img width="1948" alt="WeChat56c00cc34d9785a67a544dca036d49da" src="https://user-images.githubusercontent.com/16397174/92850308-dc461c80-f41e-11ea-8ac0-287825f4e0c4.png"> After: <img width="1968" alt="WeChat05f7afb017e3f0132394c5e54245e49e" src="https://user-images.githubusercontent.com/16397174/93189571-de88dd80-f774-11ea-9300-1943920aa27d.png"> (Note the diagrams only counts those RPC calls that needed to go through the network. Local RPC calls are not counted here.) After this change, We reduced 6 original RPC calls and added one more RPC call for keeping the consistent decommission status for the Worker. And the RPC flow becomes more clear. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Updated existing tests. Closes #29722 from Ngone51/simplify-decommission-rpc. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-09-16 15:00:31 +00:00
Kousuke Saruta	b121f0d459	[SPARK-32873][BUILD] Fix code which causes error when build with sbt and Scala 2.13 ### What changes were proposed in this pull request? This PR fix code which causes error when build with sbt and Scala 2.13 like as follows. ``` [error] [warn] /home/kou/work/oss/spark-scala-2.13/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/KafkaRDD.scala:251: method with a single empty parameter list overrides method without any parameter list [error] [warn] override def hasNext(): Boolean = requestOffset < part.untilOffset [error] [warn] [error] [warn] /home/kou/work/oss/spark-scala-2.13/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/KafkaRDD.scala:294: method with a single empty parameter list overrides method without any parameter list [error] [warn] override def hasNext(): Boolean = okNext ``` More specifically, what this PR fixes are * Methods which has an empty parameter list and overrides an method which has no parameter list. ``` override def hasNext(): Boolean = okNext ``` * Methods which has no parameter list and overrides an method which has an empty parameter list. ``` override def next: (Int, Double) = { ``` * Infix operator expression that the operator wraps. ``` 3L * math.min(k, numFeatures) * math.min(k, numFeatures) 3L * math.min(k, numFeatures) * math.min(k, numFeatures) + + math.max(math.max(k, numFeatures), 4L * math.min(k, numFeatures) math.max(math.max(k, numFeatures), 4L * math.min(k, numFeatures) * * math.min(k, numFeatures) + 4L * math.min(k, numFeatures)) ``` ### Why are the changes needed? For building Spark with sbt and Scala 2.13. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? After this change and #29742 applied, compile passed with the following command. ``` build/sbt -Pscala-2.13 -Phive -Phive-thriftserver -Pyarn -Pkubernetes compile test:compile ``` Closes #29745 from sarutak/fix-code-for-sbt-and-spark-2.13. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-09-14 15:34:58 +09:00
yi.wu	125cbe3ae0	[SPARK-32736][CORE] Avoid caching the removed decommissioned executors in TaskSchedulerImpl ### What changes were proposed in this pull request? The motivation of this PR is to avoid caching the removed decommissioned executors in `TaskSchedulerImpl`. The cache is introduced in https://github.com/apache/spark/pull/29422. The cache will hold the `isHostDecommissioned` info for a while. So if the task `FetchFailure` event comes after the executor loss event, `DAGScheduler` can still get the `isHostDecommissioned` from the cache and unregister the host shuffle map status when the host is decommissioned too. This PR tries to achieve the same goal without the cache. Instead of saving the `workerLost` in `ExecutorUpdated` / `ExecutorDecommissionInfo` / `ExecutorDecommissionState`, we could save the `hostOpt` directly. When the host is decommissioned or lost too, the `hostOpt` can be a specific host address. Otherwise, it's `None` to indicate that only the executor is decommissioned or lost. Now that we have the host info, we can also unregister the host shuffle map status when `executorLost` is triggered for the decommissioned executor. Besides, this PR also includes a few cleanups around the touched code. ### Why are the changes needed? It helps to unregister the shuffle map status earlier for both decommission and normal executor lost cases. It also saves memory in `TaskSchedulerImpl` and simplifies the code a little bit. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? This PR only refactor the code. The original behaviour should be covered by `DecommissionWorkerSuite`. Closes #29579 from Ngone51/impr-decom. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-09-08 04:40:13 +00:00
yi.wu	3092527f75	[SPARK-32651][CORE] Decommission switch configuration should have the highest hierarchy ### What changes were proposed in this pull request? Rename `spark.worker.decommission.enabled` to `spark.decommission.enabled` and move it from `org.apache.spark.internal.config.Worker` to `org.apache.spark.internal.config.package`. ### Why are the changes needed? Decommission has been supported in Standalone and k8s yet and may be supported in Yarn(https://github.com/apache/spark/pull/27636) in the future. Therefore, the switch configuration should have the highest hierarchy rather than belongs to Standalone's Worker. In other words, it should be independent of the cluster managers. ### Does this PR introduce _any_ user-facing change? No, as the decommission feature hasn't been released. ### How was this patch tested? Pass existed tests. Closes #29466 from Ngone51/fix-decom-conf. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-08-19 06:53:06 +00:00
Holden Karau	548ac7c4af	[SPARK-31198][CORE] Use graceful decommissioning as part of dynamic scaling ### What changes were proposed in this pull request? If graceful decommissioning is enabled, Spark's dynamic scaling uses this instead of directly killing executors. ### Why are the changes needed? When scaling down Spark we should avoid triggering recomputes as much as possible. ### Does this PR introduce _any_ user-facing change? Hopefully their jobs run faster or at the same speed. It also enables experimental shuffle service free dynamic scaling when graceful decommissioning is enabled (using the same code as the shuffle tracking dynamic scaling). ### How was this patch tested? For now I've extended the ExecutorAllocationManagerSuite for both core & streaming. Closes #29367 from holdenk/SPARK-31198-use-graceful-decommissioning-as-part-of-dynamic-scaling. Lead-authored-by: Holden Karau <hkarau@apple.com> Co-authored-by: Holden Karau <holden@pigscanfly.ca> Signed-off-by: Holden Karau <hkarau@apple.com>	2020-08-12 17:07:18 -07:00
Sean Owen	be2eca22e9	[SPARK-32398][TESTS][CORE][STREAMING][SQL][ML] Update to scalatest 3.2.0 for Scala 2.13.3+ ### What changes were proposed in this pull request? Updates to scalatest 3.2.0. Though it looks large, it is 99% changes to the new location of scalatest classes. ### Why are the changes needed? 3.2.0+ has a fix that is required for Scala 2.13.3+ compatibility. ### Does this PR introduce _any_ user-facing change? No, only affects tests. ### How was this patch tested? Existing tests. Closes #29196 from srowen/SPARK-32398. Authored-by: Sean Owen <srowen@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-23 16:20:17 -07:00
Holden Karau	a4ca355af8	[SPARK-20629][CORE][K8S] Copy shuffle data when nodes are being shutdown ### What is changed? This pull request adds the ability to migrate shuffle files during Spark's decommissioning. The design document associated with this change is at https://docs.google.com/document/d/1xVO1b6KAwdUhjEJBolVPl9C6sLj7oOveErwDSYdT-pE . To allow this change the `MapOutputTracker` has been extended to allow the location of shuffle files to be updated with `updateMapOutput`. When a shuffle block is put, a block update message will be sent which triggers the `updateMapOutput`. Instead of rejecting remote puts of shuffle blocks `BlockManager` delegates the storage of shuffle blocks to it's shufflemanager's resolver (if supported). A new, experimental, trait is added for shuffle resolvers to indicate they handle remote putting of blocks. The existing block migration code is moved out into a separate file, and a producer/consumer model is introduced for migrating shuffle files from the host as quickly as possible while not overwhelming other executors. ### Why are the changes needed? Recomputting shuffle blocks can be expensive, we should take advantage of our decommissioning time to migrate these blocks. ### Does this PR introduce any user-facing change? This PR introduces two new configs parameters, `spark.storage.decommission.shuffleBlocks.enabled` & `spark.storage.decommission.rddBlocks.enabled` that control which blocks should be migrated during storage decommissioning. ### How was this patch tested? New unit test & expansion of the Spark on K8s decom test to assert that decommisioning with shuffle block migration means that the results are not recomputed even when the original executor is terminated. This PR is a cleaned-up version of the previous WIP PR I made https://github.com/apache/spark/pull/28331 (thanks to attilapiros for his very helpful reviewing on it :)). Closes #28708 from holdenk/SPARK-20629-copy-shuffle-data-when-nodes-are-being-shutdown-cleaned-up. Lead-authored-by: Holden Karau <hkarau@apple.com> Co-authored-by: Holden Karau <holden@pigscanfly.ca> Co-authored-by: “attilapiros” <piros.attila.zsolt@gmail.com> Co-authored-by: Attila Zsolt Piros <attilazsoltpiros@apiros-mbp16.lan> Signed-off-by: Holden Karau <hkarau@apple.com>	2020-07-19 21:33:13 -07:00
Erik Krogen	cf22d947fb	[SPARK-32036] Replace references to blacklist/whitelist language with more appropriate terminology, excluding the blacklisting feature ### What changes were proposed in this pull request? This PR will remove references to these "blacklist" and "whitelist" terms besides the blacklisting feature as a whole, which can be handled in a separate JIRA/PR. This touches quite a few files, but the changes are straightforward (variable/method/etc. name changes) and most quite self-contained. ### Why are the changes needed? As per discussion on the Spark dev list, it will be beneficial to remove references to problematic language that can alienate potential community members. One such reference is "blacklist" and "whitelist". While it seems to me that there is some valid debate as to whether these terms have racist origins, the cultural connotations are inescapable in today's world. ### Does this PR introduce _any_ user-facing change? In the test file `HiveQueryFileTest`, a developer has the ability to specify the system property `spark.hive.whitelist` to specify a list of Hive query files that should be tested. This system property has been renamed to `spark.hive.includelist`. The old property has been kept for compatibility, but will log a warning if used. I am open to feedback from others on whether keeping a deprecated property here is unnecessary given that this is just for developers running tests. ### How was this patch tested? Existing tests should be suitable since no behavior changes are expected as a result of this PR. Closes #28874 from xkrogen/xkrogen-SPARK-32036-rename-blacklists. Authored-by: Erik Krogen <ekrogen@linkedin.com> Signed-off-by: Thomas Graves <tgraves@apache.org>	2020-07-15 11:40:55 -05:00
Sean Owen	d6a68e0b67	[SPARK-29292][STREAMING][SQL][BUILD] Get streaming, catalyst, sql compiling for Scala 2.13 ### What changes were proposed in this pull request? Continuation of https://github.com/apache/spark/pull/28971 which lets streaming, catalyst and sql compile for 2.13. Same idea. ### Why are the changes needed? Eventually, we need to support a Scala 2.13 build, perhaps in Spark 3.1. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. (2.13 was not tested; this is about getting it to compile without breaking 2.12) Closes #29078 from srowen/SPARK-29292.2. Authored-by: Sean Owen <srowen@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-14 02:06:50 -07:00
Holden Karau	90ac9f975b	[SPARK-32004][ALL] Drop references to slave ### What changes were proposed in this pull request? This change replaces the world slave with alternatives matching the context. ### Why are the changes needed? There is no need to call things slave, we might as well use better clearer names. ### Does this PR introduce _any_ user-facing change? Yes, the ouput JSON does change. To allow backwards compatibility this is an additive change. The shell scripts for starting & stopping workers are renamed, and for backwards compatibility old scripts are added to call through to the new ones while printing a deprecation message to stderr. ### How was this patch tested? Existing tests. Closes #28864 from holdenk/SPARK-32004-drop-references-to-slave. Lead-authored-by: Holden Karau <hkarau@apple.com> Co-authored-by: Holden Karau <holden@pigscanfly.ca> Signed-off-by: Holden Karau <hkarau@apple.com>	2020-07-13 14:05:33 -07:00
iRakson	9b098f1eb9	[SPARK-30119][WEBUI] Support pagination for streaming tab ### What changes were proposed in this pull request? #28747 reverted #28439 due to some flaky test case. This PR fixes the flaky test and adds pagination support. ### Why are the changes needed? To support pagination for streaming tab ### Does this PR introduce _any_ user-facing change? Yes, Now streaming tab tables will be paginated. ### How was this patch tested? Manually. Closes #28748 from iRakson/fixstreamingpagination. Authored-by: iRakson <raksonrakesh@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-06-12 10:27:31 -05:00
Kousuke Saruta	88a4e55fae	[SPARK-31765][WEBUI][TEST-MAVEN] Upgrade HtmlUnit >= 2.37.0 ### What changes were proposed in this pull request? This PR upgrades HtmlUnit. Selenium and Jetty also upgraded because of dependency. ### Why are the changes needed? Recently, a security issue which affects HtmlUnit is reported. https://nvd.nist.gov/vuln/detail/CVE-2020-5529 According to the report, arbitrary code can be run by malicious users. HtmlUnit is used for test so the impact might not be large but it's better to upgrade it just in case. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing testcases. Closes #28585 from sarutak/upgrade-htmlunit. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-06-11 18:27:53 -05:00
Kousuke Saruta	f7501ddd70	Revert "[SPARK-30119][WEBUI] Add Pagination Support to Streaming Page" This PR reverts #28439 due to that PR breaks QA build. Closes #28747 from sarutak/revert-SPARK-30119. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>	2020-06-08 10:10:52 +09:00
iRakson	e9337f505b	[SPARK-30119][WEBUI] Add Pagination Support to Streaming Page ### What changes were proposed in this pull request? * Pagination Support is added to all tables of streaming page in spark web UI. For adding pagination support, existing classes from #7399 were used. * Earlier streaming page has two tables `Active Batches` and `Completed Batches`. Now, we will have three tables `Running Batches`, `Waiting Batches` and `Completed Batches`. If we have large number of waiting and running batches then keeping track in a single table is difficult. Also other pages have different table for different type type of data. * Earlier empty tables were shown. Now only non-empty tables will be shown. `Active Batches` table used to show details of waiting batches followed by running batches. ### Why are the changes needed? Pagination will allow users to analyse the table in much better way. All spark web UI pages support pagination apart from streaming pages, so this will add consistency as well. Also it might fix the potential OOM errors that can arise. ### Does this PR introduce _any_ user-facing change? Yes. `Active Batches` table is split into two tables `Running Batches` and `Waiting Batches`. Pagination Support is added to the all the tables. Every other functionality is unchanged. ### How was this patch tested? Manually. Before changes: <img width="1667" alt="Screenshot 2020-05-03 at 7 07 14 PM" src="https://user-images.githubusercontent.com/15366835/80915680-8fb44b80-8d71-11ea-9957-c4a3769b8b67.png"> After Changes: <img width="1669" alt="Screenshot 2020-05-03 at 6 51 22 PM" src="https://user-images.githubusercontent.com/15366835/80915694-a9ee2980-8d71-11ea-8fc5-246413a4951d.png"> Closes #28439 from iRakson/streamingPagination. Authored-by: iRakson <raksonrakesh@gmail.com> Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>	2020-06-07 13:08:50 +09:00
HyukjinKwon	baafd4386c	Revert "[SPARK-31765][WEBUI] Upgrade HtmlUnit >= 2.37.0" This reverts commit `e5c3463910`.	2020-06-03 14:15:30 +09:00
Kousuke Saruta	e5c3463910	[SPARK-31765][WEBUI] Upgrade HtmlUnit >= 2.37.0 ### What changes were proposed in this pull request? This PR upgrades HtmlUnit. Selenium and Jetty also upgraded because of dependency. ### Why are the changes needed? Recently, a security issue which affects HtmlUnit is reported. https://nvd.nist.gov/vuln/detail/CVE-2020-5529 According to the report, arbitrary code can be run by malicious users. HtmlUnit is used for test so the impact might not be large but it's better to upgrade it just in case. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing testcases. Closes #28585 from sarutak/upgrade-htmlunit. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-06-02 08:29:07 -05:00
Gengliang Wang	db5e5fce68	Revert "[SPARK-31765][WEBUI] Upgrade HtmlUnit >= 2.37.0" This reverts commit `92877c4ef2`. Closes #28602 from gengliangwang/revertSPARK-31765. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>	2020-05-21 16:00:58 -07:00
Kousuke Saruta	92877c4ef2	[SPARK-31765][WEBUI] Upgrade HtmlUnit >= 2.37.0 ### What changes were proposed in this pull request? This PR upgrades HtmlUnit. Selenium and Jetty also upgraded because of dependency. ### Why are the changes needed? Recently, a security issue which affects HtmlUnit is reported. https://nvd.nist.gov/vuln/detail/CVE-2020-5529 According to the report, arbitrary code can be run by malicious users. HtmlUnit is used for test so the impact might not be large but it's better to upgrade it just in case. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing testcases. Closes #28585 from sarutak/upgrade-htmlunit. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>	2020-05-21 11:43:25 -07:00
Wenchen Fan	2012d58475	[SPARK-31732][TESTS] Disable some flaky tests temporarily ### What changes were proposed in this pull request? It's quite annoying to be blocked by flaky tests in several PRs. This PR disables them. The tests come from 3 PRs I'm recently watching: https://github.com/apache/spark/pull/28526 https://github.com/apache/spark/pull/28463 https://github.com/apache/spark/pull/28517 ### Why are the changes needed? To make PR builder more stable ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? N/A Closes #28547 from cloud-fan/test. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-05-16 07:33:58 -07:00

1 2 3 4 5 ...

1255 commits