ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Dongjoon Hyun	dcaf62afea	[SPARK-34346][CORE][TESTS][FOLLOWUP] Fix UT by removing core-site.xml ### What changes were proposed in this pull request? This is a follow-up for SPARK-34346 which causes a flakiness due to `core-site.xml` test resource file addition. This PR aims to remove the test resource `core/src/test/resources/core-site.xml` from `core` module. ### Why are the changes needed? Due to the test resource `core-site.xml`, YARN UT becomes flaky in GitHub Action and Jenkins. ``` $ build/sbt "yarn/testOnly .YarnClusterSuite -- -z SPARK-16414" -Pyarn ... [info] YarnClusterSuite: [info] - yarn-cluster should respect conf overrides in SparkHadoopUtil (SPARK-16414, SPARK-23630) FAILED * (20 seconds, 209 milliseconds) [info] FAILED did not equal FINISHED (stdout/stderr was not captured) (BaseYarnClusterSuite.scala:210) ``` To isolate more, we may use `SPARK_TEST_HADOOP_CONF_DIR` like `yarn` module's `yarn/Client`, but it seems an overkill in `core` module. ``` // SPARK-23630: during testing, Spark scripts filter out hadoop conf dirs so that user's // environments do not interfere with tests. This allows a special env variable during // tests so that custom conf dirs can be used by unit tests. val confDirs = Seq("HADOOP_CONF_DIR", "YARN_CONF_DIR") ++ (if (Utils.isTesting) Seq("SPARK_TEST_HADOOP_CONF_DIR") else Nil) ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. Closes #31515 from dongjoon-hyun/SPARK-34346-2. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-02-08 11:32:23 +09:00
Dongjoon Hyun	eb5558ed35	[SPARK-34390][CORE] Enable Zstandard buffer pool by default ### What changes were proposed in this pull request? This PR aims to enable ZStandard JNI BufferPool by default in Apache Spark 3.2.0. ### Why are the changes needed? 1. SPEED UP SPARK-34387 shows the speed-up on both Java8/Java11 by adding [ZStandardBenchmark](https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/io/ZStandardBenchmark.scala). 2. MEMORY USAGE The followings are the memory usage graphs while running [ZStandardBenchmark](https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/io/ZStandardBenchmark.scala) on Java11 with the increased N (100000) value in order to visualize easily. In the charts, the first half is the memory consumption without buffer pool while the last last half is one with buffer pool. The difference is noticeable. ```scala - val N = 10000 + val N = 100000 ``` - Compression ![Screenshot from 2021-02-06 18-41-17](https://user-images.githubusercontent.com/9700541/107134909-0c4cfb00-68ab-11eb-9273-82cbecdebfba.png) - Decompression ![Screenshot from 2021-02-06 18-43-05](https://user-images.githubusercontent.com/9700541/107134927-2edf1400-68ab-11eb-97c4-5cd101e91bb0.png) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the existing UTs. Closes #31502 from dongjoon-hyun/SPARK-34390. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-02-06 23:55:17 -08:00
Dongjoon Hyun	466c045bfa	[SPARK-34387][CORE][TESTS] Add ZStandardBenchmark ### What changes were proposed in this pull request? This PR aims to add ZStandardBenchmark as a base-line. ### Why are the changes needed? This will prevent any regression when we upgrade Zstandard library in the future. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually. Closes #31498 from dongjoon-hyun/SPARK-ZSTD-BENCH. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-02-06 15:34:40 -08:00
Kent Yao	961c85166a	[SPARK-34346][CORE][SQL] io.file.buffer.size set by spark.buffer.size will override by loading hive-site.xml accidentally may cause perf regression ### What changes were proposed in this pull request? In many real-world cases, when interacting with hive catalog through Spark SQL, users may just share the `hive-site.xml` for their hive jobs and make a copy to `SPARK_HOME`/conf w/o modification. In Spark, when we generate Hadoop configurations, we will use `spark.buffer.size(65536)` to reset `io.file.buffer.size(4096)`. But when we load the hive-site.xml, we may ignore this behavior and reset `io.file.buffer.size` again according to `hive-site.xml`. 1. The configuration priority for setting Hadoop and Hive config here is not right, while literally, the order should be `spark > spark.hive > spark.hadoop > hive > hadoop` 2. This breaks `spark.buffer.size` congfig's behavior for tuning the IO performance w/ HDFS if there is an existing `io.file.buffer.size` in hive-site.xml ### Why are the changes needed? bugfix for configuration behavior and fix performance regression by that behavior change ### Does this PR introduce _any_ user-facing change? this pr restores silent user face change ### How was this patch tested? new tests Closes #31460 from yaooqinn/SPARK-34346. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-02-05 10:13:19 +09:00
Jungtaek Lim (HeartSaVioR)	fbe726f5b1	[SPARK-34339][CORE][SQL] Expose the number of total paths in Utils.buildLocationMetadata() ### What changes were proposed in this pull request? This PR proposes to expose the number of total paths in Utils.buildLocationMetadata(), with relaxing space usage a bit (around 10+ chars). Suppose the first 2 of 5 paths are only fit to the threshold, the outputs between the twos are below: * before the change: `[path1, path2]` * after the change: `(5 paths)[path1, path2, ...]` ### Why are the changes needed? SPARK-31793 silently truncates the paths hence end users can't indicate how many paths are truncated, and even more, whether paths are truncated or not. ### Does this PR introduce _any_ user-facing change? Yes, the location metadata will also show how many paths are truncated (not shown), instead of silently truncated. ### How was this patch tested? Modified UTs Closes #31464 from HeartSaVioR/SPARK-34339. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-02-05 09:37:38 +09:00
Jungtaek Lim (HeartSaVioR)	44dcf0062c	[SPARK-34326][CORE][SQL] Fix UTs added in SPARK-31793 depending on the length of temp path ### What changes were proposed in this pull request? This PR proposes to fix the UTs being added in SPARK-31793, so that all things contributing the length limit are properly accounted. ### Why are the changes needed? The test `DataSourceScanExecRedactionSuite.SPARK-31793: FileSourceScanExec metadata should contain limited file paths` is failing conditionally, depending on the length of the temp directory. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Modified UTs explain the missing points, which also do the test. Closes #31449 from HeartSaVioR/SPARK-34326-v2. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>	2021-02-04 08:46:11 +09:00
Dongjoon Hyun	8e28218106	[SPARK-34340][CORE] Support ZSTD JNI BufferPool ### What changes were proposed in this pull request? This PR aims two goals. 1. Support ZSTD JNI BufferPool feature by adding a new configuration, `spark.io.compression.zstd.bufferPool.enabled`, for Apache Spark 3.2.0. 2. Make Spark independent from ZSTD JNI library's default buffer pool policy change. ### Why are the changes needed? ZSTD JNI library has different behaviors across its versions. \| Version \| Description \| Commit \| \| ---------- \| --------------- \| ----------- \| \| v1.4.5-7 \| `BufferPool` was added and used it by default \| `4f55c89172` \| \| v1.4.5-8 \| `RecyclingBufferPool` was added and `BufferPool` became an interface to allow custom BufferPool implementation \| `dd2588edd3` \| \| v1.4.7+ \| `NoPool` is used by default and user should specify buffer pool explicitly \| `f7c8279bc1` \| ### Does this PR introduce _any_ user-facing change? No, the default value (`false`) is consistent with the AS-IS ZSTD-JNI library's default buffer pool. ### How was this patch tested? Pass the CIs with the updated UT. Closes #31453 from dongjoon-hyun/SPARK-34340. Lead-authored-by: Dongjoon Hyun <dongjoon@apache.org> Co-authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-02-03 09:11:20 -08:00
Erik Krogen	791de00ddf	[SPARK-34182][AVRO] Improve error messages when matching Catalyst-to-Avro schemas ### What changes were proposed in this pull request? Improve the error messages for incompatibilities between Avro and Catalyst schemas. First, make `AvroSerializer` more similar to `AvroDeserializer` in printing out contextual information such as hierarchical field names. Standardize exception messages in both serializer and deserializer to always include such contextual information, and include a top-level exception which shows the full schemas which were being parsed when the incompatibility was found. Both now print out the hierarchical name for both the Avro and Catalyst fields, since they may be different due to case sensitivity and Avro union handling. ### Why are the changes needed? The error messages in this type of failure scenario are very lacking in information on the write path (`AvroSerializer`). Below are two examples of messages that provide insufficient information to determine what went wrong (lacking in field names, context about the overall schema structure, etc.). ``` org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot convert Catalyst type IntegerType to Avro type "float". org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot convert Catalyst type StructType(StructField(bar,IntegerType,true)) to Avro type {"type":"record","name":"test","fields":[{"name":"NOTbar","type":["null","int"],"default":null}]}. ``` The error messages currently existing in `AvroDeserializer` are much better, but still not very internally consistent, and it would be better if they were consistent with the newly added exception messages in `AvroSerializer`. ### Does this PR introduce _any_ user-facing change? Error messages when there are incompatibilities between Avro and Catalyst schemas will be greatly improved on when writing Avro data using the `avroSchema` option, a little bit improved when reading Avro data, and much more consistent between the two. Below is an example of a new message. See `AvroSerdeSuite` for more examples. ``` org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot convert Catalyst type STRUCT<`foo`: STRUCT<`bar`: INT>> to Avro type {"type":"record","name":"top","fields":[{"name":"foo","type":"int"}]} at org.apache.spark.sql.avro.AvroSerializer.liftedTree1$1(AvroSerializer.scala:83) ... Caused by: org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot convert Catalyst field 'foo' to Avro field 'foo' because schema is incompatible (sqlType = STRUCT<`bar`: INT>, avroType = "int") at org.apache.spark.sql.avro.AvroSerializer.newConverter(AvroSerializer.scala:230) ... ``` ### How was this patch tested? New unit test suite, `AvroSerdeSuite`, was added to test corner cases on `AvroSerializer` and `AvroDeserializer` and verify that the exception messages are as expected. Existing tests in `AvroSuite` also continue to pass, with modifications in places where assertions were made about the exceptions that would be thrown. Closes #31333 from xkrogen/xkrogen-SPARK-34182-avro-serde-errormessages. Authored-by: Erik Krogen <xkrogen@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-02-03 08:49:36 +00:00
HyukjinKwon	e927bf90e0	Revert "[SPARK-34326][CORE][SQL] Fix UTs added in SPARK-31793 depending on the length of temp path" This reverts commit `63866025d2`.	2021-02-03 12:32:39 +09:00
offthewall123	60c71c6d2d	[SPARK-34325][CORE] Remove unused shuffleBlockResolver variable inSortShuffleWriter ### What changes were proposed in this pull request? Remove unused shuffleBlockResolver variable in SortShuffleWriter. ### Why are the changes needed? For better code understanding. ### Does this PR introduce _any_ user-facing change? NO. ### How was this patch tested? End to End. Closes #31433 from offthewall123/remove_shuffleBlockResolver_in_SortShuffleWriter. Authored-by: offthewall123 <dingyu.xu@intel.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-02-03 09:40:08 +09:00
Jungtaek Lim (HeartSaVioR)	63866025d2	[SPARK-34326][CORE][SQL] Fix UTs added in SPARK-31793 depending on the length of temp path ### What changes were proposed in this pull request? This PR proposes to fix the UTs being added in SPARK-31793, so that all things contributing the length limit are properly accounted. ### Why are the changes needed? The test `DataSourceScanExecRedactionSuite.SPARK-31793: FileSourceScanExec metadata should contain limited file paths` is failing conditionally, depending on the length of the temp directory. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Modified UTs explain the missing points, which also do the test. Closes #31435 from HeartSaVioR/SPARK-34326. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>	2021-02-03 07:35:22 +09:00
yangjie01	9db566a882	[SPARK-34310][CORE][SQL] Replaces map and flatten with flatMap ### What changes were proposed in this pull request? Replaces `collection.map(f1).flatten(f2)` with `collection.flatMap` if possible. it's semantically consistent, but looks simpler. ### Why are the changes needed? Code Simpilefications. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the Jenkins or GitHub Action Closes #31416 from LuciferYang/SPARK-34310. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-02-01 08:21:35 -06:00
neko	bd9eeebce0	[SPARK-34288][WEBUI] Add a tip info for the `resources` column in the executors page ### What changes were proposed in this pull request? This is an extension of PR #25613, I try to add a tip info for `Resource` column to make it easier for users to know what the column actually means and avoid confusion. ### Why are the changes needed? After upgrading from 2.3.2 to 3.0.1, the new `Resources` column in the executors page is always blank because it does not use GPU/FPGA, and there is no tip info, so users are often confused when they do not know the exact meaning of this column. ### Does this PR introduce _any_ user-facing change? add a tip info in the executors page. ### How was this patch tested? manual test works well as below: ![fixed](https://user-images.githubusercontent.com/52202080/106248350-d032ee00-624b-11eb-9ae4-92319ed11110.png) Closes #31392 from akiyamaneko/executors-resources-tips. Authored-by: neko <echohlne@gmail.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>	2021-01-30 10:23:52 +08:00
yangjie01	2e192b6f45	[SPARK-34284][CORE][TESTS] Fix deprecated API usage of Apache commons-io ### What changes were proposed in this pull request? There are some deprecated API usage compilation warning related to Apache commons-io as follows: ``` [WARNING] [Warn] /spark/core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala:1109: [deprecation org.apache.spark.deploy.SparkSubmitSuite.checkDownloadedFile.$org_scalatest_assert_macro_expr.$org_scalatest_assert_macro_left \| origin=org.apache.commons.io.FileUtils.readFileToString \| version=] method readFileToString in class FileUtils is deprecated [WARNING] [Warn] /spark/core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala:1110: [deprecation org.apache.spark.deploy.SparkSubmitSuite.checkDownloadedFile.$org_scalatest_assert_macro_expr.$org_scalatest_assert_macro_right \| origin=org.apache.commons.io.FileUtils.readFileToString \| version=] method readFileToString in class FileUtils is deprecated [WARNING] [Warn] /spark/core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala:1152: [deprecation org.apache.spark.deploy.SparkSubmitSuite \| origin=org.apache.commons.io.FileUtils.write \| version=] method write in class FileUtils is deprecated [WARNING] [Warn] /spark/core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala:1167: [deprecation org.apache.spark.deploy.SparkSubmitSuite \| origin=org.apache.commons.io.FileUtils.write \| version=] method write in class FileUtils is deprecated [WARNING] [Warn] /spark/core/src/test/scala/org/apache/spark/deploy/history/HistoryServerSuite.scala:201: [deprecation org.apache.spark.deploy.history.HistoryServerSuite.<local HistoryServerSuite>.$anonfun.exp \| origin=org.apache.commons.io.IOUtils.toString \| version=] method toString in class IOUtils is deprecated [WARNING] [Warn] /spark/core/src/test/scala/org/apache/spark/deploy/history/HistoryServerSuite.scala:716: [deprecation org.apache.spark.deploy.history.HistoryServerSuite.getContentAndCode.inString.$anonfun \| origin=org.apache.commons.io.IOUtils.toString \| version=] method toString in class IOUtils is deprecated [WARNING] [Warn] /spark/core/src/test/scala/org/apache/spark/deploy/history/HistoryServerSuite.scala:732: [deprecation org.apache.spark.deploy.history.HistoryServerSuite.connectAndGetInputStream.errString.$anonfun \| origin=org.apache.commons.io.IOUtils.toString \| version=] method toString in class IOUtils is deprecated [WARNING] [Warn] /spark/streaming/src/test/scala/org/apache/spark/streaming/InputStreamsSuite.scala:267: [deprecation org.apache.spark.streaming.InputStreamsSuite.<local InputStreamsSuite>.$anonfun.$anonfun.write \| origin=org.apache.commons.io.IOUtils.write \| version=] method write in class IOUtils is deprecated [WARNING] [Warn] /spark/streaming/src/test/scala/org/apache/spark/streaming/StreamingContextSuite.scala:912: [deprecation org.apache.spark.streaming.StreamingContextSuite.createCorruptedCheckpoint \| origin=org.apache.commons.io.FileUtils.write \| version=] method write in class FileUtils is deprecated ``` The main API change is to need to add a `java.nio.charset.Charset` parameter when the corresponding method is called, so the main change of is pr is add a `StandardCharsets.UTF_8` parameter to the these method. ### Why are the changes needed? Fix deprecated API usage of Apache commons-io. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the Jenkins or GitHub Action Closes #31389 from LuciferYang/SPARK-34284. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-29 17:50:14 +09:00
Dongjoon Hyun	bc41c5a0e5	[SPARK-34273][CORE] Do not reregister BlockManager when SparkContext is stopped ### What changes were proposed in this pull request? This PR aims to prevent `HeartbeatReceiver` asks `Executor` to re-register blocker manager when the SparkContext is already stopped. ### Why are the changes needed? Currently, `HeartbeatReceiver` blindly asks re-registration for the new heartbeat message. However, when SparkContext is stopped, we don't need to re-register new block manager. Re-registration causes unnecessary executors' logs and and a delay on job termination. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs with the newly added test case. Closes #31373 from dongjoon-hyun/SPARK-34273. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-28 13:06:42 -08:00
Dongjoon Hyun	23d4f6b393	[SPARK-34278][CORE] Make BlockManagerMaster driver heartbeat timeout configurable ### What changes were proposed in this pull request? This PR adds a new configuration, `spark.storage.blockManagerMasterDriverHeartbeatTimeoutMs`. ### Why are the changes needed? Currently, it's a hard-coded `10 minutes`. ### Does this PR introduce _any_ user-facing change? No. The default value is the same. ### How was this patch tested? Pass the CIs. Closes #31383 from dongjoon-hyun/SPARK-34278. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-28 09:23:40 -08:00
yangjie01	15445a8d9e	[SPARK-34275][CORE][SQL][MLLIB] Replaces filter and size with count ### What changes were proposed in this pull request? Use `count` to simplify `find + size(or length)` operation, it's semantically consistent, but looks simpler. Before ``` seq.filter(p).size ``` After ``` seq.count(p) ``` ### Why are the changes needed? Code Simpilefications. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the Jenkins or GitHub Action Closes #31374 from LuciferYang/SPARK-34275. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-28 15:27:07 +09:00
Holden Karau	9d83d62f14	[SPARK-34193][CORE] TorrentBroadcast block manager decommissioning race fix ### What changes were proposed in this pull request? Allow broadcast blocks to be put during decommissioning since migrations don't apply to them and they may be stored as part of job exec. ### Why are the changes needed? Potential race condition. ### Does this PR introduce _any_ user-facing change? Removal of race condition. ### How was this patch tested? New unit test. Closes #31298 from holdenk/SPARK-34193-torrentbroadcast-blockmanager-decommissioning-potential-race-condition. Authored-by: Holden Karau <hkarau@apple.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-28 06:15:35 +09:00
neko	f1bc37e624	[SPARK-34221][WEBUI] Ensure if a stage fails in the UI page, the corresponding error message can be displayed correctly ### What changes were proposed in this pull request? Ensure that if a stage fails in the UI page, the corresponding error message can be displayed correctly. ### Why are the changes needed? errormessage is not handled properly in JavaScript. If the 'at' is not exist, the error message on the page will be blank. I made wochanges, 1. `msg.indexOf("at")` => `msg.indexOf("\n")` ![image](https://user-images.githubusercontent.com/52202080/105663531-7362cb00-5f0d-11eb-87fd-008ed65c33ca.png) As shows ablove, truncated at the 'at' position will result in a strange abstract of the error message. If there is a `\n` worit is more reasonable to truncate at the '\n' position. 2. If the `\n` does not exist check whether the msg is more than 100. If true, then truncate the display to avoid too long error message ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manual test shows as belows, just a js change: before modified: ![problem](https://user-images.githubusercontent.com/52202080/105712153-661cff00-5f54-11eb-80bf-e33c323c4e55.png) after modified ![after mdified](https://user-images.githubusercontent.com/52202080/105712180-6c12e000-5f54-11eb-8998-ff8bc8a0a503.png) Closes #31314 from akiyamaneko/error_message_display_empty. Authored-by: neko <echohlne@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-27 10:01:57 -08:00
Chao Sun	abf7e81712	[SPARK-33212][FOLLOW-UP][BUILD] Bring back duplicate dependency check and add more strict Hadoop version check ### What changes were proposed in this pull request? 1. Add back Maven enforcer for duplicate dependencies check 2. More strict check on Hadoop versions which support shaded client in `IsolatedClientLoader`. To do proper version check, this adds a util function `majorMinorPatchVersion` to extract major/minor/patch version from a string. 3. Cleanup unnecessary code ### Why are the changes needed? The Maven enforcer was removed as part of #30556. This proposes to add it back. Also, Hadoop shaded client doesn't work in certain cases (see [these comments](https://github.com/apache/spark/pull/30701#discussion_r558522227) for details). This strictly checks that the current Hadoop version (i.e., 3.2.2 at the moment) has good support of shaded client or otherwise fallback to old unshaded ones. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #31203 from sunchao/SPARK-33212-followup. Lead-authored-by: Chao Sun <sunchao@apple.com> Co-authored-by: Chao Sun <sunchao@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-26 15:34:55 -08:00
yangjie01	8999e8805d	[SPARK-34224][CORE][SQL][SS][DSTREAM][YARN][TEST][EXAMPLES] Ensure all resource opened by `Source.fromXXX` are closed ### What changes were proposed in this pull request? Using a function like `.mkString` or `.getLines` directly on a `scala.io.Source` opened by `fromFile`, `fromURL`, `fromURI ` will leak the underlying file handle, this pr use the `Utils.tryWithResource` method wrap the `BufferedSource` to ensure these `BufferedSource` closed. ### Why are the changes needed? Avoid file handle leak. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the Jenkins or GitHub Action Closes #31323 from LuciferYang/source-not-closed. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-26 19:06:37 +09:00
Warren Zhu	68b765e6b8	[SPARK-34232][CORE] Redact SparkListenerEnvironmentUpdate event in log ### What changes were proposed in this pull request? Redact event SparkListenerEnvironmentUpdate in log when its processing time exceeded logSlowEventThreshold ### Why are the changes needed? Credentials could be exposed when its processing time exceeded logSlowEventThreshold ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manually tested Closes #31335 from warrenzhu25/34232. Authored-by: Warren Zhu <warren.zhu25@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-26 17:10:53 +09:00
Yuanjian Li	59cbacaddf	[SPARK-34185][DOCS] Review and fix issues in API docs ### What changes were proposed in this pull request? Compare the 3.1.1 API doc with the latest release version 3.0.1. Fix the following issues: - Add missing `Since` annotation for new APIs - Remove the leaking class/object in API doc ### Why are the changes needed? Fix the issues in the Spark 3.1.1 release API docs. ### Does this PR introduce _any_ user-facing change? Yes, API doc changes. ### How was this patch tested? Manually test. Closes #31271 from xuanyuanking/SPARK-34185. Lead-authored-by: Yuanjian Li <yuanjian.li@databricks.com> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-25 11:38:20 +09:00
Ismaël Mejía	e9e81f798f	[SPARK-27733][CORE] Upgrade Avro to version 1.10.1 ### What changes were proposed in this pull request? Update Avro dependency to version 1.10.1 ### Why are the changes needed? To catch up multiple improvements of Avro as well as fix security issues on transitive dependencies. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Since there were no API changes required we just run the tests Closes #31232 from iemejia/SPARK-27733-avro-upgrade. Authored-by: Ismaël Mejía <iemejia@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-20 15:42:27 -08:00
ulysses-you	2145f07e22	[SPARK-34166][CORE][TESTS] Fix flaky test in DecommissionWorkerSuite ### What changes were proposed in this pull request? Add executor number check. ### Why are the changes needed? The test `decommission workers ensure that shuffle output is regenerated even with shuffle service` assumes it has two executor and both of two tasks can execute concurrently. The two tasks will execute serially if there only one executor. The result is test is unexpceted. E.g. ``` [info] 5 did not equal 4 Expected 4 tasks but got List(0:0:0:0-SUCCESS, 0:0:1:0-FAILED, 0:0:1:1-SUCCESS, 0:1:0:0-SUCCESS, 1:0:0:0-SUCCESS) (DecommissionWorkerSuite.scala:190) ``` The failed task due to the first task finished and decommission the worker. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass exists test. Closes #31255 from ulysses-you/SPARK-34166. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-19 21:26:24 -08:00
Norbert Schultz	c3d8352ca1	[SPARK-34115][CORE] Check SPARK_TESTING as lazy val to avoid slowdown ### What changes were proposed in this pull request? Check SPARK_TESTING as lazy val to avoid slow down when there are many environment variables ### Why are the changes needed? If there are many environment variables, sys.env slows is very slow. As Utils.isTesting is called very often during Dataframe-Optimization, this can slow down evaluation very much. An example for triggering the problem can be found in the bug ticket https://issues.apache.org/jira/browse/SPARK-34115 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? With the example provided in the ticket. Closes #31244 from nob13/bug/34115. Lead-authored-by: Norbert Schultz <norbert.schultz@reactivecore.de> Co-authored-by: Norbert Schultz <noschultz@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-20 09:39:13 +09:00
Dongjoon Hyun	415506cc04	[SPARK-34142][CORE] Support Fallback Storage Cleanup during stopping SparkContext ### What changes were proposed in this pull request? This PR aims to support fallback storage clean-up during stopping `SparkContext`. ### Why are the changes needed? SPARK-33545 added `Support Fallback Storage during worker decommission` for the managed cloud-storages with TTL support. Usually, it's one day. This PR will add an additional clean-up feature during stopping `SparkContext` in order to save some money before TTL or the other HDFS-compatible storage which doesn't have TTL support. ### Does this PR introduce _any_ user-facing change? Yes, but this is a new feature. ### How was this patch tested? Pass the newly added UT. Closes #31215 from dongjoon-hyun/SPARK-34142. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-17 16:54:01 -08:00
mohan3d	ebd8bc934d	[SPARK-34123][WEB UI] optimize spark history summary page loading ### What changes were proposed in this pull request? Display history server entries using datatables instead of Mustache + Datatables which proved to be faster and non-blocking for the webpage while searching (using search bar in the page) ### Why are the changes needed? Small changes in the attempts (entries) and removed part of HTML (Mustache template). ### Does this PR introduce _any_ user-facing change? Not very sure, but it's not supposed to change the way the page looks rather it changes how entries are displayed. ### How was this patch tested? Running test, since it's not adding new functionality. Closes #31191 from mohan3d/feat/history-server-ui-optimization. Lead-authored-by: mohan3d <mohan3d94@gmail.com> Co-authored-by: Author: mohan3d <mohan3d94@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-01-17 14:37:28 -06:00
Chao Sun	b6f46ca297	[SPARK-33212][BUILD] Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile ### What changes were proposed in this pull request? This: 1. switches Spark to use shaded Hadoop clients, namely hadoop-client-api and hadoop-client-runtime, for Hadoop 3.x. 2. upgrade built-in version for Hadoop 3.x to Hadoop 3.2.2 Note that for Hadoop 2.7, we'll still use the same modules such as hadoop-client. In order to still keep default Hadoop profile to be hadoop-3.2, this defines the following Maven properties: ``` hadoop-client-api.artifact hadoop-client-runtime.artifact hadoop-client-minicluster.artifact ``` which default to: ``` hadoop-client-api hadoop-client-runtime hadoop-client-minicluster ``` but all switch to `hadoop-client` when the Hadoop profile is hadoop-2.7. A side affect from this is we'll import the same dependency multiple times. For this I have to disable Maven enforcer `banDuplicatePomDependencyVersions`. Besides above, there are the following changes: - explicitly add a few dependencies which are imported via transitive dependencies from Hadoop jars, but are removed from the shaded client jars. - removed the use of `ProxyUriUtils.getPath` from `ApplicationMaster` which is a server-side/private API. - modified `IsolatedClientLoader` to exclude `hadoop-auth` jars when Hadoop version is 3.x. This change should only matter when we're not sharing Hadoop classes with Spark (which is _mostly_ used in tests). ### Why are the changes needed? Hadoop 3.2.2 is released with new features and bug fixes, so it's good for the Spark community to adopt it. However, latest Hadoop versions starting from Hadoop 3.2.1 have upgraded to use Guava 27+. In order to resolve Guava conflicts, this takes the approach by switching to shaded client jars provided by Hadoop. This also has the benefits of avoid pulling other 3rd party dependencies from Hadoop side so as to avoid more potential future conflicts. ### Does this PR introduce _any_ user-facing change? When people use Spark with `hadoop-provided` option, they should make sure class path contains `hadoop-client-api` and `hadoop-client-runtime` jars. In addition, they may need to make sure these jars appear before other Hadoop jars in the order. Otherwise, classes may be loaded from the other non-shaded Hadoop jars and cause potential conflicts. ### How was this patch tested? Relying on existing tests. Closes #30701 from sunchao/test-hadoop-3.2.2. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-15 14:06:50 -08:00
KevinSmile	c75c29dcaa	[SPARK-32598][SCHEDULER] Fix missing driver logs under UI App-Executors tab in standalone cluster mode ### What changes were proposed in this pull request? Fix [SPARK-32598] (missing driver logs under UI-ApplicationDetails-Executors tab in standalone cluster mode) . The direct bug is: the original author forgot to implement `getDriverLogUrls` in `StandaloneSchedulerBackend` `1de272f98d/core/src/main/scala/org/apache/spark/scheduler/SchedulerBackend.scala (L70-L75)` So we set DriverLogUrls as env in `DriverRunner`, and retrieve it at `StandaloneSchedulerBackend`. ### Why are the changes needed? Fix bug [SPARK-32598]. ### Does this PR introduce _any_ user-facing change? Yes. User will see driver logs (standalone cluster mode) under UI-ApplicationDetails-Executors tab now. Before: ![image](https://user-images.githubusercontent.com/17903517/93901055-b5de8600-fd28-11ea-879a-d97e6f70cc6e.png) After: ![image](https://user-images.githubusercontent.com/17903517/93901080-baa33a00-fd28-11ea-8895-3787c5efbf88.png) ### How was this patch tested? Re-check the real case in [SPARK-32598] and found this user-facing bug fixed. Closes #29644 from KevinSmile/kw-dev-master. Authored-by: KevinSmile <kevinwang013@hotmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-01-15 09:01:26 -06:00
yangjie01	9e33d49b5b	[SPARK-33346][CORE][SQL][MLLIB][DSTREAM][K8S] Change the never changed 'var' to 'val' ### What changes were proposed in this pull request? Some local variables are declared as `var`, but they are never reassigned and should be declared as `val`, so this pr turn these from `var` to `val` except for `mockito` related cases. ### Why are the changes needed? Use `val` instead of `var` when possible. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the Jenkins or GitHub Action Closes #31142 from LuciferYang/SPARK-33346. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-01-15 08:47:02 -06:00
yangjie01	8ed23ed499	[SPARK-34118][CORE][SQL] Replaces filter and check for emptiness with exists or forall ### What changes were proposed in this pull request? This pr use `exists` or `forall` to simplify `filter + emptiness check`, it's semantically consistent, but looks simpler. The rule as follow: - `seq.filter(p).size == 0)` -> `!seq.exists(p)` - `seq.filter(p).length > 0` -> `seq.exists(p)` - `seq.filterNot(p).isEmpty` -> `seq.forall(p)` - `seq.filterNot(p).nonEmpty` -> `!seq.forall(p)` ### Why are the changes needed? Code Simpilefications. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the Jenkins or GitHub Action Closes #31184 from LuciferYang/SPARK-34118. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-15 12:12:33 +09:00
Gengliang Wang	467d758973	[SPARK-34075][SQL][CORE] Hidden directories are being listed for partition inference ### What changes were proposed in this pull request? Fix a regression from https://github.com/apache/spark/pull/29959. In Spark, the following file paths are considered as hidden paths and they are ignored on file reads: 1. starts with "_" and doesn't contain "=" 2. starts with "." However, after the refactoring PR https://github.com/apache/spark/pull/29959, the hidden paths are not filtered out on partition inference: https://github.com/apache/spark/pull/29959/files#r556432426 This PR is to fix the bug. To archive the goal, the method `InMemoryFileIndex.shouldFilterOut` is refactored as `HadoopFSUtils.shouldFilterOutPathName` ### Why are the changes needed? Bugfix ### Does this PR introduce _any_ user-facing change? Yes, it fixes a bug for reading file paths with partitions. ### How was this patch tested? Unit test Closes #31169 from gengliangwang/fileListingBug. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-14 09:39:38 +09:00
yangjie01	8b1ba233f1	[SPARK-34068][CORE][SQL][MLLIB][GRAPHX] Remove redundant collection conversion ### What changes were proposed in this pull request? There are some redundant collection conversion can be removed, for version compatibility, clean up these with Scala-2.13 profile. ### Why are the changes needed? Remove redundant collection conversion ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Pass the Jenkins or GitHub Action - Manual test `core`, `graphx`, `mllib`, `mllib-local`, `sql`, `yarn`,`kafka-0-10` in Scala 2.13 passed Closes #31125 from LuciferYang/SPARK-34068. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-01-13 18:07:02 -06:00
yangjie01	8c5fecda73	[SPARK-34070][CORE][SQL] Replaces find and emptiness check with exists ### What changes were proposed in this pull request? This pr use `exists` to simplify `find + emptiness check`, it's semantically consistent, but looks simpler. Before ``` seq.find(p).isDefined or seq.find(p).isEmpty ``` After ``` seq.exists(p) or !seq.exists(p) ``` ### Why are the changes needed? Code Simpilefications. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the Jenkins or GitHub Action Closes #31130 from LuciferYang/SPARK-34070. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-01-13 10:42:24 -06:00
schintap	bd5039fc35	[SPARK-33741][CORE] Add min threshold time speculation config ### What changes were proposed in this pull request? Add min threshold time speculation config ### Why are the changes needed? When we turn on speculation with default configs we have the last 10% of the tasks subject to speculation. There are a lot of stages where the stage runs for few seconds to minutes. Also in general we don't want to speculate tasks that run within a minimum threshold. By setting a minimum threshold for speculation config gives us better control for speculative tasks ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test Closes #30710 from redsanket/SPARK-33741. Lead-authored-by: schintap <schintap@verizonmedia.com> Co-authored-by: Sanket Chintapalli <chintapalli.sanketreddy@gmail.com> Signed-off-by: Thomas Graves <tgraves@apache.org>	2021-01-13 08:57:56 -06:00
ulysses-you	f64297d290	[SPARK-32850][TEST][FOLLOWUP] Fix flaky test due to timeout ### What changes were proposed in this pull request? Increase test timeout. ### Why are the changes needed? It's more reasonable to use 60s instead of 6s since many code place use the 60s in `TestUtils.waitUntilExecutorsUp`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass exists test. Closes #31166 from ulysses-you/SPARK-32850-FOLLOWUP. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-13 19:21:44 +09:00
ulysses-you	65222b7051	[SPARK-34069][CORE] Kill barrier tasks should respect SPARK_JOB_INTERRUPT_ON_CANCEL ### What changes were proposed in this pull request? Add shouldInterruptTaskThread check when kill barrier task. ### Why are the changes needed? We should interrupt task thread if user set local property `SPARK_JOB_INTERRUPT_ON_CANCEL` to true. ### Does this PR introduce _any_ user-facing change? Yes, task will be interrupted if user set `SPARK_JOB_INTERRUPT_ON_CANCEL` to true. ### How was this patch tested? Add test. Closes #31127 from ulysses-you/SPARK-34069. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>	2021-01-12 13:19:45 -06:00
“attilapiros”	6bd7a6200f	[SPARK-33711][K8S] Avoid race condition between POD lifecycle manager and scheduler backend ### What changes were proposed in this pull request? Missing POD detection is extended by timestamp (and time limit) based check to avoid wrongfully detection of missing POD detection. The two new timestamps: - `fullSnapshotTs` is introduced for the `ExecutorPodsSnapshot` which only updated by the pod polling snapshot source - `registrationTs` is introduced for the `ExecutorData` and it is initialized at the executor registration at the scheduler backend Moreover a new config `spark.kubernetes.executor.missingPodDetectDelta` is used to specify the accepted delta between the two. ### Why are the changes needed? Watching a POD (`ExecutorPodsWatchSnapshotSource`) only inform about single POD changes. This could wrongfully lead to detecting of missing PODs (PODs known by scheduler backend but missing from POD snapshots) by the executor POD lifecycle manager. A key indicator of this error is seeing this log message: > "The executor with ID [some_id] was not found in the cluster but we didn't get a reason why. Marking the executor as failed. The executor may have been deleted but the driver missed the deletion event." So one of the problem is running the missing POD detection check even when a single POD is changed without having a full consistent snapshot about all the PODs (see `ExecutorPodsPollingSnapshotSource`). The other problem could be the race between the executor POD lifecycle manager and the scheduler backend: so even in case of a having a full snapshot the registration at the scheduler backend could precede the snapshot polling (and processing of those polled snapshots). ### Does this PR introduce _any_ user-facing change? Yes. When the POD is missing then the reason message explaining the executor's exit is extended with both timestamps (the polling time and the executor registration time) and even the new config is mentioned. ### How was this patch tested? The existing unit tests are extended. Closes #30675 from attilapiros/SPARK-33711. Authored-by: “attilapiros” <piros.attila.zsolt@gmail.com> Signed-off-by: Holden Karau <hkarau@apple.com>	2021-01-11 14:25:12 -08:00
yikf	1495ad8c46	[SPARK-33991][CORE][WEBUI] Repair enumeration conversion error for AllJobsPage ### What changes were proposed in this pull request? For `AllJobsPage `class, `AllJobsPage` gets the schedulingMode of enumerated type by loading the `spark.scheduler.mode `configuration from Sparkconf, but an enumeration conversion error occurs when I set the value of this configuration to lowercase. The reason for this problem is that the value of the `SchedulingMode `enumeration class is uppercase, which occurs when I configure `spark.scheduler.mode` to be lowercase. I saw that the `#org.apache.spark.scheduler.TaskSchedulerImpl` class convert the `spark. scheduler.mode` value to uppercase, so I think it should be converted in `AllJobsPage `as well. ### Why are the changes needed? An enumerated conversion error occurred with Spark when I set the value of this configuration to lowercase. ### How was this patch tested? Existing tests. Closes #31015 from yikf/master. Authored-by: yikf <13468507104@163.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-01-11 08:48:02 -06:00
angerszhu	5ef6907792	[SPARK-33084][CORE][SQL] Rename Unit test file and use fake ivy link ### What changes were proposed in this pull request? According to https://github.com/apache/spark/pull/29966#discussion_r554514344 Use wrong name about suite file, this pr to fix this problem. And change to use some fake ivy link for this test ### Why are the changes needed? Follow file name rule ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? No Closes #31118 from AngersZhuuuu/SPARK-33084-FOLLOW-UP. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-11 23:18:18 +09:00
yi.wu	4afca0f706	[SPARK-31952][SQL] Fix incorrect memory spill metric when doing Aggregate ### What changes were proposed in this pull request? This PR takes over https://github.com/apache/spark/pull/28780. 1. Counted the spilled memory size when creating the `UnsafeExternalSorter` with the existing `InMemorySorter` 2. Accumulate the `totalSpillBytes` when merging two `UnsafeExternalSorter` ### Why are the changes needed? As mentioned in https://github.com/apache/spark/pull/28780: > It happends when hash aggregate downgrades to sort based aggregate. `UnsafeExternalSorter.createWithExistingInMemorySorter` calls spill on an `InMemorySorter` immediately, but the memory pointed by `InMemorySorter` is acquired by outside `BytesToBytesMap`, instead the allocatedPages in `UnsafeExternalSorter`. So the memory spill bytes metric is always 0, but disk bytes spill metric is right. Besides, this PR also fixes the `UnsafeExternalSorter.merge` by accumulating the `totalSpillBytes` of two sorters. Thus, we can report the correct spilled size in `HashAggregateExec.finishAggregate`. Issues can be reproduced by the following step by checking the SQL metrics in UI: ``` bin/spark-shell --driver-memory 512m --executor-memory 512m --executor-cores 1 --conf "spark.default.parallelism=1" scala> sql("select id, count(1) from range(10000000) group by id").write.csv("/tmp/result.json") ``` Before: <img width="200" alt="WeChatfe5146180d91015e03b9a27852e9a443" src="https://user-images.githubusercontent.com/16397174/103625414-e6fc6280-4f75-11eb-8b93-c55095bdb5b8.png"> After: <img width="200" alt="WeChat42ab0e73c5fbc3b14c12ab85d232071d" src="https://user-images.githubusercontent.com/16397174/103625420-e8c62600-4f75-11eb-8e1f-6f5e8ab561b9.png"> ### Does this PR introduce _any_ user-facing change? Yes, users can see the correct spill metrics after this PR. ### How was this patch tested? Tested manually and added UTs. Closes #31035 from Ngone51/SPARK-31952. Lead-authored-by: yi.wu <yi.wu@databricks.com> Co-authored-by: wangguangxin.cn <wangguangxin.cn@bytedance.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-11 07:15:28 +00:00
HyukjinKwon	830249284d	[SPARK-34059][SQL][CORE] Use for/foreach rather than map to make sure execute it eagerly ### What changes were proposed in this pull request? This PR is basically a followup of https://github.com/apache/spark/pull/14332. Calling `map` alone might leave it not executed due to lazy evaluation, e.g.) ``` scala> val foo = Seq(1,2,3) foo: Seq[Int] = List(1, 2, 3) scala> foo.map(println) 1 2 3 res0: Seq[Unit] = List((), (), ()) scala> foo.view.map(println) res1: scala.collection.SeqView[Unit,Seq[_]] = SeqViewM(...) scala> foo.view.foreach(println) 1 2 3 ``` We should better use `foreach` to make sure it's executed where the output is unused or `Unit`. ### Why are the changes needed? To prevent the potential issues by not executing `map`. ### Does this PR introduce _any_ user-facing change? No, the current codes look not causing any problem for now. ### How was this patch tested? I found these item by running IntelliJ inspection, double checked one by one, and fixed them. These should be all instances across the codebase ideally. Closes #31110 from HyukjinKwon/SPARK-34059. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-01-10 15:22:24 -08:00
Chandni Singh	d00f0695b7	[SPARK-32917][SHUFFLE][CORE] Adds support for executors to push shuffle blocks after successful map task completion ### What changes were proposed in this pull request? This is the shuffle writer side change where executors can push data to remote shuffle services. This is needed for push-based shuffle - SPIP [SPARK-30602](https://issues.apache.org/jira/browse/SPARK-30602). Summary of changes: - This adds support for executors to push shuffle blocks after map tasks complete writing shuffle data. - This also introduces a timeout specifically for creating connection to remote shuffle services. ### Why are the changes needed? - These changes are needed for push-based shuffle. Refer to the SPIP in [SPARK-30602](https://issues.apache.org/jira/browse/SPARK-30602). - The main reason to create a separate connection creation timeout is because the existing `connectionTimeoutMs` is overloaded and is used for connection creation timeouts as well as connection idle timeout. The connection creation timeout should be much lower than the idle timeouts. The default for `connectionTimeoutMs` is 120s. This is quite high for just establishing the connections. If a shuffle server node is bad then the connection creation will fail within few seconds. However, an overloaded shuffle server may take much longer to respond to a request and the channel can stay idle for a much longer time which is expected. Another reason is that with push-based shuffle, an executor may be fetching shuffle data and pushing shuffle data (next stage) simultaneously. Both these tasks will share the same connections with the shuffle service. If there is a bad shuffle server node and the connection creation timeout is very high then both these tasks end up waiting a long time time eventually impacting the performance. ### Does this PR introduce _any_ user-facing change? Yes. This PR introduces client-side configs for push-based shuffle. If push-based shuffle is turned-off then the users will not see any change. ### How was this patch tested? Added unit tests. The reference PR with the consolidated changes covering the complete implementation is also provided in [SPARK-30602](https://issues.apache.org/jira/browse/SPARK-30602). We have already verified the functionality and the improved performance as documented in the SPIP doc. Lead-authored-by: Min Shen mshenlinkedin.com Co-authored-by: Chandni Singh chsinghlinkedin.com Co-authored-by: Ye Zhou yezhoulinkedin.com Closes #30312 from otterc/SPARK-32917. Lead-authored-by: Chandni Singh <singh.chandni@gmail.com> Co-authored-by: Chandni Singh <chsingh@linkedin.com> Co-authored-by: Min Shen <mshen@linked.in.com> Co-authored-by: Ye Zhou <yezhou@linkedin.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>	2021-01-08 12:21:56 -06:00
Kousuke Saruta	cc20154562	[SPARK-34005][CORE] Update peak memory metrics for each Executor on task end ### What changes were proposed in this pull request? This PR makes `AppStatusListener` update the peak memory metrics for each Executor on task end like other peak memory metrics (e.g, stage, executors in a stage). ### Why are the changes needed? When `AppStatusListener#onExecutorMetricsUpdate` is called, peak memory metrics for Executors, stages and executors in a stage are updated but currently, the metrics only for Executors are not updated on task end. ### Does this PR introduce _any_ user-facing change? Yes. Executor peak memory metrics is updated more accurately. ### How was this patch tested? After I run a job with `local-cluster[1,1,1024]` and visited `/api/v1/<appid>/executors`, I confirmed `peakExecutorMemory` metrics is shown for an Executor even though the life time of each job is very short . I also modify the json files for `HistoryServerSuite`. Closes #31029 from sarutak/update-executor-metrics-on-taskend. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-07 21:24:15 -08:00
Baohe Zhang	29510821a0	[SPARK-33029][CORE][WEBUI] Fix the UI executor page incorrectly marking the driver as excluded ### What changes were proposed in this pull request? Filter out the driver entity when updating the exclusion status of live executors(including the driver), so the UI won't be marked as excluded in the UI even if the node that hosts the driver has been marked as excluded. ### Why are the changes needed? Before this change, if we run spark with the standalone mode and with spark.blacklist.enabled=true. The driver will be marked as excluded when the host that hosts that driver has been marked as excluded. While it's incorrect because the exclude list feature will exclude executors only and the driver is still active. ![image](https://user-images.githubusercontent.com/26694233/103238740-35c05180-4911-11eb-99a2-c87c059ba0cf.png) After the fix, the driver won't be marked as excluded. ![image](https://user-images.githubusercontent.com/26694233/103238806-6f915800-4911-11eb-80d5-3c99266cfd0a.png) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual test. Reopen the UI and see the driver is no longer marked as excluded. Closes #30954 from baohe-zhang/SPARK-33029. Authored-by: Baohe Zhang <baohe.zhang@verizonmedia.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-05 19:16:40 -08:00
LantaoJin	a7d3fcd354	[SPARK-34000][CORE] Fix stageAttemptToNumSpeculativeTasks java.util.NoSuchElementException ### What changes were proposed in this pull request? From below log, Stage 600 could be removed from `stageAttemptToNumSpeculativeTasks` by `onStageCompleted()`, but the speculative task 306.1 in stage 600 threw `NoSuchElementException` when it entered into `onTaskEnd()`. ``` 21/01/04 03:00:32,259 WARN [task-result-getter-2] scheduler.TaskSetManager:69 : Lost task 306.1 in stage 600.0 (TID 283610, hdc49-mcc10-01-0510-4108-039-tess0097.stratus.rno.ebay.com, executor 27): TaskKilled (another attempt succeeded) 21/01/04 03:00:32,259 INFO [task-result-getter-2] scheduler.TaskSetManager:57 : Task 306.1 in stage 600.0 (TID 283610) failed, but the task will not be re-executed (either because the task failed with a shuffle data fetch failure, so the previous stage needs to be re-run, or because a different copy of the task has already succeeded). 21/01/04 03:00:32,259 INFO [task-result-getter-2] cluster.YarnClusterScheduler:57 : Removed TaskSet 600.0, whose tasks have all completed, from pool default 21/01/04 03:00:32,259 INFO [HiveServer2-Handler-Pool: Thread-5853] thriftserver.SparkExecuteStatementOperation:190 : Returning result set with 50 rows from offsets [5378600, 5378650) with 1fe245f8-a7f9-4ec0-bcb5-8cf324cbbb47 21/01/04 03:00:32,260 ERROR [spark-listener-group-executorManagement] scheduler.AsyncEventQueue:94 : Listener ExecutorAllocationListener threw an exception java.util.NoSuchElementException: key not found: Stage 600 (Attempt 0) at scala.collection.MapLike.default(MapLike.scala:235) at scala.collection.MapLike.default$(MapLike.scala:234) at scala.collection.AbstractMap.default(Map.scala:63) at scala.collection.mutable.HashMap.apply(HashMap.scala:69) at org.apache.spark.ExecutorAllocationManager$ExecutorAllocationListener.onTaskEnd(ExecutorAllocationManager.scala:621) at org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:45) at org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:38) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:38) at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:115) at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:99) at org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:116) at org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:116) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) at org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:102) at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:97) at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1320) at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:97) ``` ### Why are the changes needed? To avoid throwing the java.util.NoSuchElementException ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? This is a protective patch and it's not easy to reproduce in UT due to the event order is not fixed in a async queue. Closes #31025 from LantaoJin/SPARK-34000. Authored-by: LantaoJin <jinlantao@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-04 21:37:26 -08:00
angerszhu	559f411da8	[SPARK-33908][CORE][FOLLOWUP] Correct Scaladoc of resolveDependencyPaths/resolveMavenDependencies ### What changes were proposed in this pull request? Fix un-correct doc of last change https://github.com/apache/spark/pull/30922#discussion_r551453193 ### Why are the changes needed? FIx doc ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Builds finished correctly. Closes #31016 from AngersZhuuuu/SPARK-33908-FOLLOW-UP. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-04 15:44:42 -08:00
HyukjinKwon	6b86aa0b52	[SPARK-33984][PYTHON] Upgrade to Py4J 0.10.9.1 ### What changes were proposed in this pull request? This PR upgrade Py4J from 0.10.9 to 0.10.9.1 that contains some bug fixes and improvements. It contains one bug fix (`4152353ac1`). ### Why are the changes needed? To leverage fixes from the upstream in Py4J. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Jenkins build and GitHub Actions will test it out. Closes #31009 from HyukjinKwon/SPARK-33984. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-04 10:23:38 -08:00
Baohe Zhang	45df6db906	[SPARK-33906][WEBUI] Fix the bug of UI Executor page stuck due to undefined peakMemoryMetrics ### What changes were proposed in this pull request? Check if the executorSummary.peakMemoryMetrics is defined before accessing it. Without checking, the UI has risked being stuck at the Executors page. ### Why are the changes needed? App live UI may stuck at Executors page without this fix. Steps to reproduce (with master branch): In mac OS standalone mode, open a spark-shell $SPARK_HOME/bin/spark-shell --master spark://localhost:7077 val x = sc.makeRDD(1 to 100000, 5) x.count() Then open the app UI in the browser, and click the Executors page, will get stuck at this page: ![image](https://user-images.githubusercontent.com/26694233/103105677-ca1a7380-45f4-11eb-9245-c69f4a4e816b.png) Also, the return JSON from API endpoint http://localhost:4040/api/v1/applications/app-20201224134418-0003/executors miss "peakMemoryMetrics" for executor objects. I attached the full json text in https://issues.apache.org/jira/browse/SPARK-33906. I debugged it and observed that ExecutorMetricsPoller .getExecutorUpdates returns an empty map, which causes peakExecutorMetrics to None in https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/status/LiveEntity.scala#L345. The possible reason for returning the empty map is that the stage completion time is shorter than the heartbeat interval, so the stage entry in stageTCMP has already been removed before the reportHeartbeat is called. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual test, rerun the steps of bug reproduce and see the bug is gone. Closes #30920 from baohe-zhang/SPARK-33906. Authored-by: Baohe Zhang <baohe.zhang@verizonmedia.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-12-31 13:34:55 -08:00

1 2 3 4 5 ...

7885 commits