ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Wenchen Fan	ba86524b25	[SPARK-31037][SQL] refine AQE config names ### What changes were proposed in this pull request? When introducing AQE to others, I feel the config names are a bit incoherent and hard to use. This PR refines the config names: 1. remove the "shuffle" prefix. AQE is all about shuffle and we don't need to add the "shuffle" prefix everywhere. 2. `targetPostShuffleInputSize` is obscure, rename to `advisoryShufflePartitionSizeInBytes`. 3. `reducePostShufflePartitions` doesn't match the actual optimization, rename to `coalesceShufflePartitions` 4. `minNumPostShufflePartitions` is obscure, rename it `minPartitionNum` under the `coalesceShufflePartitions` namespace 5. `maxNumPostShufflePartitions` is confusing with the word "max", rename it `initialPartitionNum` 6. `skewedJoinOptimization` is too verbose. skew join is a well-known terminology in database area, we can just say `skewJoin` ### Why are the changes needed? Make the config names easy to understand. ### Does this PR introduce any user-facing change? deprecate the config `spark.sql.adaptive.shuffle.targetPostShuffleInputSize` ### How was this patch tested? N/A Closes #27793 from cloud-fan/aqe. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-06 00:46:34 +08:00
yi.wu	2257ce2443	[SPARK-31034][CORE] ShuffleBlockFetcherIterator should always create request for last block group ### What changes were proposed in this pull request? This is a bug fix of #27280. This PR fix the bug where `ShuffleBlockFetcherIterator` may forget to create request for the last block group. ### Why are the changes needed? When (all blocks).sum < `targetRemoteRequestSize` and (all blocks).length > `maxBlocksInFlightPerAddress` and (last block group).size < `maxBlocksInFlightPerAddress`, `ShuffleBlockFetcherIterator` will not create a request for the last group. Thus, it will lost data for the reduce task. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Updated test. Closes #27786 from Ngone51/fix_no_request_bug. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-05 21:31:26 +08:00
Maxim Gekk	1fd9a91c66	[SPARK-31005][SQL] Support time zone ids in casting strings to timestamps ### What changes were proposed in this pull request? In the PR, I propose to change `DateTimeUtils.stringToTimestamp` to support any valid time zone id at the end of input string. After the changes, the function accepts zone ids in the formats: - no zone id. In that case, the function uses the local session time zone from the SQL config `spark.sql.session.timeZone` - -[h]h:[m]m - +[h]h:[m]m - Z - Short zone id, see https://docs.oracle.com/javase/8/docs/api/java/time/ZoneId.html#SHORT_IDS - Zone ID starts with 'UTC+', 'UTC-', 'GMT+', 'GMT-', 'UT+' or 'UT-'. The ID is split in two, with a two or three letter prefix and a suffix starting with the sign. The suffix must be in the formats: - +\|-h[h] - +\|-hh[:]mm - +\|-hh:mm:ss - +\|-hhmmss - Region-based zone IDs in the form `{area}/{city}`, such as `Europe/Paris` or `America/New_York`. The default set of region ids is supplied by the IANA Time Zone Database (TZDB). ### Why are the changes needed? - To use `stringToTimestamp` as a substitution of removed `stringToTime`, see https://github.com/apache/spark/pull/27710#discussion_r385020173 - Improve UX of Spark SQL by allowing flexible formats of zone ids. Currently, Spark accepts only `Z` and zone offsets that can be inconvenient when a time zone offset is shifted due to daylight saving rules. For instance: ```sql spark-sql> select cast('2015-03-18T12:03:17.123456 Europe/Moscow' as timestamp); NULL ``` ### Does this PR introduce any user-facing change? Yes. After the changes, casting strings to timestamps allows time zone id at the end of the strings: ```sql spark-sql> select cast('2015-03-18T12:03:17.123456 Europe/Moscow' as timestamp); 2015-03-18 12:03:17.123456 ``` ### How was this patch tested? - Added new test cases to the `string to timestamp` test in `DateTimeUtilsSuite`. - Run `CastSuite` and `AnsiCastSuite`. Closes #27753 from MaxGekk/stringToTimestamp-uni-zoneId. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-05 20:49:43 +08:00
Wenchen Fan	807ea413b4	[SPARK-31019][SQL] make it clear that people can deduplicate map keys ### What changes were proposed in this pull request? rename the config and make it non-internal. ### Why are the changes needed? Now we fail the query if duplicated map keys are detected, and provide a legacy config to deduplicate it. However, we must provide a way to get users out of this situation, instead of just rejecting to run the query. This exit strategy should always be there, while legacy config indicates that it may be removed someday. ### Does this PR introduce any user-facing change? no, just rename a config which was added in 3.0 ### How was this patch tested? add more tests for the fail behavior. Closes #27772 from cloud-fan/map. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-03-05 20:43:52 +09:00
Kent Yao	f45ae7f2c5	[SPARK-31038][SQL] Add checkValue for spark.sql.session.timeZone ### What changes were proposed in this pull request? The `spark.sql.session.timeZone` config can accept any string value including invalid time zone ids, then it will fail other queries that rely on the time zone. We should do the value checking in the set phase and fail fast if the zone value is invalid. ### Why are the changes needed? improve configuration ### Does this PR introduce any user-facing change? yes, will fail fast if the value is a wrong timezone id ### How was this patch tested? add ut Closes #27792 from yaooqinn/SPARK-31038. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-05 19:38:20 +08:00
maryannxue	9b602e26d2	[SPARK-31046][SQL] Make more efficient and clean up AQE update UI code ### What changes were proposed in this pull request? This PR avoids sending redundant metrics (those that have been included in previous update) as well as useless metrics (those in future stages) to Spark UI in AQE UI metrics update. ### Why are the changes needed? This change will make UI metrics update more efficient. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Manual test in Spark UI. Closes #27799 from maryannxue/aqe-ui-cleanup. Authored-by: maryannxue <maryannxue@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-05 18:53:01 +08:00
Terry Kim	66b4fd040e	[SPARK-31024][SQL] Allow specifying session catalog name `spark_catalog` in qualified column names for v1 tables ### What changes were proposed in this pull request? Currently, the user cannot specify the session catalog name (`spark_catalog`) in qualified column names for v1 tables: ``` SELECT spark_catalog.default.t.i FROM spark_catalog.default.t ``` fails with `cannot resolve 'spark_catalog.default.t.i`. This is inconsistent with v2 table behavior where catalog name can be used: ``` SELECT testcat.ns1.tbl.id FROM testcat.ns1.tbl.id ``` This PR proposes to fix the inconsistency and allow the user to specify session catalog name in column names for v1 tables. ### Why are the changes needed? Fixing an inconsistent behavior. ### Does this PR introduce any user-facing change? Yes, now the following query works: ``` SELECT spark_catalog.default.t.i FROM spark_catalog.default.t ``` ### How was this patch tested? Added new tests. Closes #27776 from imback82/spark_catalog_col_name_resolution. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-05 18:33:59 +08:00
yi.wu	0a22f19664	[SPARK-31050][TEST] Disable flaky `Roundtrip` test in KafkaDelegationTokenSuite ### What changes were proposed in this pull request? Disable test `KafkaDelegationTokenSuite`. ### Why are the changes needed? `KafkaDelegationTokenSuite` is too flaky. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Pass Jenkins. Closes #27789 from Ngone51/retry_kafka. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-03-05 00:21:32 -08:00
Yuanjian Li	7db0af5785	[SPARK-30668][SQL][FOLLOWUP] Raise exception instead of silent change for new DateFormatter ### What changes were proposed in this pull request? This is a follow-up work for #27441. For the cases of new TimestampFormatter return null while legacy formatter can return a value, we need to throw an exception instead of silent change. The legacy config will be referenced in the error message. ### Why are the changes needed? Avoid silent result change for new behavior in 3.0. ### Does this PR introduce any user-facing change? Yes, an exception is thrown when we detect legacy formatter can parse the string and the new formatter return null. ### How was this patch tested? Extend existing UT. Closes #27537 from xuanyuanking/SPARK-30668-follow. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-05 15:29:39 +08:00
Kent Yao	3edab6cc1d	[MINOR][CORE] Expose the alias -c flag of --conf for spark-submit ### What changes were proposed in this pull request? -c is short for --conf, it was introduced since v1.1.0 but hidden from users until now ### Why are the changes needed? ### Does this PR introduce any user-facing change? no expose hidden feature ### How was this patch tested? Nah Closes #27802 from yaooqinn/conf. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-03-04 20:37:51 -08:00
beliefer	ebcff675e0	[SPARK-30889][SPARK-30913][CORE][DOC] Add version information to the configuration of Tests.scala and Worker ### What changes were proposed in this pull request? 1.Add version information to the configuration of `Tests` and `Worker`. 2.Update the docs of `Worker`. I sorted out some information of `Tests` show below. Item name \| Since version \| JIRA ID \| Commit ID \| Note -- \| -- \| -- \| -- \| -- spark.testing.memory \| 1.6.0 \| SPARK-10983 \| b3ffac5178795f2d8e7908b3e77e8e89f50b5f6f#diff-395d07dcd46359cca610ce74357f0bb4 \| spark.testing.dynamicAllocation.scheduleInterval \| 2.3.0 \| SPARK-22864 \| 4e9e6aee44bb2ddb41b567d659358b22fd824222#diff-b096353602813e47074ace09a3890d56 \| spark.testing \| 1.0.1 \| SPARK-1606 \| ce57624b8232159fe3ec6db228afc622133df591#diff-d239aee594001f8391676e1047a0381e \| spark.test.noStageRetry \| 1.2.0 \| SPARK-3796 \| f55218aeb1e9d638df6229b36a59a15ce5363482#diff-6a9ff7fb74fd490a50462d45db2d5e11 \| spark.testing.reservedMemory \| 1.6.0 \| SPARK-12081 \| 84c44b500b5c90dffbe1a6b0aa86f01699b09b96#diff-395d07dcd46359cca610ce74357f0bb4 \| spark.testing.nHosts \| 3.0.0 \| SPARK-26491 \| 1a641525e60039cc6b10816e946cb6f44b3e2696#diff-8b4ea8f3b0cc1e7ce7e943de1abbb165 \| spark.testing.nExecutorsPerHost \| 3.0.0 \| SPARK-26491 \| 1a641525e60039cc6b10816e946cb6f44b3e2696#diff-8b4ea8f3b0cc1e7ce7e943de1abbb165 \| spark.testing.nCoresPerExecutor \| 3.0.0 \| SPARK-26491 \| 1a641525e60039cc6b10816e946cb6f44b3e2696#diff-8b4ea8f3b0cc1e7ce7e943de1abbb165 \| spark.resources.warnings.testing \| 3.1.0 \| SPARK-29148 \| 496f6ac86001d284cbfb7488a63dd3a168919c0f#diff-8b4ea8f3b0cc1e7ce7e943de1abbb165 \| spark.testing.resourceProfileManager \| 3.1.0 \| SPARK-29148 \| 496f6ac86001d284cbfb7488a63dd3a168919c0f#diff-8b4ea8f3b0cc1e7ce7e943de1abbb165 \| I sorted out some information of `Worker` show below. Item name \| Since version \| JIRA ID \| Commit ID \| Note -- \| -- \| -- \| -- \| -- spark.worker.resourcesFile \| 3.0.0 \| SPARK-27369 \| 7cbe01e8efc3f6cd3a0cac4bcfadea8fcc74a955#diff-b2fc8d6ab7ac5735085e2d6cfacb95da \| spark.worker.timeout \| 0.6.2 \| None \| e395aa295aeec6767df798bf1002b1f30983c1cd#diff-776a630ac2b2ec5fe85c07ca20a58fc0 \| spark.worker.driverTerminateTimeout \| 2.1.2 \| SPARK-20843 \| ebd72f453aa0b4f68760d28b3e93e6dd33856659#diff-829a8674171f92acd61007bedb1bfa4f \| spark.worker.cleanup.enabled \| 1.0.0 \| SPARK-1154 \| 1440154c27ca48b5a75103eccc9057286d3f6ca8#diff-916ca56b663f178f302c265b7ef38499 \| spark.worker.cleanup.interval \| 1.0.0 \| SPARK-1154 \| 1440154c27ca48b5a75103eccc9057286d3f6ca8#diff-916ca56b663f178f302c265b7ef38499 \| spark.worker.cleanup.appDataTtl \| 1.0.0 \| SPARK-1154 \| 1440154c27ca48b5a75103eccc9057286d3f6ca8#diff-916ca56b663f178f302c265b7ef38499 \| spark.worker.preferConfiguredMasterAddress \| 2.2.1 \| SPARK-20529 \| 75e5ea294c15ecfb7366ae15dce196aa92c87ca4#diff-916ca56b663f178f302c265b7ef38499 \| spark.worker.ui.port \| 1.1.0 \| SPARK-2857 \| 12f99cf5f88faf94d9dbfe85cb72d0010a3a25ac#diff-48ca297b6536cb92362bec1487581f05 \| spark.worker.ui.retainedExecutors \| 1.5.0 \| SPARK-9202 \| c0686668ae6a92b6bb4801a55c3b78aedbee816a#diff-916ca56b663f178f302c265b7ef38499 \| spark.worker.ui.retainedDrivers \| 1.5.0 \| SPARK-9202 \| c0686668ae6a92b6bb4801a55c3b78aedbee816a#diff-916ca56b663f178f302c265b7ef38499 \| spark.worker.ui.compressedLogFileLengthCacheSize \| 2.0.2 \| SPARK-17711 \| 26e978a93f029e1a1b5c7524d0b52c8141b70997#diff-d239aee594001f8391676e1047a0381e \| spark.worker.decommission.enabled \| 3.1.0 \| SPARK-20628 \| d273a2bb0fac452a97f5670edd69d3e452e3e57e#diff-b2fc8d6ab7ac5735085e2d6cfacb95da \| ### Why are the changes needed? Supplemental configuration version information. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Exists UT Closes #27783 from beliefer/add-version-to-tests-config. Authored-by: beliefer <beliefer@163.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-03-05 11:58:21 +09:00
Jungtaek Lim (HeartSaVioR)	78721fd8c5	[SPARK-31014][CORE] InMemoryStore: remove key from parentToChildrenMap when removing key from CountingRemoveIfForEach ### What changes were proposed in this pull request? This patch addresses missed spot on SPARK-30964 (#27716) - SPARK-30964 added secondary index which defines the relationship between parent - children and able to operate all children for given parent faster. While SPARK-30964 handled the addition and deletion of secondary index in InstanceList properly, it missed to add code to handle deletion of secondary index in CountingRemoveIfForEach, resulting to the leak of indices. This patch adds the deletion of secondary index in CountingRemoveIfForEach. ### Why are the changes needed? Described above. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? N/A, as relevant field and class are marked as private, and it cannot be checked in higher level. I'm not sure we want to adjust scope to add a test. Closes #27765 from HeartSaVioR/SPARK-31014. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>	2020-03-04 10:05:26 -08:00
DB Tsai	3c16fae5c1	[SPARK-31027][SQL] Refactor DataSourceStrategy to be more extendable ### What changes were proposed in this pull request? Refactor `DataSourceStrategy.scala` and `DataSourceStrategySuite.scala` so it's more extendable to implement nested predicate pushdown. ### Why are the changes needed? To support nested predicate pushdown. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing tests and new tests. Closes #27778 from dbtsai/SPARK-31027. Authored-by: DB Tsai <d_tsai@apple.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-03-04 23:41:49 +09:00
yi.wu	87b93d32a6	[SPARK-31017][TEST][CORE] Test for shuffle requests packaging with different size and numBlocks limit ### What changes were proposed in this pull request? Added 2 tests for `ShuffleBlockFetcherIteratorSuite`. ### Why are the changes needed? When packaging shuffle fetch requests in `ShuffleBlockFetcherIterator`, there are two limitations: `maxBytesInFlight` and `maxBlocksInFlightPerAddress`. However, we don’t have test cases to test them both, e.g. the size limitation is hit before the numBlocks limitation. We should add test cases in `ShuffleBlockFetcherIteratorSuite` to test: 1. the size limitation is hit before the numBlocks limitation 2. the numBlocks limitation is hit before the size limitation ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added new tests. Closes #27767 from Ngone51/add_test. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-04 20:21:48 +08:00
Terry Kim	b30278107f	[SPARK-30885][SQL][FOLLOW-UP] Fix issues where some V1 commands allow tables that are not fully qualified ### What changes were proposed in this pull request? There are few V1 commands such as `REFRESH TABLE` that still allow `spark_catalog.t` because they run the commands with parsed table names without trying to load them in the catalog. This PR addresses this issue. The PR also addresses the issue brought up in https://github.com/apache/spark/pull/27642#discussion_r382402104. ### Why are the changes needed? To fix a bug where for some V1 commands, `spark_catalog.t` is allowed. ### Does this PR introduce any user-facing change? Yes, a bug is fixed and `REFRESH TABLE spark_catalog.t` is not allowed. ### How was this patch tested? Added new test. Closes #27718 from imback82/fix_TempViewOrV1Table. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-04 18:09:48 +08:00
Wenchen Fan	e4c61e35da	[SPARK-30960][SQL] add back the legacy date/timestamp format support in CSV/JSON parser ### What changes were proposed in this pull request? Before Spark 3.0, the JSON/CSV parser has a special behavior that, when the parser fails to parse a timestamp/date, fallback to another way to parse it, to support some legacy format. The fallback was removed by https://issues.apache.org/jira/browse/SPARK-26178 and https://issues.apache.org/jira/browse/SPARK-26243. This PR adds back this legacy fallback. Since we switch the API to do datetime operations, we can't be exactly the same as before. Here we add back the support of the legacy formats that are common (examples of Spark 2.4): 1. the fields can have one or two letters ``` scala> sql("""select from_json('{"time":"1123-2-22 2:22:22"}', 'time Timestamp')""").show(false) +-------------------------------------------+ \|jsontostructs({"time":"1123-2-22 2:22:22"})\| +-------------------------------------------+ \|[1123-02-22 02:22:22] \| +-------------------------------------------+ ``` 2. the separator between data and time can be "T" as well ``` scala> sql("""select from_json('{"time":"2000-12-12T12:12:12"}', 'time Timestamp')""").show(false) +---------------------------------------------+ \|jsontostructs({"time":"2000-12-12T12:12:12"})\| +---------------------------------------------+ \|[2000-12-12 12:12:12] \| +---------------------------------------------+ ``` 3. the second fraction can be arbitrary length ``` scala> sql("""select from_json('{"time":"1123-02-22T02:22:22.123456789123"}', 'time Timestamp')""").show(false) +----------------------------------------------------------+ \|jsontostructs({"time":"1123-02-22T02:22:22.123456789123"})\| +----------------------------------------------------------+ \|[1123-02-15 02:22:22.123] \| +----------------------------------------------------------+ ``` 4. date string can end up with any chars after "T" or space ``` scala> sql("""select from_json('{"time":"1123-02-22Tabc"}', 'time date')""").show(false) +----------------------------------------+ \|jsontostructs({"time":"1123-02-22Tabc"})\| +----------------------------------------+ \|[1123-02-22] \| +----------------------------------------+ ``` 5. remove "GMT" from the string before parsing ``` scala> sql("""select from_json('{"time":"GMT1123-2-22 2:22:22.123"}', 'time Timestamp')""").show(false) +--------------------------------------------------+ \|jsontostructs({"time":"GMT1123-2-22 2:22:22.123"})\| +--------------------------------------------------+ \|[1123-02-22 02:22:22.123] \| +--------------------------------------------------+ ``` ### Why are the changes needed? It doesn't hurt to keep this legacy support. It just makes the parsing more relaxed. ### Does this PR introduce any user-facing change? yes, to make 3.0 support parsing most of the csv/json values that were supported before. ### How was this patch tested? new tests Closes #27710 from cloud-fan/bug2. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-03-04 18:27:44 +09:00
Shixiong Zhu	ebfff7af6a	[SPARK-30984][SS] Add UI test for Structured Streaming UI ### What changes were proposed in this pull request? - Add a UI test for Structured Streaming UI - Fix the unsafe usages of `SimpleDateFormat` by using a ThreadLocal shared object. - Use `start` to replace `submission` to be consistent with the API `StreamingQuery.start()`. ### Why are the changes needed? Structured Streaming UI is missing UI tests. ### Does this PR introduce any user-facing change? No ### How was this patch tested? The new test. Closes #27732 from zsxwing/ss-ui-test. Authored-by: Shixiong Zhu <zsxwing@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-04 13:55:34 +08:00
Yuanjian Li	f7f1948a8c	[SPARK-30289][FOLLOWUP][DOC] Update the migration guide for `spark.sql.legacy.ctePrecedencePolicy` ### What changes were proposed in this pull request? Fix the migration guide document for `spark.sql.legacy.ctePrecedence.enabled`, which is introduced in #27579. ### Why are the changes needed? The config value changed. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Document only. Closes #27782 from xuanyuanking/SPARK-30829-follow. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-03-04 13:56:02 +09:00
zero323	e1b3e9a3d2	[SPARK-29212][ML][PYSPARK] Add common classes without using JVM backend ### What changes were proposed in this pull request? Implement common base ML classes (`Predictor`, `PredictionModel`, `Classifier`, `ClasssificationModel` `ProbabilisticClassifier`, `ProbabilisticClasssificationModel`, `Regressor`, `RegrssionModel`) for non-Java backends. Note - `Predictor` and `JavaClassifier` should be abstract as `_fit` method is not implemented. - `PredictionModel` should be abstract as `_transform` is not implemented. ### Why are the changes needed? To provide extensions points for non-JVM algorithms, as well as a public (as opposed to `Java` variants, which are commonly described in docstrings as private) hierarchy which can be used to distinguish between different classes of predictors. For longer discussion see [SPARK-29212](https://issues.apache.org/jira/browse/SPARK-29212) and / or https://github.com/apache/spark/pull/25776. ### Does this PR introduce any user-facing change? It adds new base classes as listed above, but effective interfaces (method resolution order notwithstanding) stay the same. Additionally "private" `Java` classes in`ml.regression` and `ml.classification` have been renamed to follow PEP-8 conventions (added leading underscore). It is for discussion if the same should be done to equivalent classes from `ml.wrapper`. If we take `JavaClassifier` as an example, type hierarchy will change from ![old pyspark ml classification JavaClassifier](https://user-images.githubusercontent.com/1554276/72657093-5c0b0c80-39a0-11ea-9069-a897d75de483.png) to ![new pyspark ml classification _JavaClassifier](https://user-images.githubusercontent.com/1554276/72657098-64fbde00-39a0-11ea-8f80-01187a5ea5a6.png) Similarly the old model ![old pyspark ml classification JavaClassificationModel](https://user-images.githubusercontent.com/1554276/72657103-7513bd80-39a0-11ea-9ffc-59eb6ab61fde.png) will become ![new pyspark ml classification _JavaClassificationModel](https://user-images.githubusercontent.com/1554276/72657110-80ff7f80-39a0-11ea-9f5c-fe408664e827.png) ### How was this patch tested? Existing unit tests. Closes #27245 from zero323/SPARK-29212. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2020-03-04 12:20:02 +08:00
zhengruifeng	111e9038d8	[SPARK-30770][ML] avoid vector conversion in GMM.transform ### What changes were proposed in this pull request? Current impl needs to convert ml.Vector to breeze.Vector, which can be skipped. ### Why are the changes needed? avoid unnecessary vector conversions ### Does this PR introduce any user-facing change? No ### How was this patch tested? existing testsuites Closes #27519 from zhengruifeng/gmm_transform_opt. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2020-03-04 11:02:27 +08:00
yi.wu	380e887631	[SPARK-30999][SQL] Don't cancel a QueryStageExec which failed before call doMaterialize ### What changes were proposed in this pull request? This PR proposes to not cancel a `QueryStageExec` which failed before calling `doMaterialize`. Besides, this PR also includes 2 minor improvements: * fail fast when stage failed before calling `doMaterialize` * format Exception with Cause ### Why are the changes needed? For a stage which failed before materializing the lazy value (e.g. `inputRDD`), calling `cancel` on it could re-trigger the same failure again, e.g. executing child node again(see `AdaptiveQueryExecSuite`.`SPARK-30291: AQE should catch the exceptions when doing materialize` for example). And finally, the same failure will be counted 2 times, one is for materialize error and another is for cancel error. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Updated test. Closes #27752 from Ngone51/avoid_cancel_finished_stage. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2020-03-03 13:40:51 -08:00
Takeshi Yamamuro	4a1d273a4a	[SPARK-30997][SQL] Fix an analysis failure in generators with aggregate functions ### What changes were proposed in this pull request? We have supported generators in SQL aggregate expressions by SPARK-28782. But, the generator(explode) query with aggregate functions in DataFrame failed as follows; ``` // SPARK-28782: Generator support in aggregate expressions scala> spark.range(3).toDF("id").createOrReplaceTempView("t") scala> sql("select explode(array(min(id), max(id))) from t").show() +---+ \|col\| +---+ \| 0\| \| 2\| +---+ // A failure case handled in this pr scala> spark.range(3).select(explode(array(min($"id"), max($"id")))).show() org.apache.spark.sql.AnalysisException: The query operator `Generate` contains one or more unsupported expression types Aggregate, Window or Generate. Invalid expressions: [min(`id`), max(`id`)];; Project [col#46L] +- Generate explode(array(min(id#42L), max(id#42L))), false, [col#46L] +- Range (0, 3, step=1, splits=Some(4)) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis(CheckAnalysis.scala:49) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis$(CheckAnalysis.scala:48) at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:129) ``` The root cause is that `ExtractGenerator` wrongly replaces a project w/ aggregate functions before `GlobalAggregates` replaces it with an aggregate as follows; ``` scala> sql("SET spark.sql.optimizer.planChangeLog.level=warn") scala> spark.range(3).select(explode(array(min($"id"), max($"id")))).show() 20/03/01 12:51:58 WARN HiveSessionStateBuilder$$anon$1: === Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences === !'Project [explode(array(min('id), max('id))) AS List()] 'Project [explode(array(min(id#72L), max(id#72L))) AS List()] +- Range (0, 3, step=1, splits=Some(4)) +- Range (0, 3, step=1, splits=Some(4)) 20/03/01 12:51:58 WARN HiveSessionStateBuilder$$anon$1: === Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractGenerator === !'Project [explode(array(min(id#72L), max(id#72L))) AS List()] Project [col#76L] !+- Range (0, 3, step=1, splits=Some(4)) +- Generate explode(array(min(id#72L), max(id#72L))), false, [col#76L] ! +- Range (0, 3, step=1, splits=Some(4)) 20/03/01 12:51:58 WARN HiveSessionStateBuilder$$anon$1: === Result of Batch Resolution === !'Project [explode(array(min('id), max('id))) AS List()] Project [col#76L] !+- Range (0, 3, step=1, splits=Some(4)) +- Generate explode(array(min(id#72L), max(id#72L))), false, [col#76L] ! +- Range (0, 3, step=1, splits=Some(4)) // the analysis failed here... ``` To avoid the case in `ExtractGenerator`, this pr addes a condition to ignore generators having aggregate functions. A correct sequence of rules is as follows; ``` 20/03/01 13:19:06 WARN HiveSessionStateBuilder$$anon$1: === Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences === !'Project [explode(array(min('id), max('id))) AS List()] 'Project [explode(array(min(id#27L), max(id#27L))) AS List()] +- Range (0, 3, step=1, splits=Some(4)) +- Range (0, 3, step=1, splits=Some(4)) 20/03/01 13:19:06 WARN HiveSessionStateBuilder$$anon$1: === Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$GlobalAggregates === !'Project [explode(array(min(id#27L), max(id#27L))) AS List()] 'Aggregate [explode(array(min(id#27L), max(id#27L))) AS List()] +- Range (0, 3, step=1, splits=Some(4)) +- Range (0, 3, step=1, splits=Some(4)) 20/03/01 13:19:06 WARN HiveSessionStateBuilder$$anon$1: === Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractGenerator === !'Aggregate [explode(array(min(id#27L), max(id#27L))) AS List()] 'Project [explode(_gen_input_0#31) AS List()] !+- Range (0, 3, step=1, splits=Some(4)) +- Aggregate [array(min(id#27L), max(id#27L)) AS _gen_input_0#31] ! +- Range (0, 3, step=1, splits=Some(4)) ``` ### Why are the changes needed? A bug fix. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added tests. Closes #27749 from maropu/ExplodeInAggregate. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-03-03 12:25:12 -08:00
Terry Kim	c263c15408	[SPARK-31015][SQL] Star() expression fails when used with qualified column names for v2 tables ### What changes were proposed in this pull request? For a v2 table created with `CREATE TABLE testcat.ns1.ns2.tbl (id bigint, name string) USING foo`, the following works as expected ``` SELECT testcat.ns1.ns2.tbl.id FROM testcat.ns1.ns2.tbl ``` , but a query with qualified column name with star() ``` SELECT testcat.ns1.ns2.tbl.* FROM testcat.ns1.ns2.tbl [info] org.apache.spark.sql.AnalysisException: cannot resolve 'testcat.ns1.ns2.tbl.' given input columns 'id, name'; ``` fails to resolve. And this PR proposes to fix this issue. ### Why are the changes needed? To fix a bug as describe above. ### Does this PR introduce any user-facing change? Yes, now `SELECT testcat.ns1.ns2.tbl. FROM testcat.ns1.ns2.tbl` works as expected. ### How was this patch tested? Added new test. Closes #27766 from imback82/fix_star_expression. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-04 00:55:26 +08:00
Javier	3ff2135686	[SPARK-30049][SQL] SQL fails to parse when comment contains an unmatched quote character ### What changes were proposed in this pull request? A SQL statement that contains a comment with an unmatched quote character can lead to a parse error: - Added a insideComment flag in the splitter method to avoid checking single and double quotes within a comment: ``` spark-sql> SELECT 1 -- someone's comment here > ; Error in query: extraneous input ';' expecting <EOF>(line 2, pos 0) == SQL == SELECT 1 -- someone's comment here ; ^^^ ``` ### Why are the changes needed? This misbehaviour was not present on previous spark versions. ### Does this PR introduce any user-facing change? - No ### How was this patch tested? - New tests were added. Closes #27321 from javierivanov/SPARK-30049B. Lead-authored-by: Javier <jfuentes@hortonworks.com> Co-authored-by: Javier Fuentes <j.fuentes.m@icloud.com> Signed-off-by: Thomas Graves <tgraves@apache.org>	2020-03-03 09:55:15 -06:00
xuesenliang	7a4cf339d7	[SPARK-30388][CORE] Mark running map stages of finished job as finished, and cancel running tasks ### What changes were proposed in this pull request? When a job finished, its running (re-submitted) map stages should be marked as finished if not used by other jobs. The running tasks of these stages are cancelled. And the ListenerBus should be notified too, otherwise, these map stage items will stay on the "Active Stages" page of web UI and never gone. For example: Suppose job 0 has two stages: map stage 0 and result stage 1. Map stage 0 has two partitions, and its result stage 1 has two partitions too. Steps to reproduce the bug: 1. map stage 0: start task 0(```TID 0```) and task 1 (```TID 1```), then both finished successfully. 2. result stage 1: start task 0(```TID 2```) and task 1 (```TID 3```) 3. result stage 1: task 0(```TID 2```) finished successfully 4. result stage 1: speculative task 1.1(```TID 4```) launched, but then failed due to FetchFailedException. 5. driver re-submits map stage 0 and result stage 1. 6. map stage 0 (retry 1): task0(```TID 5```) launched 7. result stage 1: task 1(```TID 3```) finished successfully, so job 0 finished. 8. map stage 0 is removed from ```runningStages``` and ```stageIdToStage```, because it doesn't belong to any job. ``` private def DAGScheduler#cleanupStateForJobAndIndependentStages(job: ActiveJob): HashSet[Stage] = { ... stageIdToStage.filterKeys(stageId => registeredStages.get.contains(stageId)).foreach { case (stageId, stage) => ... def removeStage(stageId: Int): Unit = { for (stage <- stageIdToStage.get(stageId)) { if (runningStages.contains(stage)) { logDebug("Removing running stage %d".format(stageId)) runningStages -= stage } ... } stageIdToStage -= stageId } jobSet -= job.jobId if (jobSet.isEmpty) { // no other job needs this stage removeStage(stageId) } } ... } ``` 9. map stage 0 (retry 1): task0(TID 5) finished successfully, but its stage 0 is not in ```stageIdToStage```, so the stage not ```markStageAsFinished``` ``` private[scheduler] def DAGScheduler#handleTaskCompletion(event: CompletionEvent): Unit = { val task = event.task val stageId = task.stageId ... if (!stageIdToStage.contains(task.stageId)) { postTaskEnd(event) // Skip all the actions if the stage has been cancelled. return } ... ``` #### Relevant spark driver logs as follows: ``` 20/01/02 11:21:45 INFO DAGScheduler: Got job 0 (main at NativeMethodAccessorImpl.java:0) with 2 output partitions 20/01/02 11:21:45 INFO DAGScheduler: Final stage: ResultStage 1 (main at NativeMethodAccessorImpl.java:0) 20/01/02 11:21:45 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 0) 20/01/02 11:21:45 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 0) 20/01/02 11:21:45 INFO DAGScheduler: Submitting ShuffleMapStage 0 (MapPartitionsRDD[2] at main at NativeMethodAccessorImpl.java:0), which has no missing parents 20/01/02 11:21:45 INFO DAGScheduler: Submitting 2 missing tasks from ShuffleMapStage 0 (MapPartitionsRDD[2] at main at NativeMethodAccessorImpl.java:0) (first 15 tasks are for partitions Vector(0, 1)) 20/01/02 11:21:45 INFO YarnClusterScheduler: Adding task set 0.0 with 2 tasks 20/01/02 11:21:45 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 9.179.143.4, executor 1, partition 0, PROCESS_LOCAL, 7704 bytes) 20/01/02 11:21:45 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, 9.76.13.26, executor 2, partition 1, PROCESS_LOCAL, 7705 bytes) 20/01/02 11:22:18 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 32491 ms on 9.179.143.4 (executor 1) (1/2) 20/01/02 11:22:26 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 40544 ms on 9.76.13.26 (executor 2) (2/2) 20/01/02 11:22:26 INFO DAGScheduler: ShuffleMapStage 0 (main at NativeMethodAccessorImpl.java:0) finished in 40.854 s 20/01/02 11:22:26 INFO YarnClusterScheduler: Removed TaskSet 0.0, whose tasks have all completed, from pool 20/01/02 11:22:26 INFO DAGScheduler: Submitting ResultStage 1 (MapPartitionsRDD[6] at main at NativeMethodAccessorImpl.java:0), which has no missing parents 20/01/02 11:22:26 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 1 (MapPartitionsRDD[6] at main at NativeMethodAccessorImpl.java:0) (first 15 tasks are for partitions Vector(0, 1)) 20/01/02 11:22:26 INFO YarnClusterScheduler: Adding task set 1.0 with 2 tasks 20/01/02 11:22:26 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 2, 9.179.143.4, executor 1, partition 0, NODE_LOCAL, 7929 bytes) 20/01/02 11:22:26 INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 3, 9.76.13.26, executor 2, partition 1, NODE_LOCAL, 7929 bytes) 20/01/02 11:22:26 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) in 79 ms on 9.179.143.4 (executor 1) (1/2) 20/01/02 11:22:26 INFO TaskSetManager: Marking task 1 in stage 1.0 (on 9.76.13.26) as speculatable because it ran more than 158 ms 20/01/02 11:22:26 INFO TaskSetManager: Starting task 1.1 in stage 1.0 (TID 4, 9.179.143.52, executor 3, partition 1, ANY, 7929 bytes) 20/01/02 11:22:26 WARN TaskSetManager: Lost task 1.1 in stage 1.0 (TID 4, 9.179.143.52, executor 3): FetchFailed(BlockManagerId(1, 9.179.143.4, 7337, None), shuffleId=0, mapId=0, reduceId=1, message=org.apache.spark.shuffle.FetchFailedException: Connection reset by peer) 20/01/02 11:22:26 INFO TaskSetManager: Task 1.1 in stage 1.0 (TID 4) failed, but the task will not be re-executed (either because the task failed with a shuffle data fetch failure, so the previous stage needs to be re-run, or because a different copy of the task has already succeeded). 20/01/02 11:22:26 INFO DAGScheduler: Marking ResultStage 1 (main at NativeMethodAccessorImpl.java:0) as failed due to a fetch failure from ShuffleMapStage 0 (main at NativeMethodAccessorImpl.java:0) 20/01/02 11:22:26 INFO DAGScheduler: ResultStage 1 (main at NativeMethodAccessorImpl.java:0) failed in 0.261 s due to org.apache.spark.shuffle.FetchFailedException: Connection reset by peer 20/01/02 11:22:26 INFO DAGScheduler: Resubmitting ShuffleMapStage 0 (main at NativeMethodAccessorImpl.java:0) and ResultStage 1 (main at NativeMethodAccessorImpl.java:0) due to fetch failure 20/01/02 11:22:26 INFO DAGScheduler: Resubmitting failed stages 20/01/02 11:22:26 INFO DAGScheduler: Submitting ShuffleMapStage 0 (MapPartitionsRDD[2] at main at NativeMethodAccessorImpl.java:0), which has no missing parents 20/01/02 11:22:26 INFO DAGScheduler: Submitting 1 missing tasks from ShuffleMapStage 0 (MapPartitionsRDD[2] at main at NativeMethodAccessorImpl.java:0) (first 15 tasks are for partitions Vector(0)) 20/01/02 11:22:26 INFO YarnClusterScheduler: Adding task set 0.1 with 1 tasks 20/01/02 11:22:26 INFO TaskSetManager: Starting task 0.0 in stage 0.1 (TID 5, 9.179.143.4, executor 1, partition 0, PROCESS_LOCAL, 7704 bytes) // NOTE: Here should be "INFO TaskSetManager: Finished task 1.0 in stage 1.0 (TID 3) in 10000 ms on 9.76.13.26 (executor 2) (2/2)" // and this bug is being fixed in https://issues.apache.org/jira/browse/SPARK-30404 20/01/02 11:22:36 INFO TaskSetManager: Ignoring task-finished event for 1.0 in stage 1.0 because task 1 has already completed successfully 20/01/02 11:22:36 INFO YarnClusterScheduler: Removed TaskSet 1.0, whose tasks have all completed, from pool 20/01/02 11:22:36 INFO DAGScheduler: ResultStage 1 (main at NativeMethodAccessorImpl.java:0) finished in 10.131 s 20/01/02 11:22:36 INFO DAGScheduler: Job 0 finished: main at NativeMethodAccessorImpl.java:0, took 51.031212 s 20/01/02 11:22:58 INFO TaskSetManager: Finished task 0.0 in stage 0.1 (TID 5) in 32029 ms on 9.179.143.4 (executor 1) (1/1) 20/01/02 11:22:58 INFO YarnClusterScheduler: Removed TaskSet 0.1, whose tasks have all completed, from pool ``` ### Why are the changes needed? web UI is incorrect: ```stage 0 (retry 1)``` is finished, but it stays in ```Active Stages``` Page. ![active_stage](https://user-images.githubusercontent.com/4401756/71656718-71185680-2d77-11ea-8dbc-fd8085ab3dfb.png) ### Does this PR introduce any user-facing change? No ### How was this patch tested? A new test case is added. And test manually on cluster. The result is as follows: ![cancel_stage](https://user-images.githubusercontent.com/4401756/71658434-04a15580-2d7f-11ea-952b-dd8dd685f37d.png) Closes #27050 from liangxs/master. Authored-by: xuesenliang <xuesenliang@tencent.com> Signed-off-by: Thomas Graves <tgraves@apache.org>	2020-03-03 09:29:43 -06:00
Sean Owen	97d9a22b04	[SPARK-30994][CORE] Update xerces to 2.12.0 ### What changes were proposed in this pull request? Manage up the version of Xerces that Hadoop uses (and potentially user apps) to 2.12.0 to match https://issues.apache.org/jira/browse/HADOOP-16530 ### Why are the changes needed? Picks up bug and security fixes: https://www.xml.com/news/2018-05-apache-xerces-j-2120/ ### Does this PR introduce any user-facing change? Should be no behavior changes. ### How was this patch tested? Existing tests. Closes #27746 from srowen/SPARK-30994. Authored-by: Sean Owen <srowen@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-03-03 09:27:18 -06:00
roland-ondeviceresearch	a4aaee01fa	[MINOR][DOCS] ForeachBatch java example fix ### What changes were proposed in this pull request? ForEachBatch Java example was incorrect ### Why are the changes needed? Example did not compile ### Does this PR introduce any user-facing change? Yes, to docs. ### How was this patch tested? In IDE. Closes #27740 from roland1982/foreachwriter_java_example_fix. Authored-by: roland-ondeviceresearch <roland@ondeviceresearch.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-03-03 09:24:33 -06:00
Gengliang Wang	b5166aac1f	[SPARK-31013][CORE][WEBUI] InMemoryStore: improve removeAllByIndexValues over natural key index ### What changes were proposed in this pull request? The method `removeAllByIndexValues` in KVStore is to delete all the objects which have certain values in the given index. However, in the current implementation of `InMemoryStore`, when the given index is the natural key index, there is no special handling for it and a linear search over all the task data is performed. We can improve it by deleting the natural keys directly from the internal hashmap. ### Why are the changes needed? Better performance if the given index for `removeAllByIndexValues` is the natural key index in `InMemoryStore` ### Does this PR introduce any user-facing change? No ### How was this patch tested? Enhance the existing test. Closes #27763 from gengliangwang/useNaturalIndex. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-03 19:34:19 +08:00
Takeshi Yamamuro	313e62c376	[SPARK-30998][SQL] ClassCastException when a generator having nested inner generators ### What changes were proposed in this pull request? A query below failed in the master; ``` scala> sql("select array(array(1, 2), array(3)) ar").select(explode(explode($"ar"))).show() 20/03/01 13:51:56 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)/ 1] java.lang.ClassCastException: scala.collection.mutable.ArrayOps$ofRef cannot be cast to org.apache.spark.sql.catalyst.util.ArrayData at org.apache.spark.sql.catalyst.expressions.ExplodeBase.eval(generators.scala:313) at org.apache.spark.sql.execution.GenerateExec.$anonfun$doExecute$8(GenerateExec.scala:108) at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490) at scala.collection.Iterator$ConcatIterator.hasNext(Iterator.scala:222) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) ... ``` This pr modified the `hasNestedGenerator` code in `ExtractGenerator` for correctly catching nested inner generators. ### Why are the changes needed? A bug fix. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added tests. Closes #27750 from maropu/HandleNestedGenerators. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-03-03 19:00:33 +09:00
Kent Yao	1fac06c430	Revert "[SPARK-30808][SQL] Enable Java 8 time API in Thrift server" This reverts commit `afaeb29599`. ### What changes were proposed in this pull request? Based on the result and comment from https://github.com/apache/spark/pull/27552#discussion_r385531744 In the hive module, server-side provides datetime values simply use `value.toSting`, and the client-side regenerates the results back in `HiveBaseResultSet` with `java.sql.Date(Timestamp).valueOf`. there will be inconsistency between client and server if we use java8 APIs ### Why are the changes needed? the change is still unclear enough ### Does this PR introduce any user-facing change? no ### How was this patch tested? Nah Closes #27733 from yaooqinn/SPARK-30808. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-03 14:21:20 +08:00
HyukjinKwon	3956e95f05	[SPARK-25202][SQL][FOLLOW-UP] Keep the old parameter name 'pattern' at split in Scala API ### What changes were proposed in this pull request? To address the concern pointed out in https://github.com/apache/spark/pull/22227. This will make `split` source-compatible by removing minimal cosmetic changes. ### Why are the changes needed? For source compatibility. ### Does this PR introduce any user-facing change? No (it will prevent potential user-facing change from the original PR) ### How was this patch tested? Unittest was changed (in order for us to detect that source compatibility easily). Closes #27756 from HyukjinKwon/SPARK-25202. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-03-03 10:24:50 +09:00
maryannxue	473a28c1d0	[SPARK-30991] Refactor AQE readers and RDDs ### What changes were proposed in this pull request? This PR combines `CustomShuffledRowRDD` and `LocalShuffledRowRDD` into `ShuffledRowRDD`, and creates `CustomShuffleReaderExec` to unify and replace all existing AQE readers: `CoalescedShuffleReaderExec`, `LocalShuffleReaderExec` and `SkewJoinShuffleReaderExec`. ### Why are the changes needed? To reduce code redundancy. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Passed existing UTs. Closes #27742 from maryannxue/aqe-readers. Authored-by: maryannxue <maryannxue@apache.org> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2020-03-02 16:04:00 -08:00
Josh Rosen	f0010c81e2	[SPARK-31003][TESTS] Fix incorrect uses of assume() in tests ### What changes were proposed in this pull request? This patch fixes several incorrect uses of `assume()` in our tests. If a call to `assume(condition)` fails then it will cause the test to be marked as skipped instead of failed: this feature allows test cases to be skipped if certain prerequisites are missing. For example, we use this to skip certain tests when running on Windows (or when Python dependencies are unavailable). In contrast, `assert(condition)` will fail the test if the condition doesn't hold. If `assume()` is accidentally substituted for `assert()`then the resulting test will be marked as skipped in cases where it should have failed, undermining the purpose of the test. This patch fixes several such cases, replacing certain `assume()` calls with `assert()`. Credit to ahirreddy for spotting this problem. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing tests. Closes #27754 from JoshRosen/fix-assume-vs-assert. Lead-authored-by: Josh Rosen <rosenville@gmail.com> Co-authored-by: Josh Rosen <joshrosen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-03-02 15:20:45 -08:00
yi.wu	b517f991fe	[SPARK-30969][CORE] Remove resource coordination support from Standalone ### What changes were proposed in this pull request? Remove automatically resource coordination support from Standalone. ### Why are the changes needed? Resource coordination is mainly designed for the scenario where multiple workers launched on the same host. However, it's, actually, a non-existed scenario for today's Spark. Because, Spark now can start multiple executors in a single Worker, while it only allow one executor per Worker at very beginning. So, now, it really help nothing for user to launch multiple workers on the same host. Thus, it's not worth for us to bring over complicated implementation and potential high maintain cost for such an impossible scenario. ### Does this PR introduce any user-facing change? No, it's Spark 3.0 feature. ### How was this patch tested? Pass Jenkins. Closes #27722 from Ngone51/abandon_coordination. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Xingbo Jiang <xingbo.jiang@databricks.com>	2020-03-02 11:23:07 -08:00
Wu, Xiaochang	ac122762f5	[SPARK-30813][ML] Fix Matrices.sprand comments ### What changes were proposed in this pull request? Fix mistakes in comments ### Why are the changes needed? There are mistakes in comments ### Does this PR introduce any user-facing change? No ### How was this patch tested? N/A Closes #27564 from xwu99/fix-mllib-sprand-comment. Authored-by: Wu, Xiaochang <xiaochang.wu@intel.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-03-02 08:56:17 -06:00
Jungtaek Lim (HeartSaVioR)	f24a46011c	[SPARK-30993][SQL] Use its sql type for UDT when checking the type of length (fixed/var) or mutable ### What changes were proposed in this pull request? This patch fixes the bug of UnsafeRow which misses to handle the UDT specifically, in `isFixedLength` and `isMutable`. These methods don't check its SQL type for UDT, always treating UDT as variable-length, and non-mutable. It doesn't bring any issue if UDT is used to represent complicated type, but when UDT is used to represent some type which is matched with fixed length of SQL type, it exposes the chance of correctness issues, as these informations sometimes decide how the value should be handled. We got report from user mailing list which suspected as mapGroupsWithState looks like handling UDT incorrectly, but after some investigation it was from GenerateUnsafeRowJoiner in shuffle phase. `0e2ca11d80/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeRowJoiner.scala (L32-L43)` Here updating position should not happen on fixed-length column, but due to this bug, the value of UDT having fixed-length as sql type would be modified, which actually corrupts the value. ### Why are the changes needed? Misclassifying of the type of length for UDT can corrupt the value when the row is presented to the input of GenerateUnsafeRowJoiner, which brings correctness issue. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? New UT added. Closes #27747 from HeartSaVioR/SPARK-30993. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-02 22:33:11 +08:00
Gengliang Wang	6b641430c3	[SPARK-30964][CORE][WEBUI] Accelerate InMemoryStore with a new index ### What changes were proposed in this pull request? Spark uses the class `InMemoryStore` as the KV storage for live UI and history server(by default if no LevelDB file path is provided). In `InMemoryStore`, all the task data in one application is stored in a hashmap, which key is the task ID and the value is the task data. This fine for getting or deleting with a provided task ID. However, Spark stage UI always shows all the task data in one stage and the current implementation is to look up all the values in the hashmap. The time complexity is O(numOfTasks). Also, when there are too many stages (>spark.ui.retainedStages), Spark will linearly try to look up all the task data of the stages to be deleted as well. This can be very bad for a large application with many stages and tasks. We can improve it by allowing the natural key of an entity to have a real parent index. So that on each lookup with parent node provided, Spark can look up all the natural keys(in our case, the task IDs) first, and then find the data with the natural keys in the hashmap. ### Why are the changes needed? The in-memory KV store becomes really slow for large applications. We can improve it with a new index. The performance can be 10 times, 100 times, even 1000 times faster. This is also possible to make the Spark driver more stable for large applications. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing unit tests. Also, I run a benchmark with the following code ``` val store = new InMemoryStore() val numberOfTasksPerStage = 10000 (0 until 1000).map { sId => (0 until numberOfTasksPerStage).map { taskId => val task = newTaskData(sId * numberOfTasksPerStage + taskId, "SUCCESS", sId) store.write(task) } } val appStatusStore = new AppStatusStore(store) var start = System.nanoTime() appStatusStore.taskSummary(2, attemptId, Array(0, 0.25, 0.5, 0.75, 1)) println("task summary run time: " + ((System.nanoTime() - start) / 1000000)) val stageIds = Seq(1, 11, 66, 88) val stageKeys = stageIds.map(Array(_, attemptId)) start = System.nanoTime() store.removeAllByIndexValues(classOf[TaskDataWrapper], TaskIndexNames.STAGE, stageKeys.asJavaCollection) println("clean up tasks run time: " + ((System.nanoTime() - start) / 1000000)) ``` Task summary before the changes: 98642ms Task summary after the changes: 120ms Task clean up before the changes: 4900ms Task clean up before the changes: 4ms It's 800x faster after the changes in the micro-benchmark. Closes #27716 from gengliangwang/liveUIStore. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-02 15:48:48 +08:00
beliefer	71365c2502	[SPARK-30912][CORE][DOC] Add version information to the configuration of Streaming.scala ### What changes were proposed in this pull request? 1.Add version information to the configuration of `Streaming`. 2.Update the docs of `Streaming`. I sorted out some information show below. Item name \| Since version \| JIRA ID \| Commit ID \| Note -- \| -- \| -- \| -- \| -- spark.streaming.dynamicAllocation.enabled \| 3.0.0 \| SPARK-26941 \| cad475dcc9376557f882859856286e858002389a#diff-335c3bbf4ca27cf65a6f850b1a7c89dd \| spark.streaming.dynamicAllocation.testing \| 3.0.0 \| SPARK-26941 \| cad475dcc9376557f882859856286e858002389a#diff-335c3bbf4ca27cf65a6f850b1a7c89dd \| spark.streaming.dynamicAllocation.minExecutors \| 3.0.0 \| SPARK-26941 \| cad475dcc9376557f882859856286e858002389a#diff-335c3bbf4ca27cf65a6f850b1a7c89dd \| spark.streaming.dynamicAllocation.maxExecutors \| 3.0.0 \| SPARK-26941 \| cad475dcc9376557f882859856286e858002389a#diff-335c3bbf4ca27cf65a6f850b1a7c89dd \| spark.streaming.dynamicAllocation.scalingInterval \| 3.0.0 \| SPARK-26941 \| cad475dcc9376557f882859856286e858002389a#diff-335c3bbf4ca27cf65a6f850b1a7c89dd \| spark.streaming.dynamicAllocation.scalingUpRatio \| 3.0.0 \| SPARK-26941 \| cad475dcc9376557f882859856286e858002389a#diff-335c3bbf4ca27cf65a6f850b1a7c89dd \| spark.streaming.dynamicAllocation.scalingDownRatio \| 3.0.0 \| SPARK-26941 \| cad475dcc9376557f882859856286e858002389a#diff-335c3bbf4ca27cf65a6f850b1a7c89dd \| ### Why are the changes needed? Supplemental configuration version information. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Exists UT Closes #27745 from beliefer/add-version-to-streaming-config. Authored-by: beliefer <beliefer@163.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-03-02 15:16:40 +09:00
beliefer	c63366a693	[SPARK-30891][CORE][DOC] Add version information to the configuration of History ### What changes were proposed in this pull request? 1.Add version information to the configuration of `History`. 2.Update the docs of `History`. I sorted out some information show below. Item name \| Since version \| JIRA ID \| Commit ID \| Note -- \| -- \| -- \| -- \| -- spark.history.fs.logDirectory \| 1.1.0 \| SPARK-1768 \| 21ddd7d1e9f8e2a726427f32422c31706a20ba3f#diff-a7befb99e7bd7e3ab5c46c2568aa5b3e \| spark.history.fs.safemodeCheck.interval \| 1.6.0 \| SPARK-11020 \| cf04fdfe71abc395163a625cc1f99ec5e54cc07e#diff-a7befb99e7bd7e3ab5c46c2568aa5b3e \| spark.history.fs.update.interval \| 1.4.0 \| SPARK-6046 \| 4527761bcd6501c362baf2780905a0018b9a74ba#diff-a7befb99e7bd7e3ab5c46c2568aa5b3e \| spark.history.fs.cleaner.enabled \| 1.3.0 \| SPARK-3562 \| 8942b522d8a3269a2a357e3a274ed4b3e66ebdde#diff-a7befb99e7bd7e3ab5c46c2568aa5b3e \| Branch branch-1.3 does not exist, exists in branch-1.4, but it is 1.3.0-SNAPSHOT in pom.xml spark.history.fs.cleaner.interval \| 1.4.0 \| SPARK-5933 \| 1991337336596f94698e79c2366f065c374128ab#diff-a7befb99e7bd7e3ab5c46c2568aa5b3e \| spark.history.fs.cleaner.maxAge \| 1.4.0 \| SPARK-5933 \| 1991337336596f94698e79c2366f065c374128ab#diff-a7befb99e7bd7e3ab5c46c2568aa5b3e \| spark.history.fs.cleaner.maxNum \| 3.0.0 \| SPARK-28294 \| bbc2be4f425c4c26450e1bf21db407e81046ce21#diff-6bddeb5e25239974fc13db66266b167b \| spark.history.store.path \| 2.3.0 \| SPARK-20642 \| 74daf622de4e534d5a5929b424a6e836850eefad#diff-19f35f981fdc5b0a46f070b879a9a9fc \| spark.history.store.maxDiskUsage \| 2.3.0 \| SPARK-20654 \| 8b497046c647a21bbed1bdfbdcb176745a1d5cd5#diff-19f35f981fdc5b0a46f070b879a9a9fc \| spark.history.ui.port \| 1.0.0 \| SPARK-1276 \| 9ae80bf9bd3e4da7443af97b41fe26aa5d35d70b#diff-b49b5b9c31ddb36a9061004b5b723058 \| spark.history.fs.inProgressOptimization.enabled \| 2.4.0 \| SPARK-6951 \| 653fe02415a537299e15f92b56045569864b6183#diff-19f35f981fdc5b0a46f070b879a9a9fc \| spark.history.fs.endEventReparseChunkSize \| 2.4.0 \| SPARK-6951 \| 653fe02415a537299e15f92b56045569864b6183#diff-19f35f981fdc5b0a46f070b879a9a9fc \| spark.history.fs.eventLog.rolling.maxFilesToRetain \| 3.0.0 \| SPARK-30481 \| a2fe73b83c0e7c61d1c83b236565a71e3d005a71#diff-6bddeb5e25239974fc13db66266b167b \| spark.history.fs.eventLog.rolling.compaction.score.threshold \| 3.0.0 \| SPARK-30481 \| a2fe73b83c0e7c61d1c83b236565a71e3d005a71#diff-6bddeb5e25239974fc13db66266b167b \| spark.history.fs.driverlog.cleaner.enabled \| 3.0.0 \| SPARK-25118 \| 5f11e8c4cb9a5db037ac239b8fcc97f3a746e772#diff-6bddeb5e25239974fc13db66266b167b \| spark.history.fs.driverlog.cleaner.interval \| 3.0.0 \| SPARK-25118 \| 5f11e8c4cb9a5db037ac239b8fcc97f3a746e772#diff-6bddeb5e25239974fc13db66266b167b \| spark.history.fs.driverlog.cleaner.maxAge \| 3.0.0 \| SPARK-25118 \| 5f11e8c4cb9a5db037ac239b8fcc97f3a746e772#diff-6bddeb5e25239974fc13db66266b167b \| spark.history.ui.acls.enable \| 1.0.1 \| Spark 1489 \| c8dd13221215275948b1a6913192d40e0c8cbadd#diff-b49b5b9c31ddb36a9061004b5b723058 \| spark.history.ui.admin.acls \| 2.1.1 \| SPARK-19033 \| 4ca1788805e4a0131ba8f0ccb7499ee0e0242837#diff-a7befb99e7bd7e3ab5c46c2568aa5b3e \| spark.history.ui.admin.acls.groups \| 2.1.1 \| SPARK-19033 \| 4ca1788805e4a0131ba8f0ccb7499ee0e0242837#diff-a7befb99e7bd7e3ab5c46c2568aa5b3e \| spark.history.fs.numReplayThreads \| 2.0.0 \| SPARK-13988 \| 6fdd0e32a6c3fdce1f3f7e1f8d252af05c419f7b#diff-a7befb99e7bd7e3ab5c46c2568aa5b3e \| spark.history.retainedApplications \| 1.0.0 \| SPARK-1276 \| 9ae80bf9bd3e4da7443af97b41fe26aa5d35d70b#diff-b49b5b9c31ddb36a9061004b5b723058 \| spark.history.provider \| 1.1.0 \| SPARK-1768 \| 21ddd7d1e9f8e2a726427f32422c31706a20ba3f#diff-a7befb99e7bd7e3ab5c46c2568aa5b3e \| spark.history.kerberos.enabled \| 1.0.1 \| Spark-1490 \| 866b03ef4d27b2160563b58d577de29ba6eb4442#diff-b49b5b9c31ddb36a9061004b5b723058 \| spark.history.kerberos.principal \| 1.0.1 \| Spark-1490 \| 866b03ef4d27b2160563b58d577de29ba6eb4442#diff-b49b5b9c31ddb36a9061004b5b723058 \| spark.history.kerberos.keytab \| 1.0.1 \| Spark-1490 \| 866b03ef4d27b2160563b58d577de29ba6eb4442#diff-b49b5b9c31ddb36a9061004b5b723058 \| spark.history.custom.executor.log.url \| 3.0.0 \| SPARK-26311 \| ae5b2a6a92be4986ef5b8062d7fb59318cff6430#diff-6bddeb5e25239974fc13db66266b167b \| spark.history.custom.executor.log.url.applyIncompleteApplication \| 3.0.0 \| SPARK-26311 \| ae5b2a6a92be4986ef5b8062d7fb59318cff6430#diff-6bddeb5e25239974fc13db66266b167b \| ### Why are the changes needed? Supplemental configuration version information. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Exists UT Closes #27751 from beliefer/add-version-to-history-config. Authored-by: beliefer <beliefer@163.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-03-02 15:15:49 +09:00
beliefer	3beb4f875d	[SPARK-30908][CORE][DOC] Add version information to the configuration of Kryo ### What changes were proposed in this pull request? 1.Add version information to the configuration of `Kryo`. 2.Update the docs of `Kryo`. I sorted out some information show below. Item name \| Since version \| JIRA ID \| Commit ID \| Note -- \| -- \| -- \| -- \| -- spark.kryo.registrationRequired \| 1.1.0 \| SPARK-2102 \| efdaeb111917dd0314f1d00ee8524bed1e2e21ca#diff-1f81c62dad0e2dfc387a974bb08c497c \| spark.kryo.registrator \| 0.5.0 \| None \| 91c07a33d90ab0357e8713507134ecef5c14e28a#diff-792ed56b3398163fa14e8578549d0d98 \| This is not a release version, do we need to record it? spark.kryo.classesToRegister \| 1.2.0 \| SPARK-1813 \| 6bb56faea8d238ea22c2de33db93b1b39f492b3a#diff-529fc5c06b9731c1fbda6f3db60b16aa \| spark.kryo.unsafe \| 2.1.0 \| SPARK-928 \| bc167a2a53f5a795d089e8a884569b1b3e2cd439#diff-1f81c62dad0e2dfc387a974bb08c497c \| spark.kryo.pool \| 3.0.0 \| SPARK-26466 \| 38f030725c561979ca98b2a6cc7ca6c02a1f80ed#diff-a3c6b992784f9abeb9f3047d3dcf3ed9 \| spark.kryo.referenceTracking \| 0.8.0 \| None \| 0a8cc309211c62f8824d76618705c817edcf2424#diff-1f81c62dad0e2dfc387a974bb08c497c \| spark.kryoserializer.buffer \| 1.4.0 \| SPARK-5932 \| 2d222fb39dd978e5a33cde6ceb59307cbdf7b171#diff-1f81c62dad0e2dfc387a974bb08c497c \| spark.kryoserializer.buffer.max \| 1.4.0 \| SPARK-5932 \| 2d222fb39dd978e5a33cde6ceb59307cbdf7b171#diff-1f81c62dad0e2dfc387a974bb08c497c \| ### Why are the changes needed? Supplemental configuration version information. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Exists UT Closes #27734 from beliefer/add-version-to-kryo-config. Authored-by: beliefer <beliefer@163.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-03-02 15:14:47 +09:00
Jiaan Geng	a429ac83e4	[SPARK-30841][SQL][DOC][FOLLOW-UP] Add version information to the configuration of SQL ### What changes were proposed in this pull request? This PR follows https://github.com/apache/spark/pull/27691 I sorted out some information show below. Item name \| Since version \| JIRA ID \| Commit ID \| Note -- \| -- \| -- \| -- \| -- spark.sql.orc.compression.codec \| 2.3.0 \| SPARK-21839 \| d8f45408635d4fccac557cb1e877dfe9267fb326#diff-9a6b543db706f1a90f790783d6930a13 spark.sql.orc.impl \| 2.3.0 \| SPARK-20728 \| 326f1d6728a7734c228d8bfaa69442a1c7b92e9b#diff-9a6b543db706f1a90f790783d6930a13 spark.sql.orc.enableVectorizedReader \| 2.3.0 \| SPARK-16060 \| 60f6b994505e3f82091a04eed2dc0a9e8bd523ce#diff-9a6b543db706f1a90f790783d6930a13 spark.sql.orc.columnarReaderBatchSize \| 2.4.0 \| SPARK-23188 \| cc41245fa3f954f961541bf4b4275c28473042b8#diff-9a6b543db706f1a90f790783d6930a13 spark.sql.orc.filterPushdown \| 1.4.0 \| SPARK-2883 \| 65d71bd9fbfe6fe1b741c80fed72d6ae3d22b028#diff-41ef65b9ef5b518f77e2a03559893f4d spark.sql.orc.mergeSchema \| 3.0.0 \| SPARK-11412 \| 73183b3c8c2022846587f08e8dea5c387ed3b8d5#diff-9a6b543db706f1a90f790783d6930a13 spark.sql.hive.verifyPartitionPath \| 1.4.0 \| Spark-5068 \| 1f39a61118184e136f38381a9f3ba0b2d5d589d9#diff-41ef65b9ef5b518f77e2a03559893f4d spark.sql.hive.metastorePartitionPruning \| 1.5.0 \| SPARK-9386 \| ce89ff477aea6def68265ed218f6105680755c9a#diff-41ef65b9ef5b518f77e2a03559893f4d spark.sql.hive.manageFilesourcePartitions \| 2.1.1 \| SPARK-19944 \| 80ebca62cbdb7d5c8606e95a944164ab1a943694#diff-9a6b543db706f1a90f790783d6930a13 spark.sql.hive.filesourcePartitionFileCacheSize \| 2.1.1 \| SPARK-19944 \| 80ebca62cbdb7d5c8606e95a944164ab1a943694#diff-9a6b543db706f1a90f790783d6930a13 spark.sql.hive.caseSensitiveInferenceMode \| 2.1.1 \| SPARK-19944 \| 80ebca62cbdb7d5c8606e95a944164ab1a943694#diff-9a6b543db706f1a90f790783d6930a13 spark.sql.optimizer.metadataOnly \| 2.1.1 \| SPARK-19944 \| 80ebca62cbdb7d5c8606e95a944164ab1a943694#diff-9a6b543db706f1a90f790783d6930a13 spark.sql.columnNameOfCorruptRecord \| 1.2.0 \| SPARK-3339 \| 1c7f0ab302de9f82b1bd6da852d133823bc67c66#diff-41ef65b9ef5b518f77e2a03559893f4d spark.sql.broadcastTimeout \| 1.3.0 \| SPARK-4269 \| fa66ef6c97e87c9255b67b03836a4ba50598ebae#diff-41ef65b9ef5b518f77e2a03559893f4d spark.sql.thriftserver.scheduler.pool \| 1.1.1 \| SPARK-3025 \| 496f62d9a98067256d8a51fd1e7a485ff6492fa8#diff-41ef65b9ef5b518f77e2a03559893f4d spark.sql.thriftServer.incrementalCollect \| 2.0.3 \| SPARK-18857 \| c94288b57b5ce2232e58e35cada558d8d5b8ec6e#diff-32bb9518401c0948c5ea19377b5069ab spark.sql.thriftserver.ui.retainedStatements \| 1.4.0 \| SPARK-5100 \| 343d3bfafd449a0371feb6a88f78e07302fa7143#diff-41ef65b9ef5b518f77e2a03559893f4d spark.sql.thriftserver.ui.retainedSessions \| 1.4.0 \| SPARK-5100 \| 343d3bfafd449a0371feb6a88f78e07302fa7143#diff-41ef65b9ef5b518f77e2a03559893f4d spark.sql.sources.default \| 1.3.0 \| SPARK-5658 \| a21090ebe1ef7a709709300712de7d928a923244#diff-41ef65b9ef5b518f77e2a03559893f4d spark.sql.hive.convertCTAS \| 2.0.0 \| SPARK-15646 \| 5a835b99f9852b0c2a35f9c75a51d493474994ea#diff-32bb9518401c0948c5ea19377b5069ab spark.sql.hive.gatherFastStats \| 2.0.1 \| SPARK-17063 \| 3d283f6c9d9daef53fa4e90b0ead2a94710a37a7#diff-32bb9518401c0948c5ea19377b5069ab spark.sql.sources.partitionColumnTypeInference.enabled \| 1.5.0 \| SPARK-7939 \| 03ef6be9ce61a13dcd9d8c71298fb4be39119411#diff-41ef65b9ef5b518f77e2a03559893f4d spark.sql.sources.bucketing.enabled \| 2.0.0 \| SPARK-13486 \| 2b2c8c33236677c916541f956f7b94bba014a9ce#diff-32bb9518401c0948c5ea19377b5069ab spark.sql.sources.bucketing.maxBuckets \| 2.4.0 \| SPARK-23997 \| de46df549acee7fda56bb0871f444d2f3b49e582#diff-9a6b543db706f1a90f790783d6930a13 spark.sql.crossJoin.enabled \| 2.0.0 \| SPARK-15425 \| 4462da7071462084c5b55cc414c7faa0e1396a18#diff-32bb9518401c0948c5ea19377b5069ab spark.sql.orderByOrdinal \| 2.0.0 \| SPARK-12789 \| 2c5b18fb0fdeabd378dd97e91f72d1eac4e21cc7#diff-32bb9518401c0948c5ea19377b5069ab spark.sql.groupByOrdinal \| 2.0.0 \| SPARK-13957 \| 05f652d6c2bbd764a1dd5a45301811e14519486f#diff-32bb9518401c0948c5ea19377b5069ab spark.sql.groupByAliases \| 2.2.0 \| SPARK-14471 \| af3a1411a28796d4d9a100eefb093b1d91532754#diff-9a6b543db706f1a90f790783d6930a13 spark.sql.sources.outputCommitterClass \| 1.4.0 \| SPARK-7567 \| a385f4b8dd22e0e056569cffc4fa63047cb7c8f2#diff-41ef65b9ef5b518f77e2a03559893f4d spark.sql.sources.commitProtocolClass \| 2.1.1 \| SPARK-19944 \| 80ebca62cbdb7d5c8606e95a944164ab1a943694#diff-9a6b543db706f1a90f790783d6930a13 spark.sql.sources.parallelPartitionDiscovery.threshold \| 1.5.0 \| SPARK-8125 \| a1064df0ee3daf496800be84293345a10e1497d9#diff-41ef65b9ef5b518f77e2a03559893f4d spark.sql.sources.parallelPartitionDiscovery.parallelism \| 2.1.1 \| SPARK-19944 \| 80ebca62cbdb7d5c8606e95a944164ab1a943694#diff-9a6b543db706f1a90f790783d6930a13 spark.sql.sources.ignoreDataLocality \| 3.0.0 \| SPARK-30812 \| b76bc0b1b8b2abd00a84f805af90ca4c5925faaa#diff-9a6b543db706f1a90f790783d6930a13 spark.sql.selfJoinAutoResolveAmbiguity \| 1.4.0 \| SPARK-6231 \| e61083ccab7764d1929248490a3d2e83987241e0#diff-41ef65b9ef5b518f77e2a03559893f4d spark.sql.analyzer.failAmbiguousSelfJoin \| 3.0.0 \| SPARK-30812 \| b76bc0b1b8b2abd00a84f805af90ca4c5925faaa#diff-9a6b543db706f1a90f790783d6930a13 spark.sql.retainGroupColumns \| 1.4.0 \| SPARK-7462 \| 9c35f02b35fda80d6558573466735e79b3dd9124#diff-41ef65b9ef5b518f77e2a03559893f4d spark.sql.pivotMaxValues \| 1.6.0 \| SPARK-8992 \| 5940fc71d2a245cc6e50edb455c3dd3dbb8de43a#diff-41ef65b9ef5b518f77e2a03559893f4d spark.sql.runSQLOnFiles \| 1.6.0 \| SPARK-11197 \| f8c6bec65784de89b47e96a367d3f9790c1b3115#diff-41ef65b9ef5b518f77e2a03559893f4d spark.sql.codegen.wholeStage \| 2.0.0 \| SPARK-13486 \| 2b2c8c33236677c916541f956f7b94bba014a9ce#diff-32bb9518401c0948c5ea19377b5069ab spark.sql.codegen.useIdInClassName \| 2.3.1 \| SPARK-23032 \| 26a8b4e398ee6d1de06a5f3ac1d6d342c9b67d78#diff-9a6b543db706f1a90f790783d6930a13 spark.sql.codegen.maxFields \| 2.0.0 \| SPARK-14224 and SPARK-14223 and SPARK-14310 \| 5a4b11a901703464b9261dea0642d80cf8d4856c#diff-32bb9518401c0948c5ea19377b5069ab spark.sql.codegen.factoryMode \| 2.4.0 \| SPARK-23711 \| a40ffc656d62372da85e0fa932b67207839e7fde#diff-9a6b543db706f1a90f790783d6930a13 spark.sql.codegen.fallback \| 2.0.0 \| SPARK-15759 \| f0fa0a8946fb4bdf0f4697a8e389f49e98422871#diff-32bb9518401c0948c5ea19377b5069ab spark.sql.codegen.logging.maxLines \| 2.3.0 \| SPARK-20871 \| 2a53fbfce72b3faef020e39a1e8628d68bc95beb#diff-9a6b543db706f1a90f790783d6930a13 spark.sql.codegen.hugeMethodLimit \| 2.3.0 \| SPARK-21871 \| 4a779bdac3e75c17b7d36c5a009ba6c948fa9fb6#diff-9a6b543db706f1a90f790783d6930a13 spark.sql.codegen.methodSplitThreshold \| 3.0.0 \| SPARK-25850 \| e017cb39642a5039abd8ce8127ad41712901bdbc#diff-9a6b543db706f1a90f790783d6930a13 spark.sql.codegen.splitConsumeFuncByOperator \| 2.3.1 \| SPARK-21717 \| c79e771f8952e6773c3a84cc617145216feddbcf#diff-9a6b543db706f1a90f790783d6930a13 spark.sql.files.maxPartitionBytes \| 2.0.0 \| SPARK-13664 \| 17eec0a71ba8713c559d641e3f43a1be726b037c#diff-32bb9518401c0948c5ea19377b5069ab spark.sql.files.openCostInBytes \| 2.0.0 \| SPARK-14259 \| 400b2f863ffaa01a34a8dae1541c61526fef908b#diff-32bb9518401c0948c5ea19377b5069ab spark.sql.files.ignoreCorruptFiles \| 2.1.1 \| SPARK-19944 \| 80ebca62cbdb7d5c8606e95a944164ab1a943694#diff-9a6b543db706f1a90f790783d6930a13 spark.sql.files.ignoreMissingFiles \| 2.3.0 \| SPARK-22366 \| 8e9863531bebbd4d83eafcbc2b359b8bd0ac5734#diff-9a6b543db706f1a90f790783d6930a13 spark.sql.files.maxRecordsPerFile \| 2.2.0 \| SPARK-19944 \| 0ee38a39e43dd7ad9d50457e446ae36f64621a1b#diff-9a6b543db706f1a90f790783d6930a13 spark.sql.exchange.reuse \| 2.0.0 \| SPARK-13523 \| 3dc9ae2e158e5b51df6f799767946fe1d190156b#diff-32bb9518401c0948c5ea19377b5069ab spark.sql.execution.reuseSubquery \| 3.0.0 \| SPARK-30812 \| b76bc0b1b8b2abd00a84f805af90ca4c5925faaa#diff-9a6b543db706f1a90f790783d6930a13 spark.sql.streaming.stateStore.providerClass \| 2.3.0 \| SPARK-20883 and SPARK-20376 \| fa757ee1d41396ad8734a3f2dd045bb09bc82a2e#diff-9a6b543db706f1a90f790783d6930a13 spark.sql.streaming.stateStore.minDeltasForSnapshot \| 2.0.0 \| SPARK-13809 \| 8c826880f5eaa3221c4e9e7d3fece54e821a0b98#diff-32bb9518401c0948c5ea19377b5069ab spark.sql.streaming.flatMapGroupsWithState.stateFormatVersion \| 2.4.0 \| SPARK-22187 \| b3d88ac02940eff4c867d3acb79fe5ff9d724e83#diff-9a6b543db706f1a90f790783d6930a13 spark.sql.streaming.checkpointLocation \| 2.0.0 \| SPARK-13985 \| caea15214571d9b12dcf1553e5c1cc8b83a8ba5b#diff-32bb9518401c0948c5ea19377b5069ab spark.sql.streaming.forceDeleteTempCheckpointLocation \| 3.0.0 \| SPARK-30812 \| b76bc0b1b8b2abd00a84f805af90ca4c5925faaa#diff-9a6b543db706f1a90f790783d6930a13 spark.sql.streaming.minBatchesToRetain \| 2.1.1 \| SPARK-19944 \| 80ebca62cbdb7d5c8606e95a944164ab1a943694#diff-9a6b543db706f1a90f790783d6930a13 spark.sql.streaming.maxBatchesToRetainInMemory \| 2.4.0 \| SPARK-24717 \| 8b7d4f842fdc90b8d1c37080bdd9b5e1d070f5c0#diff-9a6b543db706f1a90f790783d6930a13 spark.sql.streaming.aggregation.stateFormatVersion \| 2.4.0 \| SPARK-24763 \| 6c5cb85856235efd464b109558896f81ae2c4c75#diff-9a6b543db706f1a90f790783d6930a13 spark.sql.streaming.stopActiveRunOnRestart \| 3.0.0 \| SPARK-29568 \| 363af16c72abe19fc5cc5b5bdf9d8dc34975f2ba#diff-9a6b543db706f1a90f790783d6930a13 spark.sql.streaming.join.stateFormatVersion \| 3.0.0 \| SPARK-26154 \| c941362cb94b24bdf48d4928a1a4dff1b13a1484#diff-9a6b543db706f1a90f790783d6930a13 spark.sql.streaming.unsupportedOperationCheck \| 2.0.0 \| SPARK-14473 \| 775cf17eaaae1a38efe47b282b1d6bbdb99bd759#diff-32bb9518401c0948c5ea19377b5069ab spark.sql.variable.substitute \| 2.0.0 \| SPARK-14769 \| 334c293ec0bcc2195d502c574ca40dbc4769d666#diff-32bb9518401c0948c5ea19377b5069ab spark.sql.codegen.aggregate.map.twolevel.enabled \| 2.3.0 \| SPARK-22159 \| d29d1e87995e02cb57ba3026c945c3cd66bb06e2#diff-9a6b543db706f1a90f790783d6930a13 spark.sql.codegen.aggregate.map.vectorized.enable \| 3.0.0 \| SPARK-28257 \| 42b80ae128ab1aa8a87c1376fe88e2cde52e6e4f#diff-9a6b543db706f1a90f790783d6930a13 spark.sql.codegen.aggregate.splitAggregateFunc.enabled \| 3.0.0 \| SPARK-21870 \| cb0cddffe9452937033e0e6b1fc0e600d2c787ad#diff-9a6b543db706f1a90f790783d6930a13 spark.sql.view.maxNestedViewDepth \| 2.2.0 \| SPARK-19877 \| ee36bc1c9043ead3c3ba4fba7e68c6c47ad7ae7a#diff-9a6b543db706f1a90f790783d6930a13 spark.sql.streaming.commitProtocolClass \| 2.1.0 \| SPARK-19944 \| 80ebca62cbdb7d5c8606e95a944164ab1a943694#diff-9a6b543db706f1a90f790783d6930a13 spark.sql.streaming.multipleWatermarkPolicy \| 2.4.0 \| SPARK-24730 \| 6078b891da8fe7fc36579699473168ae7443284c#diff-9a6b543db706f1a90f790783d6930a13 ### Why are the changes needed? Supplemental configuration version information. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Exists UT Closes #27730 from beliefer/add-version-to-sql-config-part-two. Lead-authored-by: Jiaan Geng <beliefer@163.com> Co-authored-by: beliefer <beliefer@163.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-03-02 15:13:58 +09:00
Huaxin Gao	0a5e9a12a7	[SPARK-30995][ML][DOCS] Latex doesn't work correctly in FMClassifier/FMRegressor Scala doc ### What changes were proposed in this pull request? Latex doesn't work correctly ### Why are the changes needed? Fix the doc to make Latex work ### Does this PR introduce any user-facing change? Before fix: ![image](https://user-images.githubusercontent.com/13592258/75611743-0fa00a00-5ad2-11ea-9cc0-4b246b25d63d.png) ![image](https://user-images.githubusercontent.com/13592258/75611755-25adca80-5ad2-11ea-884d-b2792b714bd5.png) After fix: ![image](https://user-images.githubusercontent.com/13592258/75611776-46762000-5ad2-11ea-838d-f7f6f93c8aec.png) ![image](https://user-images.githubusercontent.com/13592258/75611778-51c94b80-5ad2-11ea-85e4-8c7424268f52.png) ### How was this patch tested? Manually build doc and test Closes #27748 from huaxingao/fm_doc. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-03-02 10:33:26 +09:00
Maxim Gekk	f828453e95	[SPARK-30988][SQL][TESTS] Add more edge-case exercising values to stats tests ### What changes were proposed in this pull request? Added more test cases to `StatisticsCollectionTestBase`. ### Why are the changes needed? To improve test coverage. ### Does this PR introduce any user-facing change? No ### How was this patch tested? By `StatisticsSuite` and `StatisticsCollectionSuite`. Closes #27741 from MaxGekk/stat-collect-tests. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-03-02 10:30:00 +09:00
Josh Rosen	f4499f678d	[SPARK-29419][SQL] Fix Encoder thread-safety bug in createDataset(Seq) ### What changes were proposed in this pull request? This PR fixes a thread-safety bug in `SparkSession.createDataset(Seq)`: if the caller-supplied `Encoder` is used in multiple threads then createDataset's usage of the encoder may lead to incorrect / corrupt results because the Encoder's internal mutable state will be updated from multiple threads. Here is an example demonstrating the problem: ```scala import org.apache.spark.sql._ val enc = implicitly[Encoder[(Int, Int)]] val datasets = (1 to 100).par.map { _ => val pairs = (1 to 100).map(x => (x, x)) spark.createDataset(pairs)(enc) } datasets.reduce(_ union _).collect().foreach { pair => require(pair._1 == pair._2, s"Pair elements are mismatched: $pair") } ``` Before this PR's change, the above example fails because Spark produces corrupted records where different input records' fields have been co-mingled. This bug is similar to SPARK-22355 / #19577, a similar problem in `Dataset.collect()`. The fix implemented here is based on #24735's updated version of the `Datataset.collect()` bugfix: use `.copy()`. For consistency, I used same [code comment](`d841b33ba3/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala (L3414)`) / explanation as that PR. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Tested manually using the example listed above. Thanks to smcnamara-stripe for identifying this bug. Closes #26076 from JoshRosen/SPARK-29419. Authored-by: Josh Rosen <rosenville@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-03-02 10:19:12 +09:00
iRakson	92a5ae2ae4	[SPARK-30234][SQL][FOLLOWUP] Rename `spark.sql.legacy.addDirectory.recursive.enabled` to `spark.sql.legacy.addSingleFileInAddFile` ### What changes were proposed in this pull request? Rename `spark.sql.legacy.addDirectory.recursive.enabled` to `spark.sql.legacy.addSingleFileInAddFile` ### Why are the changes needed? To follow the naming convention ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing UTs. Closes #27725 from iRakson/SPARK-30234_CONFIG. Authored-by: iRakson <raksonrakesh@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-03-01 10:55:41 +09:00
Thomas Graves	0e2ca11d80	[SPARK-29149][YARN] Update YARN cluster manager For Stage Level Scheduling ### What changes were proposed in this pull request? Yarn side changes for Stage level scheduling. The previous PR for dynamic allocation changes was https://github.com/apache/spark/pull/27313 Modified the data structures to store things on a per ResourceProfile basis. I tried to keep the code changes to a minimum, the main loop that requests just goes through each Resourceprofile and the logic inside for each one stayed very close to the same. On submission we now have to give each ResourceProfile a separate yarn Priority because yarn doesn't support asking for containers with different resources at the same Priority. We just use the profile id as the priority level. Using a different Priority actually makes things easier when the containers come back to match them again which ResourceProfile they were requested for. The expectation is that yarn will only give you a container with resource amounts you requested or more. It should never give you a container if it doesn't satisfy your resource requests. If you want to see the full feature changes you can look at https://github.com/apache/spark/pull/27053/files for reference ### Why are the changes needed? For stage level scheduling YARN support. ### Does this PR introduce any user-facing change? no ### How was this patch tested? Tested manually on YARN cluster and then unit tests. Closes #27583 from tgravescs/SPARK-29149. Authored-by: Thomas Graves <tgraves@nvidia.com> Signed-off-by: Thomas Graves <tgraves@apache.org>	2020-02-28 15:23:33 -06:00
Thomas Graves	6c0c41fa0d	[SPARK-30987][CORE] Increase the timeout on local-cluster waitUntilExecutorsUp calls ### What changes were proposed in this pull request? The ResourceDiscoveryPlugin tests intermittently timeout. They are timing out on just bringing up the local-cluster. I am not able to reproduce locally. I suspect the jenkins boxes are overloaded and taking longer then 10 seconds. There was another jira SPARK-29139 that increased timeout for some other of these as well. So try increasing the timeout to 60 seconds. Examples of timeouts: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/119030/testReport/ https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/119005/testReport/ https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/119029/testReport/ ### Why are the changes needed? tests should no longer intermittently fail. ### Does this PR introduce any user-facing change? no ### How was this patch tested? unit tests ran. Closes #27738 from tgravescs/SPARK-30987. Authored-by: Thomas Graves <tgraves@nvidia.com> Signed-off-by: Xingbo Jiang <xingbo.jiang@databricks.com>	2020-02-28 11:43:05 -08:00
Huaxin Gao	961c539a67	[SPARK-28998][SQL][FOLLOW-UP] Remove unnecessary MiMa excludes ### What changes were proposed in this pull request? Remove the cases for ```MissingTypesProblem```, ```InheritedNewAbstractMethodProblem```, ```DirectMissingMethodProblem``` and ```ReversedMissingMethodProblem```. ### Why are the changes needed? After the changes, we don't have ```org.apache.spark.sql.sources.v2``` any more, so the only problem we can get is ```MissingClassProblem``` ### Does this PR introduce any user-facing change? No ### How was this patch tested? Manually tested Closes #27731 from huaxingao/spark-28998-followup. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-02-28 11:22:08 -08:00
yi.wu	3b69796a89	[SPARK-30947][CORE] Log better message when accelerate resource is empty ### What changes were proposed in this pull request? Try to log better message when accelerate resource is empty. ### Why are the changes needed? Otherwise, it's weird to see cpu/memory resources after logging that resources is empty: ``` 20/02/25 21:47:55 INFO ResourceUtils: ============================================================== 20/02/25 21:47:55 INFO ResourceUtils: Resources for spark.driver: 20/02/25 21:47:55 INFO ResourceUtils: ============================================================== 20/02/25 21:47:55 INFO SparkContext: Submitted application: Spark shell 20/02/25 21:47:55 INFO ResourceProfile: Default ResourceProfile created, executor resources: Map(cores -> name: cores, amount: 1, script: , vendor: , memory -> name: memory, amount: 1024, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0) 20/02/25 21:47:55 INFO ResourceProfile: Limiting resource is at -1 tasks per executor ``` ### Does this PR introduce any user-facing change? NO. ### How was this patch tested? Tested manually. Closes #27693 from Ngone51/dont_log_resource. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Xingbo Jiang <xingbo.jiang@databricks.com>	2020-02-28 10:54:38 -08:00
iRakson	a40a2f8338	[SPARK-27619][SQL][FOLLOWUP] Rename 'spark.sql.legacy.useHashOnMapType' to 'spark.sql.legacy.allowHashOnMapType' ### What changes were proposed in this pull request? Renamed configuration from `spark.sql.legacy.useHashOnMapType` to `spark.sql.legacy.allowHashOnMapType`. ### Why are the changes needed? Better readability of configuration. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing UTs. Closes #27719 from iRakson/SPARK-27619_FOLLOWUP. Authored-by: iRakson <raksonrakesh@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-28 22:57:50 +08:00

1 2 3 4 5 ...

26710 commits