Commit graph

27485 commits

Author SHA1 Message Date
Sean Owen 97d9a22b04 [SPARK-30994][CORE] Update xerces to 2.12.0
### What changes were proposed in this pull request?

Manage up the version of Xerces that Hadoop uses (and potentially user apps) to 2.12.0 to match https://issues.apache.org/jira/browse/HADOOP-16530

### Why are the changes needed?

Picks up bug and security fixes: https://www.xml.com/news/2018-05-apache-xerces-j-2120/

### Does this PR introduce any user-facing change?

Should be no behavior changes.

### How was this patch tested?

Existing tests.

Closes #27746 from srowen/SPARK-30994.

Authored-by: Sean Owen <srowen@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-03-03 09:27:18 -06:00
roland-ondeviceresearch a4aaee01fa [MINOR][DOCS] ForeachBatch java example fix
### What changes were proposed in this pull request?
ForEachBatch Java example was incorrect

### Why are the changes needed?
Example did not compile

### Does this PR introduce any user-facing change?
Yes, to docs.

### How was this patch tested?
In IDE.

Closes #27740 from roland1982/foreachwriter_java_example_fix.

Authored-by: roland-ondeviceresearch <roland@ondeviceresearch.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-03-03 09:24:33 -06:00
Gengliang Wang b5166aac1f [SPARK-31013][CORE][WEBUI] InMemoryStore: improve removeAllByIndexValues over natural key index
### What changes were proposed in this pull request?

The method `removeAllByIndexValues` in KVStore is to delete all the objects which have certain values in the given index.
However, in the current implementation of `InMemoryStore`, when the given index is the natural key index, there is no special handling for it and a linear search over all the task data is performed.

We can improve it by deleting the natural keys directly from the internal hashmap.

### Why are the changes needed?

Better performance if the given index for `removeAllByIndexValues` is the natural key index in
`InMemoryStore`
### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Enhance the existing test.

Closes #27763 from gengliangwang/useNaturalIndex.

Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-03 19:34:19 +08:00
Takeshi Yamamuro 313e62c376 [SPARK-30998][SQL] ClassCastException when a generator having nested inner generators
### What changes were proposed in this pull request?

A query below failed in the master;

```
scala> sql("select array(array(1, 2), array(3)) ar").select(explode(explode($"ar"))).show()
20/03/01 13:51:56 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)/ 1]
java.lang.ClassCastException: scala.collection.mutable.ArrayOps$ofRef cannot be cast to org.apache.spark.sql.catalyst.util.ArrayData
	at org.apache.spark.sql.catalyst.expressions.ExplodeBase.eval(generators.scala:313)
	at org.apache.spark.sql.execution.GenerateExec.$anonfun$doExecute$8(GenerateExec.scala:108)
	at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
	at scala.collection.Iterator$ConcatIterator.hasNext(Iterator.scala:222)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
    ...
```

This pr modified the `hasNestedGenerator` code in `ExtractGenerator` for correctly catching nested inner generators.

### Why are the changes needed?

A bug fix.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Added tests.

Closes #27750 from maropu/HandleNestedGenerators.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-03-03 19:00:33 +09:00
Kent Yao 1fac06c430 Revert "[SPARK-30808][SQL] Enable Java 8 time API in Thrift server"
This reverts commit afaeb29599.

### What changes were proposed in this pull request?

Based on the result and comment from https://github.com/apache/spark/pull/27552#discussion_r385531744

In the hive module, server-side provides datetime values simply use `value.toSting`, and the client-side regenerates the results back in `HiveBaseResultSet` with `java.sql.Date(Timestamp).valueOf`.
there will be inconsistency between client and server if we use java8 APIs

### Why are the changes needed?

the change is still unclear enough

### Does this PR introduce any user-facing change?

no
### How was this patch tested?

Nah

Closes #27733 from yaooqinn/SPARK-30808.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-03 14:21:20 +08:00
HyukjinKwon 3956e95f05 [SPARK-25202][SQL][FOLLOW-UP] Keep the old parameter name 'pattern' at split in Scala API
### What changes were proposed in this pull request?

To address the concern pointed out in https://github.com/apache/spark/pull/22227. This will make `split` source-compatible by removing minimal cosmetic changes.

### Why are the changes needed?

For source compatibility.

### Does this PR introduce any user-facing change?

No (it will prevent potential user-facing change from the original PR)

### How was this patch tested?

Unittest was changed (in order for us to detect that source compatibility easily).

Closes #27756 from HyukjinKwon/SPARK-25202.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-03-03 10:24:50 +09:00
maryannxue 473a28c1d0 [SPARK-30991] Refactor AQE readers and RDDs
### What changes were proposed in this pull request?
This PR combines `CustomShuffledRowRDD` and `LocalShuffledRowRDD` into `ShuffledRowRDD`, and creates `CustomShuffleReaderExec` to unify and replace all existing AQE readers: `CoalescedShuffleReaderExec`, `LocalShuffleReaderExec` and `SkewJoinShuffleReaderExec`.

### Why are the changes needed?
To reduce code redundancy.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Passed existing UTs.

Closes #27742 from maryannxue/aqe-readers.

Authored-by: maryannxue <maryannxue@apache.org>
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
2020-03-02 16:04:00 -08:00
Josh Rosen f0010c81e2 [SPARK-31003][TESTS] Fix incorrect uses of assume() in tests
### What changes were proposed in this pull request?

This patch fixes several incorrect uses of `assume()` in our tests.

If a call to `assume(condition)` fails then it will cause the test to be marked as skipped instead of failed: this feature allows test cases to be skipped if certain prerequisites are missing. For example, we use this to skip certain tests when running on Windows (or when Python dependencies are unavailable).

In contrast, `assert(condition)` will fail the test if the condition doesn't hold.

If `assume()` is accidentally substituted for `assert()`then the resulting test will be marked as skipped in cases where it should have failed, undermining the purpose of the test.

This patch fixes several such cases, replacing certain `assume()` calls with `assert()`.

Credit to ahirreddy for spotting this problem.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #27754 from JoshRosen/fix-assume-vs-assert.

Lead-authored-by: Josh Rosen <rosenville@gmail.com>
Co-authored-by: Josh Rosen <joshrosen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-03-02 15:20:45 -08:00
yi.wu b517f991fe [SPARK-30969][CORE] Remove resource coordination support from Standalone
### What changes were proposed in this pull request?

Remove automatically resource coordination support from Standalone.

### Why are the changes needed?

Resource coordination is mainly designed for the scenario where multiple workers launched on the same host. However, it's, actually, a non-existed  scenario for today's Spark. Because, Spark now can start multiple executors in a single Worker, while it only allow one executor per Worker at very beginning. So, now, it really help nothing for user to launch multiple workers on the same host. Thus, it's not worth for us to bring over complicated implementation and potential high maintain cost for such an impossible scenario.

### Does this PR introduce any user-facing change?

No, it's Spark 3.0 feature.

### How was this patch tested?

Pass Jenkins.

Closes #27722 from Ngone51/abandon_coordination.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Xingbo Jiang <xingbo.jiang@databricks.com>
2020-03-02 11:23:07 -08:00
Wu, Xiaochang ac122762f5 [SPARK-30813][ML] Fix Matrices.sprand comments
### What changes were proposed in this pull request?
Fix mistakes in comments

### Why are the changes needed?
There are mistakes in comments

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
N/A

Closes #27564 from xwu99/fix-mllib-sprand-comment.

Authored-by: Wu, Xiaochang <xiaochang.wu@intel.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-03-02 08:56:17 -06:00
Jungtaek Lim (HeartSaVioR) f24a46011c [SPARK-30993][SQL] Use its sql type for UDT when checking the type of length (fixed/var) or mutable
### What changes were proposed in this pull request?

This patch fixes the bug of UnsafeRow which misses to handle the UDT specifically, in `isFixedLength` and `isMutable`. These methods don't check its SQL type for UDT, always treating UDT as variable-length, and non-mutable.

It doesn't bring any issue if UDT is used to represent complicated type, but when UDT is used to represent some type which is matched with fixed length of SQL type, it exposes the chance of correctness issues, as these informations sometimes decide how the value should be handled.

We got report from user mailing list which suspected as mapGroupsWithState looks like handling UDT incorrectly, but after some investigation it was from GenerateUnsafeRowJoiner in shuffle phase.

0e2ca11d80/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeRowJoiner.scala (L32-L43)

Here updating position should not happen on fixed-length column, but due to this bug, the value of UDT having fixed-length as sql type would be modified, which actually corrupts the value.

### Why are the changes needed?

Misclassifying of the type of length for UDT can corrupt the value when the row is presented to the input of GenerateUnsafeRowJoiner, which brings correctness issue.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

New UT added.

Closes #27747 from HeartSaVioR/SPARK-30993.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-02 22:33:11 +08:00
Gengliang Wang 6b641430c3 [SPARK-30964][CORE][WEBUI] Accelerate InMemoryStore with a new index
### What changes were proposed in this pull request?

Spark uses the class `InMemoryStore` as the KV storage for live UI and history server(by default if no LevelDB file path is provided).
In `InMemoryStore`, all the task data in one application is stored in a hashmap, which key is the task ID and the value is the task data. This fine for getting or deleting with a provided task ID.
However, Spark stage UI always shows all the task data in one stage and the current implementation is to look up all the values in the hashmap. The time complexity is O(numOfTasks).
Also, when there are too many stages (>spark.ui.retainedStages), Spark will linearly try to look up all the task data of the stages to be deleted as well.

This can be very bad for a large application with many stages and tasks. We can improve it by allowing the natural key of an entity to have a real parent index. So that on each lookup with parent node provided, Spark can look up all the natural keys(in our case, the task IDs) first, and then find the data with the natural keys in the hashmap.

### Why are the changes needed?

The in-memory KV store becomes really slow for large applications. We can improve it with a new index. The performance can be 10 times, 100 times, even 1000 times faster.
This is also possible to make the Spark driver more stable for large applications.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Existing unit tests.
Also, I run a benchmark with the following code
```
  val store = new InMemoryStore()
  val numberOfTasksPerStage = 10000
   (0 until 1000).map { sId =>
     (0 until numberOfTasksPerStage).map { taskId =>
       val task = newTaskData(sId * numberOfTasksPerStage + taskId, "SUCCESS", sId)
       store.write(task)
     }
   }
  val appStatusStore = new AppStatusStore(store)
  var start = System.nanoTime()
  appStatusStore.taskSummary(2, attemptId, Array(0, 0.25, 0.5, 0.75, 1))
  println("task summary run time: " + ((System.nanoTime() - start) / 1000000))
  val stageIds = Seq(1, 11, 66, 88)
  val stageKeys = stageIds.map(Array(_, attemptId))
  start = System.nanoTime()
  store.removeAllByIndexValues(classOf[TaskDataWrapper], TaskIndexNames.STAGE,
    stageKeys.asJavaCollection)
   println("clean up tasks run time: " + ((System.nanoTime() - start) / 1000000))
```

Task summary before the changes: 98642ms
Task summary after the changes: 120ms

Task clean up before the changes:  4900ms
Task clean up before the changes: 4ms

It's 800x faster after the changes in the micro-benchmark.

Closes #27716 from gengliangwang/liveUIStore.

Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-02 15:48:48 +08:00
beliefer 71365c2502 [SPARK-30912][CORE][DOC] Add version information to the configuration of Streaming.scala
### What changes were proposed in this pull request?
1.Add version information to the configuration of `Streaming`.
2.Update the docs of `Streaming`.

I sorted out some information show below.

Item name | Since version | JIRA ID | Commit ID | Note
-- | -- | -- | -- | --
spark.streaming.dynamicAllocation.enabled | 3.0.0 | SPARK-26941 | cad475dcc9376557f882859856286e858002389a#diff-335c3bbf4ca27cf65a6f850b1a7c89dd |
spark.streaming.dynamicAllocation.testing | 3.0.0 | SPARK-26941 | cad475dcc9376557f882859856286e858002389a#diff-335c3bbf4ca27cf65a6f850b1a7c89dd |
spark.streaming.dynamicAllocation.minExecutors | 3.0.0 | SPARK-26941 | cad475dcc9376557f882859856286e858002389a#diff-335c3bbf4ca27cf65a6f850b1a7c89dd |
spark.streaming.dynamicAllocation.maxExecutors | 3.0.0 | SPARK-26941 | cad475dcc9376557f882859856286e858002389a#diff-335c3bbf4ca27cf65a6f850b1a7c89dd |
spark.streaming.dynamicAllocation.scalingInterval | 3.0.0 | SPARK-26941 | cad475dcc9376557f882859856286e858002389a#diff-335c3bbf4ca27cf65a6f850b1a7c89dd |
spark.streaming.dynamicAllocation.scalingUpRatio | 3.0.0 | SPARK-26941 | cad475dcc9376557f882859856286e858002389a#diff-335c3bbf4ca27cf65a6f850b1a7c89dd |
spark.streaming.dynamicAllocation.scalingDownRatio | 3.0.0 | SPARK-26941 | cad475dcc9376557f882859856286e858002389a#diff-335c3bbf4ca27cf65a6f850b1a7c89dd |

### Why are the changes needed?
Supplemental configuration version information.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Exists UT

Closes #27745 from beliefer/add-version-to-streaming-config.

Authored-by: beliefer <beliefer@163.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-03-02 15:16:40 +09:00
beliefer c63366a693 [SPARK-30891][CORE][DOC] Add version information to the configuration of History
### What changes were proposed in this pull request?
1.Add version information to the configuration of `History`.
2.Update the docs of `History`.

I sorted out some information show below.

Item name | Since version | JIRA ID | Commit ID | Note
-- | -- | -- | -- | --
spark.history.fs.logDirectory | 1.1.0 | SPARK-1768 | 21ddd7d1e9f8e2a726427f32422c31706a20ba3f#diff-a7befb99e7bd7e3ab5c46c2568aa5b3e |  
spark.history.fs.safemodeCheck.interval | 1.6.0 | SPARK-11020 | cf04fdfe71abc395163a625cc1f99ec5e54cc07e#diff-a7befb99e7bd7e3ab5c46c2568aa5b3e |  
spark.history.fs.update.interval | 1.4.0 | SPARK-6046 | 4527761bcd6501c362baf2780905a0018b9a74ba#diff-a7befb99e7bd7e3ab5c46c2568aa5b3e |  
spark.history.fs.cleaner.enabled | 1.3.0 | SPARK-3562 | 8942b522d8a3269a2a357e3a274ed4b3e66ebdde#diff-a7befb99e7bd7e3ab5c46c2568aa5b3e | Branch branch-1.3 does not exist, exists in branch-1.4, but it is 1.3.0-SNAPSHOT in pom.xml
spark.history.fs.cleaner.interval | 1.4.0 | SPARK-5933 | 1991337336596f94698e79c2366f065c374128ab#diff-a7befb99e7bd7e3ab5c46c2568aa5b3e |
spark.history.fs.cleaner.maxAge | 1.4.0 | SPARK-5933 | 1991337336596f94698e79c2366f065c374128ab#diff-a7befb99e7bd7e3ab5c46c2568aa5b3e |
spark.history.fs.cleaner.maxNum | 3.0.0 | SPARK-28294 | bbc2be4f425c4c26450e1bf21db407e81046ce21#diff-6bddeb5e25239974fc13db66266b167b |  
spark.history.store.path | 2.3.0 | SPARK-20642 | 74daf622de4e534d5a5929b424a6e836850eefad#diff-19f35f981fdc5b0a46f070b879a9a9fc |  
spark.history.store.maxDiskUsage | 2.3.0 | SPARK-20654 | 8b497046c647a21bbed1bdfbdcb176745a1d5cd5#diff-19f35f981fdc5b0a46f070b879a9a9fc |  
spark.history.ui.port | 1.0.0 | SPARK-1276 | 9ae80bf9bd3e4da7443af97b41fe26aa5d35d70b#diff-b49b5b9c31ddb36a9061004b5b723058 |  
spark.history.fs.inProgressOptimization.enabled | 2.4.0 | SPARK-6951 | 653fe02415a537299e15f92b56045569864b6183#diff-19f35f981fdc5b0a46f070b879a9a9fc |  
spark.history.fs.endEventReparseChunkSize | 2.4.0 | SPARK-6951 | 653fe02415a537299e15f92b56045569864b6183#diff-19f35f981fdc5b0a46f070b879a9a9fc |  
spark.history.fs.eventLog.rolling.maxFilesToRetain | 3.0.0 | SPARK-30481 | a2fe73b83c0e7c61d1c83b236565a71e3d005a71#diff-6bddeb5e25239974fc13db66266b167b |  
spark.history.fs.eventLog.rolling.compaction.score.threshold | 3.0.0 | SPARK-30481 | a2fe73b83c0e7c61d1c83b236565a71e3d005a71#diff-6bddeb5e25239974fc13db66266b167b |  
spark.history.fs.driverlog.cleaner.enabled | 3.0.0 | SPARK-25118 | 5f11e8c4cb9a5db037ac239b8fcc97f3a746e772#diff-6bddeb5e25239974fc13db66266b167b |  
spark.history.fs.driverlog.cleaner.interval | 3.0.0 | SPARK-25118 | 5f11e8c4cb9a5db037ac239b8fcc97f3a746e772#diff-6bddeb5e25239974fc13db66266b167b |  
spark.history.fs.driverlog.cleaner.maxAge | 3.0.0 | SPARK-25118 | 5f11e8c4cb9a5db037ac239b8fcc97f3a746e772#diff-6bddeb5e25239974fc13db66266b167b |  
spark.history.ui.acls.enable | 1.0.1 | Spark 1489 | c8dd13221215275948b1a6913192d40e0c8cbadd#diff-b49b5b9c31ddb36a9061004b5b723058 |  
spark.history.ui.admin.acls | 2.1.1 | SPARK-19033 | 4ca1788805e4a0131ba8f0ccb7499ee0e0242837#diff-a7befb99e7bd7e3ab5c46c2568aa5b3e |  
spark.history.ui.admin.acls.groups | 2.1.1 | SPARK-19033 | 4ca1788805e4a0131ba8f0ccb7499ee0e0242837#diff-a7befb99e7bd7e3ab5c46c2568aa5b3e |  
spark.history.fs.numReplayThreads | 2.0.0 | SPARK-13988 | 6fdd0e32a6c3fdce1f3f7e1f8d252af05c419f7b#diff-a7befb99e7bd7e3ab5c46c2568aa5b3e |  
spark.history.retainedApplications | 1.0.0 | SPARK-1276 | 9ae80bf9bd3e4da7443af97b41fe26aa5d35d70b#diff-b49b5b9c31ddb36a9061004b5b723058 |
spark.history.provider | 1.1.0 | SPARK-1768 | 21ddd7d1e9f8e2a726427f32422c31706a20ba3f#diff-a7befb99e7bd7e3ab5c46c2568aa5b3e |  
spark.history.kerberos.enabled | 1.0.1 | Spark-1490 | 866b03ef4d27b2160563b58d577de29ba6eb4442#diff-b49b5b9c31ddb36a9061004b5b723058 |  
spark.history.kerberos.principal | 1.0.1 | Spark-1490 | 866b03ef4d27b2160563b58d577de29ba6eb4442#diff-b49b5b9c31ddb36a9061004b5b723058 |  
spark.history.kerberos.keytab | 1.0.1 | Spark-1490 | 866b03ef4d27b2160563b58d577de29ba6eb4442#diff-b49b5b9c31ddb36a9061004b5b723058 |  
spark.history.custom.executor.log.url | 3.0.0 | SPARK-26311 | ae5b2a6a92be4986ef5b8062d7fb59318cff6430#diff-6bddeb5e25239974fc13db66266b167b |  
spark.history.custom.executor.log.url.applyIncompleteApplication | 3.0.0 | SPARK-26311 | ae5b2a6a92be4986ef5b8062d7fb59318cff6430#diff-6bddeb5e25239974fc13db66266b167b |  

### Why are the changes needed?
Supplemental configuration version information.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Exists UT

Closes #27751 from beliefer/add-version-to-history-config.

Authored-by: beliefer <beliefer@163.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-03-02 15:15:49 +09:00
beliefer 3beb4f875d [SPARK-30908][CORE][DOC] Add version information to the configuration of Kryo
### What changes were proposed in this pull request?
1.Add version information to the configuration of `Kryo`.
2.Update the docs of `Kryo`.

I sorted out some information show below.

Item name | Since version | JIRA ID | Commit ID | Note
-- | -- | -- | -- | --
spark.kryo.registrationRequired | 1.1.0 | SPARK-2102 | efdaeb111917dd0314f1d00ee8524bed1e2e21ca#diff-1f81c62dad0e2dfc387a974bb08c497c |  
spark.kryo.registrator | 0.5.0 | None | 91c07a33d90ab0357e8713507134ecef5c14e28a#diff-792ed56b3398163fa14e8578549d0d98 | This is not a release version, do we need to record it?
spark.kryo.classesToRegister | 1.2.0 | SPARK-1813 | 6bb56faea8d238ea22c2de33db93b1b39f492b3a#diff-529fc5c06b9731c1fbda6f3db60b16aa |  
spark.kryo.unsafe | 2.1.0 | SPARK-928 | bc167a2a53f5a795d089e8a884569b1b3e2cd439#diff-1f81c62dad0e2dfc387a974bb08c497c |  
spark.kryo.pool | 3.0.0 | SPARK-26466 | 38f030725c561979ca98b2a6cc7ca6c02a1f80ed#diff-a3c6b992784f9abeb9f3047d3dcf3ed9 |  
spark.kryo.referenceTracking | 0.8.0 | None | 0a8cc309211c62f8824d76618705c817edcf2424#diff-1f81c62dad0e2dfc387a974bb08c497c |  
spark.kryoserializer.buffer | 1.4.0 | SPARK-5932 | 2d222fb39dd978e5a33cde6ceb59307cbdf7b171#diff-1f81c62dad0e2dfc387a974bb08c497c |  
spark.kryoserializer.buffer.max | 1.4.0 | SPARK-5932 | 2d222fb39dd978e5a33cde6ceb59307cbdf7b171#diff-1f81c62dad0e2dfc387a974bb08c497c |  

### Why are the changes needed?
Supplemental configuration version information.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Exists UT

Closes #27734 from beliefer/add-version-to-kryo-config.

Authored-by: beliefer <beliefer@163.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-03-02 15:14:47 +09:00
Jiaan Geng a429ac83e4 [SPARK-30841][SQL][DOC][FOLLOW-UP] Add version information to the configuration of SQL
### What changes were proposed in this pull request?
This PR follows https://github.com/apache/spark/pull/27691
I sorted out some information show below.

Item name | Since version | JIRA ID | Commit ID | Note
-- | -- | -- | -- | --
spark.sql.orc.compression.codec | 2.3.0 | SPARK-21839 | d8f45408635d4fccac557cb1e877dfe9267fb326#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.orc.impl | 2.3.0 | SPARK-20728 | 326f1d6728a7734c228d8bfaa69442a1c7b92e9b#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.orc.enableVectorizedReader | 2.3.0 | SPARK-16060 | 60f6b994505e3f82091a04eed2dc0a9e8bd523ce#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.orc.columnarReaderBatchSize | 2.4.0 | SPARK-23188 | cc41245fa3f954f961541bf4b4275c28473042b8#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.orc.filterPushdown | 1.4.0 | SPARK-2883 | 65d71bd9fbfe6fe1b741c80fed72d6ae3d22b028#diff-41ef65b9ef5b518f77e2a03559893f4d
spark.sql.orc.mergeSchema | 3.0.0 | SPARK-11412 | 73183b3c8c2022846587f08e8dea5c387ed3b8d5#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.hive.verifyPartitionPath | 1.4.0 | Spark-5068 | 1f39a61118184e136f38381a9f3ba0b2d5d589d9#diff-41ef65b9ef5b518f77e2a03559893f4d
spark.sql.hive.metastorePartitionPruning | 1.5.0 | SPARK-9386 | ce89ff477aea6def68265ed218f6105680755c9a#diff-41ef65b9ef5b518f77e2a03559893f4d
spark.sql.hive.manageFilesourcePartitions | 2.1.1 | SPARK-19944 | 80ebca62cbdb7d5c8606e95a944164ab1a943694#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.hive.filesourcePartitionFileCacheSize | 2.1.1 | SPARK-19944 | 80ebca62cbdb7d5c8606e95a944164ab1a943694#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.hive.caseSensitiveInferenceMode | 2.1.1 | SPARK-19944 | 80ebca62cbdb7d5c8606e95a944164ab1a943694#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.optimizer.metadataOnly | 2.1.1 | SPARK-19944 | 80ebca62cbdb7d5c8606e95a944164ab1a943694#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.columnNameOfCorruptRecord | 1.2.0 | SPARK-3339 | 1c7f0ab302de9f82b1bd6da852d133823bc67c66#diff-41ef65b9ef5b518f77e2a03559893f4d
spark.sql.broadcastTimeout | 1.3.0 | SPARK-4269 | fa66ef6c97e87c9255b67b03836a4ba50598ebae#diff-41ef65b9ef5b518f77e2a03559893f4d
spark.sql.thriftserver.scheduler.pool | 1.1.1 | SPARK-3025 | 496f62d9a98067256d8a51fd1e7a485ff6492fa8#diff-41ef65b9ef5b518f77e2a03559893f4d
spark.sql.thriftServer.incrementalCollect | 2.0.3 | SPARK-18857 | c94288b57b5ce2232e58e35cada558d8d5b8ec6e#diff-32bb9518401c0948c5ea19377b5069ab
spark.sql.thriftserver.ui.retainedStatements | 1.4.0 | SPARK-5100 | 343d3bfafd449a0371feb6a88f78e07302fa7143#diff-41ef65b9ef5b518f77e2a03559893f4d
spark.sql.thriftserver.ui.retainedSessions | 1.4.0 | SPARK-5100 | 343d3bfafd449a0371feb6a88f78e07302fa7143#diff-41ef65b9ef5b518f77e2a03559893f4d
spark.sql.sources.default | 1.3.0 | SPARK-5658 | a21090ebe1ef7a709709300712de7d928a923244#diff-41ef65b9ef5b518f77e2a03559893f4d
spark.sql.hive.convertCTAS | 2.0.0 | SPARK-15646 | 5a835b99f9852b0c2a35f9c75a51d493474994ea#diff-32bb9518401c0948c5ea19377b5069ab
spark.sql.hive.gatherFastStats | 2.0.1 | SPARK-17063 | 3d283f6c9d9daef53fa4e90b0ead2a94710a37a7#diff-32bb9518401c0948c5ea19377b5069ab
spark.sql.sources.partitionColumnTypeInference.enabled | 1.5.0 | SPARK-7939 | 03ef6be9ce61a13dcd9d8c71298fb4be39119411#diff-41ef65b9ef5b518f77e2a03559893f4d
spark.sql.sources.bucketing.enabled | 2.0.0 | SPARK-13486 | 2b2c8c33236677c916541f956f7b94bba014a9ce#diff-32bb9518401c0948c5ea19377b5069ab
spark.sql.sources.bucketing.maxBuckets | 2.4.0 | SPARK-23997 | de46df549acee7fda56bb0871f444d2f3b49e582#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.crossJoin.enabled | 2.0.0 | SPARK-15425 | 4462da7071462084c5b55cc414c7faa0e1396a18#diff-32bb9518401c0948c5ea19377b5069ab
spark.sql.orderByOrdinal | 2.0.0 | SPARK-12789 | 2c5b18fb0fdeabd378dd97e91f72d1eac4e21cc7#diff-32bb9518401c0948c5ea19377b5069ab
spark.sql.groupByOrdinal | 2.0.0 | SPARK-13957 | 05f652d6c2bbd764a1dd5a45301811e14519486f#diff-32bb9518401c0948c5ea19377b5069ab
spark.sql.groupByAliases | 2.2.0 | SPARK-14471 | af3a1411a28796d4d9a100eefb093b1d91532754#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.sources.outputCommitterClass | 1.4.0 | SPARK-7567 | a385f4b8dd22e0e056569cffc4fa63047cb7c8f2#diff-41ef65b9ef5b518f77e2a03559893f4d
spark.sql.sources.commitProtocolClass | 2.1.1 | SPARK-19944 | 80ebca62cbdb7d5c8606e95a944164ab1a943694#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.sources.parallelPartitionDiscovery.threshold | 1.5.0 | SPARK-8125 | a1064df0ee3daf496800be84293345a10e1497d9#diff-41ef65b9ef5b518f77e2a03559893f4d
spark.sql.sources.parallelPartitionDiscovery.parallelism | 2.1.1 | SPARK-19944 | 80ebca62cbdb7d5c8606e95a944164ab1a943694#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.sources.ignoreDataLocality | 3.0.0 | SPARK-30812 | b76bc0b1b8b2abd00a84f805af90ca4c5925faaa#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.selfJoinAutoResolveAmbiguity | 1.4.0 | SPARK-6231 | e61083ccab7764d1929248490a3d2e83987241e0#diff-41ef65b9ef5b518f77e2a03559893f4d
spark.sql.analyzer.failAmbiguousSelfJoin | 3.0.0 | SPARK-30812 | b76bc0b1b8b2abd00a84f805af90ca4c5925faaa#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.retainGroupColumns | 1.4.0 | SPARK-7462 | 9c35f02b35fda80d6558573466735e79b3dd9124#diff-41ef65b9ef5b518f77e2a03559893f4d
spark.sql.pivotMaxValues | 1.6.0 | SPARK-8992 | 5940fc71d2a245cc6e50edb455c3dd3dbb8de43a#diff-41ef65b9ef5b518f77e2a03559893f4d
spark.sql.runSQLOnFiles | 1.6.0 | SPARK-11197 | f8c6bec65784de89b47e96a367d3f9790c1b3115#diff-41ef65b9ef5b518f77e2a03559893f4d
spark.sql.codegen.wholeStage | 2.0.0 | SPARK-13486 | 2b2c8c33236677c916541f956f7b94bba014a9ce#diff-32bb9518401c0948c5ea19377b5069ab
spark.sql.codegen.useIdInClassName | 2.3.1 | SPARK-23032 | 26a8b4e398ee6d1de06a5f3ac1d6d342c9b67d78#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.codegen.maxFields | 2.0.0 | SPARK-14224 and SPARK-14223 and SPARK-14310 | 5a4b11a901703464b9261dea0642d80cf8d4856c#diff-32bb9518401c0948c5ea19377b5069ab
spark.sql.codegen.factoryMode | 2.4.0 | SPARK-23711 | a40ffc656d62372da85e0fa932b67207839e7fde#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.codegen.fallback | 2.0.0 | SPARK-15759 | f0fa0a8946fb4bdf0f4697a8e389f49e98422871#diff-32bb9518401c0948c5ea19377b5069ab
spark.sql.codegen.logging.maxLines | 2.3.0 | SPARK-20871 | 2a53fbfce72b3faef020e39a1e8628d68bc95beb#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.codegen.hugeMethodLimit | 2.3.0 | SPARK-21871 | 4a779bdac3e75c17b7d36c5a009ba6c948fa9fb6#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.codegen.methodSplitThreshold | 3.0.0 | SPARK-25850 | e017cb39642a5039abd8ce8127ad41712901bdbc#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.codegen.splitConsumeFuncByOperator | 2.3.1 | SPARK-21717 | c79e771f8952e6773c3a84cc617145216feddbcf#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.files.maxPartitionBytes | 2.0.0 | SPARK-13664 | 17eec0a71ba8713c559d641e3f43a1be726b037c#diff-32bb9518401c0948c5ea19377b5069ab
spark.sql.files.openCostInBytes | 2.0.0 | SPARK-14259 | 400b2f863ffaa01a34a8dae1541c61526fef908b#diff-32bb9518401c0948c5ea19377b5069ab
spark.sql.files.ignoreCorruptFiles | 2.1.1 | SPARK-19944 | 80ebca62cbdb7d5c8606e95a944164ab1a943694#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.files.ignoreMissingFiles | 2.3.0 | SPARK-22366 | 8e9863531bebbd4d83eafcbc2b359b8bd0ac5734#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.files.maxRecordsPerFile | 2.2.0 | SPARK-19944 | 0ee38a39e43dd7ad9d50457e446ae36f64621a1b#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.exchange.reuse | 2.0.0 | SPARK-13523 | 3dc9ae2e158e5b51df6f799767946fe1d190156b#diff-32bb9518401c0948c5ea19377b5069ab
spark.sql.execution.reuseSubquery | 3.0.0 | SPARK-30812 | b76bc0b1b8b2abd00a84f805af90ca4c5925faaa#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.streaming.stateStore.providerClass | 2.3.0 | SPARK-20883 and SPARK-20376 | fa757ee1d41396ad8734a3f2dd045bb09bc82a2e#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.streaming.stateStore.minDeltasForSnapshot | 2.0.0 | SPARK-13809 | 8c826880f5eaa3221c4e9e7d3fece54e821a0b98#diff-32bb9518401c0948c5ea19377b5069ab
spark.sql.streaming.flatMapGroupsWithState.stateFormatVersion | 2.4.0 | SPARK-22187 | b3d88ac02940eff4c867d3acb79fe5ff9d724e83#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.streaming.checkpointLocation | 2.0.0 | SPARK-13985 | caea15214571d9b12dcf1553e5c1cc8b83a8ba5b#diff-32bb9518401c0948c5ea19377b5069ab
spark.sql.streaming.forceDeleteTempCheckpointLocation | 3.0.0 | SPARK-30812 | b76bc0b1b8b2abd00a84f805af90ca4c5925faaa#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.streaming.minBatchesToRetain | 2.1.1 | SPARK-19944 | 80ebca62cbdb7d5c8606e95a944164ab1a943694#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.streaming.maxBatchesToRetainInMemory | 2.4.0 | SPARK-24717 | 8b7d4f842fdc90b8d1c37080bdd9b5e1d070f5c0#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.streaming.aggregation.stateFormatVersion | 2.4.0 | SPARK-24763 | 6c5cb85856235efd464b109558896f81ae2c4c75#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.streaming.stopActiveRunOnRestart | 3.0.0 | SPARK-29568 | 363af16c72abe19fc5cc5b5bdf9d8dc34975f2ba#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.streaming.join.stateFormatVersion | 3.0.0 | SPARK-26154 | c941362cb94b24bdf48d4928a1a4dff1b13a1484#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.streaming.unsupportedOperationCheck | 2.0.0 | SPARK-14473 | 775cf17eaaae1a38efe47b282b1d6bbdb99bd759#diff-32bb9518401c0948c5ea19377b5069ab
spark.sql.variable.substitute | 2.0.0 | SPARK-14769 | 334c293ec0bcc2195d502c574ca40dbc4769d666#diff-32bb9518401c0948c5ea19377b5069ab
spark.sql.codegen.aggregate.map.twolevel.enabled | 2.3.0 | SPARK-22159 | d29d1e87995e02cb57ba3026c945c3cd66bb06e2#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.codegen.aggregate.map.vectorized.enable | 3.0.0 | SPARK-28257 | 42b80ae128ab1aa8a87c1376fe88e2cde52e6e4f#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.codegen.aggregate.splitAggregateFunc.enabled | 3.0.0 | SPARK-21870 | cb0cddffe9452937033e0e6b1fc0e600d2c787ad#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.view.maxNestedViewDepth | 2.2.0 | SPARK-19877 | ee36bc1c9043ead3c3ba4fba7e68c6c47ad7ae7a#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.streaming.commitProtocolClass | 2.1.0 | SPARK-19944 | 80ebca62cbdb7d5c8606e95a944164ab1a943694#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.streaming.multipleWatermarkPolicy | 2.4.0 | SPARK-24730 | 6078b891da8fe7fc36579699473168ae7443284c#diff-9a6b543db706f1a90f790783d6930a13

### Why are the changes needed?
Supplemental configuration version information.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Exists UT

Closes #27730 from beliefer/add-version-to-sql-config-part-two.

Lead-authored-by: Jiaan Geng <beliefer@163.com>
Co-authored-by: beliefer <beliefer@163.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-03-02 15:13:58 +09:00
Huaxin Gao 0a5e9a12a7 [SPARK-30995][ML][DOCS] Latex doesn't work correctly in FMClassifier/FMRegressor Scala doc
### What changes were proposed in this pull request?
Latex doesn't work correctly

### Why are the changes needed?
Fix the doc to make Latex work

### Does this PR introduce any user-facing change?
Before fix:

![image](https://user-images.githubusercontent.com/13592258/75611743-0fa00a00-5ad2-11ea-9cc0-4b246b25d63d.png)

![image](https://user-images.githubusercontent.com/13592258/75611755-25adca80-5ad2-11ea-884d-b2792b714bd5.png)

After fix:

![image](https://user-images.githubusercontent.com/13592258/75611776-46762000-5ad2-11ea-838d-f7f6f93c8aec.png)

![image](https://user-images.githubusercontent.com/13592258/75611778-51c94b80-5ad2-11ea-85e4-8c7424268f52.png)

### How was this patch tested?
Manually build doc and test

Closes #27748 from huaxingao/fm_doc.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-03-02 10:33:26 +09:00
Maxim Gekk f828453e95 [SPARK-30988][SQL][TESTS] Add more edge-case exercising values to stats tests
### What changes were proposed in this pull request?
Added more test cases to `StatisticsCollectionTestBase`.

### Why are the changes needed?
To improve test coverage.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
By `StatisticsSuite` and `StatisticsCollectionSuite`.

Closes #27741 from MaxGekk/stat-collect-tests.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-03-02 10:30:00 +09:00
Josh Rosen f4499f678d [SPARK-29419][SQL] Fix Encoder thread-safety bug in createDataset(Seq)
### What changes were proposed in this pull request?

This PR fixes a thread-safety bug in `SparkSession.createDataset(Seq)`: if the caller-supplied `Encoder` is used in multiple threads then createDataset's usage of the encoder may lead to incorrect / corrupt results because the Encoder's internal mutable state will be updated from multiple threads.

Here is an example demonstrating the problem:

```scala
import org.apache.spark.sql._

val enc = implicitly[Encoder[(Int, Int)]]

val datasets = (1 to 100).par.map { _ =>
  val pairs = (1 to 100).map(x => (x, x))
  spark.createDataset(pairs)(enc)
}

datasets.reduce(_ union _).collect().foreach {
  pair => require(pair._1 == pair._2, s"Pair elements are mismatched: $pair")
}
```

Before this PR's change, the above example fails because Spark produces corrupted records where different input records' fields have been co-mingled.

This bug is similar to SPARK-22355 / #19577, a similar problem in `Dataset.collect()`.

The fix implemented here is based on #24735's updated version of the `Datataset.collect()` bugfix: use `.copy()`. For consistency, I used same [code comment](d841b33ba3/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala (L3414)) / explanation as that PR.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Tested manually using the example listed above.

Thanks to smcnamara-stripe for identifying this bug.

Closes #26076 from JoshRosen/SPARK-29419.

Authored-by: Josh Rosen <rosenville@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-03-02 10:19:12 +09:00
iRakson 92a5ae2ae4 [SPARK-30234][SQL][FOLLOWUP] Rename spark.sql.legacy.addDirectory.recursive.enabled to spark.sql.legacy.addSingleFileInAddFile
### What changes were proposed in this pull request?
Rename `spark.sql.legacy.addDirectory.recursive.enabled` to `spark.sql.legacy.addSingleFileInAddFile`

### Why are the changes needed?
To follow the naming convention

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Existing UTs.

Closes #27725 from iRakson/SPARK-30234_CONFIG.

Authored-by: iRakson <raksonrakesh@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-03-01 10:55:41 +09:00
Thomas Graves 0e2ca11d80 [SPARK-29149][YARN] Update YARN cluster manager For Stage Level Scheduling
### What changes were proposed in this pull request?

Yarn side changes for Stage level scheduling.  The previous PR for dynamic allocation changes was https://github.com/apache/spark/pull/27313

Modified the data structures to store things on a per ResourceProfile basis.
 I tried to keep the code changes to a minimum, the main loop that requests just goes through each Resourceprofile and the logic inside for each one stayed very close to the same.
On submission we now have to give each ResourceProfile a separate yarn Priority because yarn doesn't support asking for containers with different resources at the same Priority. We just use the profile id as the priority level.
Using a different Priority actually makes things easier when the containers come back to match them again which ResourceProfile they were requested for.
The expectation is that yarn will only give you a container with resource amounts you requested or more. It should never give you a container if it doesn't satisfy your resource requests.

If you want to see the full feature changes you can look at https://github.com/apache/spark/pull/27053/files for reference

### Why are the changes needed?

For stage level scheduling YARN support.

### Does this PR introduce any user-facing change?

no

### How was this patch tested?

Tested manually on YARN cluster and then unit tests.

Closes #27583 from tgravescs/SPARK-29149.

Authored-by: Thomas Graves <tgraves@nvidia.com>
Signed-off-by: Thomas Graves <tgraves@apache.org>
2020-02-28 15:23:33 -06:00
Thomas Graves 6c0c41fa0d [SPARK-30987][CORE] Increase the timeout on local-cluster waitUntilExecutorsUp calls
### What changes were proposed in this pull request?

The ResourceDiscoveryPlugin tests intermittently timeout. They are timing out on just bringing up the local-cluster. I am not able to reproduce locally.  I suspect the jenkins boxes are overloaded and taking longer then 10 seconds. There was another jira SPARK-29139 that increased timeout for some other of these as well. So try increasing the timeout to 60 seconds.

Examples of timeouts:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/119030/testReport/
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/119005/testReport/
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/119029/testReport/

### Why are the changes needed?

tests should no longer intermittently fail.

### Does this PR introduce any user-facing change?

no
### How was this patch tested?

unit tests ran.

Closes #27738 from tgravescs/SPARK-30987.

Authored-by: Thomas Graves <tgraves@nvidia.com>
Signed-off-by: Xingbo Jiang <xingbo.jiang@databricks.com>
2020-02-28 11:43:05 -08:00
Huaxin Gao 961c539a67 [SPARK-28998][SQL][FOLLOW-UP] Remove unnecessary MiMa excludes
### What changes were proposed in this pull request?
Remove the cases for ```MissingTypesProblem```, ```InheritedNewAbstractMethodProblem```, ```DirectMissingMethodProblem``` and ```ReversedMissingMethodProblem```.

### Why are the changes needed?
After the changes, we don't have ```org.apache.spark.sql.sources.v2```  any more, so the only problem we can get is ```MissingClassProblem```

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Manually tested

Closes #27731 from huaxingao/spark-28998-followup.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-02-28 11:22:08 -08:00
yi.wu 3b69796a89 [SPARK-30947][CORE] Log better message when accelerate resource is empty
### What changes were proposed in this pull request?

Try to log better message when accelerate resource is empty.

### Why are the changes needed?

Otherwise, it's weird to see cpu/memory resources after logging **that** resources is empty:

```
20/02/25 21:47:55 INFO ResourceUtils: ==============================================================
20/02/25 21:47:55 INFO ResourceUtils: Resources for spark.driver:

20/02/25 21:47:55 INFO ResourceUtils: ==============================================================
20/02/25 21:47:55 INFO SparkContext: Submitted application: Spark shell
20/02/25 21:47:55 INFO ResourceProfile: Default ResourceProfile created, executor resources: Map(cores -> name: cores, amount: 1, script: , vendor: , memory -> name: memory, amount: 1024, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0)
20/02/25 21:47:55 INFO ResourceProfile: Limiting resource is  at -1 tasks per executor
```

### Does this PR introduce any user-facing change?

NO.

### How was this patch tested?

Tested manually.

Closes #27693 from Ngone51/dont_log_resource.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Xingbo Jiang <xingbo.jiang@databricks.com>
2020-02-28 10:54:38 -08:00
iRakson a40a2f8338 [SPARK-27619][SQL][FOLLOWUP] Rename 'spark.sql.legacy.useHashOnMapType' to 'spark.sql.legacy.allowHashOnMapType'
### What changes were proposed in this pull request?
Renamed configuration from `spark.sql.legacy.useHashOnMapType` to `spark.sql.legacy.allowHashOnMapType`.

### Why are the changes needed?
Better readability of configuration.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Existing UTs.

Closes #27719 from iRakson/SPARK-27619_FOLLOWUP.

Authored-by: iRakson <raksonrakesh@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-28 22:57:50 +08:00
Eric Wu eba2076ca3 [SPARK-30842][SQL] Adjust abstraction structure for join operators
### What changes were proposed in this pull request?
Currently the join operators are not well abstracted, since there are lot of common logic. A trait can be created for easier pattern matching and other future handiness. This is a follow-up PR based on comment
https://github.com/apache/spark/pull/27509#discussion_r379613391 .

This PR refined from the following aspects:
1. Refined structure of all physical join operators
2. Add missing joinType field for CartesianProductExec operator
3. Refined codes related to Explain Formatted

The EXPLAIN FORMATTED changes are
1. Converge all join operator `verboseStringWithOperatorId` implementations to `BaseJoinExec`. Join condition displayed, and join keys displayed if it’s not empty.
2. `#1` will add Join condition to `BroadcastNestedLoopJoinExec`.
3. `#1` will **NOT** affect `CartesianProductExec`,`SortMergeJoin` and `HashJoin`s, since they already got there override implementation before.
4. Converge all join operator `simpleStringWithNodeId` to `BaseJoinExec`, which will enhance the one line description for `CartesianProductExec` with `JoinType` added.
5. Override `simpleStringWithNodeId` in `BroadcastNestedLoopJoinExec` to show `BuildSide`, which was only done for `HashJoin`s before.

### Why are the changes needed?
Make the code consistent with other operators and for future handiness of join operators.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Existing tests

Closes #27595 from Eric5553/RefineJoin.

Authored-by: Eric Wu <492960551@qq.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-28 18:42:15 +08:00
zhengruifeng 14bb639c55 [SPARK-30938][ML][MLLIB] BinaryClassificationMetrics optimization
### What changes were proposed in this pull request?
1, avoid `Iterator.grouped(size: Int)`, which need to maintain an arraybuffer of `size`
2, keep the number of partitions in curve computation

### Why are the changes needed?
1, `BinaryClassificationMetrics` tend to fail (OOM) when `grouping=count/numBins` is too large, due to `Iterator.grouped(size: Int)` need to maintain an arraybuffer with `size` entries, however, in `BinaryClassificationMetrics` we do not need to maintain such a big array;
2, make sizes of partitions more even;

This PR computes metrics more stable and a littler faster;

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
existing testsuites

Closes #27682 from zhengruifeng/grouped_opt.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
2020-02-28 16:55:24 +08:00
Kent Yao 1383bd459a [SPARK-30970][K8S][CORE] Fix NPE while resolving k8s master url
### What changes were proposed in this pull request?

```
bin/spark-sql --master  k8s:///https://kubernetes.docker.internal:6443 --conf spark.kubernetes.container.image=yaooqinn/spark:v2.4.4
Exception in thread "main" java.lang.NullPointerException
	at org.apache.spark.util.Utils$.checkAndGetK8sMasterUrl(Utils.scala:2739)
	at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:261)
	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:774)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
```
Althrough `k8s:///https://kubernetes.docker.internal:6443` is a wrong master url but should not throw npe
The `case null` will never be touched.
3f4060c340/core/src/main/scala/org/apache/spark/util/Utils.scala (L2772-L2776)

### Why are the changes needed?

bug fix

### Does this PR introduce any user-facing change?

no

### How was this patch tested?

add ut case

Closes #27721 from yaooqinn/SPARK-30970.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-02-28 00:01:20 -08:00
Huaxin Gao c0d4cc3657 [MINOR][SQL] Remove unnecessary MiMa excludes
### What changes were proposed in this pull request?
I found a few unnecessary MiMa excludes when auditing SQL binary incompatible changes.

### Why are the changes needed?
These MiMa excludes are not required any more, so remove.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Manually tested

Closes #27729 from huaxingao/mima.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-02-28 15:18:25 +09:00
Wenchen Fan f21894e5fa [SPARK-30902][SQL] Default table provider should be decided by catalog implementations
### What changes were proposed in this pull request?

When `CREATE TABLE` SQL statement does not specify the provider, leave it to the catalog implementations to decide.

### Why are the changes needed?

It's super weird if we set the default provider to parquet when creating a table in a JDBC catalog.

### Does this PR introduce any user-facing change?

Yes, v2 catalog will not see a "provider" property in table properties if it's not specified in `CREATE TABLE` SQL statement. V2 catalog is new in 3.0.

### How was this patch tested?

new tests

Closes #27650 from cloud-fan/create_table.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-02-28 15:14:23 +09:00
zero323 7de33f56e8 [SPARK-30681][PYSPARK][SQL] Add higher order functions API to PySpark
### What changes were proposed in this pull request?

This PR add Python API for invoking following higher functions:

- `transform`
- `exists`
- `forall`
- `filter`
- `aggregate`
- `zip_with`
- `transform_keys`
- `transform_values`
- `map_filter`
- `map_zip_with`

to `pyspark.sql`. Each of these accepts plain Python functions of one of the following types

- `(Column) -> Column: ...`
- `(Column, Column) -> Column: ...`
- `(Column, Column, Column) -> Column: ...`

Internally this proposal piggbacks on objects supporting Scala implementation ([SPARK-27297](https://issues.apache.org/jira/browse/SPARK-27297)) by:

1. Creating  required `UnresolvedNamedLambdaVariables`  exposing these as PySpark `Columns`
2. Invoking Python function with these columns as arguments.
3. Using the result, and underlying JVM objects from 1., to create `expressions.LambdaFunction` which is passed to desired expression, and repacked as Python `Column`.

### Why are the changes needed?

Currently higher order functions are available only using SQL and Scala API and can use only SQL expressions

```python
df.selectExpr("transform(values, x -> x + 1)")
```

This works reasonably well for simple functions, but can get really ugly with complex functions (complex functions, casts), resulting objects are somewhat verbose and we don't get any IDE support.  Additionally DSL used, though  very simple, is not documented.

With changes propose here, above query could be rewritten as:

```python
df.select(transform("values", lambda x: x + 1))
```

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

- For positive cases this PR adds doctest strings covering possible usage patterns.
- For negative cases (unsupported function types) this PR adds unit tests.

### Notes

If approved, the same approach can be used in SparkR.

Closes #27406 from zero323/SPARK-30681.

Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-02-28 12:59:39 +09:00
zero323 c467961e8a [SPARK-30682][SPARKR][SQL] Add SparkR interface for higher order functions
### What changes were proposed in this pull request?

This PR add R API for invoking following higher functions:

- `transform` -> `array_transform` (to avoid conflict with `base::transform`).
- `exists` -> `array_exists` (to avoid conflict with `base::exists`).
- `forall` -> `array_forall` (no conflicts, renamed for consistency)
- `filter` -> `array_filter` (to avoid conflict with `stats::filter`).
- `aggregate` -> `array_aggregate` (to avoid conflict with `stats::transform`).
- `zip_with` -> `arrays_zip_with` (no conflicts, renamed for consistency)
- `transform_keys`
- `transform_values`
- `map_filter`
- `map_zip_with`

Overall implementation follows the same pattern as proposed for PySpark (#27406) and reuses object supporting Scala implementation (SPARK-27297).

### Why are the changes needed?

Currently higher order functions are available only using SQL and Scala API and can use only SQL expressions:

```r
select(df, expr("transform(xs, x -> x + 1)")
```

This is error-prone, and hard to do right, when complex logic is used (`when` / `otherwise`, complex objects).

If this PR is accepted, above function could be simply rewritten as:

```r
select(df, transform("xs", function(x) x + 1))
```

### Does this PR introduce any user-facing change?

No (but new user-facing functions are added).

### How was this patch tested?

Added new unit tests.

Closes #27433 from zero323/SPARK-30682.

Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-02-28 12:58:56 +09:00
yi.wu a1d2ce90b0 [SPARK-30972][SQL] PruneHiveTablePartitions should be executed as earlyScanPushDownRules
### What changes were proposed in this pull request?

Make rule `PruneHiveTablePartitions` to execute as `earlyScanPushDownRules`.

### Why are the changes needed?

Similar to rule `PruneFileSourcePartitions`, `PruneHiveTablePartitions` should also be executed as earlyScanPushDownRules to eliminate the impact on statistic computation later.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Pass Jenkins.

Closes #27723 from Ngone51/early_hive_prune.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-28 11:50:20 +08:00
Liang-Chi Hsieh ba032acf95 [SPARK-30955][SQL] Exclude Generate output when aliasing in nested column pruning
### What changes were proposed in this pull request?

When aliasing in nested column pruning in Project on top of Generate, we should exclude Generate outputs.

### Why are the changes needed?

Right now we would prune nested columns in Project on top of Generate. It is possible that referred nested columns are from Generate's outputs, not from its child. To address that case, we should exclude Generate outputs when aliasing in nested column pruning.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Unit test.

Closes #27702 from viirya/fix-nested-pruning.

Lead-authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Co-authored-by: Liang-Chi Hsieh <liangchi@uber.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-02-28 12:29:46 +09:00
Takeshi Yamamuro 314442a3d5 [SQL][MINOR][TESTS] Remove GivenWhenThen trait from HiveComparisonTest
### What changes were proposed in this pull request?

This pr intends to remove non-used trait, `GivenWhenThen`, from `HiveComparisonTest`.

### Why are the changes needed?

For better code.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Existing tests.

Closes #27726 from maropu/MINOR-20200228.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>
2020-02-27 17:19:27 -08:00
Dongjoon Hyun 3995728c3c [SPARK-30968][BUILD] Upgrade aws-java-sdk-sts to 1.11.655
### What changes were proposed in this pull request?

This PR aims to upgrade `aws-java-sdk-sts` to `1.11.655`.

### Why are the changes needed?

[SPARK-29677](https://github.com/apache/spark/pull/26333) upgrades AWS Kinesis Client to 1.12.0 for Apache Spark 2.4.5 and 3.0.0.

Since AWS Kinesis Client 1.12.0 is using AWS SDK 1.11.665, `aws-java-sdk-sts` should be consistent with Kinesis client dependency.
- https://github.com/awslabs/amazon-kinesis-client/releases/tag/v1.12.0

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Pass the Jenkins.

Closes #27720 from dongjoon-hyun/SPARK-30968.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-02-27 17:05:56 -08:00
Tigran Saluev 6f4a2e4c99 [MINOR][ML] Fix confusing error message in VectorAssembler
### What changes were proposed in this pull request?

When VectorAssembler encounters a NULL with handleInvalid="error", it throws an exception. This exception, though, has a typo making it confusing. Yet apparently, this same exception for NaN values is fine. Fixed it to look like the right one.

### Why are the changes needed?

Encountering this error with such message was very confusing! I hope to save time of fellow engineers by improving it.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

It's just an error message...

Closes #27709 from Saluev/patch-1.

Authored-by: Tigran Saluev <tigran@saluev.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-02-27 11:05:53 -06:00
Kent Yao 2d2706cb86 [SPARK-30956][SQL][TESTS] Use intercept instead of try-catch to assert failures in IntervalUtilsSuite
### What changes were proposed in this pull request?

In this PR, I addressed the comment from https://github.com/apache/spark/pull/27672#discussion_r383719562 to use `intercept` instead of `try-catch` block to assert  failures in the IntervalUtilsSuite

### Why are the changes needed?

improve tests
### Does this PR introduce any user-facing change?

no

### How was this patch tested?

Nah

Closes #27700 from yaooqinn/intervaltest.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-02-27 23:12:35 +09:00
yi.wu 22dfd15a45 [SPARK-30937][DOC] Group Hive upgrade guides together
### What changes were proposed in this pull request?

This PR groups all hive upgrade related migration guides inside Spark 3.0 together.

Also add another behavior change of `ScriptTransform` in the new Hive section.

### Why are the changes needed?

Make the doc more clearly to user.

### Does this PR introduce any user-facing change?

No, new doc for Spark 3.0.

### How was this patch tested?

N/A.

Closes #27670 from Ngone51/hive_migration.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-27 21:29:42 +08:00
HyukjinKwon b2a74107d5 [SPARK-23435][INFRA][FOLLOW-UP] Remove unnecessary dependency in AppVeyor
### What changes were proposed in this pull request?
`testthat` version was pinned to `1.0.2` at f15102b170 due to compatibility issue in SparkR.
The compatibility issue is finally fixed as of 298d0a5102 and we now use testthat latest version.

Now we don't need to install `crayon', 'praise' and 'R6' as they are dependences in testthat (https://github.com/r-lib/testthat/blob/master/DESCRIPTION).

### Why are the changes needed?
To minimise build specification and prevent dependency confusion.

### Does this PR introduce any user-facing change?
No. Dev only change.

### How was this patch tested?
AppVeyor build will test it out.

Closes #27717 from HyukjinKwon/SPARK-23435-followup.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-02-27 00:18:46 -08:00
Eric Wu bce8d9354c [SPARK-30765][SQL] Refine base operator abstraction code style
### What changes were proposed in this pull request?
When doing base operator abstraction work, we found there are still some code snippet is  inconsistent with other abstraction code style. This PR addressed following two code refactor cases.

**Case 1** Override keyword missed for some fields in derived classes. The compiler will not capture it if we rename some fields in the future.

**Case 2** Inconsistent abstract class field definition. The updated style will simplify derived class definition, e.g. `EvalPythonExec` `WindowExecBase`

### Why are the changes needed?
Improve the code style consistency and code quality

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Existing tests

Closes #27511 from Eric5553/BaseClassAbstraction.

Authored-by: Eric Wu <492960551@qq.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-02-27 16:40:10 +09:00
Liang-Chi Hsieh 160c144baa [SPARK-30590][SQL] Untyped select API cannot take typed column expression that needs input type
### What changes were proposed in this pull request?

This patch proposes to throw clear analysis exception if untyped `Dataset.select` takes typed column expression that needs input type.

### Why are the changes needed?

`Dataset` provides few typed `select` helper functions to select typed column expressions. The maximum number of typed columns supported is 5. If wanting to select more than 5 typed columns, it silently calls untyped `Dataset.select` and can causes weird unresolved error, like:

```
org.apache.spark.sql.AnalysisException: unresolved operator 'Aggregate [fooagg(FooAgg(1), None, None, None, input[0, int, false] AS value#114, assertnotnull(cast(value#114 as int)), input[0, int, false] AS value#113, IntegerType, IntegerType, false) AS foo_agg_1#116, fooagg(FooAgg(2), None, None, None, input[0, int, false] AS value#119, assertnotnull(cast(value#119 as int)), input[0, int, false] AS value#118, IntegerType, IntegerType, false) AS foo_agg_2#121, fooagg(FooAgg(3), None, None, None, input[0, int, false] AS value#124, assertnotnull(cast(value#124 as int)), input[0, int, false] AS value#123, IntegerType, IntegerType, false) AS foo_agg_3#126, fooagg(FooAgg(4), None, None, None, input[0, int, false] AS value#129, assertnotnull(cast(value#129 as int)), input[0, int, false] AS value#128, IntegerType, IntegerType, false) AS foo_agg_4#131, fooagg(FooAgg(5), None, None, None, input[0, int, false] AS value#134, assertnotnull(cast(value#134 as int)), input[0, int, false] AS value#133, IntegerType, IntegerType, false) AS foo_agg_5#136, fooagg(FooAgg(6), None, None, None, input[0, int, false] AS value#139, assertnotnull(cast(value#139 as int)), input[0, int, false] AS value#138, IntegerType, IntegerType, false) AS foo_agg_6#141];;
'Aggregate [fooagg(FooAgg(1), None, None, None, input[0, int, false] AS value#114, assertnotnull(cast(value#114 as int)), input[0, int, false] AS value#113, IntegerType, IntegerType, false) AS foo_agg_1#116, fooagg(FooAgg(2), None, None, None, input[0, int, false] AS value#119, assertnotnull(cast(value#119 as int)), input[0, int, false] AS value#118, IntegerType, IntegerType, false) AS foo_agg_2#121, fooagg(FooAgg(3), None, None, None, input[0, int, false] AS value#124, assertnotnull(cast(value#124 as int)), input[0, int, false] AS value#123, IntegerType, IntegerType, false) AS foo_agg_3#126, fooagg(FooAgg(4), None, None, None, input[0, int, false] AS value#129, assertnotnull(cast(value#129 as int)), input[0, int, false] AS value#128, IntegerType, IntegerType, false) AS foo_agg_4#131, fooagg(FooAgg(5), None, None, None, input[0, int, false] AS value#134, assertnotnull(cast(value#134 as int)), input[0, int, false] AS value#133, IntegerType, IntegerType, false) AS foo_agg_5#136, fooagg(FooAgg(6), None, None, None, input[0, int, false] AS value#139, assertnotnull(cast(value#139 as int)), input[0, int, false] AS value#138, IntegerType, IntegerType, false) AS foo_agg_6#141]
+- Project [_1#6 AS a#13, _2#7 AS b#14, _3#8 AS c#15, _4#9 AS d#16, _5#10 AS e#17, _6#11 AS F#18]
 +- LocalRelation [_1#6, _2#7, _3#8, _4#9, _5#10, _6#11]

at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:43)
 at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:95)
 at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$3.apply(CheckAnalysis.scala:431)
 at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$3.apply(CheckAnalysis.scala:430)
 at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
 at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:430)
```

However, to fully disallow typed columns as input to untyped `select` API will break current usage like `count` that is a `TypedColumn` in `functions`. In order to keep compatibility, we should allow current usage of certain `TypedColumn`s as input to untyped `select` API. For the `TypedColumn`s that will cause unresolved exception, we should explicitly let users know that they are incorrectly calling untyped `select` with typed columns which need input type.

### Does this PR introduce any user-facing change?

Yes, but this PR only refines the error message.

When users call `Dataset.select` API with typed column that needs input type, an analysis exception will be thrown. Previously an unresolved error will be thrown.

### How was this patch tested?

Unit tests.

Closes #27499 from viirya/SPARK-30590.

Lead-authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Co-authored-by: Liang-Chi Hsieh <liangchi@uber.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-27 14:09:07 +08:00
beliefer 1515d45b8d [SPARK-27924][SQL][FOLLOW-UP] Improve ANSI SQL Boolean-Predicate
### What changes were proposed in this pull request?
This PR follows https://github.com/apache/spark/pull/25074 and improves the implement.

### Why are the changes needed?
Improve code.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Exists UT

Closes #27699 from beliefer/improve-boolean-test.

Authored-by: beliefer <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-27 13:42:02 +08:00
Dongjoon Hyun 2b744fe885 [SPARK-30963][INFRA] Add GitHub Action job for document generation
### What changes were proposed in this pull request?

This PR aims to add a new `GitHub Action` job for document generation.

### Why are the changes needed?

We had better test the document generation in PR Builder.
- https://lists.apache.org/thread.html/rd06a2154e853812652b8f7fa3c003746ed531b213c531517f055e1dc%40%3Cdev.spark.apache.org%3E

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Pass the GitHub Action in this PR.

Closes #27715 from dongjoon-hyun/SPARK-30963.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-02-26 19:24:41 -08:00
beliefer 325bf56e73 [SPARK-30888][CORE][DOC] Add version information to the configuration of Network
### What changes were proposed in this pull request?
1.Add version information to the configuration of `Network`.
2.Update the docs of `Network`.

I sorted out some information show below.

Item name | Since version | JIRA ID | Commit ID | Note
-- | -- | -- | -- | --
spark.network.crypto.saslFallback | 2.2.0 | SPARK-19139 | 8f3f73abc1fe62496722476460c174af0250e3fe#diff-0ac65da2bc6b083fb861fe410c7688c2 |  
spark.network.crypto.enabled | 2.2.0 | SPARK-19139 | 8f3f73abc1fe62496722476460c174af0250e3fe#diff-6bdad48cfc34314e89599655442ff210 |  
spark.network.remoteReadNioBufferConversion | 2.4.0 | SPARK-24307 | 2c82745686f4456c4d5c84040a431dcb5b6cb60b#diff-2b643ea78c1add0381754b1f47eec132 |  
spark.network.timeout | 1.3.0 | SPARK-4688 | d3f07fd23cc26a70f44c52e24445974d4885d58a#diff-1df6b5af3d8f9f16255ff8c7a06f402f |  
spark.network.timeoutInterval | 1.3.2 | SPARK-5529 | ec196ab1c7569d7ab0a50c9d7338c2835f2c84d5#diff-47779b72f095f7e7f926898fa1a425ee |  
spark.rpc.askTimeout | 1.4.0 | SPARK-6490 | 8136810dfad12008ac300116df7bc8448740f1ae#diff-529fc5c06b9731c1fbda6f3db60b16aa |  
spark.rpc.connect.threads | 1.6.0 | SPARK-6028 | 084e4e126211d74a79e8dbd2d0e604dd3c650822#diff-0c89b4a60c30a7cd2224bb64d93da942 |  
spark.rpc.io.numConnectionsPerPeer | 1.6.0 | SPARK-10745 | 34a77679877bc40b58a10ec539a8da00fed7db39#diff-0c89b4a60c30a7cd2224bb64d93da942 |  
spark.rpc.io.threads | 1.6.0 | SPARK-6028 | 084e4e126211d74a79e8dbd2d0e604dd3c650822#diff-0c89b4a60c30a7cd2224bb64d93da942 |  
spark.rpc.lookupTimeout | 1.4.0 | SPARK-6490 | 8136810dfad12008ac300116df7bc8448740f1ae#diff-529fc5c06b9731c1fbda6f3db60b16aa |  
spark.rpc.message.maxSize | 2.0.0 | SPARK-7997 | bc1babd63da4ee56e6d371eb24805a5d714e8295#diff-529fc5c06b9731c1fbda6f3db60b16aa |  
spark.rpc.netty.dispatcher.numThreads | 1.6.0 | SPARK-11079 | 1797055dbf1d2fd7714d7c65c8d2efde2f15efc1#diff-05133dfc4bfdb6a27aa092d86ce24866 |  
spark.rpc.numRetries | 1.4.0 | SPARK-6490 | 8136810dfad12008ac300116df7bc8448740f1ae#diff-529fc5c06b9731c1fbda6f3db60b16aa |  
spark.rpc.retry.wait | 1.4.0 | SPARK-6490 | 8136810dfad12008ac300116df7bc8448740f1ae#diff-529fc5c06b9731c1fbda6f3db60b16aa |  

### Why are the changes needed?
Supplemental configuration version information.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Exists UT

Closes #27674 from beliefer/add-version-to-network-config.

Authored-by: beliefer <beliefer@163.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-02-27 11:05:11 +09:00
HyukjinKwon cd3ef2249f [SPARK-30906][SQL][TESTS][FOLLOW-UP] Set the configuration against TestHive explicitly in HiveSerDeSuite
### What changes were proposed in this pull request?

After https://github.com/apache/spark/pull/27659 (see https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7-hive-2.3/253/), the tests below fail consistently, specifically in one job https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7-hive-2.3/ in Jenkins

```
org.apache.spark.sql.hive.execution.HiveSerDeSuite.Test the default fileformat for Hive-serde tables
```

The profile is same as PR builder but seems it fails specifically in this machine.

Several configurations used in `HiveSerDeSuite` are not being set presumably due to the inconsistency between `SQLConf.get` and the active Spark session described in the https://github.com/apache/spark/pull/27387, and as a side effect of the cloned session at https://github.com/apache/spark/pull/27659.

This PR proposes to explicitly set the configuration against `TestHive` by using `withExistingConf` at `withSQLConf`

### Why are the changes needed?

To make `spark-master-test-sbt-hadoop-2.7-hive-2.3` job pass.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Cannot reproduce in my local. Presumably it cannot be reproduced in the PR builder. We should see if the tests pass at `spark-master-test-sbt-hadoop-2.7-hive-2.3` job after this PR is merged.

Closes #27705 from HyukjinKwon/SPARK-30906.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>
2020-02-26 18:01:26 -08:00
beliefer 825d3dc11b [SPARK-30841][SQL][DOC] Add version information to the configuration of SQL
### What changes were proposed in this pull request?
Add version information to the configuration of Spark SQL.
Note: Because SQLConf has a lot of configuration items, I split the items into two PR. Another PR will follows this PR.

I sorted out some information show below.

Item name | Since version | JIRA ID | Commit ID | Note
-- | -- | -- | -- | --
spark.sql.analyzer.maxIterations | 3.0.0 | SPARK-30138 | c2f29d5ea58eb4565cc5602937d6d0bb75558513#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.optimizer.excludedRules | 2.4.0 | SPARK-24802 | 434319e73f8cb6e080671bdde42a72228bd814ef#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.optimizer.maxIterations | 2.0.0 | SPARK-14677 | f4be0946af219379fb2476e6f80b2e50463adeb2#diff-32bb9518401c0948c5ea19377b5069ab |  
spark.sql.optimizer.inSetConversionThreshold | 2.0.0 | SPARK-14796 | 3647120a5a879edf3a96a5fd68fb7aa849ad57ef#diff-32bb9518401c0948c5ea19377b5069ab |  
spark.sql.optimizer.inSetSwitchThreshold | 3.0.0 | SPARK-26205 | 0c23a39384b7ae5fb4aeb4f7f6fe72007b84bbd2#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.optimizer.planChangeLog.level | 3.0.0 | SPARK-25415 | 8b702e1e0aba1d3e4b0aa582f20cf99f80a44a09#diff-9a6b543db706f1a90f790783d6930a13 | This configuration does not exist in branch-2.4 branch, but from the branch-3.0 git log, it is found that the version number of the pom.xml file is 2.4.0-SNAPSHOT
spark.sql.optimizer.planChangeLog.rules | 3.0.0 | SPARK-25415 | 8b702e1e0aba1d3e4b0aa582f20cf99f80a44a09#diff-9a6b543db706f1a90f790783d6930a13 | This configuration does not exist in branch-2.4 branch, but from the branch-3.0 git log, it is found that the version number of the pom.xml file is 2.4.0-SNAPSHOT
spark.sql.optimizer.planChangeLog.batches | 3.0.0 | SPARK-27088 | 074533334d01afdd7862a1ac6c5a7a672bcce3f8#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.optimizer.dynamicPartitionPruning.enabled | 3.0.0 | SPARK-11150 | a7a3935c97d1fe6060cae42bbc9229c087b648ab#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.optimizer.dynamicPartitionPruning.useStats | 3.0.0 | SPARK-11150 | a7a3935c97d1fe6060cae42bbc9229c087b648ab#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.optimizer.dynamicPartitionPruning.fallbackFilterRatio | 3.0.0 | SPARK-11150 | a7a3935c97d1fe6060cae42bbc9229c087b648ab#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly | 3.0.0 | SPARK-30528 | 59a13c9b7bc3b3aa5b5bc30a60344f849c0f8012#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.inMemoryColumnarStorage.compressed | 1.0.1 | SPARK-2631 | 86534d0f5255362618c05a07b0171ec35c915822#diff-41ef65b9ef5b518f77e2a03559893f4d |  
spark.sql.inMemoryColumnarStorage.batchSize | 1.1.1 | SPARK-2650 | 779d1eb26d0f031791e93c908d51a59c3b422a55#diff-41ef65b9ef5b518f77e2a03559893f4d |  
spark.sql.inMemoryColumnarStorage.partitionPruning | 1.2.0 | SPARK-2961 | 248067adbe90f93c7d5e23aa61b3072dfdf48a8a#diff-41ef65b9ef5b518f77e2a03559893f4d |  
spark.sql.inMemoryTableScanStatistics.enable | 3.0.0 | SPARK-28257 | 42b80ae128ab1aa8a87c1376fe88e2cde52e6e4f#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.inMemoryColumnarStorage.enableVectorizedReader | 2.3.1 | SPARK-23312 | e5e9f9a430c827669ecfe9d5c13cc555fc89c980#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.columnVector.offheap.enabled | 2.3.0 | SPARK-20101 | 572af5027e45ca96e0d283a8bf7c84dcf476f9bc#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.join.preferSortMergeJoin | 2.0.0 | SPARK-13977 | 9c23c818ca0175c8f2a4a66eac261ec251d27c97#diff-32bb9518401c0948c5ea19377b5069ab |  
spark.sql.sort.enableRadixSort | 2.0.0 | SPARK-14724 | e2b5647ab92eb478b3f7b36a0ce6faf83e24c0e5#diff-32bb9518401c0948c5ea19377b5069ab |
spark.sql.autoBroadcastJoinThreshold | 1.1.0 | SPARK-2393 | c7db274be79f448fda566208946cb50958ea9b1a#diff-41ef65b9ef5b518f77e2a03559893f4d |  
spark.sql.limit.scaleUpFactor | 2.1.1 | SPARK-19944 | 80ebca62cbdb7d5c8606e95a944164ab1a943694#diff-9a6b543db706f1a90f790783d6930a13 |
spark.sql.hive.advancedPartitionPredicatePushdown.enabled | 2.3.0 | SPARK-20331 | d8cada8d1d3fce979a4bc1f9879593206722a3b9#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.shuffle.partitions | 1.1.0 | SPARK-1508 | 08ed9ad81397b71206c4dc903bfb94b6105691ed#diff-41ef65b9ef5b518f77e2a03559893f4d |  
spark.sql.adaptive.enabled | 1.6.0 | SPARK-9858 and SPARK-9859 and SPARK-9861 | d728d5c98658c44ed2949b55d36edeaa46f8c980#diff-41ef65b9ef5b518f77e2a03559893f4d |
spark.sql.adaptive.forceApply | 3.0.0 | SPARK-30719 | b29cb1a82b1a1facf1dd040025db93d998dad4cd#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.adaptive.shuffle.reducePostShufflePartitions | 3.0.0 | SPARK-30812 | b76bc0b1b8b2abd00a84f805af90ca4c5925faaa#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.adaptive.shuffle.fetchShuffleBlocksInBatch | 3.0.0 | SPARK-30812 | b76bc0b1b8b2abd00a84f805af90ca4c5925faaa#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.adaptive.shuffle.minNumPostShufflePartitions | 3.0.0 | SPARK-9853 | 8616109061efc5b23b24bb9ec4a3c0f2745903c1#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.adaptive.shuffle.targetPostShuffleInputSize | 1.6.0 | SPARK-9858 and SPARK-9859 and SPARK-9861 | d728d5c98658c44ed2949b55d36edeaa46f8c980#diff-41ef65b9ef5b518f77e2a03559893f4d |  
spark.sql.adaptive.shuffle.maxNumPostShufflePartitions | 3.0.0 | SPARK-9853 | 8616109061efc5b23b24bb9ec4a3c0f2745903c1#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.adaptive.shuffle.localShuffleReader.enabled | 3.0.0 | SPARK-29893 | 6e581cf164c3a2930966b270ac1406dc1195c942#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.adaptive.skewedJoinOptimization.enabled | 3.0.0 | SPARK-30812 | b76bc0b1b8b2abd00a84f805af90ca4c5925faaa#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.adaptive.skewedJoinOptimization.skewedPartitionFactor | 3.0.0 | SPARK-30812 | 5b36cdbbfef147e93b35eaa4f8e0bea9690b6d06#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.adaptive.nonEmptyPartitionRatioForBroadcastJoin | 3.0.0 | SPARK-9853 and SPARK-29002 | 8616109061efc5b23b24bb9ec4a3c0f2745903c1#diff-9a6b543db706f1a90f790783d6930a13 and b2f06608b785f577999318c00f2c315f39d90889#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.subexpressionElimination.enabled | 1.6.0 | SPARK-10371 | f38509a763816f43a224653fe65e4645894c9fc4#diff-41ef65b9ef5b518f77e2a03559893f4d |  
spark.sql.caseSensitive | 1.4.0 | SPARK-4699 | 21bd7222e55b9cf684c072141998a0623a69f514#diff-41ef65b9ef5b518f77e2a03559893f4d |  
spark.sql.constraintPropagation.enabled | 2.2.0 | SPARK-19846 | e011004bedca47be998a0c14fe22a6f9bb5090cd#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.parser.escapedStringLiterals | 2.2.1 | SPARK-20399 | 3d1908fd58fd9b1970cbffebdb731bfe4c776ad9#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.sources.fileCompressionFactor | 2.3.1 | SPARK-22790 | 0fc5533e53ad03eb67590ddd231f40c2713150c3#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.parquet.mergeSchema | 1.5.0 | SPARK-8690 | 246265f2bb056d5e9011d3331b809471a24ff8d7#diff-41ef65b9ef5b518f77e2a03559893f4d |  
spark.sql.parquet.respectSummaryFiles | 1.5.0 | SPARK-8838 | 6175d6cfe795fbd88e3ee713fac375038a3993a8#diff-41ef65b9ef5b518f77e2a03559893f4d |  
spark.sql.parquet.binaryAsString | 1.1.1 | SPARK-2927 | de501e169f24e4573747aec85b7651c98633c028#diff-41ef65b9ef5b518f77e2a03559893f4d |  
spark.sql.parquet.int96AsTimestamp | 1.3.0 | SPARK-4987 | 67d52207b5cf2df37ca70daff2a160117510f55e#diff-41ef65b9ef5b518f77e2a03559893f4d |  
spark.sql.parquet.int96TimestampConversion | 2.3.0 | SPARK-12297 | acf7ef3154e094875fa89f30a78ab111b267db91#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.parquet.outputTimestampType | 2.3.0 | SPARK-10365 | 21a7bfd5c324e6c82152229f1394f26afeae771c#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.parquet.compression.codec | 1.1.1 | SPARK-3131 | 3a9d874d7a46ab8b015631d91ba479d9a0ba827f#diff-41ef65b9ef5b518f77e2a03559893f4d |  
spark.sql.parquet.filterPushdown | 1.2.0 | SPARK-4391 | 576688aa2a19bd4ba239a2b93af7947f983e5124#diff-41ef65b9ef5b518f77e2a03559893f4d |  
spark.sql.parquet.filterPushdown.date | 2.4.0 | SPARK-23727 | b02e76cbffe9e589b7a4e60f91250ca12a4420b2#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.parquet.filterPushdown.timestamp | 2.4.0 | SPARK-24718 | 43e4e851b642bbee535d22e1b9e72ec6b99f6ed4#diff-9a6b543db706f1a90f790783d6930a13 |
spark.sql.parquet.filterPushdown.decimal | 2.4.0 | SPARK-24549 | 9549a2814951f9ba969955d78ac4bd2240f85989#diff-9a6b543db706f1a90f790783d6930a13 |
spark.sql.parquet.filterPushdown.string.startsWith | 2.4.0 | SPARK-24638 | 03545ce6de08bd0ad685c5f59b73bc22dfc40887#diff-9a6b543db706f1a90f790783d6930a13 | 
spark.sql.parquet.pushdown.inFilterThreshold | 2.4.0 | SPARK-17091 | e1de34113e057707dfc5ff54a8109b3ec7c16dfb#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.parquet.writeLegacyFormat | 1.6.0 | SPARK-10400 | 01cd688f5245cbb752863100b399b525b31c3510#diff-41ef65b9ef5b518f77e2a03559893f4d |  
spark.sql.parquet.output.committer.class | 1.5.0 | SPARK-8139 | 111d6b9b8a584b962b6ae80c7aa8c45845ce0099#diff-41ef65b9ef5b518f77e2a03559893f4d |  
spark.sql.parquet.enableVectorizedReader | 2.0.0 | SPARK-13486 | 2b2c8c33236677c916541f956f7b94bba014a9ce#diff-32bb9518401c0948c5ea19377b5069ab |
spark.sql.parquet.recordLevelFilter.enabled | 2.3.0 | SPARK-17310 | 673c67046598d33b9ecf864024ca7a937c1998d6#diff-9a6b543db706f1a90f790783d6930a13 |
spark.sql.parquet.columnarReaderBatchSize | 2.4.0 | SPARK-23188 | cc41245fa3f954f961541bf4b4275c28473042b8#diff-9a6b543db706f1a90f790783d6930a13 |  

### Why are the changes needed?
Supplemental configuration version information.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Exists UT

Closes #27691 from beliefer/add-version-to-sql-config-part-one.

Authored-by: beliefer <beliefer@163.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-02-27 10:58:44 +09:00
beliefer c2857501d5 [SPARK-30909][CORE][DOC] Add version information to the configuration of Python
### What changes were proposed in this pull request?
1.Add version information to the configuration of `Python`.
2.Update the docs of `Python`.

I sorted out some information show below.

Item name | Since version | JIRA ID | Commit ID | Note
-- | -- | -- | -- | --
spark.python.worker.reuse | 1.2.0 | SPARK-3030 | 2aea0da84c58a179917311290083456dfa043db7#diff-0a67bc4d171abe4df8eb305b0f4123a2 |  
spark.python.task.killTimeout | 2.2.2 | SPARK-22535 | be68f86e11d64209d9e325ce807025318f383bea#diff-0a67bc4d171abe4df8eb305b0f4123a2 |  
spark.python.use.daemon | 2.3.0 | SPARK-22554 | 57c5514de9dba1c14e296f85fb13fef23ce8c73f#diff-9008ad45db34a7eee2e265a50626841b |  
spark.python.daemon.module | 2.4.0 | SPARK-22959 | afae8f2bc82597593595af68d1aa2d802210ea8b#diff-9008ad45db34a7eee2e265a50626841b |  
spark.python.worker.module | 2.4.0 | SPARK-22959 | afae8f2bc82597593595af68d1aa2d802210ea8b#diff-9008ad45db34a7eee2e265a50626841b |  
spark.executor.pyspark.memory | 2.4.0 | SPARK-25004 | 7ad18ee9f26e75dbe038c6034700f9cd4c0e2baa#diff-6bdad48cfc34314e89599655442ff210 |  

### Why are the changes needed?
Supplemental configuration version information.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Exists UT

Closes #27704 from beliefer/add-version-to-python-config.

Authored-by: beliefer <beliefer@163.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-02-27 10:57:34 +09:00
beliefer 776e21af40 [SPARK-30910][CORE][DOC] Add version information to the configuration of R
### What changes were proposed in this pull request?
1.Add version information to the configuration of `R`.
2.Update the docs of `R`.

I sorted out some information show below.

Item name | Since version | JIRA ID | Commit ID | Note
-- | -- | -- | -- | --
spark.r.backendConnectionTimeout | 2.1.0 | SPARK-17919 | 2881a2d1d1a650a91df2c6a01275eba14a43b42a#diff-025470e1b7094d7cf4a78ea353fb3981 |  
spark.r.numRBackendThreads | 1.4.0 | SPARK-8282 | 28e8a6ea65fd08ab9cefc4d179d5c66ffefd3eb4#diff-697f7f2fc89808e0113efc71ed235db2 |  
spark.r.heartBeatInterval | 2.1.0 | SPARK-17919 | 2881a2d1d1a650a91df2c6a01275eba14a43b42a#diff-fe903bf14db371aa320b7cc516f2463c |  
spark.sparkr.r.command | 1.5.3 | SPARK-10971 | 9695f452e86a88bef3bcbd1f3c0b00ad9e9ac6e1#diff-025470e1b7094d7cf4a78ea353fb3981 |  
spark.r.command | 1.5.3 | SPARK-10971 | 9695f452e86a88bef3bcbd1f3c0b00ad9e9ac6e1#diff-025470e1b7094d7cf4a78ea353fb3981 |  

### Why are the changes needed?
Supplemental configuration version information.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Exists UT

Closes #27708 from beliefer/add-version-to-R-config.

Authored-by: beliefer <beliefer@163.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-02-27 10:56:38 +09:00
Huaxin Gao 2c9db5608d [SPARK-30928][ML] Remove unnecessary MiMa excludes
### What changes were proposed in this pull request?
https://issues.apache.org/jira/browse/SPARK-30928
remove unnecessary MiMa excludes

### Why are the changes needed?

When auditing binary incompatible changes for 3.0, I found several MiMa excludes are not necessary, so remove these.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
run dev/mima to check

Closes #27696 from huaxingao/spark-mima.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-02-26 19:45:15 -06:00