### What changes were proposed in this pull request?
This PR proposes to enable the JSON datasources to write non-ascii characters as codepoints.
To enable/disable this feature, I introduce a new option `writeNonAsciiCharacterAsCodePoint` for JSON datasources.
### Why are the changes needed?
JSON specification allows codepoints as literal but Spark SQL's JSON datasources don't support the way to do it.
It's great if we can write non-ascii characters as codepoints, which is a platform neutral representation.
### Does this PR introduce _any_ user-facing change?
Yes. Users can write non-ascii characters as codepoints with JSON datasources.
### How was this patch tested?
New test.
Closes#32147 from sarutak/json-unicode-write.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR upgrades `GenJavadoc` to `0.17`.
### Why are the changes needed?
This version seems to include a fix for an issue which can happen with Scala 2.13.5.
https://github.com/lightbend/genjavadoc/releases/tag/v0.17
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
I confirmed build succeed with the following commands.
```
# For Scala 2.12
$ build/sbt -Phive -Phive-thriftserver -Pyarn -Pmesos -Pkubernetes -Phadoop-cloud -Pspark-ganglia-lgpl -Pkinesis-asl -Pdocker-integration-tests -Pkubernetes-integration-tests unidoc
# For Scala 2.13
build/sbt -Phive -Phive-thriftserver -Pyarn -Pmesos -Pkubernetes -Phadoop-cloud -Pspark-ganglia-lgpl -Pkinesis-asl -Pdocker-integration-tests -Pkubernetes-integration-tests -Pscala-2.13 unidoc
```
Closes#32392 from sarutak/upgrade-genjavadoc-0.17.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
With this PR Spark avoids creating multiple monitor threads for the same worker and same task context.
### Why are the changes needed?
Without this change unnecessary threads will be created. It even can cause job failure for example when a coalesce (without shuffle) from high partition number goes to very low one. This exception is exactly comes for such a run:
```
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0) (192.168.1.210 executor driver): java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:717)
at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:166)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.rdd.CoalescedRDD.$anonfun$compute$1(CoalescedRDD.scala:99)
at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
at scala.collection.Iterator.foreach(Iterator.scala:941)
at scala.collection.Iterator.foreach$(Iterator.scala:941)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
at scala.collection.TraversableOnce.to(TraversableOnce.scala:315)
at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313)
at scala.collection.AbstractIterator.to(Iterator.scala:1429)
at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307)
at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1429)
at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294)
at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1429)
at org.apache.spark.rdd.RDD.$anonfun$collect$2(RDD.scala:1030)
at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2260)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:498)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1437)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:501)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2262)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2211)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2210)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2210)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1083)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1083)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1083)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2449)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2391)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2380)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:872)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2220)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2241)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2260)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2285)
at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1030)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
at org.apache.spark.rdd.RDD.collect(RDD.scala:1029)
at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:180)
at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:717)
at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:166)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.rdd.CoalescedRDD.$anonfun$compute$1(CoalescedRDD.scala:99)
at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
at scala.collection.Iterator.foreach(Iterator.scala:941)
at scala.collection.Iterator.foreach$(Iterator.scala:941)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
at scala.collection.TraversableOnce.to(TraversableOnce.scala:315)
at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313)
at scala.collection.AbstractIterator.to(Iterator.scala:1429)
at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307)
at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1429)
at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294)
at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1429)
at org.apache.spark.rdd.RDD.$anonfun$collect$2(RDD.scala:1030)
at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2260)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:498)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1437)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:501)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
... 1 more
```
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Manually I used a the following Python script used (`reproduce-SPARK-35009.py`):
```
import pyspark
conf = pyspark.SparkConf().setMaster("local[*]").setAppName("Test1")
sc = pyspark.SparkContext.getOrCreate(conf)
rows = 70000
data = list(range(rows))
rdd = sc.parallelize(data, rows)
assert rdd.getNumPartitions() == rows
rdd0 = rdd.filter(lambda x: False)
data = rdd0.coalesce(1).collect()
assert data == []
```
Spark submit:
```
$ ./bin/spark-submit reproduce-SPARK-35009.py
```
#### With this change
Checking the number of monitor threads with jcmd:
```
$ jcmd
85273 sun.tools.jcmd.JCmd
85227 org.apache.spark.deploy.SparkSubmit reproduce-SPARK-35009.py
41020 scala.tools.nsc.MainGenericRunner
$ jcmd 85227 Thread.print | grep -c "Monitor for python"
2
$ jcmd 85227 Thread.print | grep -c "Monitor for python"
2
...
$ jcmd 85227 Thread.print | grep -c "Monitor for python"
2
$ jcmd 85227 Thread.print | grep -c "Monitor for python"
2
$ jcmd 85227 Thread.print | grep -c "Monitor for python"
2
$ jcmd 85227 Thread.print | grep -c "Monitor for python"
2
```
<img width="859" alt="Screenshot 2021-04-14 at 16 06 51" src="https://user-images.githubusercontent.com/2017933/114731755-4969b980-9d42-11eb-8ec5-f60b217bdd96.png">
#### Without this change
```
...
$ jcmd 90052 Thread.print | grep -c "Monitor for python" [INSERT]
5645
..
```
<img width="856" alt="Screenshot 2021-04-14 at 16 30 18" src="https://user-images.githubusercontent.com/2017933/114731724-4373d880-9d42-11eb-9f9b-d976bf2530e2.png">
Closes#32169 from attilapiros/SPARK-35009.
Authored-by: attilapiros <piros.attila.zsolt@gmail.com>
Signed-off-by: attilapiros <piros.attila.zsolt@gmail.com>
### What changes were proposed in this pull request?
This PR aims to upgrade SBT to 1.5.1.
### Why are the changes needed?
https://github.com/sbt/sbt/releases/tag/v1.5.1
### Does this PR introduce _any_ user-facing change?
NO.
### How was this patch tested?
Pass the SBT CIs (Build/Test/Docs/Plugins).
Closes#32382 from lipzhu/SPARK-35254.
Authored-by: lipzhu <lipzhu@ebay.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This pr aims to upgrade Apache commons-lang3 to 3.12.0
### Why are the changes needed?
This version will bring the latest bug fixes as follows:
- https://commons.apache.org/proper/commons-lang/changes-report.html#a3.12.0
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Pass the Jenkins or GitHub Action
Closes#32393 from LuciferYang/lang3-to-312.
Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
`failureMessage` is already formatted, but `replaceAll("\n", " ")` destroyed the format. This PR fixed it.
### Why are the changes needed?
The formatted error message is easier to read and debug.
### Does this PR introduce _any_ user-facing change?
Yes, users see the clear error message in the application log.
(Note I changed a little bit to let the test throw exception intentionally. The test itself is good.)
Before:
![2141619490903_ pic_hd](https://user-images.githubusercontent.com/16397174/116177970-5a092f00-a747-11eb-9a0f-017391e80c8b.jpg)
After:
![2151619490955_ pic_hd](https://user-images.githubusercontent.com/16397174/116177981-5ecde300-a747-11eb-90ef-fd16e906beeb.jpg)
### How was this patch tested?
Manually tested.
Closes#32356 from Ngone51/format-stage-error-message.
Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: attilapiros <piros.attila.zsolt@gmail.com>
### What changes were proposed in this pull request?
This PR extends `ADD FILE/JAR/ARCHIVE` commands to be able to take multiple path arguments like Hive.
### Why are the changes needed?
To make those commands more useful.
### Does this PR introduce _any_ user-facing change?
Yes. In the current implementation, those commands can take a path which contains whitespaces without enclose it by neither `'` nor `"` but after this change, users need to enclose such paths.
I've note this incompatibility in the migration guide.
### How was this patch tested?
New tests.
Closes#32205 from sarutak/add-multiple-files.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
### What changes were proposed in this pull request?
This PR proposes to introduce a new JDBC option `refreshKrb5Config` which allows to reflect the change of `krb5.conf`.
### Why are the changes needed?
In the current master, JDBC datasources can't accept `refreshKrb5Config` which is defined in `Krb5LoginModule`.
So even if we change the `krb5.conf` after establishing a connection, the change will not be reflected.
The similar issue happens when we run multiple `*KrbIntegrationSuites` at the same time.
`MiniKDC` starts and stops every KerberosIntegrationSuite and different port number is recorded to `krb5.conf`.
Due to `SecureConnectionProvider.JDBCConfiguration` doesn't take `refreshKrb5Config`, KerberosIntegrationSuites except the first running one see the wrong port so those suites fail.
You can easily confirm with the following command.
```
build/sbt -Phive Phive-thriftserver -Pdocker-integration-tests "testOnly org.apache.spark.sql.jdbc.*KrbIntegrationSuite"
```
### Does this PR introduce _any_ user-facing change?
Yes. Users can set `refreshKrb5Config` to refresh krb5 relevant configuration.
### How was this patch tested?
New test.
Closes#32344 from sarutak/kerberos-refresh-issue.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
### What changes were proposed in this pull request?
Explicitly declare DecimalType(20, 0) for Parquet UINT_64, avoid use DecimalType.LongDecimal which only happens to have 20 as precision.
https://github.com/apache/spark/pull/31960#discussion_r622691560
### Why are the changes needed?
fix ambiguity
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
not needed, just current CI pass
Closes#32390 from yaooqinn/SPARK-34786-F.
Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
`WritablePartitionedIterator` define in `WritablePartitionedPairCollection.scala` and there are two implementation of these trait, but the code for these two implementations is duplicate.
The main change of this pr is turn the `WritablePartitionedIterator` from a trait into a default implementation class because there is only one implementation now.
### Why are the changes needed?
Cleanup duplicate code.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass the Jenkins or GitHub Action
Closes#32232 from LuciferYang/writable-partitioned-iterator.
Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: yi.wu <yi.wu@databricks.com>
### What changes were proposed in this pull request?
Make sure we re-throw an exception that is not null.
### Why are the changes needed?
to be super safe
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
N/A
Closes#32387 from cloud-fan/minor.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
Co-Authored-By: Chao Sun <sunchaoapple.com>
Co-Authored-By: Ryan Blue <rbluenetflix.com>
### What changes were proposed in this pull request?
This implements function resolution and evaluation for functions registered through V2 FunctionCatalog [SPARK-27658](https://issues.apache.org/jira/browse/SPARK-27658). In particular:
- Added documentation for how to define the "magic method" in `ScalarFunction`.
- Added a new expression `ApplyFunctionExpression` which evaluates input by delegating to `ScalarFunction.produceResult` method.
- added a new expression `V2Aggregator` which is a type of `TypedImperativeAggregate`. It's a wrapper of V2 `AggregateFunction` and mostly delegate methods to the implementation of the latter. It also uses plain Java serde for intermediate state.
- Added function resolution logic for `ScalarFunction` and `AggregateFunction` in `Analyzer`.
+ For `ScalarFunction` this checks if the magic method is implemented through Java reflection, and create a `Invoke` expression if so. Otherwise, it checks if the default `produceResult` is overridden. If so, it creates a `ApplyFunctionExpression` which evaluates through `InternalRow`. Otherwise an analysis exception is thrown.
+ For `AggregateFunction`, this checks if the `update` method is overridden. If so, it converts it to `V2Aggregator`. Otherwise an analysis exception is thrown similar to the case of `ScalarFunction`.
- Extended existing `InMemoryTableCatalog` to add the function catalog capability. Also renamed it to `InMemoryCatalog` since it no longer only covers tables.
**Note**: this currently can successfully detect whether a subclass overrides the default `produceResult` or `update` method from the parent interface **only for Java implementations**. It seems in Scala it's hard to differentiate whether a subclass overrides a default method from its parent interface. In this case, it will be a runtime error instead of analysis error.
A few TODOs:
- Extend `V2SessionCatalog` with function catalog. This seems a little tricky since API such V2 `FunctionCatalog`'s `loadFunction` is different from V1 `SessionCatalog`'s `lookupFunction`.
- Add magic method for `AggregateFunction`.
- Type coercion when looking up functions
### Why are the changes needed?
As V2 FunctionCatalog APIs are finalized, we should integrate it with function resolution and evaluation process so that they are actually useful.
### Does this PR introduce _any_ user-facing change?
Yes, now a function exposed through V2 FunctionCatalog can be analyzed and evaluated.
### How was this patch tested?
Added new unit tests.
Closes#32082 from sunchao/resolve-func-v2.
Lead-authored-by: Chao Sun <sunchao@apple.com>
Co-authored-by: Chao Sun <sunchao@apache.org>
Co-authored-by: Chao Sun <sunchao@uber.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Reorder `DemoteBroadcastHashJoin` and `EliminateUnnecessaryJoin`.
### Why are the changes needed?
Skip unnecessary check in `DemoteBroadcastHashJoin` if `EliminateUnnecessaryJoin` affects.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
No result affect.
Closes#32380 from ulysses-you/SPARK-34781-FOLLOWUP.
Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Add `ShuffledHashJoin` pattern check in `OptimizeSkewedJoin` so that we can optimize it.
### Why are the changes needed?
Currently, we have already supported all type of join through hint that make it easy to choose the join implementation.
We would choose `ShuffledHashJoin` if one table is not big but over the broadcast threshold. It's better that we can support optimize it in `OptimizeSkewedJoin`.
### Does this PR introduce _any_ user-facing change?
Probably yes, the execute plan in AQE mode may be changed.
### How was this patch tested?
Improve exists test in `AdaptiveQueryExecSuite`
Closes#32328 from ulysses-you/SPARK-35214.
Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
Add doc about `TRANSFORM` and related function.
### Why are the changes needed?
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Not need
Closes#32257 from AngersZhuuuu/SPARK-33976-followup.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
This PR let JDBC clients identify ANSI interval columns properly.
### Why are the changes needed?
This PR is similar to https://github.com/apache/spark/pull/29539.
JDBC users can query interval values through thrift server, create views with ansi interval columns, e.g.
`CREATE global temp view view1 as select interval '1-1' year to month as I;`
but when they want to get the details of the columns of view1, the will fail with `Unrecognized type name: YEAR-MONTH INTERVAL`
```
Caused by: java.lang.IllegalArgumentException: Unrecognized type name: YEAR-MONTH INTERVAL
at org.apache.spark.sql.hive.thriftserver.SparkGetColumnsOperation.toJavaSQLType(SparkGetColumnsOperation.scala:190)
at org.apache.spark.sql.hive.thriftserver.SparkGetColumnsOperation.$anonfun$addToRowSet$1(SparkGetColumnsOperation.scala:206)
at scala.collection.immutable.List.foreach(List.scala:392)
at org.apache.spark.sql.hive.thriftserver.SparkGetColumnsOperation.addToRowSet(SparkGetColumnsOperation.scala:198)
at org.apache.spark.sql.hive.thriftserver.SparkGetColumnsOperation.$anonfun$runInternal$7(SparkGetColumnsOperation.scala:109)
at org.apache.spark.sql.hive.thriftserver.SparkGetColumnsOperation.$anonfun$runInternal$7$adapted(SparkGetColumnsOperation.scala:109)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.sql.hive.thriftserver.SparkGetColumnsOperation.$anonfun$runInternal$5(SparkGetColumnsOperation.scala:109)
at org.apache.spark.sql.hive.thriftserver.SparkGetColumnsOperation.$anonfun$runInternal$5$adapted(SparkGetColumnsOperation.scala:107)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.sql.hive.thriftserver.SparkGetColumnsOperation.runInternal(SparkGetColumnsOperation.scala:107)
... 34 more
```
### Does this PR introduce _any_ user-facing change?
Yes. Let hive JDBC recognize ANSI interval.
### How was this patch tested?
Jenkins test.
Closes#32345 from beliefer/SPARK-35085.
Lead-authored-by: gengjiaan <gengjiaan@360.cn>
Co-authored-by: beliefer <beliefer@163.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
As we have suport the year-month and day-time intervals. Add the test actual size of year-month and day-time intervals type
### Why are the changes needed?
Just add test
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
./dev/scalastyle
run test for "ColumnTypeSuite"
Closes#32366 from Peng-Lei/SPARK-34878.
Authored-by: PengLei <18066542445@189.cn>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
The UnsupportedOperationChecker shouldn't allow streaming-batch intersects. As described in the ticket, they can't actually be planned correctly, and even simple cases like the below will fail:
```
test("intersect") {
val input = MemoryStream[Long]
val df = input.toDS().intersect(spark.range(10).as[Long])
testStream(df) (
AddData(input, 1L),
CheckAnswer(1)
)
}
```
### Why are the changes needed?
Users will be confused by the cryptic errors produced from trying to run an invalid query plan.
### Does this PR introduce _any_ user-facing change?
Some queries which previously failed with a poor error will now fail with a better one.
### How was this patch tested?
modified unit test
Closes#32371 from jose-torres/ossthing.
Authored-by: Jose Torres <joseph.torres@databricks.com>
Signed-off-by: hyukjinkwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR updates the interpreted code path of invoke expressions, to unwrap the `InvocationTargetException`
### Why are the changes needed?
Make interpreted and codegen path consistent for invoke expressions.
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
new UT
Closes#32370 from cloud-fan/minor.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: hyukjinkwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR proposes to make `CREATE FUNCTION USING` syntax can take archives as resources.
### Why are the changes needed?
It would be useful.
`CREATE FUNCTION USING` syntax doesn't support archives as resources because archives were not supported in Spark SQL.
Now Spark SQL supports archives so I think we can support them for the syntax.
### Does this PR introduce _any_ user-facing change?
Yes. Users can specify archives for `CREATE FUNCTION USING` syntax.
### How was this patch tested?
New test.
Closes#32359 from sarutak/load-function-using-archive.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: hyukjinkwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This patch adds a note for aarch64 user to install the specific pyarrow>=4.0.0.
### Why are the changes needed?
The pyarrow aarch64 support is [introduced](https://github.com/apache/arrow/pull/9285) in [PyArrow 4.0.0](https://github.com/apache/arrow/releases/tag/apache-arrow-4.0.0), and it has been published 27.Apr.2021.
See more in [SPARK-34979](https://issues.apache.org/jira/browse/SPARK-34979).
### Does this PR introduce _any_ user-facing change?
Yes, this doc can help user install arrow on aarch64.
### How was this patch tested?
doc test passed.
Closes#32363 from Yikun/SPARK-34979.
Authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: hyukjinkwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Add JindoFS SDK documents link in the cloud integration section of Spark's official document.
### Why are the changes needed?
If Spark users need to interact with Alibaba Cloud OSS, JindoFS SDK is the official solution provided by Alibaba Cloud.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
tested the url manually.
Closes#32360 from adrian-wang/jindodoc.
Authored-by: Daoyuan Wang <daoyuan.wdy@alibaba-inc.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
There are two types of dense vectors:
* pyspark.ml.linalg.DenseVector
* pyspark.mllib.linalg.DenseVector
In spark-3.1.1, array_to_vector returns instances of pyspark.ml.linalg.DenseVector.
The documentation is ambiguous & can lead to the false conclusion that instances of
pyspark.mllib.linalg.DenseVector will be returned.
Conversion from ml versions to mllib versions can easly be achieved with
mlutils.convertVectorColumnsToML helper.
### What changes were proposed in this pull request?
Make documentation more explicit
### Why are the changes needed?
The documentation is a bit misleading and users can lose time investigating & realizing there are two DenseVector types.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
No test were run as only the documentation was changed
Closes#32255 from jlafaye/master.
Authored-by: Julien Lafaye <jlafaye@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
In this PR, we add extract/date_part support for ANSI Intervals
The `extract` is an ANSI expression and `date_part` is NON-ANSI but exists as an equivalence for `extract`
#### expression
```
<extract expression> ::=
EXTRACT <left paren> <extract field> FROM <extract source> <right paren>
```
#### <extract field> for interval source
```
<primary datetime field> ::=
<non-second primary datetime field>
| SECOND
<non-second primary datetime field> ::=
YEAR
| MONTH
| DAY
| HOUR
| MINUTE
```
#### dataType
```
If <extract field> is a <primary datetime field> that does not specify SECOND or <extract field> is not a <primary datetime field>, then the declared type of the result is an implementation-defined exact numeric type with scale 0 (zero)
Otherwise, the declared type of the result is an implementation-defined exact numeric type with scale not less than the specified or implied <time fractional seconds precision> or <interval fractional seconds precision>, as appropriate, of the SECOND <primary datetime field> of the <extract source>.
```
### Why are the changes needed?
Subtask of ANSI Intervals Support
### Does this PR introduce _any_ user-facing change?
Yes
1. extract/date_part support ANSI intervals
2. for non-ansi intervals, the return type is changed from long to byte when extracting hours
### How was this patch tested?
new added tests
Closes#32351 from yaooqinn/SPARK-35091.
Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Create empty partition for custom shuffle reader if input RDD is empty.
### Why are the changes needed?
If input RDD partition is empty then the map output statistics will be null. And if all shuffle stage's input RDD partition is empty, we will skip it and lose the chance to coalesce partition.
We can simply create a empty partition for these custom shuffle reader to reduce the partition number.
### Does this PR introduce _any_ user-facing change?
Yes, the shuffle partition might be changed in AQE.
### How was this patch tested?
add new test.
Closes#32362 from ulysses-you/SPARK-35239.
Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
https://github.com/apache/spark/pull/32229 support ANSI SQL intervals by the aggregate function `avg`.
But have not treat that the input zero rows. so this will lead to:
```
Caused by: java.lang.ArithmeticException: / by zero
at com.google.common.math.LongMath.divide(LongMath.java:367)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1864)
at org.apache.spark.rdd.RDD.$anonfun$count$1(RDD.scala:1253)
at org.apache.spark.rdd.RDD.$anonfun$count$1$adapted(RDD.scala:1253)
at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2248)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:498)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1437)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:501)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
```
### Why are the changes needed?
Fix a bug.
### Does this PR introduce _any_ user-facing change?
No. Just new feature.
### How was this patch tested?
new tests.
Closes#32358 from beliefer/SPARK-34837-followup.
Authored-by: gengjiaan <gengjiaan@360.cn>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
Before this patch
```
scala> Seq(java.time.Period.ofMonths(Int.MinValue)).toDF("i").select($"i" / -1).show(false)
+-------------------------------------+
|(i / -1) |
+-------------------------------------+
|INTERVAL '-178956970-8' YEAR TO MONTH|
+-------------------------------------+
scala> Seq(java.time.Duration.of(Long.MinValue, java.time.temporal.ChronoUnit.MICROS)).toDF("i").select($"i" / -1).show(false)
+---------------------------------------------------+
|(i / -1) |
+---------------------------------------------------+
|INTERVAL '-106751991 04:00:54.775808' DAY TO SECOND|
+---------------------------------------------------+
```
Wrong result of min ANSI interval division by -1, this pr fix this
### Why are the changes needed?
Fix bug
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Added UT
Closes#32314 from AngersZhuuuu/SPARK-35169.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
`AggregateBenchmark` is only testing the performance for vectorized fast hash map, but not row-based hash map (which is used by default). We should add the row-based hash map into the benchmark.
java 8 benchmark run - https://github.com/c21/spark/actions/runs/787731549
java 11 benchmark run - https://github.com/c21/spark/actions/runs/787742858
### Why are the changes needed?
To have and track a basic sense of benchmarking different fast hash map used in hash aggregate.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing unit test, as this only touches benchmark code.
Closes#32357 from c21/agg-benchmark.
Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Support YearMonthIntervalType and DayTimeIntervalType to extend ArrowColumnVector
### Why are the changes needed?
https://issues.apache.org/jira/browse/SPARK-35139
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
1. By checking coding style via:
$ ./dev/scalastyle
$ ./dev/lint-java
2. Run the test "ArrowWriterSuite"
Closes#32340 from Peng-Lei/SPARK-35139.
Authored-by: PengLei <18066542445@189.cn>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This is a re-proposal of https://github.com/apache/spark/pull/23163. Currently spark always requires a [local sort](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L188) before writing to output table with dynamic partition/bucket columns. The sort can be unnecessary if cardinality of partition/bucket values is small, and can be avoided by keeping multiple output writers concurrently.
This PR introduces a config `spark.sql.maxConcurrentOutputFileWriters` (which disables this feature by default), where user can tune the maximal number of concurrent writers. The config is needed here as we cannot keep arbitrary number of writers in task memory which can cause OOM (especially for Parquet/ORC vectorization writer).
The feature is to first use concurrent writers to write rows. If the number of writers exceeds the above config specified limit. Sort rest of rows and write rows one by one (See `DynamicPartitionDataConcurrentWriter.writeWithIterator()`).
In addition, interface `WriteTaskStatsTracker` and its implementation `BasicWriteTaskStatsTracker` are also changed because previously they are relying on the assumption that only one writer is active for writing dynamic partitions and bucketed table.
### Why are the changes needed?
Avoid the sort before writing output for dynamic partitioned query and bucketed table.
Help improve CPU and IO performance for these queries.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Added unit test in `DataFrameReaderWriterSuite.scala`.
Closes#32198 from c21/writer.
Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
EXPLAIN command puts an empty line if there is no output for an analyzed plan. For example,
`sql("CREATE VIEW test AS SELECT 1").explain(true)` produces:
```
== Parsed Logical Plan ==
'CreateViewStatement [test], SELECT 1, false, false, PersistedView
+- 'Project [unresolvedalias(1, None)]
+- OneRowRelation
== Analyzed Logical Plan ==
CreateViewCommand `default`.`test`, SELECT 1, false, false, PersistedView, true
+- Project [1 AS 1#7]
+- OneRowRelation
== Optimized Logical Plan ==
CreateViewCommand `default`.`test`, SELECT 1, false, false, PersistedView, true
+- Project [1 AS 1#7]
+- OneRowRelation
== Physical Plan ==
Execute CreateViewCommand
+- CreateViewCommand `default`.`test`, SELECT 1, false, false, PersistedView, true
+- Project [1 AS 1#7]
+- OneRowRelation
```
### Why are the changes needed?
To handle empty output of analyzed plan and remove the unneeded empty line.
### Does this PR introduce _any_ user-facing change?
Yes, now the EXPLAIN command for the analyzed plan produces the following without the empty line:
```
== Analyzed Logical Plan ==
CreateViewCommand `default`.`test`, SELECT 1, false, false, PersistedView, true
+- Project [1 AS 1#7]
+- OneRowRelation
```
### How was this patch tested?
Added a test.
Closes#32342 from imback82/analyzed_plan_blank_line.
Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: hyukjinkwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This change is to use repos.spark-packages.org instead of Bintray as the repository service for spark-packages.
### Why are the changes needed?
The change is needed because Bintray will no longer be available from May 1st.
### Does this PR introduce _any_ user-facing change?
This should be transparent for users who use SparkSubmit.
### How was this patch tested?
Tested running spark-shell with --packages manually.
Closes#32346 from bozhang2820/replace-bintray.
Authored-by: Bo Zhang <bo.zhang@databricks.com>
Signed-off-by: hyukjinkwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This is another attempt to fix the flaky test "query without test harness" on ContinuousSuite.
`query without test harness` is flaky because it starts a continuous query with two partitions but assumes they will run at the same speed.
In this test, 0 and 2 will be written to partition 0, 1 and 3 will be written to partition 1. It assumes when we see 3, 2 should be written to the memory sink. But this is not guaranteed. We can add `if (currentValue == 2) Thread.sleep(5000)` at this line b2a2b5d820/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousRateStreamSource.scala (L135) to reproduce the failure: `Result set Set([0], [1], [3]) are not a superset of Set(0, 1, 2, 3)!`
The fix is changing `waitForRateSourceCommittedValue` to wait until all partitions reach the desired values before stopping the query.
### Why are the changes needed?
Fix a flaky test.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Existing tests. Manually verify the reproduction I mentioned above doesn't fail after this change.
Closes#32316 from zsxwing/SPARK-28247-fix.
Authored-by: Shixiong Zhu <zsxwing@gmail.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
### What changes were proposed in this pull request?
This PR group exception messages in `sql/catalyst/src/main/scala/org/apache/spark/sql/types`.
### Why are the changes needed?
It will largely help with standardization of error messages and its maintenance.
### Does this PR introduce _any_ user-facing change?
No. Error messages remain unchanged.
### How was this patch tested?
No new tests - pass all original tests to make sure it doesn't break any existing behavior.
Closes#32244 from beliefer/SPARK-35060.
Lead-authored-by: beliefer <beliefer@163.com>
Co-authored-by: gengjiaan <gengjiaan@360.cn>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Under hive's document https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Transform there are many usage about TRANSFORM and CLUSTER BY/ORDER BY, in this pr add some test about this cases.
### Why are the changes needed?
Add UT
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Added UT
Closes#32333 from AngersZhuuuu/SPARK-33985.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This patch proposes an improvement on nested column pruning if the pruning target is generator's output. Previously we disallow such case. This patch allows to prune on it if there is only one single nested column is accessed after `Generate`.
E.g., `df.select(explode($"items").as('item)).select($"item.itemId")`. As we only need `itemId` from `item`, we can prune other fields out and only keep `itemId`.
In this patch, we only address explode-like generators. We will address other generators in followups.
### Why are the changes needed?
This helps to extend the availability of nested column pruning.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Unit test
Closes#31966 from viirya/SPARK-34638.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
### What changes were proposed in this pull request?
Add note in migration guide about DayTimeIntervalType/YearMonthIntervalType show different between Hive SerDe and row format delimited
### Why are the changes needed?
Add note
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Not need
Closes#32343 from AngersZhuuuu/SPARK-35220-FOLLOWUP.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
This patch moves DS v2 custom metric classes to `org.apache.spark.sql.connector.metric` package. Moving `CustomAvgMetric` and `CustomSumMetric` to above package and make them as public java abstract class too.
### Why are the changes needed?
`CustomAvgMetric` and `CustomSumMetric` should be public APIs for developers to extend. As there are a few metric classes, we should put them together in one package.
### Does this PR introduce _any_ user-facing change?
No, dev only and they are not released yet.
### How was this patch tested?
Unit tests.
Closes#32348 from viirya/move-custom-metric-classes.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Add `IssueNavigationLink` to make IDEA support hyperlink on JIRA Ticket and GitHub PR on Git plugin.
![image](https://user-images.githubusercontent.com/26535726/115997353-5ecdc600-a615-11eb-99eb-6acbf15d8626.png)
### Why are the changes needed?
Make it more friendly for developers which using IDEA.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Closes#32337 from pan3793/SPARK-35223.
Authored-by: Cheng Pan <379377944@qq.com>
Signed-off-by: Kent Yao <yao@apache.org>
### What changes were proposed in this pull request?
This PR makes `Sequence` expression supports ANSI intervals as step expression.
If the start and stop expression is `TimestampType,` then the step expression could select year-month or day-time interval.
If the start and stop expression is `DateType,` then the step expression must be year-month.
### Why are the changes needed?
Extends the function of `Sequence` expression.
### Does this PR introduce _any_ user-facing change?
'Yes'. Users could use ANSI intervals as step expression for `Sequence` expression.
### How was this patch tested?
New tests.
Closes#32311 from beliefer/SPARK-35088.
Lead-authored-by: beliefer <beliefer@163.com>
Co-authored-by: gengjiaan <gengjiaan@360.cn>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
Modifies the UpdateFields optimizer to fix correctness issues with certain nested and chained withField operations. Examples for recreating the issue are in the new unit tests as well as the JIRA issue.
### Why are the changes needed?
Certain withField patterns can cause Exceptions or even incorrect results. It appears to be a result of the additional UpdateFields optimization added in https://github.com/apache/spark/pull/29812. It traverses fieldOps in reverse order to take the last one per field, but this can cause nested structs to change order which leads to mismatches between the schema and the actual data. This updates the optimization to maintain the initial ordering of nested structs to match the generated schema.
### Does this PR introduce _any_ user-facing change?
It fixes exceptions and incorrect results for valid uses in the latest Spark release.
### How was this patch tested?
Added new unit tests for these edge cases.
Closes#32338 from Kimahriman/bug/optimize-with-fields.
Authored-by: Adam Binford <adamq43@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
### What changes were proposed in this pull request?
In the test `"unsafe buffer with NO_CODEGEN"` of `MutableProjectionSuite`, fix unsafe buffer size calculation to be able to place all input fields without buffer overflow + meta-data.
### Why are the changes needed?
To make the test suite `MutableProjectionSuite` more stable.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
By running the affected test suite:
```
$ build/sbt "test:testOnly *MutableProjectionSuite"
```
Closes#32339 from MaxGekk/fix-buffer-overflow-MutableProjectionSuite.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
This is one of the patches for SPIP SPARK-30602 for push-based shuffle.
Summary of changes:
- Introduce `MergeStatus` which tracks the partition level metadata for a merged shuffle partition in the Spark driver
- Unify `MergeStatus` and `MapStatus` under a single trait to allow code reusing inside `MapOutputTracker`
- Extend `MapOutputTracker` to support registering / unregistering `MergeStatus`, calculate preferred locations for a shuffle taking into consideration of merged shuffle partitions, and serving reducer requests for block fetching locations with merged shuffle partitions.
The added APIs in `MapOutputTracker` will be used by `DAGScheduler` in SPARK-32920 and by `ShuffleBlockFetcherIterator` in SPARK-32922
### Why are the changes needed?
Refer to SPARK-30602
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Added unit tests.
Lead-authored-by: Min Shen mshenlinkedin.com
Co-authored-by: Chandni Singh chsinghlinkedin.com
Co-authored-by: Venkata Sowrirajan vsowrirajanlinkedin.com
Closes#30480 from Victsm/SPARK-32921.
Lead-authored-by: Venkata krishnan Sowrirajan <vsowrirajan@linkedin.com>
Co-authored-by: Min Shen <mshen@linkedin.com>
Co-authored-by: Chandni Singh <singh.chandni@gmail.com>
Co-authored-by: Chandni Singh <chsingh@linkedin.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
### What changes were proposed in this pull request?
columns like 'Shuffle Read Size / Records', 'Output Size/ Records' etc in table ` Aggregated Metrics by Executor` of stage-detail page should be sorted as numerical-order instead of lexicographical-order.
### Why are the changes needed?
buf fix,the sorting style should be consistent between different columns.
The correspondence between the table and the index is shown below(it is defined in stagespage-template.html):
| index | column name |
| ----- | -------------------------------------- |
| 0 | Executor ID |
| 1 | Logs |
| 2 | Address |
| 3 | Task Time |
| 4 | Total Tasks |
| 5 | Failed Tasks |
| 6 | Killed Tasks |
| 7 | Succeeded Tasks |
| 8 | Excluded |
| 9 | Input Size / Records |
| 10 | Output Size / Records |
| 11 | Shuffle Read Size / Records |
| 12 | Shuffle Write Size / Records |
| 13 | Spill (Memory) |
| 14 | Spill (Disk) |
| 15 | Peak JVM Memory OnHeap / OffHeap |
| 16 | Peak Execution Memory OnHeap / OffHeap |
| 17 | Peak Storage Memory OnHeap / OffHeap |
| 18 | Peak Pool Memory Direct / Mapped |
I constructed some data to simulate the sorting results of the index columns from 9 to 18.
As shown below,it can be seen that the sorting results of columns 9-12 are wrong:
![simulate-result](https://user-images.githubusercontent.com/52202080/115120775-c9fa1580-9fe1-11eb-8514-71f29db3a5eb.png)
The reason is that the real data corresponding to columns 9-12 (note that it is not the data displayed on the page) are **all strings similar to`94685/131`(bytes/records),while the real data corresponding to columns 13-18 are all numbers,**
so the sorting corresponding to columns 13-18 loos well, but the results of columns 9-12 are incorrect because the strings are sorted according to lexicographical order.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Only JS was modified, and the manual test result works well.
**before modified:**
![looks-illegal](https://user-images.githubusercontent.com/52202080/115120812-06c60c80-9fe2-11eb-9ada-fa520fe43c4e.png)
**after modified:**
![sort-result-corrent](https://user-images.githubusercontent.com/52202080/114865187-7c847980-9e24-11eb-9fbc-39ee224726d6.png)
Closes#32190 from kyoty/aggregated-metrics-by-executor-sorted-incorrectly.
Authored-by: kyoty <echohlne@gmail.com>
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
### What changes were proposed in this pull request?
DayTimeIntervalType/YearMonthIntervalString show different between Hive SerDe and row format delimited.
Create this pr to add a test and have disscuss.
For this problem I think we have two direction:
1. leave it as current and add a item t explain this in migration guide docs.
2. Since we should not change hive serde's behavior, so we can cast spark row format delimited's behavior to use cast DayTimeIntervalType/YearMonthIntervalType as HIVE_STYLE
### Why are the changes needed?
Add UT
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
added ut
Closes#32335 from AngersZhuuuu/SPARK-35220.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: hyukjinkwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
```sql
spark-sql> set spark.sql.adaptive.coalescePartitions.initialPartitionNum=1;
spark.sql.adaptive.coalescePartitions.initialPartitionNum 1
Time taken: 2.18 seconds, Fetched 1 row(s)
spark-sql> set mapred.reduce.tasks;
21/04/21 14:27:11 WARN SetCommand: Property mapred.reduce.tasks is deprecated, showing spark.sql.shuffle.partitions instead.
spark.sql.shuffle.partitions 1
Time taken: 0.03 seconds, Fetched 1 row(s)
spark-sql> set spark.sql.shuffle.partitions;
spark.sql.shuffle.partitions 200
Time taken: 0.024 seconds, Fetched 1 row(s)
spark-sql> set mapred.reduce.tasks=2;
21/04/21 14:31:52 WARN SetCommand: Property mapred.reduce.tasks is deprecated, automatically converted to spark.sql.shuffle.partitions instead.
spark.sql.shuffle.partitions 2
Time taken: 0.017 seconds, Fetched 1 row(s)
spark-sql> set mapred.reduce.tasks;
21/04/21 14:31:55 WARN SetCommand: Property mapred.reduce.tasks is deprecated, showing spark.sql.shuffle.partitions instead.
spark.sql.shuffle.partitions 1
Time taken: 0.017 seconds, Fetched 1 row(s)
spark-sql>
```
`mapred.reduce.tasks` is mapping to `spark.sql.shuffle.partitions` at write-side, but `spark.sql.adaptive.coalescePartitions.initialPartitionNum` might take precede of `spark.sql.shuffle.partitions`
### Why are the changes needed?
roundtrip for `mapred.reduce.tasks`
### Does this PR introduce _any_ user-facing change?
yes, `mapred.reduce.tasks` will always report `spark.sql.shuffle.partitions` whether `spark.sql.adaptive.coalescePartitions.initialPartitionNum` exists or not.
### How was this patch tested?
a new test
Closes#32265 from yaooqinn/SPARK-35168.
Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Kent Yao <yao@apache.org>
### What changes were proposed in this pull request?
This PR aims to upgrade Kafka client to 2.8.0.
Note that Kafka 2.8.0 uses ZSTD JNI 1.4.9-1 like Apache Spark 3.2.0.
### Why are the changes needed?
This will bring the latest client-side improvement and bug fixes like the following examples.
- KAFKA-10631 ProducerFencedException is not Handled on Offest Commit
- KAFKA-10134 High CPU issue during rebalance in Kafka consumer after upgrading to 2.5
- KAFKA-12193 Re-resolve IPs when a client is disconnected
- KAFKA-10090 Misleading warnings: The configuration was supplied but isn't a known config
- KAFKA-9263 The new hw is added to incorrect log when ReplicaAlterLogDirsThread is replacing log
- KAFKA-10607 Ensure the error counts contains the NONE
- KAFKA-10458 Need a way to update quota for TokenBucket registered with Sensor
- KAFKA-10503 MockProducer doesn't throw ClassCastException when no partition for topic
**RELEASE NOTE**
- https://downloads.apache.org/kafka/2.8.0/RELEASE_NOTES.html
- https://downloads.apache.org/kafka/2.7.0/RELEASE_NOTES.html
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass the CIs with the existing tests because this is a dependency change.
Closes#32325 from dongjoon-hyun/SPARK-33913.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: hyukjinkwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
1, remove existing agg, and use a new agg supporting virtual centering
2, add related testsuites
### Why are the changes needed?
centering vectors should accelerate convergence, and generate solution more close to R
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
updated testsuites and added testsuites
Closes#32124 from zhengruifeng/svc_agg_refactor.
Authored-by: Ruifeng Zheng <ruifengz@foxmail.com>
Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
Avoid to recompute the pending speculative tasks in the ExecutorAllocationManager, and remove some unnecessary code.
### Why are the changes needed?
The number of the pending speculative tasks is recomputed in the ExecutorAllocationManager to calculate the maximum number of executors required. While , it only needs to be computed once to improve performance.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Existing tests.
Closes#32306 from weixiuli/SPARK-35200.
Authored-by: weixiuli <weixiuli@jd.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>