### What changes were proposed in this pull request?
Add a new config to make cache plan disable configs configurable.
### Why are the changes needed?
The disable configs of cache plan if to avoid the perfermance regression, but not all the query will slow than before due to AQE or bucket scan enabled. It's useful to make a new config so that user can decide if some configs should be disabled during cache plan.
### Does this PR introduce _any_ user-facing change?
Yes, a new config.
### How was this patch tested?
Add test.
Closes#32482 from ulysses-you/SPARK-35332.
Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Add New SQL functions:
* TRY_ADD
* TRY_DIVIDE
These expressions are identical to the following expression under ANSI mode except that it returns null if error occurs:
* ADD
* DIVIDE
Note: it is easy to add other expressions like `TRY_SUBTRACT`/`TRY_MULTIPLY` but let's control the number of these new expressions and just add `TRY_ADD` and `TRY_DIVIDE` for now.
### Why are the changes needed?
1. Users can manage to finish queries without interruptions in ANSI mode.
2. Users can get NULLs instead of unreasonable results if overflow occurs when ANSI mode is off.
For example, the behavior of the following SQL operations is unreasonable:
```
2147483647 + 2 => -2147483647
```
With the new safe version SQL functions:
```
TRY_ADD(2147483647, 2) => null
```
Note: **We should only add new expressions to important operators, instead of adding new safe expressions for all the expressions that can throw errors.**
### Does this PR introduce _any_ user-facing change?
Yes, new SQL functions: TRY_ADD/TRY_DIVIDE
### How was this patch tested?
Unit test
Closes#32292 from gengliangwang/try_add.
Authored-by: Gengliang Wang <ltnwgl@gmail.com>
Signed-off-by: Gengliang Wang <ltnwgl@gmail.com>
### What changes were proposed in this pull request?
`./build/mvn` now downloads the .sha512 checksum of Maven artifacts it downloads, and checks the checksum after download.
### Why are the changes needed?
This ensures the integrity of the Maven artifact during a user's build, which may come from several non-ASF mirrors.
### Does this PR introduce _any_ user-facing change?
Should not affect anything about Spark per se, just the build.
### How was this patch tested?
Manual testing wherein I forced Maven/Scala download, verified checksums are downloaded and checked, and verified it fails on error with a corrupted checksum.
Closes#32505 from srowen/SPARK-35373.
Authored-by: Sean Owen <srowen@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
This PR removes the check of `summary.logLikelihood` in ml/clustering.py - this GMM test is quite flaky. It fails easily e.g., if:
- change number of partitions;
- just change the way to compute the sum of weights;
- change the underlying BLAS impl
Also uses more permissive precision on `Word2Vec` test case.
### Why are the changes needed?
To recover the build and tests.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Existing test cases.
Closes#32533 from zhengruifeng/SPARK_35392_disable_flaky_gmm_test.
Lead-authored-by: Ruifeng Zheng <ruifengz@foxmail.com>
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
We have supported DPP in AQE when the join is Broadcast hash join before applying the AQE rules in [SPARK-34168](https://issues.apache.org/jira/browse/SPARK-34168), which has some limitations. It only apply DPP when the small table side executed firstly and then the big table side can reuse the broadcast exchange in small table side. This PR is to address the above limitations and can apply the DPP when the broadcast exchange can be reused.
### Why are the changes needed?
Resolve the limitations when both enabling DPP and AQE
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Adding new ut
Closes#31756 from JkSelf/supportDPP2.
Authored-by: jiake <ke.a.jia@intel.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
In Spark, we have an extension in the MERGE syntax: INSERT/UPDATE *. This is not from ANSI standard or any other mainstream databases, so we need to define the behaviors by our own.
The behavior today is very weird: assume the source table has `n1` columns, target table has `n2` columns. We generate the assignments by taking the first `min(n1, n2)` columns from source & target tables and pairing them by ordinal.
This PR proposes a more reasonable behavior: take all the columns from target table as keys, and find the corresponding columns from source table by name as values.
### Why are the changes needed?
Fix the MEREG INSERT/UPDATE * to be more user-friendly and easy to do schema evolution.
### Does this PR introduce _any_ user-facing change?
Yes, but MERGE is only supported by very few data sources.
### How was this patch tested?
new tests
Closes#32192 from cloud-fan/merge.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
In https://github.com/yaooqinn/itachi/issues/8, we had a discussion about the current extension injection for the spark session. We've agreed that the current way is not that convenient for both third-party developers and end-users.
It's much simple if third-party developers can provide a resource file that contains default extensions for Spark to load ahead
### Why are the changes needed?
better use experience
### Does this PR introduce _any_ user-facing change?
no, dev change
### How was this patch tested?
new tests
Closes#32515 from yaooqinn/SPARK-35380.
Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Kent Yao <yao@apache.org>
### What changes were proposed in this pull request?
This PR aims to unify two K8s version variables in two `pom.xml`s into one. `kubernetes-client.version` is correct because the artifact ID is `kubernetes-client`.
```
kubernetes.client.version (kubernetes/core module)
kubernetes-client.version (kubernetes/integration-test module)
```
### Why are the changes needed?
Having two variables for the same value is confusing and inconvenient when we upgrade K8s versions.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass the CIs. (The compilation test passes are enough.)
Closes#32531 from dongjoon-hyun/SPARK-35394.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR fixes the same issue as #32424.
```py
from pyspark.sql.functions import flatten, struct, transform
df = spark.sql("SELECT array(1, 2, 3) as numbers, array('a', 'b', 'c') as letters")
df.select(flatten(
transform(
"numbers",
lambda number: transform(
"letters",
lambda letter: struct(number.alias("n"), letter.alias("l"))
)
)
).alias("zipped")).show(truncate=False)
```
**Before:**
```
+------------------------------------------------------------------------+
|zipped |
+------------------------------------------------------------------------+
|[{a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}]|
+------------------------------------------------------------------------+
```
**After:**
```
+------------------------------------------------------------------------+
|zipped |
+------------------------------------------------------------------------+
|[{1, a}, {1, b}, {1, c}, {2, a}, {2, b}, {2, c}, {3, a}, {3, b}, {3, c}]|
+------------------------------------------------------------------------+
```
### Why are the changes needed?
To produce the correct results.
### Does this PR introduce _any_ user-facing change?
Yes, it fixes the results to be correct as mentioned above.
### How was this patch tested?
Added a unit test as well as manually.
Closes#32523 from ueshin/issues/SPARK-35382/nested_higher_order_functions.
Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Change `map` in `InvokeLike.invoke` to a while loop to improve performance, following Spark [style guide](https://github.com/databricks/scala-style-guide#traversal-and-zipwithindex).
### Why are the changes needed?
`InvokeLike.invoke`, which is used in non-codegen path for `Invoke` and `StaticInvoke`, currently uses `map` to evaluate arguments:
```scala
val args = arguments.map(e => e.eval(input).asInstanceOf[Object])
if (needNullCheck && args.exists(_ == null)) {
// return null if one of arguments is null
null
} else {
...
```
which is pretty expensive if the method itself is trivial. We can change it to a plain while loop.
<img width="871" alt="Screen Shot 2021-05-12 at 12 19 59 AM" src="https://user-images.githubusercontent.com/506679/118055719-7f985a00-b33d-11eb-943b-cf85eab35f44.png">
Benchmark results show this can improve as much as 3x from `V2FunctionBenchmark`:
Before
```
OpenJDK 64-Bit Server VM 1.8.0_292-b10 on Linux 5.4.0-1046-azure
Intel(R) Xeon(R) CPU E5-2673 v3 2.40GHz
scalar function (long + long) -> long, result_nullable = false codegen = false: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
--------------------------------------------------------------------------------------------------------------------------------------------------------------
native_long_add 36506 36656 251 13.7 73.0 1.0X
java_long_add_default 47151 47540 370 10.6 94.3 0.8X
java_long_add_magic 178691 182457 1327 2.8 357.4 0.2X
java_long_add_static_magic 177151 178258 1151 2.8 354.3 0.2X
```
After
```
OpenJDK 64-Bit Server VM 1.8.0_292-b10 on Linux 5.4.0-1046-azure
Intel(R) Xeon(R) CPU E5-2673 v3 2.40GHz
scalar function (long + long) -> long, result_nullable = false codegen = false: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
--------------------------------------------------------------------------------------------------------------------------------------------------------------
native_long_add 29897 30342 568 16.7 59.8 1.0X
java_long_add_default 40628 41075 664 12.3 81.3 0.7X
java_long_add_magic 54553 54755 182 9.2 109.1 0.5X
java_long_add_static_magic 55410 55532 127 9.0 110.8 0.5X
```
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Existing tests.
Closes#32527 from sunchao/SPARK-35384.
Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR allows the PR source branch to include slashes.
### Why are the changes needed?
There are PRs whose source branches include slashes, like `issues/SPARK-35119/gha` here or #32523.
Before the fix, the PR build fails in `Sync the current branch with the latest in Apache Spark` phase.
For example, at #32523, the source branch is `issues/SPARK-35382/nested_higher_order_functions`:
```
...
fatal: couldn't find remote ref nested_higher_order_functions
Error: Process completed with exit code 128.
```
(https://github.com/ueshin/apache-spark/runs/2569356241)
### Does this PR introduce _any_ user-facing change?
No, this is a dev-only change.
### How was this patch tested?
This PR source branch includes slashes and #32525 doesn't.
Closes#32524 from ueshin/issues/SPARK-35119/gha.
Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR proposes to skip the "q6", "q34", "q64", "q74", "q75", "q78" queries in the TPCDS-related tests because the TPCDS v2.7 queries have almost the same ones; the only differences in these queries are ORDER BY columns.
### Why are the changes needed?
To improve test performance.
### Does this PR introduce _any_ user-facing change?
No, dev only.
### How was this patch tested?
Existing tests.
Closes#32520 from maropu/SkipDupQueries.
Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
This proposes to document the available metrics for ExecutorAllocationManager in the Spark monitoring documentation.
### Why are the changes needed?
The ExecutorAllocationManager is instrumented with metrics using the Spark metrics system.
The relevant work is in SPARK-7007 and SPARK-33763
ExecutorAllocationManager metrics are currently undocumented.
### Does this PR introduce _any_ user-facing change?
This PR adds documentation only.
### How was this patch tested?
na
Closes#32500 from LucaCanali/followupMetricsDocSPARK33763.
Authored-by: Luca Canali <luca.canali@cern.ch>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
Currently spark is not allowing to set spark.driver.memory, spark.executor.cores, spark.executor.memory to 0, but allowing driver cores to 0. This PR checks for driver core size as well. Thanks Oleg Lypkan for finding this.
### Why are the changes needed?
To make the configuration check consistent.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Manual testing
Closes#32504 from shahidki31/shahid/drivercore.
Lead-authored-by: shahid <shahidki31@gmail.com>
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Co-authored-by: Shahid <shahidki31@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
Switch to plain `while` loop following Spark [style guide](https://github.com/databricks/scala-style-guide#traversal-and-zipwithindex).
### Why are the changes needed?
`while` loop may yield better performance comparing to `foreach`.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
N/A
Closes#32522 from sunchao/SPARK-35361-follow-up.
Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
This PR aims to improve S3A magic committer support by inferring all missing configs from a single minimum configuration, `spark.hadoop.fs.s3a.bucket.<bucket>.committer.magic.enabled=true`.
Given that AWS S3 provides a [strong read-after-write consistency](https://aws.amazon.com/blogs/aws/amazon-s3-update-strong-read-after-write-consistency/) since December 2020, we can ignore DynamoDB-related configurations. As a result, the minimum set of configuration are the following:
```
spark.hadoop.fs.s3a.committer.magic.enabled=true
spark.hadoop.fs.s3a.bucket.<bucket>.committer.magic.enabled=true
spark.hadoop.fs.s3a.committer.name=magic
spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a=org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory
spark.sql.parquet.output.committer.class=org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter
spark.sql.sources.commitProtocolClass=org.apache.spark.internal.io.cloud.PathOutputCommitProtocol
```
### Why are the changes needed?
To use S3A magic committer in Apache Spark, the users need to setup a set of configurations. And, if something is missed, it will end up with the error messages like the following.
```
Exception in thread "main" org.apache.hadoop.fs.s3a.commit.PathCommitException:
`s3a://my-spark-bucket`: Filesystem does not have support for 'magic' committer enabled in configuration option fs.s3a.committer.magic.enabled
at org.apache.hadoop.fs.s3a.commit.CommitUtils.verifyIsMagicCommitFS(CommitUtils.java:74)
at org.apache.hadoop.fs.s3a.commit.CommitUtils.getS3AFileSystem(CommitUtils.java:109)
```
### Does this PR introduce _any_ user-facing change?
Yes, after this improvement PR, all Spark users can use S3A committer by using a single configuration.
```
spark.hadoop.fs.s3a.bucket.<bucket>.committer.magic.enabled=true
```
This PR is going to inferring the missing configurations. So, there is no side-effect if the existing users who have all configurations already.
### How was this patch tested?
Pass the CIs with the newly added test cases.
Closes#32518 from dongjoon-hyun/SPARK-35383.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
After merging https://github.com/apache/spark/pull/32439, there is flaky error from the Github action job "Java 11 build with Maven":
```
Error: ## Exception when compiling 473 sources to /home/runner/work/spark/spark/sql/catalyst/target/scala-2.12/classes
java.lang.StackOverflowError
scala.reflect.internal.Trees.itransform(Trees.scala:1376)
scala.reflect.internal.Trees.itransform$(Trees.scala:1374)
scala.reflect.internal.SymbolTable.itransform(SymbolTable.scala:28)
scala.reflect.internal.SymbolTable.itransform(SymbolTable.scala:28)
scala.reflect.api.Trees$Transformer.transform(Trees.scala:2563)
scala.tools.nsc.transform.TypingTransformers$TypingTransformer.transform(TypingTransformers.scala:51)
```
We can resolve it by increasing the stack size of JVM to 256M. The container for Github action jobs has 7G memory so this should be fine.
### Why are the changes needed?
Fix flaky test failure in Java 11 build test
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Github action test
Closes#32521 from gengliangwang/increaseStackSize.
Authored-by: Gengliang Wang <ltnwgl@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
A simple follow-up of #32474 to throw exception instead of sys.error.
### Why are the changes needed?
An exception only fails the query, instead of sys.error.
### Does this PR introduce _any_ user-facing change?
Yes, if `Invoke` or `StaticInvoke` cannot find the method, instead of original `sys.error` now we only throw an exception.
### How was this patch tested?
Existing tests.
Closes#32519 from viirya/SPARK-35347-followup.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
### What changes were proposed in this pull request?
This PR proposes to bump up the janino version from 3.0.16 to v3.1.4.
The major changes of this upgrade are as follows:
- Fixed issue #131: Janino 3.1.2 is 10x slower than 3.0.11: The Compiler's IClassLoader was initialized way too eagerly, thus lots of classes were loaded from the class path, which is very slow.
- Improved the encoding of stack map frames according to JVMS11 4.7.4: Previously, only "full_frame"s were generated.
- Fixed issue #107: Janino requires "org.codehaus.commons.compiler.io", but commons-compiler does not export this package
- Fixed the promotion of the array access index expression (see JLS7 15.13 Array Access Expressions).
For all the changes, please see the change log: http://janino-compiler.github.io/janino/changelog.html
NOTE1: I've checked that there is no obvious performance regression. For all the data, see a link: https://docs.google.com/spreadsheets/d/1srxT9CioGQg1fLKM3Uo8z1sTzgCsMj4pg6JzpdcG6VU/edit?usp=sharing
NOTE2: We upgraded janino to 3.1.2 (#27860) once before, but the commit had been reverted in #29495 because of the correctness issue. Recently, #32374 had checked if Spark could land on v3.1.3 or not, but a new bug was found there. These known issues has been fixed in v3.1.4 by following PRs:
- janino-compiler/janino#145
- janino-compiler/janino#146
### Why are the changes needed?
janino v3.0.X is no longer maintained.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
GA passed.
Closes#32455 from maropu/janino_v3.1.4.
Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
Overload methods `PageRank.runWithOptions` and `PageRank.runWithOptionsWithPreviousPageRank` (not to break any user-facing signature) with a `normalized` parameter that describes "whether or not to normalize the rank sum".
### Why are the changes needed?
https://issues.apache.org/jira/browse/SPARK-35357
When dealing with a non negligible proportion of sinks in a graph, algorithm based on incremental update of ranks can get a **precision gain for free** if they are allowed to manipulate non normalized ranks.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
By adding a unit test that verifies that (even when dealing with a graph containing a sink) we end up with the same result for both these scenarios:
a)
- Run **6 iterations** of pagerank in a row using `PageRank.runWithOptions` with **normalization enabled**
b)
- Run **2 iterations** using `PageRank.runWithOptions` with **normalization disabled**
- Resume from the `preRankGraph1` and run **2 more iterations** using `PageRank.runWithOptionsWithPreviousPageRank` with **normalization disabled**
- Finally resume from the `preRankGraph2` and run **2 more iterations** using `PageRank.runWithOptionsWithPreviousPageRank` with **normalization enabled**
Closes#32485 from bonnal-enzo/make-pagerank-normalization-optional.
Authored-by: Enzo Bonnal <enzobonnal@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
According to discuss https://github.com/apache/spark/pull/25854#discussion_r629451135
### Why are the changes needed?
Clean code
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Existed UT
Closes#32499 from AngersZhuuuu/SPARK-29145-fix.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Columnar execution support for ANSI interval types include YearMonthIntervalType and DayTimeIntervalType
### Why are the changes needed?
support cache tables with ANSI interval types.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
run ./dev/lint-java
run ./dev/scalastyle
run test: CachedTableSuite
run test: ColumnTypeSuite
Closes#32452 from Peng-Lei/SPARK-35243.
Lead-authored-by: PengLei <18066542445@189.cn>
Co-authored-by: Lei Peng <peng.8lei@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR fixes the same issue as https://github.com/apache/spark/pull/32424
```r
df <- sql("SELECT array(1, 2, 3) as numbers, array('a', 'b', 'c') as letters")
collect(select(
df,
array_transform("numbers", function(number) {
array_transform("letters", function(latter) {
struct(alias(number, "n"), alias(latter, "l"))
})
})
))
```
**Before:**
```
... a, a, b, b, c, c, a, a, b, b, c, c, a, a, b, b, c, c
```
**After:**
```
... 1, a, 1, b, 1, c, 2, a, 2, b, 2, c, 3, a, 3, b, 3, c
```
### Why are the changes needed?
To produce the correct results.
### Does this PR introduce _any_ user-facing change?
Yes, it fixes the results to be correct as mentioned above.
### How was this patch tested?
Manually tested as above, and unit test was added.
Closes#32517 from HyukjinKwon/SPARK-35381.
Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
SPARK-35175 (#32274) added a linter for JS so let's add it to GA.
### Why are the changes needed?
To JS code keep clean.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
GA
Closes#32512 from sarutak/ga-lintjs.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
In `ApplyFunctionExpression`, move `zipWithIndex` out of the loop for each input row.
### Why are the changes needed?
When the `ScalarFunction` is trivial, `zipWithIndex` could incur significant costs, as shown below:
<img width="899" alt="Screen Shot 2021-05-11 at 10 03 42 AM" src="https://user-images.githubusercontent.com/506679/117866421-fb19de80-b24b-11eb-8c94-d5e8c8b1eda9.png">
By removing it out of the loop, I'm seeing sometimes 2x speedup from `V2FunctionBenchmark`. For instance:
Before:
```
scalar function (long + long) -> long, result_nullable = false codegen = false: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
native_long_add 32437 32896 434 15.4 64.9 1.0X
java_long_add_default 85675 97045 NaN 5.8 171.3 0.4X
```
After:
```
scalar function (long + long) -> long, result_nullable = false codegen = false: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
native_long_add 30182 30387 279 16.6 60.4 1.0X
java_long_add_default 42862 43009 209 11.7 85.7 0.7X
```
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Existing tests
Closes#32507 from sunchao/SPARK-35361.
Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
From a few hours ago, Python linter fails in GA.
The latest Jinja 3.0.0 seems to cause this failure.
https://pypi.org/project/Jinja2/
```
Run ./dev/lint-python
starting python compilation test...
python compilation succeeded.
starting pycodestyle test...
pycodestyle checks passed.
starting flake8 test...
flake8 checks passed.
starting mypy test...
mypy checks passed.
starting sphinx-build tests...
sphinx-build checks failed:
Running Sphinx v3.0.4
making output directory... done
[autosummary] generating autosummary for: development/contributing.rst, development/debugging.rst, development/index.rst, development/setting_ide.rst, development/testing.rst, getting_started/index.rst, getting_started/install.rst, getting_started/quickstart.ipynb, index.rst, migration_guide/index.rst, ..., reference/pyspark.ml.rst, reference/pyspark.mllib.rst, reference/pyspark.resource.rst, reference/pyspark.rst, reference/pyspark.sql.rst, reference/pyspark.ss.rst, reference/pyspark.streaming.rst, user_guide/arrow_pandas.rst, user_guide/index.rst, user_guide/python_packaging.rst
Exception occurred:
File "/__w/spark/spark/python/docs/source/_templates/autosummary/class_with_docs.rst", line 26, in top-level template code
{% if '__init__' in methods %}
jinja2.exceptions.UndefinedError: 'methods' is undefined
The full traceback has been saved in /tmp/sphinx-err-ypgyi75y.log, if you want to report the issue to the developers.
Please also report this if it was a user error, so that a better error message can be provided next time.
A bug report can be filed in the tracker at <https://github.com/sphinx-doc/sphinx/issues>. Thanks!
make: *** [Makefile:20: html] Error 2
re-running make html to print full warning list:
Running Sphinx v3.0.4
making output directory... done
[autosummary] generating autosummary for: development/contributing.rst, development/debugging.rst, development/index.rst, development/setting_ide.rst, development/testing.rst, getting_started/index.rst, getting_started/install.rst, getting_started/quickstart.ipynb, index.rst, migration_guide/index.rst, ..., reference/pyspark.ml.rst, reference/pyspark.mllib.rst, reference/pyspark.resource.rst, reference/pyspark.rst, reference/pyspark.sql.rst, reference/pyspark.ss.rst, reference/pyspark.streaming.rst, user_guide/arrow_pandas.rst, user_guide/index.rst, user_guide/python_packaging.rst
Exception occurred:
File "/__w/spark/spark/python/docs/source/_templates/autosummary/class_with_docs.rst", line 26, in top-level template code
{% if '__init__' in methods %}
jinja2.exceptions.UndefinedError: 'methods' is undefined
The full traceback has been saved in /tmp/sphinx-err-fvtmvvwv.log, if you want to report the issue to the developers.
Please also report this if it was a user error, so that a better error message can be provided next time.
A bug report can be filed in the tracker at <https://github.com/sphinx-doc/sphinx/issues>. Thanks!
make: *** [Makefile:20: html] Error 2
Error: Process completed with exit code 2.
```
### Why are the changes needed?
To recover GA build.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
GA.
Closes#32509 from sarutak/fix-python-lint-error.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR increases the stack size for Scala compilation in Maven build to fix the error:
```
java.lang.StackOverflowError
scala.reflect.internal.Trees$UnderConstructionTransformer.transform(Trees.scala:1741)
scala.reflect.internal.Trees$UnderConstructionTransformer.transform$(Trees.scala:1740)
scala.tools.nsc.transform.ExplicitOuter$OuterPathTransformer.transform(ExplicitOuter.scala:289)
scala.tools.nsc.transform.ExplicitOuter$ExplicitOuterTransformer.transform(ExplicitOuter.scala:477)
scala.tools.nsc.transform.ExplicitOuter$ExplicitOuterTransformer.transform(ExplicitOuter.scala:330)
scala.reflect.api.Trees$Transformer.$anonfun$transformStats$1(Trees.scala:2597)
scala.reflect.api.Trees$Transformer.transformStats(Trees.scala:2595)
scala.reflect.internal.Trees.itransform(Trees.scala:1404)
scala.reflect.internal.Trees.itransform$(Trees.scala:1374)
scala.reflect.internal.SymbolTable.itransform(SymbolTable.scala:28)
scala.reflect.internal.SymbolTable.itransform(SymbolTable.scala:28)
scala.reflect.api.Trees$Transformer.transform(Trees.scala:2563)
scala.tools.nsc.transform.TypingTransformers$TypingTransformer.transform(TypingTransformers.scala:51)
scala.tools.nsc.transform.ExplicitOuter$OuterPathTransformer.scala$reflect$internal$Trees$UnderConstructionTransformer$$super$transform(ExplicitOuter.scala:212)
scala.reflect.internal.Trees$UnderConstructionTransformer.transform(Trees.scala:1745)
scala.reflect.internal.Trees$UnderConstructionTransformer.transform$(Trees.scala:1740)
scala.tools.nsc.transform.ExplicitOuter$OuterPathTransformer.transform(ExplicitOuter.scala:289)
scala.tools.nsc.transform.ExplicitOuter$ExplicitOuterTransformer.transform(ExplicitOuter.scala:477)
scala.tools.nsc.transform.ExplicitOuter$ExplicitOuterTransformer.transform(ExplicitOuter.scala:330)
scala.reflect.internal.Trees.itransform(Trees.scala:1383)
```
See https://github.com/apache/spark/runs/2554067779
### Why are the changes needed?
To recover JDK 11 compilation
### Does this PR introduce _any_ user-facing change?
No, dev-only.
### How was this patch tested?
CI in this PR will test it out.
Closes#32502 from HyukjinKwon/SPARK-35372.
Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
### What changes were proposed in this pull request?
This PR proposes to introduces three new configurations to limit the maximum number of jobs/stages/executors on the timeline view.
### Why are the changes needed?
If the number of items on the timeline view grows +1000, rendering can be significantly slow.
https://issues.apache.org/jira/browse/SPARK-35229
The maximum number of tasks on the timeline is already limited by `spark.ui.timeline.tasks.maximum` so l proposed to mitigate this issue with the same manner.
### Does this PR introduce _any_ user-facing change?
Yes. the maximum number of items shown on the timeline view is limited.
I proposed the default value 500 for jobs and stages, and 250 for executors.
A executor has at most 2 items (added and removed) 250 is chosen.
### How was this patch tested?
I manually confirm this change works with the following procedures.
```
# launch a cluster
$ bin/spark-shell --conf spark.ui.retainedDeadExecutors=300 --master "local-cluster[4, 1, 1024]"
// Confirm the maximum number of jobs
(1 to 1000).foreach { _ => sc.parallelize(List(1)).collect }
// Confirm the maximum number of stages
var df = sc.parallelize(1 to 2)
(1 to 1000).foreach { i => df = df.repartition(i % 5 + 1) }
df.collect
// Confirm the maximum number of executors
(1 to 300).foreach { _ => try sc.parallelize(List(1)).foreach { _ => System.exit(0) } catch { case e => }}
```
Screenshots here.
![jobs_limited](https://user-images.githubusercontent.com/4736016/116386937-3e8c4a00-a855-11eb-8f4c-151cf7ddd3b8.png)
![stages_limited](https://user-images.githubusercontent.com/4736016/116386990-49df7580-a855-11eb-9f71-8e129e3336ab.png)
![executors_limited](https://user-images.githubusercontent.com/4736016/116387009-4f3cc000-a855-11eb-8697-a2eb4c9c99e6.png)
Closes#32381 from sarutak/mitigate-timeline-issue.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Gengliang Wang <ltnwgl@gmail.com>
### What changes were proposed in this pull request?
Added the following TreePattern enums:
- BOOL_AGG
- COUNT_IF
- CURRENT_LIKE
- RUNTIME_REPLACEABLE
Added tree traversal pruning to the following rules:
- ReplaceExpressions
- RewriteNonCorrelatedExists
- ComputeCurrentTime
- GetCurrentDatabaseAndCatalog
### Why are the changes needed?
Reduce the number of tree traversals and hence improve the query compilation latency.
Performance improvement (org.apache.spark.sql.TPCDSQuerySuite):
Rule name | Total Time (baseline) | Total Time (experiment) | experiment/baseline
ReplaceExpressions | 27546369 | 19753804 | 0.72
RewriteNonCorrelatedExists | 17304883 | 2086194 | 0.12
ComputeCurrentTime | 35751301 | 19984477 | 0.56
GetCurrentDatabaseAndCatalog | 37230787 | 18874013 | 0.51
### How was this patch tested?
Existing tests.
Closes#32461 from sigmod/finish_analysis.
Authored-by: Yingyi Bu <yingyi.bu@databricks.com>
Signed-off-by: Gengliang Wang <ltnwgl@gmail.com>
### What changes were proposed in this pull request?
This is a pre-requisite of https://github.com/apache/spark/pull/32476, in discussion of https://github.com/apache/spark/pull/32476#issuecomment-836469779 . This is to refactor sort merge join code-gen to depend on streamed/buffered terminology, which makes the code-gen agnostic to different join types and can be extended to support other join types than inner join.
### Why are the changes needed?
Pre-requisite of https://github.com/apache/spark/pull/32476.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing unit test in `InnerJoinSuite.scala` for inner join code-gen.
Closes#32495 from c21/smj-refactor.
Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
Sequence expression output a message looks confused.
This PR will fix the issue.
### Why are the changes needed?
Improve the error message for Sequence expression
### Does this PR introduce _any_ user-facing change?
Yes. this PR updates the error message of Sequence expression.
### How was this patch tested?
Tests updated.
Closes#32492 from beliefer/SPARK-35088-followup.
Authored-by: gengjiaan <gengjiaan@360.cn>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR upgrades Kubernetes and Minikube version for integration tests and removes/updates the old code for this new version.
Details of this changes:
- As [discussed in the mailing list](http://apache-spark-developers-list.1001551.n3.nabble.com/minikube-and-kubernetes-cluster-versions-for-integration-testing-td30856.html): updating Minikube version from v0.34.1 to v1.7.3 and kubernetes version from v1.15.12 to v1.17.3.
- making Minikube version checked and fail with an explanation when the test is started with on a version < v1.7.3.
- removing minikube status checking code related to old Minikube versions
- in the Minikube backend using fabric8's `Config.autoConfigure()` method to configure the kubernetes client to use the `minikube` k8s context (like it was in [one of the Minikube's example](https://github.com/fabric8io/kubernetes-client/blob/master/kubernetes-examples/src/main/java/io/fabric8/kubernetes/examples/kubectl/equivalents/ConfigUseContext.java#L36))
- Introducing `persistentVolume` test tag: this would be a temporary change to skip PVC tests in the Kubernetes integration test, as currently the PCV tests are blocking the move to Docker as Minikube's driver (for details please check https://issues.apache.org/jira/browse/SPARK-34738).
### Why are the changes needed?
With the current suggestion one can run into several problems without noticing the Minikube/kubernetes version is the problem.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
It was tested on Mac with [this script](https://gist.github.com/attilapiros/cd58a16bdde833c80c5803c337fffa94#file-check_minikube_versions-zsh) which installs each Minikube versions from v1.7.2 (including this version to test the negative case of the version check) and runs the integration tests.
It was started with:
```
./check_minikube_versions.zsh > test_log 2>&1
```
And there was only one build failure the rest was successful:
```
$ grep "BUILD SUCCESS" test_log | wc -l
26
$ grep "BUILD FAILURE" test_log | wc -l
1
```
It was for Minikube v1.7.2 and the log is:
```
KubernetesSuite:
*** RUN ABORTED ***
java.lang.AssertionError: assertion failed: Unsupported Minikube version is detected: minikube version: v1.7.2.For integration testing Minikube version 1.7.3 or greater is expected.
at scala.Predef$.assert(Predef.scala:223)
at org.apache.spark.deploy.k8s.integrationtest.backend.minikube.Minikube$.getKubernetesClient(Minikube.scala:52)
at org.apache.spark.deploy.k8s.integrationtest.backend.minikube.MinikubeTestBackend$.initialize(MinikubeTestBackend.scala:33)
at org.apache.spark.deploy.k8s.integrationtest.KubernetesSuite.beforeAll(KubernetesSuite.scala:163)
at org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:212)
at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208)
at org.apache.spark.deploy.k8s.integrationtest.KubernetesSuite.org$scalatest$BeforeAndAfter$$super$run(KubernetesSuite.scala:43)
at org.scalatest.BeforeAndAfter.run(BeforeAndAfter.scala:273)
at org.scalatest.BeforeAndAfter.run$(BeforeAndAfter.scala:271)
...
```
Moreover I made a test with having multiple k8s cluster contexts, too.
Closes#31829 from attilapiros/SPARK-34736.
Lead-authored-by: “attilapiros” <piros.attila.zsolt@gmail.com>
Co-authored-by: attilapiros <piros.attila.zsolt@gmail.com>
Signed-off-by: attilapiros <piros.attila.zsolt@gmail.com>
### What changes were proposed in this pull request?
Change the definition of `findTightestCommonType` from
```
def findTightestCommonType(t1: DataType, t2: DataType): Option[DataType]
```
to
```
val findTightestCommonType: (DataType, DataType) => Option[DataType]
```
### Why are the changes needed?
For backward compatibility.
When running a MongoDB connector (built with Spark 3.1.1) with the latest master, there is such an error
```
java.lang.NoSuchMethodError: org.apache.spark.sql.catalyst.analysis.TypeCoercion$.findTightestCommonType()Lscala/Function2
```
from https://github.com/mongodb/mongo-spark/blob/master/src/main/scala/com/mongodb/spark/sql/MongoInferSchema.scala#L150
In the previous release, the function was
```
static public scala.Function2<org.apache.spark.sql.types.DataType, org.apache.spark.sql.types.DataType, scala.Option<org.apache.spark.sql.types.DataType>> findTightestCommonType ()
```
After https://github.com/apache/spark/pull/31349, the function becomes:
```
static public scala.Option<org.apache.spark.sql.types.DataType> findTightestCommonType (org.apache.spark.sql.types.DataType t1, org.apache.spark.sql.types.DataType t2)
```
This PR is to reduce the unnecessary API change.
### Does this PR introduce _any_ user-facing change?
Yes, the definition of `TypeCoercion.findTightestCommonType` is consistent with previous release again.
### How was this patch tested?
Existing unit tests
Closes#32493 from gengliangwang/typecoercion.
Authored-by: Gengliang Wang <ltnwgl@gmail.com>
Signed-off-by: Gengliang Wang <ltnwgl@gmail.com>
### What changes were proposed in this pull request?
RepairTableCommand respects `spark.sql.addPartitionInBatch.size` too
### Why are the changes needed?
Make RepairTableCommand add partition batch size configurable.
### Does this PR introduce _any_ user-facing change?
User can use `spark.sql.addPartitionInBatch.size` to change batch size when repair table.
### How was this patch tested?
Not need
Closes#32489 from AngersZhuuuu/SPARK-35360.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
This PR adds `python/.idea` into Git ignore. PyCharm is supposed to be open against `python` directory which contains `pyspark` package as its root package.
This was caused by https://github.com/apache/spark/pull/32337.
### Why are the changes needed?
To ignore `.idea` file for PyCharm.
### Does this PR introduce _any_ user-facing change?
No, dev-only.
### How was this patch tested?
Manually tested by testing with `git` command.
Closes#32490 from HyukjinKwon/minor-python-gitignore.
Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This patch proposes to increase the maximum heap memory setting for release build.
### Why are the changes needed?
When I was cutting RCs for 2.4.8, I frequently encountered OOM during building using mvn. It happens many times until I increased the heap memory setting.
I am not sure if other release managers encounter the same issue. So I propose to increase the heap memory setting and see if it looks good for others.
### Does this PR introduce _any_ user-facing change?
No, dev only.
### How was this patch tested?
Manually used it during cutting RCs of 2.4.8.
Closes#32487 from viirya/release-mvn-oom.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
Change `failOnError` to false for `NativeAdd` in `V2FunctionBenchmark`.
### Why are the changes needed?
Since `NativeAdd` is simply doing addition on long it's better to set `failOnError` to false so it will use native long addition instead of `Math.addExact`.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
N/A
Closes#32481 from sunchao/SPARK-35261-follow-up.
Authored-by: Chao Sun <sunchao@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Rename pattern strings and regexps of year-month and day-time intervals.
### Why are the changes needed?
To improve code maintainability.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
By existing test suites.
Closes#32444 from AngersZhuuuu/SPARK-35111-followup.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
As title. We should use a more restrictive interface `ShuffledJoin` other than `BaseJoinExec` in `CoalesceBucketsInJoin`, as the rule only applies to sort merge join and shuffled hash join (i.e. `ShuffledJoin`).
### Why are the changes needed?
Code cleanup.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing unit test in `CoalesceBucketsInJoinSuite`.
Closes#32480 from c21/minor-cleanup.
Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
when `numSlices` is avaiable, `logical.Range` should compute a exact `maxRowsPerPartition`
### Why are the changes needed?
`maxRowsPerPartition` is used in optimizer, we should provide an exact value if possible
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
existing testsuites
Closes#32350 from zhengruifeng/range_maxRowsPerPartition.
Authored-by: Ruifeng Zheng <ruifengz@foxmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
This patch proposes to use `MethodUtils` for looking up methods `Invoke` and `StaticInvoke` expressions.
### Why are the changes needed?
Currently we wrote our logic in `Invoke` and `StaticInvoke` expressions for looking up methods. It is tricky to consider all the cases and there is already existing utility package for this purpose. We should reuse the utility package.
### Does this PR introduce _any_ user-facing change?
No, internal change only.
### How was this patch tested?
Existing tests.
Closes#32474 from viirya/invoke-util.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
This PR proposes to filter out TPCDS v1.4 q6 and q75 in `TPCDSQueryTestSuite`.
I saw`TPCDSQueryTestSuite` failed nondeterministically because output row orders were different with those in the golden files. For example, the failure in the GA job, https://github.com/linhongliu-db/spark/runs/2507928605?check_suite_focus=true, happened because the `tpcds/q6.sql` query output rows were only sorted by `cnt`:
a0c76a8755/sql/core/src/test/resources/tpcds/q6.sql (L20)
Actually, `tpcds/q6.sql` and `tpcds-v2.7.0/q6.sql` are almost the same and the only difference is that `tpcds-v2.7.0/q6.sql` sorts both `cnt` and `a.ca_state`:
a0c76a8755/sql/core/src/test/resources/tpcds-v2.7.0/q6.sql (L22)
So, I think it's okay just to test `tpcds-v2.7.0/q6.sql` in this case (q75 has the same issue).
### Why are the changes needed?
For stable testing.
### Does this PR introduce _any_ user-facing change?
No, dev-only.
### How was this patch tested?
GA passed.
Closes#32454 from maropu/CleanUpTpcdsQueries.
Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
This PR makes the below case work well.
```sql
select a b from values(1) t(a) distribute by a;
```
```logtalk
== Parsed Logical Plan ==
'RepartitionByExpression ['a]
+- 'Project ['a AS b#42]
+- 'SubqueryAlias t
+- 'UnresolvedInlineTable [a], [List(1)]
== Analyzed Logical Plan ==
org.apache.spark.sql.AnalysisException: cannot resolve 'a' given input columns: [b]; line 1 pos 62;
'RepartitionByExpression ['a]
+- Project [a#48 AS b#42]
+- SubqueryAlias t
+- LocalRelation [a#48]
```
### Why are the changes needed?
bugfix
### Does this PR introduce _any_ user-facing change?
yes, the original attributes can be used in `distribute by` / `cluster by` and hints like `/*+ REPARTITION(3, c) */`
### How was this patch tested?
new tests
Closes#32465 from yaooqinn/SPARK-35331.
Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
Retain column metadata during the process of nested column pruning, when constructing `StructField`.
To test the above change, this also added the logic of column projection in `InMemoryTable`. Without the fix `DSV2CharVarcharDDLTestSuite` will fail.
### Why are the changes needed?
The column metadata is used in a few places such as re-constructing CHAR/VARCHAR information such as in [SPARK-33901](https://issues.apache.org/jira/browse/SPARK-33901). Therefore, we should retain the info during nested column pruning.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Existing tests.
Closes#32354 from sunchao/SPARK-35232.
Authored-by: Chao Sun <sunchao@apache.org>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>