### What changes were proposed in this pull request?
This PR aims to upgrade Apache ORC to 1.6.11 to bring the latest bug fixes.
### Why are the changes needed?
Apache ORC 1.6.11 has the following fixes.
- https://issues.apache.org/jira/projects/ORC/versions/12350499
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass the CIs.
Closes#33971 from dongjoon-hyun/SPARK-36732.
Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit c217797297)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
This PR aims to upgrade Scala to 2.12.15 to support Java 17/18 better.
### Why are the changes needed?
Scala 2.12.15 improves compatibility with JDK 17 and 18:
https://github.com/scala/scala/releases/tag/v2.12.15
- Avoids IllegalArgumentException in JDK 17+ for lambda deserialization
- Upgrades to ASM 9.2, for JDK 18 support in optimizer
### Does this PR introduce _any_ user-facing change?
Yes, this is a Scala version change.
### How was this patch tested?
Pass the CIs
Closes#33999 from dongjoon-hyun/SPARK-36759.
Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit 16f1f71ba5)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
Upgrade Apache Parquet to 1.12.1
### Why are the changes needed?
Parquet 1.12.1 contains the following bug fixes:
- PARQUET-2064: Make Range public accessible in RowRanges
- PARQUET-2022: ZstdDecompressorStream should close `zstdInputStream`
- PARQUET-2052: Integer overflow when writing huge binary using dictionary encoding
- PARQUET-1633: Fix integer overflow
- PARQUET-2054: fix TCP leaking when calling ParquetFileWriter.appendFile
- PARQUET-2072: Do Not Determine Both Min/Max for Binary Stats
- PARQUET-2073: Fix estimate remaining row count in ColumnWriteStoreBase
- PARQUET-2078: Failed to read parquet file after writing with the same
In particular PARQUET-2078 is a blocker for the upcoming Apache Spark 3.2.0 release.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Existing tests + a new test for the issue in SPARK-36696
Closes#33969 from sunchao/upgrade-parquet-12.1.
Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: DB Tsai <d_tsai@apple.com>
(cherry picked from commit a927b0836b)
Signed-off-by: DB Tsai <d_tsai@apple.com>
### What changes were proposed in this pull request?
This PR aims to fix the regex to avoid breaking `pom.xml`.
### Why are the changes needed?
**BEFORE**
```
$ dev/change-scala-version.sh 2.12
$ git diff | head -n10
diff --git a/core/pom.xml b/core/pom.xml
index dbde22f2bf..6ed368353b 100644
--- a/core/pom.xml
+++ b/core/pom.xml
-35,7 +35,7
</properties>
<dependencies>
- <!--<!--
```
**AFTER**
Since the default Scala version is `2.12`, the following `no-op` is the correct behavior which is consistent with the previous behavior.
```
$ dev/change-scala-version.sh 2.12
$ git diff
```
### Does this PR introduce _any_ user-facing change?
No. This is a dev only change.
### How was this patch tested?
Manually.
Closes#33996 from dongjoon-hyun/SPARK-36712.
Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit d730ef24fe)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
As [reported on `devspark.apache.org`](https://lists.apache.org/thread.html/r84cff66217de438f1389899e6d6891b573780159cd45463acf3657aa%40%3Cdev.spark.apache.org%3E), the published POMs when building with Scala 2.13 have the `scala-parallel-collections` dependency only in the `scala-2.13` profile of the pom.
### What changes were proposed in this pull request?
This PR suggests to work around this by un-commenting the `scala-parallel-collections` dependency when switching to 2.13 using the the `change-scala-version.sh` script.
I included an upgrade to scala-parallel-collections version 1.0.3, the changes compared to 0.2.0 are minor.
- removed OSGi metadata
- renamed some internal inner classes
- added `Automatic-Module-Name`
### Why are the changes needed?
According to the posts, this solves issues for developers that write unit tests for their applications.
Stephen Coy suggested to use the https://www.mojohaus.org/flatten-maven-plugin. While this sounds like a more principled solution, it is possibly too risky to do at this specific point in time?
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Locally
Closes#33948 from lrytz/parCollDep.
Authored-by: Lukas Rytz <lukas.rytz@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
(cherry picked from commit 1a62e6a2c1)
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
This PR aims to upgrade `aircompressor` dependency from 1.19 to 1.21.
### Why are the changes needed?
This will bring the latest bug fix which exists in `aircompressor` 1.17 ~ 1.20.
- 1e364f7133
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass the CIs.
Closes#33883 from dongjoon-hyun/SPARK-36629.
Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit ff8cc4b800)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
After https://github.com/apache/spark/pull/32726, Python doc build requires `sphinx-plotly-directive`.
This PR is to install it from `spark-rm/Dockerfile` to make sure `do-release-docker.sh` can run successfully.
Also, this PR mentions it in the README of docs.
### Why are the changes needed?
Fix release script and update README of docs
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Manual test locally.
Closes#33797 from gengliangwang/fixReleaseDocker.
Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(cherry picked from commit 42eebb84f5)
Signed-off-by: Gengliang Wang <gengliang@apache.org>
### What changes were proposed in this pull request?
This PR aims to bump ORC to 1.6.10
### Why are the changes needed?
This will bring the latest bug fixes.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass the CIs.
Closes#33712 from williamhyun/orc.
Authored-by: William Hyun <william@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit aff1b5594a)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
According to the feedback from GitHub, the change causing memory issue has been rolled back. We can try to raise memory again for GA.
### Why are the changes needed?
Trying higher memory settings for GA. It could speed up the testing time.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
GA
Closes#33623 from viirya/increasing-mem-ga.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
(cherry picked from commit 7d13ac177b)
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
### What changes were proposed in this pull request?
This PR proposes adding a Python package, `mlflow` and `sklearn` to enable the MLflow test in pandas API on Spark.
### Why are the changes needed?
To enable the MLflow test in pandas API on Spark.
### Does this PR introduce _any_ user-facing change?
No, it's test-only
### How was this patch tested?
Manually test on local, with `python/run-tests --testnames pyspark.pandas.mlflow`.
Closes#33567 from itholic/SPARK-36254.
Lead-authored-by: itholic <haejoon.lee@databricks.com>
Co-authored-by: Haejoon Lee <44108233+itholic@users.noreply.github.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit abce61f3fd)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
As the discussion in https://github.com/apache/spark/pull/32928/files#r654049392, after confirming the compatibility, we can use a newer RocksDB version for the state store implementation.
### Why are the changes needed?
For further ARM support and leverage the bug fix for the newer version.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing tests.
Closes#33578 from xuanyuanking/SPARK-36347.
Authored-by: Yuanjian Li <yuanjian.li@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(cherry picked from commit 4cd5fa96d8)
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
This PR aims to skip MiMa in PySpark/SparkR/Docker GHA job.
This will save GHA resource because MiMa is irrelevant to Python.
No.
Pass the GHA.
Closes#33532 from williamhyun/mima.
Lead-authored-by: William Hyun <william@apache.org>
Co-authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 674202e7b6)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
Update to the latest breeze 1.2
Minor bug fixes
No.
Existing tests
Closes#33449 from srowen/SPARK-35310.
Authored-by: Sean Owen <srowen@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
Fix `lint-python` to pick `PYTHON_EXECUTABLE` from the environment variable first to switch the Python and explicitly specify `PYTHON_EXECUTABLE` to use `python3.9` in CI.
### Why are the changes needed?
Currently `lint-python` uses `python3`, but it's not the one we expect in CI.
As a result, `black` check is not working.
```
The python3 -m black command was not found. Skipping black checks for now.
```
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
The `black` check in `lint-python` should work.
Closes#33507 from ueshin/issues/SPARK-36279/lint-python.
Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 663cbdfbe5)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Trying to adjust build memory settings and serial execution to re-enable GA.
### Why are the changes needed?
GA tests are failed recently due to return code 137. We need to adjust build settings to make GA work.
### Does this PR introduce _any_ user-facing change?
No, dev only.
### How was this patch tested?
GA
Closes#33447 from viirya/test-ga.
Lead-authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit fd36ed4550)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR proposes to set the lowerbound of mypy version to use in the testing script.
### Why are the changes needed?
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/141519/console
```
python/pyspark/mllib/tree.pyi:29: error: Overloaded function signatures 1 and 2 overlap with incompatible return types
python/pyspark/mllib/tree.pyi:38: error: Overloaded function signatures 1 and 2 overlap with incompatible return types
python/pyspark/mllib/feature.pyi:34: error: Overloaded function signatures 1 and 2 overlap with incompatible return types
python/pyspark/mllib/feature.pyi:42: error: Overloaded function signatures 1 and 2 overlap with incompatible return types
python/pyspark/mllib/feature.pyi:48: error: Overloaded function signatures 1 and 2 overlap with incompatible return types
python/pyspark/mllib/feature.pyi:54: error: Overloaded function signatures 1 and 2 overlap with incompatible return types
python/pyspark/mllib/feature.pyi:76: error: Overloaded function signatures 1 and 2 overlap with incompatible return types
python/pyspark/mllib/feature.pyi:124: error: Overloaded function signatures 1 and 2 overlap with incompatible return types
python/pyspark/mllib/feature.pyi:165: error: Overloaded function signatures 1 and 2 overlap with incompatible return types
python/pyspark/mllib/clustering.pyi:45: error: Overloaded function signatures 1 and 2 overlap with incompatible return types
python/pyspark/mllib/clustering.pyi:72: error: Overloaded function signatures 1 and 2 overlap with incompatible return types
python/pyspark/mllib/classification.pyi:39: error: Overloaded function signatures 1 and 2 overlap with incompatible return types
python/pyspark/mllib/classification.pyi:52: error: Overloaded function signatures 1 and 2 overlap with incompatible return types
Found 13 errors in 4 files (checked 314 source files)
1
```
Jenkins installed mypy at SPARK-32797 but seems the version installed is not same as GIthub Actions.
It seems difficult to make the codebase compatible with multiple mypy versions. Therefore, this PR sets the lowerbound.
### Does this PR introduce _any_ user-facing change?
No, dev-only.
### How was this patch tested?
Jenkins job in this PR should test it out.
Also manually tested:
Without mypy:
```
...
flake8 checks passed.
The mypy command was not found. Skipping for now.
```
With mypy 0.812:
```
...
flake8 checks passed.
The minimum mypy version needs to be 0.910. Your current version is mypy 0.812. Skipping for now.
```
With mypy 0.910:
```
...
flake8 checks passed.
starting mypy test...
mypy checks passed.
all lint-python tests passed!
```
Closes#33487 from HyukjinKwon/SPARK-36268.
Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit d6bc8cd681)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR aims to upgrade ZSTD-JNI to 1.5.0-4.
### Why are the changes needed?
ZSTD-JNI 1.5.0-3 has a packaging issue. 1.5.0-4 is recommended to be used instead.
- https://github.com/luben/zstd-jni/issues/181#issuecomment-885138495
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass the CIs.
Closes#33483 from dongjoon-hyun/SPARK-36262.
Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit a1a197403b)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
This PR partially backports the fix in the script at https://github.com/apache/spark/pull/33410 to make the branch-3.2 build pass at https://github.com/apache/spark/actions/workflows/build_and_test.yml?query=event%3Aschedule
### Why are the changes needed?
To make the Scala 2.13 periodical job pass
### Does this PR introduce _any_ user-facing change?
No, dev-only.
### How was this patch tested?
It is a logically non-conflicting backport.
Closes#33472 from HyukjinKwon/SPARK-36251.
Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR upgrades `zstd-jni` from `1.5.0-2` to `1.5.0-3`.
`1.5.0-3` was released few days ago.
This release resolves an issue about buffer size calculation, which can affect usage in Spark.
https://github.com/luben/zstd-jni/releases/tag/v1.5.0-3
### Why are the changes needed?
It might be a corner case that skipping length is greater than `2^31 - 1` but it's possible to affect Spark.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
CI.
Closes#33464 from sarutak/upgrade-zstd-jni-1.5.0-3.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit dcb7db5370)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
This is a followup PR for SPARK-36166 (#33411), which adds `BLOCK_SCALA_VERSION` to `sparktestssupport/__init__.py`.
### Why are the changes needed?
The following command fails due to the definition is missing.
```
SCALA_PROFILE=scala2.12 dev/run-tests.py
```
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
The command shown above works.
Closes#33421 from sarutak/followup-SPARK-36166.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit c7ccc602db)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR is a simple followup from https://github.com/apache/spark/pull/33376:
- It simplifies a bit by removing the default Scala version in the testing script (so we don't have to change here in the future when we change the Scala default version).
- Call `change-scala-version.sh` script (when `SCALA_PROFILE` is explicitly specified)
### Why are the changes needed?
More refactoring. In addition, this change will be used at https://github.com/apache/spark/pull/33410
### Does this PR introduce _any_ user-facing change?
No, dev-only.
### How was this patch tested?
CI in this PR should test it out.
Closes#33411 from HyukjinKwon/SPARK-36166.
Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 8ee199ef42)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR aims to skip UNIDOC generation in PySpark GHA job.
### Why are the changes needed?
PySpark GHA jobs do not need to generate Java/Scala doc. This will save about 13 minutes in total.
-https://github.com/apache/spark/runs/3098268973?check_suite_focus=true
```
...
========================================================================
Building Unidoc API Documentation
========================================================================
[info] Building Spark unidoc using SBT with these arguments: -Phadoop-3.2 -Phive-2.3 -Pscala-2.12 -Phive-thriftserver -Pmesos -Pdocker-integration-tests -Phive -Pkinesis-asl -Pspark-ganglia-lgpl -Pkubernetes -Phadoop-cloud -Pyarn unidoc
...
[info] Main Java API documentation successful.
[success] Total time: 192 s (03:12), completed Jul 18, 2021 6:08:40 PM
```
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass the GHA.
Closes#33407 from williamhyun/SKIP_UNIDOC.
Authored-by: William Hyun <william@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit c336f73ccd)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
For Apache Spark 3.2, this PR aims to support Scala 2.13 test in `dev/run-tests.py` by adding `SCALA_PROFILE` and in `dev/run-tests-jenkins.py` by adding `AMPLAB_JENKINS_BUILD_SCALA_PROFILE`.
In addition, `test-dependencies.sh` is skipped for Scala 2.13 because we currently don't maintain the dependency manifests yet. This will be handled after Apache Spark 3.2.0 release.
### Why are the changes needed?
To test Scala 2.13 with `dev/run-tests.py`.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Manual. The following is the result. Note that this PR aims to **run** Scala 2.13 tests instead of **passing** them. We will have daily GitHub Action job via #33358 and will fix UT failures if exists.
```
$ dev/change-scala-version.sh 2.13
$ SCALA_PROFILE=scala2.13 dev/run-tests.py
...
========================================================================
Running Scala style checks
========================================================================
[info] Checking Scala style using SBT with these profiles: -Phadoop-3.2 -Phive-2.3 -Pscala-2.13 -Pkubernetes -Phadoop-cloud -Phive -Phive-thriftserver -Pyarn -Pmesos -Pdocker-integration-tests -Pkinesis-asl -Pspark-ganglia-lgpl
...
========================================================================
Building Spark
========================================================================
[info] Building Spark using SBT with these arguments: -Phadoop-3.2 -Phive-2.3 -Pscala-2.13 -Pspark-ganglia-lgpl -Pmesos -Pyarn -Phive-thriftserver -Pkinesis-asl -Pkubernetes -Pdocker-integration-tests -Phive -Phadoop-cloud test:package streaming-kinesis-asl-assembly/assembly
...
[info] Building Spark assembly using SBT with these arguments: -Phadoop-3.2 -Phive-2.3 -Pscala-2.13 -Pspark-ganglia-lgpl -Pmesos -Pyarn -Phive-thriftserver -Pkinesis-asl -Pkubernetes -Pdocker-integration-tests -Phive -Phadoop-cloud assembly/package
...
========================================================================
Running Java style checks
========================================================================
[info] Checking Java style using SBT with these profiles: -Phadoop-3.2 -Phive-2.3 -Pscala-2.13 -Pspark-ganglia-lgpl -Pmesos -Pyarn -Phive-thriftserver -Pkinesis-asl -Pkubernetes -Pdocker-integration-tests -Phive -Phadoop-cloud
...
========================================================================
Building Unidoc API Documentation
========================================================================
[info] Building Spark unidoc using SBT with these arguments: -Phadoop-3.2 -Phive-2.3 -Pscala-2.13 -Pspark-ganglia-lgpl -Pmesos -Pyarn -Phive-thriftserver -Pkinesis-asl -Pkubernetes -Pdocker-integration-tests -Phive -Phadoop-cloud unidoc
...
========================================================================
Running Spark unit tests
========================================================================
[info] Running Spark tests using SBT with these arguments: -Phadoop-3.2 -Phive-2.3 -Pscala-2.13 -Pspark-ganglia-lgpl -Pmesos -Pyarn -Phive-thriftserver -Pkinesis-asl -Pkubernetes -Pdocker-integration-tests -Phive -Phadoop-cloud test
...
```
Closes#33376 from dongjoon-hyun/SPARK-36166.
Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit f66153de78)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
This is a follow-up of #33371.
At the branch commit GitHub run, we have an empty environment variable.
This PR adds back the empty string check logic.
### Why are the changes needed?
Currently, the failure happens when we use `--modules` in GitHub Action.
```
$ GITHUB_ACTIONS=1 APACHE_SPARK_REF= dev/run-tests.py --modules core
[info] Using build tool sbt with Hadoop profile hadoop3.2 and Hive profile hive2.3 under environment github_actions
fatal: ambiguous argument '': unknown revision or path not in the working tree.
Use '--' to separate paths from revisions, like this:
'git <command> [<revision>...] -- [<file>...]'
Traceback (most recent call last):
File "/Users/dongjoon/APACHE/spark-merge/dev/run-tests.py", line 785, in <module>
main()
File "/Users/dongjoon/APACHE/spark-merge/dev/run-tests.py", line 663, in main
changed_files = identify_changed_files_from_git_commits(
File "/Users/dongjoon/APACHE/spark-merge/dev/run-tests.py", line 91, in identify_changed_files_from_git_commits
raw_output = subprocess.check_output(['git', 'diff', '--name-only', patch_sha, diff_target],
File "/Users/dongjoon/.pyenv/versions/3.9.5/lib/python3.9/subprocess.py", line 424, in check_output
return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
File "/Users/dongjoon/.pyenv/versions/3.9.5/lib/python3.9/subprocess.py", line 528, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['git', 'diff', '--name-only', 'HEAD', '']' returned non-zero exit status 128.
```
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Manually. The following failure is correct in local environment because it passed `identify_changed_files_from_git_commits` already.
```
$ GITHUB_ACTIONS=1 APACHE_SPARK_REF= dev/run-tests.py --modules core
[info] Using build tool sbt with Hadoop profile hadoop3.2 and Hive profile hive2.3 under environment github_actions
Traceback (most recent call last):
File "/Users/dongjoon/APACHE/spark-merge/dev/run-tests.py", line 785, in <module>
main()
File "/Users/dongjoon/APACHE/spark-merge/dev/run-tests.py", line 668, in main
os.environ["GITHUB_SHA"], target_ref=os.environ["GITHUB_PREV_SHA"])
File "/Users/dongjoon/.pyenv/versions/3.9.5/lib/python3.9/os.py", line 679, in __getitem__
raise KeyError(key) from None
KeyError: 'GITHUB_SHA'
```
Closes#33374 from dongjoon-hyun/SPARK-36164.
Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit 5f41a2752f)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
This PR aims to change run-test.py so that it does not fail when os.environ["APACHE_SPARK_REF"] is not defined.
### Why are the changes needed?
Currently, the run-test.py ends with an error.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass the CIs.
Closes#33371 from williamhyun/SPARK-36164.
Authored-by: William Hyun <william@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit c8a3c22628)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
This PR is a followup of https://github.com/apache/spark/pull/26330. There is the last place to fix in `dev/test-dependencies.sh`
### Why are the changes needed?
To stick to Python 3 instead of using Python 2 mistakenly.
### Does this PR introduce _any_ user-facing change?
No, dev-only.
### How was this patch tested?
Manually tested.
Closes#33368 from HyukjinKwon/change-python-3.
Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit 6bd385f1e3)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
This PR aims to disable MiMa check for Scala 2.13 artifacts.
### Why are the changes needed?
Apache Spark doesn't have Scala 2.13 Maven artifacts yet.
SPARK-36151 will enable this after Apache Spark 3.2.0 release.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Manual. The following should succeed without real testing.
```
$ dev/mima -Pscala-2.13
```
Closes#33355 from dongjoon-hyun/SPARK-36150.
Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit 5acfecbf97)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
This PR upgrades `commons-compress` from `1.20` to `1.21` to deal with CVEs.
### Why are the changes needed?
Some CVEs which affect `commons-compress 1.20` are reported and fixed in `1.21`.
https://commons.apache.org/proper/commons-compress/security-reports.html
* CVE-2021-35515
* CVE-2021-35516
* CVE-2021-35517
* CVE-2021-36090
The severities are reported as low for all the CVEs but it would be better to deal with them just in case.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
CI.
Closes#33333 from sarutak/upgrade-commons-compress-1.21.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit fd06cc211d)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
This PR reverts https://github.com/apache/spark/pull/32455 and its followup https://github.com/apache/spark/pull/32536 , because the new janino version has a bug that is not fixed yet: https://github.com/janino-compiler/janino/pull/148
### Why are the changes needed?
avoid regressions
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
existing tests
Closes#33302 from cloud-fan/revert.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit ae6199af44)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Merge test_decimal_ops into test_num_ops
- merge test_isnull() into test_num_ops.test_isnull()
- remove test_datatype_ops(), which already covered in 11fcbc73cb/python/pyspark/pandas/tests/data_type_ops/test_base.py (L58-L59)
### Why are the changes needed?
Tests for data-type-based operations of decimal Series are in two places:
- python/pyspark/pandas/tests/data_type_ops/test_decimal_ops.py
- python/pyspark/pandas/tests/data_type_ops/test_num_ops.py
We'd better merge test_decimal_ops into test_num_ops.
See also [SPARK-36002](https://issues.apache.org/jira/browse/SPARK-36002) .
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
unittests passed
Closes#33206 from Yikun/SPARK-36002.
Authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit fdc50f4452)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR aims to upgrade Apache ORC to 1.6.9.
### Why are the changes needed?
This is required to bring ORC-804 in order to fix ORC encryption masking bug.
### Does this PR introduce _any_ user-facing change?
No. This is not released yet.
### How was this patch tested?
Pass the newly added test case.
Closes#33189 from dongjoon-hyun/SPARK-35992.
Lead-authored-by: Dongjoon Hyun <dongjoon@apache.org>
Co-authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit c55b9fd1e0)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
this is the skeleton of the ansible used to configure jenkins workers in the riselab/apache spark build system
### Why are the changes needed?
they are not needed, but will help the community understand how to build systems to test multiple versions of spark, as well as propose changes that i can integrate in to the "production" riselab repo. since we're sunsetting jenkins by EOY 2021, this will potentially be useful for migrating the build system.
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
ansible-lint and much wailing and gnashing of teeth.
Closes#32178 from shaneknapp/initial-ansible-commit.
Lead-authored-by: shane knapp <incomplete@gmail.com>
Co-authored-by: shane <incomplete@gmail.com>
Signed-off-by: shane knapp <incomplete@gmail.com>
### What changes were proposed in this pull request?
This PR aims to clean up Spark 2.4 and Java7 code path from the release scripts.
### Why are the changes needed?
To simplify the logic.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
N/A
Closes#33150 from dongjoon-hyun/SPARK-35948.
Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Like SPARK-35825, this PR aims to increase JVM stack size via `MAVEN_OPTS` in release-build.sh.
### Why are the changes needed?
This will mitigate the failure in publishing snapshot GitHub Action job and during the release.
- https://github.com/apache/spark/actions/workflows/publish_snapshot.yml (3-day consecutive failures)
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
N/A
Closes#33149 from dongjoon-hyun/SPARK-35947.
Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
The implementation for the RocksDB instance, which is used in the RocksDB state store. It plays a role as a handler for the RocksDB instance and RocksDBFileManager.
### Why are the changes needed?
Part of the RocksDB state store implementation.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
New UT added.
Closes#32928 from xuanyuanking/SPARK-35784.
Authored-by: Yuanjian Li <yuanjian.li@databricks.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
### What changes were proposed in this pull request?
This PR aims to upgrade ASM to 9.1
### Why are the changes needed?
The latest `xbean-asm9-shaded` is built with ASM 9.1.
- https://mvnrepository.com/artifact/org.apache.xbean/xbean-asm9-shaded/4.20
- 5e0e3c0c64/pom.xml (L67)
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass the CIs.
Closes#33130 from dongjoon-hyun/SPARK-35928.
Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
Add path level discover for python unittests.
### Why are the changes needed?
Now we need to specify the python test cases by manually when we add a new testcase. Sometime, we forgot to add the testcase to module list, the testcase would not be executed.
Such as:
- pyspark-core pyspark.tests.test_pin_thread
Thus we need some auto-discover way to find all testcase rather than specified every case by manually.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Add below code in end of `dev/sparktestsupport/modules.py`
```python
for m in sorted(all_modules):
for g in sorted(m.python_test_goals):
print(m.name, g)
```
Compare the result before and after:
https://www.diffchecker.com/iO3FvhKLCloses#32867 from Yikun/SPARK_DISCOVER_TEST.
Authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR updates the doctests in `run-tests.py`.
### Why are the changes needed?
This should be consists with `modules.py` behavior.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Pass the GitHub Action.
I checked manually.
```
$ python dev/run-tests.py
Cannot install SparkR as R was not found in PATH
[info] Using build tool sbt with Hadoop profile hadoop3.2 and Hive profile hive2.3 under environment local
[info] Found the following changed modules: root
[info] Setup the following environment variables for tests:
========================================================================
Running Apache RAT checks
========================================================================
RAT checks passed.
```
Closes#33127 from dongjoon-hyun/SPARK-35483-2.
Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
This PR aims to enable `docker_integration_tests` when `catalyst` and `sql` module changes additionally.
### Why are the changes needed?
Currently, `catalyst` and `sql` module changes do not trigger the JDBC integration test.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
N/A
Closes#33125 from dongjoon-hyun/SPARK-35483.
Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
This PR aims to upgrade Chill to 0.10.0.
### Why are the changes needed?
This is a maintenance release having cross-compilation to 2.12.14 and 2.13.6 .
- https://github.com/twitter/chill/releases/tag/v0.10.0
### Does this PR introduce _any_ user-facing change?
No, this is a dependency change.
### How was this patch tested?
Pass the CIs.
Closes#33119 from dongjoon-hyun/SPARK-35920.
Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
The PR is proposed to support creating a Column of numpy literal value in pandas-on-Spark. It consists of three changes mainly:
- Enable the `lit` function defined in `pyspark.pandas.spark.functions` to support numpy literals input.
```py
>>> from pyspark.pandas.spark import functions as SF
>>> SF.lit(np.int64(1))
Column<'CAST(1 AS BIGINT)'>
>>> SF.lit(np.int32(1))
Column<'CAST(1 AS INT)'>
>>> SF.lit(np.int8(1))
Column<'CAST(1 AS TINYINT)'>
>>> SF.lit(np.byte(1))
Column<'CAST(1 AS TINYINT)'>
>>> SF.lit(np.float32(1))
Column<'CAST(1.0 AS FLOAT)'>
```
- Substitute `F.lit` by `SF.lit`, that is, use `lit` function defined in `pyspark.pandas.spark.functions` rather than `lit` function defined in `pyspark.sql.functions` to allow creating columns out of numpy literals.
- Enable numpy literals input in `isin` method
Non-goal:
- Some pandas-on-Spark APIs use PySpark column-related APIs internally, and these column-related APIs don't support numpy literals, thus numpy literals are disallowed as input (e.g. `to_replace` parameter in `replace` API). This PR doesn't aim to adjust all of them. This PR adjusts `isin` only, because the PR is inspired by that (as https://github.com/databricks/koalas/issues/2161).
- To complete mappings between all kinds of numpy literals and Spark data types should be a followup task.
### Why are the changes needed?
Spark (`lit` function defined in `pyspark.sql.functions`) doesn't support creating a Column out of numpy literal value.
So `lit` function defined in `pyspark.pandas.spark.functions` is adjusted in order to support that in pandas-on-Spark.
### Does this PR introduce _any_ user-facing change?
Yes.
Before:
```py
>>> a = ps.DataFrame({'source': [1,2,3,4,5]})
>>> a.source.isin([np.int64(1), np.int64(2)])
Traceback (most recent call last):
...
AttributeError: 'numpy.int64' object has no attribute '_get_object_id'
```
After:
```py
>>> a = ps.DataFrame({'source': [1,2,3,4,5]})
>>> a.source.isin([np.int64(1), np.int64(2)])
0 True
1 True
2 False
3 False
4 False
Name: source, dtype: bool
```
### How was this patch tested?
Unit tests.
Closes#32955 from xinrong-databricks/datatypeops_literal.
Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
### What changes were proposed in this pull request?
Update Ivy from 2.4.0 to 2.5.0.
- https://ant.apache.org/ivy/history/2.5.0/release-notes.html
### Why are the changes needed?
This brings various improvements and bug fixes. Most notably, the adding of `ivy.maven.lookup.sources` and `ivy.maven.lookup.javadoc` configs can significantly speed up module resolution time if these are turned off, especially behind a proxy. These could arguably be turned off by default, because when submitting jobs you probably don't care about the sources or javadoc jars. I didn't include that here but happy to look into if it's desired.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Existing UT and build passes
Closes#33088 from Kimahriman/feature/ivy-update.
Authored-by: Adam Binford <adamq43@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
After the RC vote, the release manager still need to do many work to finalize the release. This PR updates the script the automatize some steps:
1. create the final git tag
2. publish to pypi
3. publish docs to spark-website
4. move the release binaries from dev directory to release directory.
5. update the KEYS file
### Why are the changes needed?
easy the work of release manager.
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
tested with the recent 3.0.3.
Closes#33055 from cloud-fan/release.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Add `python-is-python3` to `create-release/spark-rm/Dockerfile`
### Why are the changes needed?
Systems that use pthon3 by default should explicitly indicate the python version is 3.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Tested during Apache 3.0.3 release.
Closes#33048 from Ngone51/fix-release-script.
Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR aims to add `hadoop-cloud` profile to `PUBLISH_PROFILES` in order to publish `hadoop-cloud` module.
Note that this doesn't change `BASE_RELEASE_PROFILES` and there is no change in the binary distributions.
### Why are the changes needed?
This is discussed here.
- https://lists.apache.org/thread.html/rf87d755460d5ed85c7b6ac0edad48f53c929a2cd287f30be24afd2ad%40%3Cuser.spark.apache.org%3E
### Does this PR introduce _any_ user-facing change?
Yes, this will provide `hadoop-cloud` module in Maven Central.
### How was this patch tested?
N/A (After merging this, we can check the daily snapshot result)
Closes#33003 from dongjoon-hyun/SPARK-35844.
Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>