Commit graph

1095 commits

Author SHA1 Message Date
Gengliang Wang f47a519721 [SPARK-36551][BUILD] Add sphinx-plotly-directive in Spark release Dockerfile
### What changes were proposed in this pull request?

After https://github.com/apache/spark/pull/32726, Python doc build requires `sphinx-plotly-directive`.
This PR is to install it from `spark-rm/Dockerfile` to make sure `do-release-docker.sh` can run successfully.
Also, this PR mentions it in the README of docs.

### Why are the changes needed?

Fix release script and update README of docs

### Does this PR introduce _any_ user-facing change?

No
### How was this patch tested?

Manual test locally.

Closes #33797 from gengliangwang/fixReleaseDocker.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(cherry picked from commit 42eebb84f5)
Signed-off-by: Gengliang Wang <gengliang@apache.org>
2021-08-20 20:02:44 +08:00
Sean Owen b8c1014e23 Update Spark key negotiation protocol 2021-08-14 09:08:29 -05:00
William Hyun 1a371fbfa1 [SPARK-36482][BUILD] Bump orc to 1.6.10
### What changes were proposed in this pull request?
This PR aims to bump ORC to 1.6.10

### Why are the changes needed?
This will bring the latest bug fixes.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Pass the CIs.

Closes #33712 from williamhyun/orc.

Authored-by: William Hyun <william@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit aff1b5594a)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-08-11 11:32:18 -07:00
Liang-Chi Hsieh 712c311736 [SPARK-36393][BUILD] Try to raise memory for GHA
### What changes were proposed in this pull request?

According to the feedback from GitHub, the change causing memory issue has been rolled back. We can try to raise memory again for GA.

### Why are the changes needed?

Trying higher memory settings for GA. It could speed up the testing time.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

GA

Closes #33623 from viirya/increasing-mem-ga.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
(cherry picked from commit 7d13ac177b)
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2021-08-05 01:31:45 -07:00
Hyukjin Kwon 310cd8eef1 [SPARK-36092][INFRA][BUILD][PYTHON] Migrate to GitHub Actions with Codecov from Jenkins
This PR proposes to migrate Coverage report from Jenkins to GitHub Actions by setting a dailly cron job.

For some background, currently PySpark code coverage is being reported in this specific Jenkins job: https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/

Because of the security issue between [Codecov service](https://app.codecov.io/gh/) and Jenkins machines, we had to work around by manually hosting a coverage site via GitHub pages, see also https://spark-test.github.io/pyspark-coverage-site/ by spark-test account (which is shared to only subset of PMC members).

Since we now run the build via GitHub Actions, we can leverage [Codecov plugin](https://github.com/codecov/codecov-action), and remove the workaround we used.

Virtually no. Coverage site (UI) might change but the information it holds should be virtually the same.

I manually tested:
- Scheduled run: https://github.com/HyukjinKwon/spark/actions/runs/1082261484
- Coverage report: 73f0291a7d/python/pyspark
- Run against a PR: https://github.com/HyukjinKwon/spark/actions/runs/1082367175

Closes #33591 from HyukjinKwon/SPARK-36092.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit c0d1860f25)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-01 21:38:39 +09:00
itholic a9c5b1a5c8 [SPARK-36254][INFRA][PYTHON] Install mlflow in Github Actions CI
### What changes were proposed in this pull request?

This PR proposes adding a Python package, `mlflow` and `sklearn` to enable the MLflow test in pandas API on Spark.

### Why are the changes needed?

To enable the MLflow test in pandas API on Spark.

### Does this PR introduce _any_ user-facing change?

No, it's test-only

### How was this patch tested?

Manually test on local, with `python/run-tests --testnames pyspark.pandas.mlflow`.

Closes #33567 from itholic/SPARK-36254.

Lead-authored-by: itholic <haejoon.lee@databricks.com>
Co-authored-by: Haejoon Lee <44108233+itholic@users.noreply.github.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit abce61f3fd)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-07-30 00:04:59 -07:00
Yuanjian Li e8462a584c [SPARK-36347][SS] Upgrade the RocksDB version to 6.20.3
### What changes were proposed in this pull request?
As the discussion in https://github.com/apache/spark/pull/32928/files#r654049392, after confirming the compatibility, we can use a newer RocksDB version for the state store implementation.

### Why are the changes needed?
For further ARM support and leverage the bug fix for the newer version.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Existing tests.

Closes #33578 from xuanyuanking/SPARK-36347.

Authored-by: Yuanjian Li <yuanjian.li@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(cherry picked from commit 4cd5fa96d8)
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-07-29 11:09:10 -07:00
William Hyun dfa5c4dadc [SPARK-36285][INFRA][TESTS] Skip MiMa in PySpark/SparkR/Docker GHA job
This PR aims to skip MiMa in PySpark/SparkR/Docker GHA job.

This will save GHA resource because MiMa is irrelevant to Python.

No.
Pass the GHA.

Closes #33532 from williamhyun/mima.

Lead-authored-by: William Hyun <william@apache.org>
Co-authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 674202e7b6)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-27 16:50:10 +09:00
Sean Owen c7d246ba4e [SPARK-35310][MLLIB] Update to breeze 1.2
Update to the latest breeze 1.2

Minor bug fixes

No.

Existing tests

Closes #33449 from srowen/SPARK-35310.

Authored-by: Sean Owen <srowen@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2021-07-24 08:20:25 -05:00
Takuya UESHIN c1434b1928 [SPARK-36279][INFRA][PYTHON] Fix lint-python to work with Python 3.9
### What changes were proposed in this pull request?

Fix `lint-python` to pick `PYTHON_EXECUTABLE` from the environment variable first to switch the Python and explicitly specify `PYTHON_EXECUTABLE` to use `python3.9` in CI.

### Why are the changes needed?

Currently `lint-python` uses `python3`, but it's not the one we expect in CI.
As a result, `black` check is not working.

```
The python3 -m black command was not found. Skipping black checks for now.
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

The `black` check in `lint-python` should work.

Closes #33507 from ueshin/issues/SPARK-36279/lint-python.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 663cbdfbe5)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-24 16:49:51 +09:00
Liang-Chi Hsieh a6418a3463 [SPARK-36270][BUILD] Change memory settings for enabling GA
### What changes were proposed in this pull request?

Trying to adjust build memory settings and serial execution to re-enable GA.

### Why are the changes needed?

GA tests are failed recently due to return code 137. We need to adjust build settings to make GA work.

### Does this PR introduce _any_ user-facing change?

No, dev only.

### How was this patch tested?

GA

Closes #33447 from viirya/test-ga.

Lead-authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit fd36ed4550)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-23 19:11:09 +09:00
Hyukjin Kwon f169f056b4 [SPARK-36268][PYTHON] Set the lowerbound of mypy version to 0.910
### What changes were proposed in this pull request?

This PR proposes to set the lowerbound of mypy version to use in the testing script.

### Why are the changes needed?

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/141519/console

```
python/pyspark/mllib/tree.pyi:29: error: Overloaded function signatures 1 and 2 overlap with incompatible return types
python/pyspark/mllib/tree.pyi:38: error: Overloaded function signatures 1 and 2 overlap with incompatible return types
python/pyspark/mllib/feature.pyi:34: error: Overloaded function signatures 1 and 2 overlap with incompatible return types
python/pyspark/mllib/feature.pyi:42: error: Overloaded function signatures 1 and 2 overlap with incompatible return types
python/pyspark/mllib/feature.pyi:48: error: Overloaded function signatures 1 and 2 overlap with incompatible return types
python/pyspark/mllib/feature.pyi:54: error: Overloaded function signatures 1 and 2 overlap with incompatible return types
python/pyspark/mllib/feature.pyi:76: error: Overloaded function signatures 1 and 2 overlap with incompatible return types
python/pyspark/mllib/feature.pyi:124: error: Overloaded function signatures 1 and 2 overlap with incompatible return types
python/pyspark/mllib/feature.pyi:165: error: Overloaded function signatures 1 and 2 overlap with incompatible return types
python/pyspark/mllib/clustering.pyi:45: error: Overloaded function signatures 1 and 2 overlap with incompatible return types
python/pyspark/mllib/clustering.pyi:72: error: Overloaded function signatures 1 and 2 overlap with incompatible return types
python/pyspark/mllib/classification.pyi:39: error: Overloaded function signatures 1 and 2 overlap with incompatible return types
python/pyspark/mllib/classification.pyi:52: error: Overloaded function signatures 1 and 2 overlap with incompatible return types
Found 13 errors in 4 files (checked 314 source files)
1
```

Jenkins installed mypy at SPARK-32797 but seems the version installed is not same as GIthub Actions.

It seems difficult to make the codebase compatible with multiple mypy versions. Therefore, this PR sets the lowerbound.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

Jenkins job in this PR should test it out.

Also manually tested:

Without mypy:

```
...
flake8 checks passed.

The mypy command was not found. Skipping for now.
```

With mypy 0.812:

```
...
flake8 checks passed.

The minimum mypy version needs to be 0.910. Your current version is mypy 0.812. Skipping for now.
```

With mypy 0.910:

```
...
flake8 checks passed.

starting mypy test...
mypy checks passed.

all lint-python tests passed!
```

Closes #33487 from HyukjinKwon/SPARK-36268.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit d6bc8cd681)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-23 12:28:28 +09:00
Dongjoon Hyun 60566f9d8e [SPARK-36262][BUILD] Upgrade ZSTD-JNI to 1.5.0-4
### What changes were proposed in this pull request?

This PR aims to upgrade ZSTD-JNI to 1.5.0-4.

### Why are the changes needed?

ZSTD-JNI 1.5.0-3 has a packaging issue. 1.5.0-4 is recommended to be used instead.
- https://github.com/luben/zstd-jni/issues/181#issuecomment-885138495

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

Closes #33483 from dongjoon-hyun/SPARK-36262.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit a1a197403b)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-07-22 14:04:14 -07:00
Hyukjin Kwon d01e53208b [SPARK-36251][INFRA][BUILD][3.2] Cover GitHub Actions runs without SHA in testing script
### What changes were proposed in this pull request?

This PR partially backports the fix in the script at https://github.com/apache/spark/pull/33410 to make the branch-3.2 build pass at https://github.com/apache/spark/actions/workflows/build_and_test.yml?query=event%3Aschedule

### Why are the changes needed?

To make the Scala 2.13 periodical job pass

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

It is a logically non-conflicting backport.

Closes #33472 from HyukjinKwon/SPARK-36251.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-22 11:47:36 +09:00
Kousuke Saruta fef7bf9fcc [SPARK-36244][BUILD] Upgrade zstd-jni to 1.5.0-3 to avoid a bug about buffer size calculation
### What changes were proposed in this pull request?

This PR upgrades `zstd-jni` from `1.5.0-2` to `1.5.0-3`.
`1.5.0-3` was released few days ago.
This release resolves an issue about buffer size calculation, which can affect usage in Spark.
https://github.com/luben/zstd-jni/releases/tag/v1.5.0-3

### Why are the changes needed?

It might be a corner case that skipping length is greater than `2^31 - 1` but it's possible to affect Spark.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI.

Closes #33464 from sarutak/upgrade-zstd-jni-1.5.0-3.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit dcb7db5370)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-07-21 19:37:18 -07:00
Kousuke Saruta 57794d3ec9 [SPARK-36166][TESTS][FOLLOWUP] Add BLOCK_SCALA_VERSION to sparktestssupport/__init__.py
### What changes were proposed in this pull request?

This is a followup PR for SPARK-36166 (#33411), which adds `BLOCK_SCALA_VERSION` to `sparktestssupport/__init__.py`.

### Why are the changes needed?

The following command fails due to the definition is missing.
```
SCALA_PROFILE=scala2.12 dev/run-tests.py
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

The command shown above works.

Closes #33421 from sarutak/followup-SPARK-36166.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit c7ccc602db)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-19 22:47:14 +09:00
Hyukjin Kwon 2ae77574dc [SPARK-36166][TESTS][FOLLOW-UP] Add Scala version change logic into testing script
### What changes were proposed in this pull request?

This PR is a simple followup from https://github.com/apache/spark/pull/33376:
- It simplifies a bit by removing the default Scala version in the testing script (so we don't have to change here in the future when we change the Scala default version).
- Call `change-scala-version.sh` script (when `SCALA_PROFILE` is explicitly specified)

### Why are the changes needed?

More refactoring. In addition, this change will be used at https://github.com/apache/spark/pull/33410

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

CI in this PR should test it out.

Closes #33411 from HyukjinKwon/SPARK-36166.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 8ee199ef42)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-19 18:01:14 +09:00
William Hyun d5cec45c0b [SPARK-36198][TESTS] Skip UNIDOC generation in PySpark GHA job
### What changes were proposed in this pull request?
This PR aims to skip UNIDOC generation in PySpark GHA job.

### Why are the changes needed?

PySpark GHA jobs do not need to generate Java/Scala doc. This will save about 13 minutes in total.
-https://github.com/apache/spark/runs/3098268973?check_suite_focus=true
```
...
========================================================================
Building Unidoc API Documentation
========================================================================
[info] Building Spark unidoc using SBT with these arguments:  -Phadoop-3.2 -Phive-2.3 -Pscala-2.12 -Phive-thriftserver -Pmesos -Pdocker-integration-tests -Phive -Pkinesis-asl -Pspark-ganglia-lgpl -Pkubernetes -Phadoop-cloud -Pyarn unidoc
...
[info] Main Java API documentation successful.
[success] Total time: 192 s (03:12), completed Jul 18, 2021 6:08:40 PM
```

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Pass the GHA.

Closes #33407 from williamhyun/SKIP_UNIDOC.

Authored-by: William Hyun <william@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit c336f73ccd)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-07-18 17:52:40 -07:00
Dongjoon Hyun 12c8c89693 [SPARK-36166][TESTS] Support Scala 2.13 test in dev/run-tests.py
### What changes were proposed in this pull request?

For Apache Spark 3.2, this PR aims to support Scala 2.13 test in `dev/run-tests.py` by adding `SCALA_PROFILE` and in `dev/run-tests-jenkins.py` by adding `AMPLAB_JENKINS_BUILD_SCALA_PROFILE`.

In addition, `test-dependencies.sh` is skipped for Scala 2.13 because we currently don't maintain the dependency manifests yet. This will be handled after Apache Spark 3.2.0 release.

### Why are the changes needed?

To test Scala 2.13 with `dev/run-tests.py`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual. The following is the result. Note that this PR aims to **run** Scala 2.13 tests instead of **passing** them. We will have daily GitHub Action job via #33358 and will fix UT failures if exists.
```
$ dev/change-scala-version.sh 2.13

$ SCALA_PROFILE=scala2.13 dev/run-tests.py
...
========================================================================
Running Scala style checks
========================================================================
[info] Checking Scala style using SBT with these profiles:  -Phadoop-3.2 -Phive-2.3 -Pscala-2.13 -Pkubernetes -Phadoop-cloud -Phive -Phive-thriftserver -Pyarn -Pmesos -Pdocker-integration-tests -Pkinesis-asl -Pspark-ganglia-lgpl
...
========================================================================
Building Spark
========================================================================
[info] Building Spark using SBT with these arguments:  -Phadoop-3.2 -Phive-2.3 -Pscala-2.13 -Pspark-ganglia-lgpl -Pmesos -Pyarn -Phive-thriftserver -Pkinesis-asl -Pkubernetes -Pdocker-integration-tests -Phive -Phadoop-cloud test:package streaming-kinesis-asl-assembly/assembly
...

[info] Building Spark assembly using SBT with these arguments:  -Phadoop-3.2 -Phive-2.3 -Pscala-2.13 -Pspark-ganglia-lgpl -Pmesos -Pyarn -Phive-thriftserver -Pkinesis-asl -Pkubernetes -Pdocker-integration-tests -Phive -Phadoop-cloud assembly/package
...

========================================================================
Running Java style checks
========================================================================
[info] Checking Java style using SBT with these profiles:  -Phadoop-3.2 -Phive-2.3 -Pscala-2.13 -Pspark-ganglia-lgpl -Pmesos -Pyarn -Phive-thriftserver -Pkinesis-asl -Pkubernetes -Pdocker-integration-tests -Phive -Phadoop-cloud
...

========================================================================
Building Unidoc API Documentation
========================================================================
[info] Building Spark unidoc using SBT with these arguments:  -Phadoop-3.2 -Phive-2.3 -Pscala-2.13 -Pspark-ganglia-lgpl -Pmesos -Pyarn -Phive-thriftserver -Pkinesis-asl -Pkubernetes -Pdocker-integration-tests -Phive -Phadoop-cloud unidoc
...

========================================================================
Running Spark unit tests
========================================================================
[info] Running Spark tests using SBT with these arguments:  -Phadoop-3.2 -Phive-2.3 -Pscala-2.13 -Pspark-ganglia-lgpl -Pmesos -Pyarn -Phive-thriftserver -Pkinesis-asl -Pkubernetes -Pdocker-integration-tests -Phive -Phadoop-cloud test
...
```

Closes #33376 from dongjoon-hyun/SPARK-36166.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit f66153de78)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-07-15 19:26:20 -07:00
Dongjoon Hyun 0520497d60 [SPARK-36164][INFRA][FOLLOWUP] Add empty string check back
### What changes were proposed in this pull request?

This is a follow-up of #33371.
At the branch commit GitHub run, we have an empty environment variable.
This PR adds back the empty string check logic.

### Why are the changes needed?

Currently, the failure happens when we use `--modules` in GitHub Action.
```
$ GITHUB_ACTIONS=1 APACHE_SPARK_REF= dev/run-tests.py --modules core
[info] Using build tool sbt with Hadoop profile hadoop3.2 and Hive profile hive2.3 under environment github_actions
fatal: ambiguous argument '': unknown revision or path not in the working tree.
Use '--' to separate paths from revisions, like this:
'git <command> [<revision>...] -- [<file>...]'
Traceback (most recent call last):
  File "/Users/dongjoon/APACHE/spark-merge/dev/run-tests.py", line 785, in <module>
    main()
  File "/Users/dongjoon/APACHE/spark-merge/dev/run-tests.py", line 663, in main
    changed_files = identify_changed_files_from_git_commits(
  File "/Users/dongjoon/APACHE/spark-merge/dev/run-tests.py", line 91, in identify_changed_files_from_git_commits
    raw_output = subprocess.check_output(['git', 'diff', '--name-only', patch_sha, diff_target],
  File "/Users/dongjoon/.pyenv/versions/3.9.5/lib/python3.9/subprocess.py", line 424, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/Users/dongjoon/.pyenv/versions/3.9.5/lib/python3.9/subprocess.py", line 528, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['git', 'diff', '--name-only', 'HEAD', '']' returned non-zero exit status 128.
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manually. The following failure is correct in local environment because it passed `identify_changed_files_from_git_commits` already.
```
$ GITHUB_ACTIONS=1 APACHE_SPARK_REF= dev/run-tests.py --modules core
[info] Using build tool sbt with Hadoop profile hadoop3.2 and Hive profile hive2.3 under environment github_actions
Traceback (most recent call last):
  File "/Users/dongjoon/APACHE/spark-merge/dev/run-tests.py", line 785, in <module>
    main()
  File "/Users/dongjoon/APACHE/spark-merge/dev/run-tests.py", line 668, in main
    os.environ["GITHUB_SHA"], target_ref=os.environ["GITHUB_PREV_SHA"])
  File "/Users/dongjoon/.pyenv/versions/3.9.5/lib/python3.9/os.py", line 679, in __getitem__
    raise KeyError(key) from None
KeyError: 'GITHUB_SHA'
```

Closes #33374 from dongjoon-hyun/SPARK-36164.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit 5f41a2752f)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-07-15 13:44:27 -07:00
William Hyun ebc830f14e [SPARK-36164][INFRA] run-test.py should not fail when APACHE_SPARK_REF is not defined
### What changes were proposed in this pull request?
This PR aims to change run-test.py so that it does not fail when os.environ["APACHE_SPARK_REF"] is not defined.

### Why are the changes needed?
Currently, the run-test.py ends with an error.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Pass the CIs.

Closes #33371 from williamhyun/SPARK-36164.

Authored-by: William Hyun <william@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit c8a3c22628)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-07-15 11:43:48 -07:00
Hyukjin Kwon a87a6df2d1 [SPARK-36159][BUILD] Replace 'python' to 'python3' in dev/test-dependencies.sh
### What changes were proposed in this pull request?

This PR is a followup of https://github.com/apache/spark/pull/26330. There is the last place to fix in `dev/test-dependencies.sh`

### Why are the changes needed?

To stick to Python 3 instead of using Python 2 mistakenly.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

Manually tested.

Closes #33368 from HyukjinKwon/change-python-3.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit 6bd385f1e3)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-07-15 07:58:27 -07:00
Dongjoon Hyun 384bee3663 [SPARK-36150][INFRA][TESTS] Disable MiMa for Scala 2.13 artifacts
### What changes were proposed in this pull request?

This PR aims to disable MiMa check for Scala 2.13 artifacts.

### Why are the changes needed?

Apache Spark doesn't have Scala 2.13 Maven artifacts yet.
SPARK-36151 will enable this after Apache Spark 3.2.0 release.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual. The following should succeed without real testing.
```
$ dev/mima -Pscala-2.13
```

Closes #33355 from dongjoon-hyun/SPARK-36150.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit 5acfecbf97)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-07-15 00:34:43 -07:00
Kousuke Saruta ca8d2670b7 [SPARK-36129][BUILD] Upgrade commons-compress to 1.21 to deal with CVEs
### What changes were proposed in this pull request?

This PR upgrades `commons-compress` from `1.20` to `1.21` to deal with CVEs.

### Why are the changes needed?

Some CVEs which affect `commons-compress 1.20` are reported and fixed in `1.21`.
https://commons.apache.org/proper/commons-compress/security-reports.html

* CVE-2021-35515
* CVE-2021-35516
* CVE-2021-35517
* CVE-2021-36090

The severities are reported as low for all the CVEs but it would be better to deal with them just in case.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI.

Closes #33333 from sarutak/upgrade-commons-compress-1.21.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit fd06cc211d)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-07-13 22:53:22 -07:00
Wenchen Fan c1d8ccfb64 Revert "[SPARK-35253][SPARK-35398][SQL][BUILD] Bump up the janino version to v3.1.4"
### What changes were proposed in this pull request?

This PR reverts https://github.com/apache/spark/pull/32455 and its followup https://github.com/apache/spark/pull/32536 , because the new janino version has a bug that is not fixed yet: https://github.com/janino-compiler/janino/pull/148

### Why are the changes needed?

avoid regressions

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

existing tests

Closes #33302 from cloud-fan/revert.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit ae6199af44)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-13 12:14:21 +09:00
Yikun Jiang fd277dc036 [SPARK-36002][PYTHON] Consolidate tests for data-type-based operations of decimal Series
### What changes were proposed in this pull request?
Merge test_decimal_ops into test_num_ops

- merge test_isnull() into test_num_ops.test_isnull()
- remove test_datatype_ops(), which already covered in 11fcbc73cb/python/pyspark/pandas/tests/data_type_ops/test_base.py (L58-L59)

### Why are the changes needed?
Tests for data-type-based operations of decimal Series are in two places:

- python/pyspark/pandas/tests/data_type_ops/test_decimal_ops.py
- python/pyspark/pandas/tests/data_type_ops/test_num_ops.py

We'd better merge test_decimal_ops into test_num_ops.

See also [SPARK-36002](https://issues.apache.org/jira/browse/SPARK-36002) .

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
unittests passed

Closes #33206 from Yikun/SPARK-36002.

Authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit fdc50f4452)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-09 14:08:23 +09:00
Dongjoon Hyun d7990943c3 [SPARK-35992][BUILD] Upgrade ORC to 1.6.9
### What changes were proposed in this pull request?

This PR aims to upgrade Apache ORC to 1.6.9.

### Why are the changes needed?

This is required to bring ORC-804 in order to fix ORC encryption masking bug.

### Does this PR introduce _any_ user-facing change?

No. This is not released yet.

### How was this patch tested?

Pass the newly added test case.

Closes #33189 from dongjoon-hyun/SPARK-35992.

Lead-authored-by: Dongjoon Hyun <dongjoon@apache.org>
Co-authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit c55b9fd1e0)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-07-02 09:50:00 -07:00
shane knapp 2c94fbc71e initial commit for skeleton ansible for jenkins worker config
### What changes were proposed in this pull request?
this is the skeleton of the ansible used to configure jenkins workers in the riselab/apache spark build system

### Why are the changes needed?
they are not needed, but will help the community understand how to build systems to test multiple versions of spark, as well as propose changes that i can integrate in to the "production" riselab repo.  since we're sunsetting jenkins by EOY 2021, this will potentially be useful for migrating the build system.

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
ansible-lint and much wailing and gnashing of teeth.

Closes #32178 from shaneknapp/initial-ansible-commit.

Lead-authored-by: shane knapp <incomplete@gmail.com>
Co-authored-by: shane <incomplete@gmail.com>
Signed-off-by: shane knapp <incomplete@gmail.com>
2021-06-30 10:05:27 -07:00
Dongjoon Hyun b218cc90cf [SPARK-35948][INFRA] Simplify release scripts by removing Spark 2.4/Java7 parts
### What changes were proposed in this pull request?

This PR aims to clean up Spark 2.4 and Java7 code path from the release scripts.

### Why are the changes needed?

To simplify the logic.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

N/A

Closes #33150 from dongjoon-hyun/SPARK-35948.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-30 16:24:03 +09:00
Dongjoon Hyun 5312008cca [SPARK-35947][INFRA] Increase JVM stack size in release-build.sh
### What changes were proposed in this pull request?

Like SPARK-35825, this PR aims to increase JVM stack size via `MAVEN_OPTS` in release-build.sh.

### Why are the changes needed?

This will mitigate the failure in publishing snapshot GitHub Action job and during the release.

- https://github.com/apache/spark/actions/workflows/publish_snapshot.yml (3-day consecutive failures)

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

N/A

Closes #33149 from dongjoon-hyun/SPARK-35947.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-30 16:23:13 +09:00
Yuanjian Li 3257a30e53 [SPARK-35784][SS] Implementation for RocksDB instance
### What changes were proposed in this pull request?
The implementation for the RocksDB instance, which is used in the RocksDB state store. It plays a role as a handler for the RocksDB instance and RocksDBFileManager.

### Why are the changes needed?
Part of the RocksDB state store implementation.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
New UT added.

Closes #32928 from xuanyuanking/SPARK-35784.

Authored-by: Yuanjian Li <yuanjian.li@databricks.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2021-06-29 17:46:45 -07:00
Takuya UESHIN 1f6e2f55d7 Revert "[SPARK-35721][PYTHON] Path level discover for python unittests"
This reverts commit 5db51efa1a.
2021-06-29 12:08:09 -07:00
Dongjoon Hyun 7e7028282c [SPARK-35928][BUILD] Upgrade ASM to 9.1
### What changes were proposed in this pull request?

This PR aims to upgrade ASM to 9.1

### Why are the changes needed?

The latest `xbean-asm9-shaded` is built with ASM 9.1.

- https://mvnrepository.com/artifact/org.apache.xbean/xbean-asm9-shaded/4.20
- 5e0e3c0c64/pom.xml (L67)

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

Closes #33130 from dongjoon-hyun/SPARK-35928.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-06-29 10:27:51 -07:00
Yikun Jiang 5db51efa1a [SPARK-35721][PYTHON] Path level discover for python unittests
### What changes were proposed in this pull request?
Add path level discover for python unittests.

### Why are the changes needed?
Now we need to specify the python test cases by manually when we add a new testcase. Sometime, we forgot to add the testcase to module list, the testcase would not be executed.

Such as:
- pyspark-core pyspark.tests.test_pin_thread

Thus we need some auto-discover way to find all testcase rather than specified every case by manually.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Add below code in end of `dev/sparktestsupport/modules.py`
```python
for m in sorted(all_modules):
    for g in sorted(m.python_test_goals):
        print(m.name, g)
```
Compare the result before and after:
https://www.diffchecker.com/iO3FvhKL

Closes #32867 from Yikun/SPARK_DISCOVER_TEST.

Authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-29 17:56:13 +09:00
Dongjoon Hyun 0a7a6f750c [SPARK-35483][FOLLOWUP][TESTS] Update run-tests.py doctest
### What changes were proposed in this pull request?

This PR updates the doctests in `run-tests.py`.

### Why are the changes needed?

This should be consists with `modules.py` behavior.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Pass the GitHub Action.

I checked manually.
```
$ python dev/run-tests.py
Cannot install SparkR as R was not found in PATH
[info] Using build tool sbt with Hadoop profile hadoop3.2 and Hive profile hive2.3 under environment local
[info] Found the following changed modules: root
[info] Setup the following environment variables for tests:

========================================================================
Running Apache RAT checks
========================================================================
RAT checks passed.
```

Closes #33127 from dongjoon-hyun/SPARK-35483-2.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-06-28 23:14:47 -07:00
Dongjoon Hyun 57896e662e [SPARK-35483][FOLLOWUP][TESTS] Enable docker_integration_tests for catalyst/sql module changes too
### What changes were proposed in this pull request?

This PR aims to enable `docker_integration_tests` when `catalyst` and `sql` module changes additionally.

### Why are the changes needed?

Currently, `catalyst` and `sql` module changes do not trigger the JDBC integration test.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

N/A

Closes #33125 from dongjoon-hyun/SPARK-35483.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-06-28 22:59:56 -07:00
Dongjoon Hyun b999e6bd90 [SPARK-35920][BUILD] Upgrade to Chill 0.10.0
### What changes were proposed in this pull request?

This PR aims to upgrade Chill to 0.10.0.

### Why are the changes needed?

This is a maintenance release having cross-compilation to 2.12.14 and 2.13.6 .
- https://github.com/twitter/chill/releases/tag/v0.10.0

### Does this PR introduce _any_ user-facing change?

No, this is a dependency change.

### How was this patch tested?

Pass the CIs.

Closes #33119 from dongjoon-hyun/SPARK-35920.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-06-28 22:06:41 -07:00
Xinrong Meng 5f0113e3a6 [SPARK-35344][PYTHON] Support creating a Column of numpy literals in pandas API on Spark
### What changes were proposed in this pull request?

The PR is proposed to support creating a Column of numpy literal value in pandas-on-Spark. It consists of three changes mainly:
- Enable the `lit` function defined in `pyspark.pandas.spark.functions` to support numpy literals input.

```py
>>> from pyspark.pandas.spark import functions as SF
>>> SF.lit(np.int64(1))
Column<'CAST(1 AS BIGINT)'>
>>> SF.lit(np.int32(1))
Column<'CAST(1 AS INT)'>
>>> SF.lit(np.int8(1))
Column<'CAST(1 AS TINYINT)'>
>>> SF.lit(np.byte(1))
Column<'CAST(1 AS TINYINT)'>
>>> SF.lit(np.float32(1))
Column<'CAST(1.0 AS FLOAT)'>
```
- Substitute `F.lit` by `SF.lit`, that is, use `lit` function defined in `pyspark.pandas.spark.functions` rather than `lit` function defined in `pyspark.sql.functions` to allow creating columns out of numpy literals.
- Enable numpy literals input in `isin` method

Non-goal:
- Some pandas-on-Spark APIs use PySpark column-related APIs internally, and these column-related APIs don't support numpy literals, thus numpy literals are disallowed as input (e.g. `to_replace` parameter in `replace` API). This PR doesn't aim to adjust all of them. This PR adjusts `isin` only, because the PR is inspired by that (as https://github.com/databricks/koalas/issues/2161).
- To complete mappings between all kinds of numpy literals and Spark data types should be a followup task.

### Why are the changes needed?

Spark (`lit` function defined in `pyspark.sql.functions`) doesn't support creating a Column out of numpy literal value.
So `lit` function defined in `pyspark.pandas.spark.functions`  is adjusted in order to support that in pandas-on-Spark.

### Does this PR introduce _any_ user-facing change?

Yes.
Before:
```py
>>> a = ps.DataFrame({'source': [1,2,3,4,5]})
>>> a.source.isin([np.int64(1), np.int64(2)])
Traceback (most recent call last):
...
AttributeError: 'numpy.int64' object has no attribute '_get_object_id'
```

After:
```py
>>> a = ps.DataFrame({'source': [1,2,3,4,5]})
>>> a.source.isin([np.int64(1), np.int64(2)])
0     True
1     True
2    False
3    False
4    False
Name: source, dtype: bool
```

### How was this patch tested?

Unit tests.

Closes #32955 from xinrong-databricks/datatypeops_literal.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-06-28 19:03:42 -07:00
Adam Binford 939ea3d5da [SPARK-35863][BUILD] Update Ivy to 2.5.0
### What changes were proposed in this pull request?

Update Ivy from 2.4.0 to 2.5.0.

- https://ant.apache.org/ivy/history/2.5.0/release-notes.html

### Why are the changes needed?

This brings various improvements and bug fixes. Most notably, the adding of `ivy.maven.lookup.sources` and `ivy.maven.lookup.javadoc` configs can significantly speed up module resolution time if these are turned off, especially behind a proxy. These could arguably be turned off by default, because when submitting jobs you probably don't care about the sources or javadoc jars. I didn't include that here but happy to look into if it's desired.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing UT and build passes

Closes #33088 from Kimahriman/feature/ivy-update.

Authored-by: Adam Binford <adamq43@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-06-25 07:37:36 -07:00
Wenchen Fan 95ba000279 [SPARK-35872][INFRA] Automatize some steps to finalize the release
### What changes were proposed in this pull request?

After the RC vote, the release manager still need to do many work to finalize the release. This PR updates the script the automatize some steps:
1. create the final git tag
2. publish to pypi
3. publish docs to spark-website
4. move the release binaries from dev directory to release directory.
5. update the KEYS file

### Why are the changes needed?

easy the work of release manager.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

tested with the recent 3.0.3.

Closes #33055 from cloud-fan/release.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-06-24 13:25:41 -07:00
yi.wu 1cdc56c70d [SPARK-35869][INFRA] Fix "Cannot run program python" error from do-release-docker.sh
### What changes were proposed in this pull request?

Add `python-is-python3` to `create-release/spark-rm/Dockerfile`

### Why are the changes needed?

Systems that use pthon3 by default should explicitly indicate the python version is 3.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Tested during Apache 3.0.3 release.

Closes #33048 from Ngone51/fix-release-script.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-24 12:47:28 +09:00
Dongjoon Hyun 0f25cabbc2 [SPARK-35844][INFRA] Add hadoop-cloud profile to PUBLISH_PROFILES
### What changes were proposed in this pull request?

This PR aims to add `hadoop-cloud` profile to `PUBLISH_PROFILES` in order to publish `hadoop-cloud` module.

Note that this doesn't change `BASE_RELEASE_PROFILES` and there is no change in the binary distributions.

### Why are the changes needed?

This is discussed here.
- https://lists.apache.org/thread.html/rf87d755460d5ed85c7b6ac0edad48f53c929a2cd287f30be24afd2ad%40%3Cuser.spark.apache.org%3E

### Does this PR introduce _any_ user-facing change?

Yes, this will provide `hadoop-cloud` module in Maven Central.

### How was this patch tested?

N/A (After merging this, we can check the daily snapshot result)

Closes #33003 from dongjoon-hyun/SPARK-35844.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-06-21 11:56:21 -07:00
Yikun Jiang b7df75a777 [SPARK-35708][PYTHON][TEST] Add BaseTest for DataTypeOps
### What changes were proposed in this pull request?
This patch adds DataTypeOps test to check the ops is loaded as expected.

### Why are the changes needed?
When complete https://github.com/apache/spark/pull/32821, I found there are no test for DataTypeOps. There were many logic when DataTypeOps loaded, it's better to add the test to make sure interface stable.

### Does this PR introduce _any_ user-facing change?
No, test only

### How was this patch tested?
test passed.

Closes #32859 from Yikun/SPARK-XXXXX1.

Authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-06-18 18:54:50 -07:00
Yikun Jiang f84a720fe3 [SPARK-35342][PYTHON] Introduce DecimalOps and make isnull method data-type-based
### What changes were proposed in this pull request?
- Introduce a DecimalOps for DecimalType
- Make `isnull` method data-type-based

### Why are the changes needed?
Now DecimalType, DoubleType, and FloatType data share the FractionalOps class, but DecimalType behaves differently from FloatType and DoubleType (as https://github.com/apache/spark/blob/master/python/pyspark/pandas/base.py#L987-L990), so we propose to introduce DecimalOps. The behavior difference here is caused by DecimalType could not have NaN.

https://issues.apache.org/jira/browse/SPARK-35342

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
- New added DecimalOpsTest passed
- Existing NumOpsTest passed

Closes #32821 from Yikun/SPARK-35342.

Authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-06-18 10:44:35 -07:00
David Christle 7fcb127674 [SPARK-35670][BUILD] Upgrade ZSTD-JNI to 1.5.0-2
### What changes were proposed in this pull request?
This PR aims to upgrade `zstd-jni` to 1.5.0-2, which uses `zstd` version 1.5.0.

### Why are the changes needed?
Major improvements to Zstd support are targeted for the upcoming 3.2.0 release of Spark. Zstd 1.5.0 introduces significant compression (+25% to 140%) and decompression (~15%) speed improvements in benchmarks described in more detail on the releases page:

- https://github.com/facebook/zstd/releases/tag/v1.5.0

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Build passes build tests, but the benchmark tests seem flaky. I am unsure if this change is responsible. The error is:
```
Running org.apache.spark.rdd.CoalescedRDDBenchmark:
21/06/08 18:53:10 ERROR SparkContext: Failed to add file:/home/runner/work/spark/spark/./core/target/scala-2.12/spark-core_2.12-3.2.0-SNAPSHOT-tests.jar to Spark environment
java.lang.IllegalArgumentException: requirement failed: File spark-core_2.12-3.2.0-SNAPSHOT-tests.jar was already registered with a different path (old path = /home/runner/work/spark/spark/core/target/scala-2.12/spark-core_2.12-3.2.0-SNAPSHOT-tests.jar, new path = /home/runner/work/spark/spark/./core/target/scala-2.12/spark-core_2.12-3.2.0-SNAPSHOT-tests.jar
```

https://github.com/dchristle/spark/runs/2776123749?check_suite_focus=true

cc: dongjoon-hyun

Closes #32826 from dchristle/ZSTD150.

Lead-authored-by: David Christle <dchristle@squareup.com>
Co-authored-by: David Christle <dchristle@users.noreply.github.com>
Co-authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-06-17 11:06:50 -07:00
Chao Sun 506ef9aad7 [SPARK-29250][BUILD] Upgrade to Hadoop 3.3.1
### What changes were proposed in this pull request?

This upgrade default Hadoop version from 3.2.1 to 3.3.1. The changes here are simply update the version number and dependency file.

### Why are the changes needed?

Hadoop 3.3.1 just came out, which comes with many client-side improvements such as for S3A/ABFS (20% faster when accessing S3). These are important for users who want to use Spark in a cloud environment.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

- Existing unit tests in Spark
- Manually tested using my S3 bucket for event log dir:
```
bin/spark-shell \
  -c spark.hadoop.fs.s3a.access.key=$AWS_ACCESS_KEY_ID \
  -c spark.hadoop.fs.s3a.secret.key=$AWS_SECRET_ACCESS_KEY \
  -c spark.eventLog.enabled=true
  -c spark.eventLog.dir=s3a://<my-bucket>
```
- Manually tested against docker-based YARN dev cluster, by running `SparkPi`.

Closes #30135 from sunchao/SPARK-29250.

Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-06-16 13:28:07 -07:00
Sumeet Gajjar 864ff67746 [SPARK-35429][CORE] Remove commons-httpclient from Hadoop-3.2 profile due to EOL and CVEs
### What changes were proposed in this pull request?

Remove commons-httpclient as a direct dependency for Hadoop-3.2 profile.
Hadoop-2.7 profile distribution still has it, hadoop-client has a compile dependency on commons-httpclient, thus we cannot remove it for Hadoop-2.7 profile.
```
[INFO] +- org.apache.hadoop:hadoop-client:jar:2.7.4:compile
[INFO] |  +- org.apache.hadoop:hadoop-common:jar:2.7.4:compile
[INFO] |  |  +- commons-cli:commons-cli:jar:1.2:compile
[INFO] |  |  +- xmlenc:xmlenc:jar:0.52:compile
[INFO] |  |  +- commons-httpclient:commons-httpclient:jar:3.1:compile
```

### Why are the changes needed?

Spark is pulling in commons-httpclient as a dependency directly. commons-httpclient went EOL years ago and there are most likely CVEs not being reported against it, thus we should remove it.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

- Existing unittests
- Checked the dependency tree before and after introducing the changes

Before:
```
./build/mvn dependency:tree -Phadoop-3.2 | grep -i "commons-httpclient"
Using `mvn` from path: /usr/bin/mvn
[INFO] +- commons-httpclient:commons-httpclient:jar:3.1:compile
[INFO] |  +- commons-httpclient:commons-httpclient:jar:3.1:provided
```

After
```
./build/mvn dependency:tree | grep -i "commons-httpclient"
Using `mvn` from path: /Users/sumeet.gajjar/cloudera/upstream-spark/build/apache-maven-3.6.3/bin/mvn
```

P.S. Reopening this since [spark upgraded](463daabd5a) its `hive.version` to `2.3.9` which does not have a dependency on `commons-httpclient`.

Closes #32912 from sumeetgajjar/SPARK-35429.

Authored-by: Sumeet Gajjar <sumeetgajjar93@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-06-15 14:43:30 -07:00
Yuming Wang 463daabd5a [SPARK-34512][BUILD][SQL] Upgrade built-in Hive to 2.3.9
### What changes were proposed in this pull request?

This pr upgrades built-in Hive to 2.3.9. Hive 2.3.9 changes:
- [HIVE-17155] - findConfFile() in HiveConf.java has some issues with the conf path
- [HIVE-24797] - Disable validate default values when parsing Avro schemas
- [HIVE-24608] - Switch back to get_table in HMS client for Hive 2.3.x
- [HIVE-21200] - Vectorization: date column throwing java.lang.UnsupportedOperationException for parquet
- [HIVE-21563] - Improve Table#getEmptyTable performance by disabling registerAllFunctionsOnce
- [HIVE-19228] - Remove commons-httpclient 3.x usage

### Why are the changes needed?

Fix regression caused by AVRO-2035.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit test.

Closes #32750 from wangyum/SPARK-34512.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-06-10 20:44:35 -07:00
Xinrong Meng 04a8d2cbcf [SPARK-35343][PYTHON] Make the conversion from/to pandas data-type-based for non-ExtensionDtypes
### What changes were proposed in this pull request?

Make the conversion from/to pandas (for non-ExtensionDtype) data-type-based.
NOTE: Ops class per ExtensionDtype and its data-type-based from/to pandas will be implemented in a separate PR as https://issues.apache.org/jira/browse/SPARK-35614.

### Why are the changes needed?

The conversion from/to pandas includes logic for checking data types and behaving accordingly.
That makes code hard to change or maintain.
Since we have introduced the Ops class per non-ExtensionDtype data type, we ought to make the conversion from/to pandas data-type-based for non-ExtensionDtypes.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit tests.

Closes #32592 from xinrong-databricks/datatypeop_pd_conversion.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-06-07 13:12:12 -07:00
Dongjoon Hyun 6f2ffccb5e [SPARK-35660][BUILD][K8S] Upgrade kubernetes-client to 5.4.1
### What changes were proposed in this pull request?

This PR aims to upgrade kubernetes-client to 5.4.1.

### Why are the changes needed?

This will bring a few bug fixes.
- https://github.com/fabric8io/kubernetes-client/releases/tag/v5.4.1

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

Closes #32798 from dongjoon-hyun/SPARK-35660.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-06-06 22:27:08 -07:00