Commit graph

1119 commits

Author SHA1 Message Date
Dongjoon Hyun 3b560390c0 [SPARK-36805][BUILD][K8S] Upgrade kubernetes-client to 5.7.3
### What changes were proposed in this pull request?

This PR aims to upgrade `kubernetes-client` from 5.6.0 to 5.7.3

### Why are the changes needed?

This will bring the latest improvements and bug fixes.
- https://github.com/fabric8io/kubernetes-client/releases/tag/v5.7.3
- https://github.com/fabric8io/kubernetes-client/releases/tag/v5.7.2
- https://github.com/fabric8io/kubernetes-client/releases/tag/v5.7.1
- https://github.com/fabric8io/kubernetes-client/releases/tag/v5.7.0

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Pass the CIs.

Closes #34047 from dongjoon-hyun/SPARK-36805.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-09-20 09:14:48 -07:00
Dongjoon Hyun a396dd6216 [SPARK-34112][BUILD] Upgrade ORC to 1.7.0
### What changes were proposed in this pull request?

This PR aims to upgrade Apache ORC from 1.6.11 to 1.7.0 for Apache Spark 3.3.0.

### Why are the changes needed?

[Apache ORC 1.7.0](https://orc.apache.org/news/2021/09/15/ORC-1.7.0/) is a new release with the following new features and improvements.
  - ORC-377 Support Snappy compression in C++ Writer
  - ORC-577 Support row-level filtering
  - ORC-716 Build and test on Java 17-EA
  - ORC-731 Improve Java Tools
  - ORC-742 LazyIO of non-filter columns
  - ORC-751 Implement Predicate Pushdown in C++ Reader
  - ORC-755 Introduce OrcFilterContext
  - ORC-757 Add Hashtable implementation for dictionary
  - ORC-780 Support LZ4 Compression in C++ Writer
  - ORC-797 Allow writers to get the stripe information
  - ORC-818 Build and test in Apple Silicon
  - ORC-861 Bump CMake minimum requirement to 2.8.12
  - ORC-867 Upgrade hive-storage-api to 2.8.1
  - ORC-984 Save the software version that wrote each ORC file

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the existing CIs because this is a dependency change.

Closes #34045 from dongjoon-hyun/SPARK-34112.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-09-20 01:09:15 -07:00
Yang He 5d0889bf36 [SPARK-36780][BUILD] Make dev/mima runs on Java 17
### What changes were proposed in this pull request?

Java 17 has been officially released. This PR makes `dev/mima` runs on Java 17.

### Why are the changes needed?

To make tests pass on Java 17.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual test.

Closes #34022 from RabbidHY/SPARK-36780.

Lead-authored-by: Yang He <stitch106hy@gmail.com>
Co-authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-09-17 08:54:49 -07:00
Dongjoon Hyun c217797297 [SPARK-36732][SQL][BUILD] Upgrade ORC to 1.6.11
### What changes were proposed in this pull request?

This PR aims to upgrade Apache ORC to 1.6.11 to bring the latest bug fixes.

### Why are the changes needed?

Apache ORC 1.6.11 has the following fixes.
- https://issues.apache.org/jira/projects/ORC/versions/12350499

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

Closes #33971 from dongjoon-hyun/SPARK-36732.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-09-15 23:36:26 -07:00
Dongjoon Hyun 16f1f71ba5 [SPARK-36759][BUILD] Upgrade Scala to 2.12.15
### What changes were proposed in this pull request?

This PR aims to upgrade Scala to 2.12.15 to support Java 17/18 better.

### Why are the changes needed?

Scala 2.12.15 improves compatibility with JDK 17 and 18:

https://github.com/scala/scala/releases/tag/v2.12.15

- Avoids IllegalArgumentException in JDK 17+ for lambda deserialization
- Upgrades to ASM 9.2, for JDK 18 support in optimizer

### Does this PR introduce _any_ user-facing change?

Yes, this is a Scala version change.

### How was this patch tested?

Pass the CIs

Closes #33999 from dongjoon-hyun/SPARK-36759.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-09-15 13:43:25 -07:00
Chao Sun a927b0836b [SPARK-36726] Upgrade Parquet to 1.12.1
### What changes were proposed in this pull request?

Upgrade Apache Parquet to 1.12.1

### Why are the changes needed?

Parquet 1.12.1 contains the following bug fixes:
- PARQUET-2064: Make Range public accessible in RowRanges
- PARQUET-2022: ZstdDecompressorStream should close `zstdInputStream`
- PARQUET-2052: Integer overflow when writing huge binary using dictionary encoding
- PARQUET-1633: Fix integer overflow
- PARQUET-2054: fix TCP leaking when calling ParquetFileWriter.appendFile
- PARQUET-2072: Do Not Determine Both Min/Max for Binary Stats
- PARQUET-2073: Fix estimate remaining row count in ColumnWriteStoreBase
- PARQUET-2078: Failed to read parquet file after writing with the same

In particular PARQUET-2078 is a blocker for the upcoming Apache Spark 3.2.0 release.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing tests + a new test for the issue in SPARK-36696

Closes #33969 from sunchao/upgrade-parquet-12.1.

Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: DB Tsai <d_tsai@apple.com>
2021-09-15 19:17:34 +00:00
Kevin Su 3e5d3d1cfe [SPARK-34943][BUILD] Upgrade flake8 to 3.8.0 or above in Jenkins
### What changes were proposed in this pull request?

Upgrade flake8 to 3.8.0 or above in Jenkins

### Why are the changes needed?

In flake8 < 3.8.0, F401 error occurs for imports in if statements when TYPE_CHECKING is True. However, TYPE_CHECKING is always False at runtime, so there is no need to treat it as an error in static analysis.

Since this behavior is fixed In flake8 >= 3.8.0, we should upgrade the flake8 installed in Jenkins to 3.8.0 or above. Otherwise, it occurs F401 error for several lines in pandas-on-PySpark that use TYPE_CHECKING

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Pass the CI

Closes #32749 from pingsutw/SPARK-34943.

Lead-authored-by: Kevin Su <pingsutw@gmail.com>
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-09-15 09:24:50 +09:00
Dongjoon Hyun d730ef24fe [SPARK-36712][BUILD][FOLLOWUP] Improve the regex to avoid breaking pom.xml
### What changes were proposed in this pull request?

This PR aims to fix the regex to avoid breaking `pom.xml`.

### Why are the changes needed?

**BEFORE**
```
$ dev/change-scala-version.sh 2.12
$ git diff | head -n10
diff --git a/core/pom.xml b/core/pom.xml
index dbde22f2bf..6ed368353b 100644
--- a/core/pom.xml
+++ b/core/pom.xml
 -35,7 +35,7
   </properties>

   <dependencies>
-    <!--<!--
```

**AFTER**
Since the default Scala version is `2.12`, the following `no-op` is the correct behavior which is consistent with the previous behavior.
```
$ dev/change-scala-version.sh 2.12
$ git diff
```

### Does this PR introduce _any_ user-facing change?

No. This is a dev only change.

### How was this patch tested?

Manually.

Closes #33996 from dongjoon-hyun/SPARK-36712.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-09-14 16:26:50 -07:00
yangjie01 119ddd7e95 [SPARK-36737][BUILD][CORE][SQL][SS] Upgrade Apache commons-io to 2.11.0 and revert change of SPARK-36456
### What changes were proposed in this pull request?
SPARK-36456 change to use `JavaUtils. closeQuietly` instead of `IOUtils.closeQuietly`, but there is slightly different from the 2 methods in default behavior: swallowing IOException is same, but the former logs it as ERROR while the latter doesn't log by default.

`Apache commons-io` community decided to retain the `IOUtils.closeQuietly` method in the [new version](75f20dca72/src/main/java/org/apache/commons/io/IOUtils.java (L465-L467)) and removed deprecated annotation,  the change has been released in version 2.11.0.

So the change of this pr is to upgrade `Apache commons-io` to 2.11.0 and revert change of SPARK-36456 to maintain original behavior(don't print error log).

### Why are the changes needed?

1. Upgrade `Apache commons-io` to 2.11.0 to use non-deprecated `closeQuietly` API, other changes related to `Apache commons-io are detailed in [commons-io/changes-report](https://commons.apache.org/proper/commons-io/changes-report.html#a2.11.0)

2. Revert change of SPARK-36737 to maintain original `IOUtils.closeQuietly` API behavior(don't print error log).

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass the Jenkins or GitHub Action

Closes #33977 from LuciferYang/upgrade-commons-io.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
2021-09-14 21:16:58 +09:00
Lukas Rytz 1a62e6a2c1 [SPARK-36712][BUILD] Make scala-parallel-collections in 2.13 POM a direct dependency (not in maven profile)
As [reported on `devspark.apache.org`](https://lists.apache.org/thread.html/r84cff66217de438f1389899e6d6891b573780159cd45463acf3657aa%40%3Cdev.spark.apache.org%3E), the published POMs when building with Scala 2.13 have the `scala-parallel-collections` dependency only in the `scala-2.13` profile of the pom.

### What changes were proposed in this pull request?

This PR suggests to work around this by un-commenting the `scala-parallel-collections` dependency when switching to 2.13 using the the `change-scala-version.sh` script.

I included an upgrade to scala-parallel-collections version 1.0.3, the changes compared to 0.2.0 are minor.
  - removed OSGi metadata
  - renamed some internal inner classes
  - added `Automatic-Module-Name`

### Why are the changes needed?

According to the posts, this solves issues for developers that write unit tests for their applications.

Stephen Coy suggested to use the https://www.mojohaus.org/flatten-maven-plugin. While this sounds like a more principled solution, it is possibly too risky to do at this specific point in time?

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Locally

Closes #33948 from lrytz/parCollDep.

Authored-by: Lukas Rytz <lukas.rytz@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2021-09-13 11:06:50 -05:00
Kousuke Saruta e1e19619b7 [SPARK-36729][BUILD] Upgrade Netty from 4.1.63 to 4.1.68
### What changes were proposed in this pull request?

This PR upgrades Netty from `4.1.63` to `4.1.68`.

All the changes from `4.1.64` to `4.1.68` are as follows.

* 4.1.64 and 4.1.65
  * https://netty.io/news/2021/05/19/4-1-65-Final.html
* 4.1.66
  * https://netty.io/news/2021/07/16/4-1-66-Final.html
* 4.1.67
  * https://netty.io/news/2021/08/16/4-1-67-Final.html
* 4.1.68
  * https://netty.io/news/2021/09/09/4-1-68-Final.html

### Why are the changes needed?

Recently Netty `4.1.68` was released, which includes official M1 Mac support.
* Add support for mac m1
  * https://github.com/netty/netty/pull/11666

`4.1.65` also includes a critical bug fix which Spark might be affected.
* JNI classloader deadlock with latest JDK version
  * https://github.com/netty/netty/issues/11209

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CIs.

Closes #33970 from sarutak/upgrade-netty-4.1.68.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-09-12 10:07:27 -07:00
Dongjoon Hyun ff8cc4b800 [SPARK-36629][BUILD] Upgrade aircompressor to 1.21
### What changes were proposed in this pull request?

This PR aims to upgrade `aircompressor` dependency from 1.19 to 1.21.

### Why are the changes needed?

This will bring the latest bug fix which exists in `aircompressor` 1.17 ~ 1.20.
- 1e364f7133

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

Closes #33883 from dongjoon-hyun/SPARK-36629.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-08-31 22:35:45 -07:00
Kousuke Saruta 4356d6603a [SPARK-36605][BUILD] Upgrade Jackson to 2.12.5
### What changes were proposed in this pull request?

This PR upgrades Jackson from `2.12.3` to `2.12.5`.

### Why are the changes needed?

Recently, Jackson `2.12.5` was released and it seems to be expected as the last full patch release for 2.12.x.
This release includes a fix for a regression in jackson-databind introduced in `2.12.3` which Spark 3.2 currently depends on.
https://github.com/FasterXML/jackson/wiki/Jackson-Release-2.12.5

### Does this PR introduce _any_ user-facing change?

Dependency maintenance.

### How was this patch tested?

CIs.

Closes #33860 from sarutak/upgrade-jackson-2.12.5.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2021-08-28 15:57:24 -07:00
yangjie01 1ccb06ca8c Revert "[SPARK-34309][BUILD][CORE][SQL][K8S] Use Caffeine instead of Guava Cache"
### What changes were proposed in this pull request?
This pr revert the change of SPARK-34309, includes:

- https://github.com/apache/spark/pull/31517
- https://github.com/apache/spark/pull/33772

### Why are the changes needed?

1. No really performance improvement in Spark
2. Added an additional dependency

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass the Jenkins or GitHub Action

Closes #33784 from LuciferYang/revert-caffeine.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-22 09:36:15 +09:00
Gengliang Wang 42eebb84f5 [SPARK-36551][BUILD] Add sphinx-plotly-directive in Spark release Dockerfile
### What changes were proposed in this pull request?

After https://github.com/apache/spark/pull/32726, Python doc build requires `sphinx-plotly-directive`.
This PR is to install it from `spark-rm/Dockerfile` to make sure `do-release-docker.sh` can run successfully.
Also, this PR mentions it in the README of docs.

### Why are the changes needed?

Fix release script and update README of docs

### Does this PR introduce _any_ user-facing change?

No
### How was this patch tested?

Manual test locally.

Closes #33797 from gengliangwang/fixReleaseDocker.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
2021-08-20 20:02:24 +08:00
Kousuke Saruta 281b00ab5b [SPARK-34309][BUILD][FOLLOWUP] Upgrade Caffeine to 2.9.2
### What changes were proposed in this pull request?

This PR upgrades Caffeine to `2.9.2`.
Caffeine was introduced in SPARK-34309 (#31517). At the time that PR was opened, the latest version of caffeine was `2.9.1` but now `2.9.2` is available.

### Why are the changes needed?

`2.9.2` have the following improvements (https://github.com/ben-manes/caffeine/releases/tag/v2.9.2).

* Fixed reading an intermittent null weak/soft value during a concurrent write
* Fixed extraneous eviction when concurrently removing a collected entry after a writer resurrects it with a new mapping
* Fixed excessive retries of discarding an expired entry when the fixed duration period is extended, thereby resurrecting it

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CIs.

Closes #33772 from sarutak/upgrade-caffeine-2.9.2.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
2021-08-18 13:40:52 +09:00
Sean Owen 3b0dd14f1c Update Spark key negotiation protocol 2021-08-11 18:04:55 -05:00
William Hyun aff1b5594a [SPARK-36482][BUILD] Bump orc to 1.6.10
### What changes were proposed in this pull request?
This PR aims to bump ORC to 1.6.10

### Why are the changes needed?
This will bring the latest bug fixes.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Pass the CIs.

Closes #33712 from williamhyun/orc.

Authored-by: William Hyun <william@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-08-11 11:32:06 -07:00
Liang-Chi Hsieh 7d13ac177b [SPARK-36393][BUILD] Try to raise memory for GHA
### What changes were proposed in this pull request?

According to the feedback from GitHub, the change causing memory issue has been rolled back. We can try to raise memory again for GA.

### Why are the changes needed?

Trying higher memory settings for GA. It could speed up the testing time.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

GA

Closes #33623 from viirya/increasing-mem-ga.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2021-08-05 01:31:35 -07:00
yangjie01 01cf6f4c6b [SPARK-34309][BUILD][CORE][SQL][K8S] Use Caffeine instead of Guava Cache
### What changes were proposed in this pull request?
There are 3 ways to use Guava cache in spark code:

1. `Loadingcache` is the main way to use Guava cache in spark code and the key usages are as follows:
  a. `LoadingCache` with `maximumsize` data eviction policy, such as `appCache` in `ApplicationCache`, `cache` in `Codegenerator`
  b. `LoadingCache` with `maximumWeight` data eviction policy, such as `shuffleIndexCache` in `ExternalShuffleBlockResolver`
  c. `LoadingCache` with 'expireAfterWrite' data eviction policy, such as `tableRelationCache` in `SessionCatalog`
2. `ManualCache` is another way to use Guava cache in spark code and the key usage is `cache` in `SharedInMemoryCache`, it use to caches partition file statuses in memory

3. The last use way is `hadoopJobMetadata` in `SparkEnv`, it uses Guava Cache to build a `soft-reference map`.

The goal of this pr is use `Caffeine` instead of `Guava Cache` because `Caffeine` is faster than `Guava Cache` from benchmarks, the main changes as follows:

1. Add `Caffeine` deps to maven `pom.xml`

2. Use `Caffeine` instead of Guava `LoadingCache`, `ManualCache` and soft-reference map in `SparkEnv`

3. Add `LocalCacheBenchmark` to compare performance of `Loadingcache` between `Guava Cache` and `Caffeine`

### Why are the changes needed?
`Caffeine` is faster than `Guava Cache` from benchmarks

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

- Pass the Jenkins or GitHub Action
- Add `LocalCacheBenchmark` to compare performance of `Loadingcache` between `Guava Cache` and `Caffeine`

Closes #31517 from LuciferYang/guava-cache-to-caffeine.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Holden Karau <hkarau@netflix.com>
2021-08-04 12:01:44 -07:00
Hyukjin Kwon c0d1860f25 [SPARK-36092][INFRA][BUILD][PYTHON] Migrate to GitHub Actions with Codecov from Jenkins
### What changes were proposed in this pull request?

This PR proposes to migrate Coverage report from Jenkins to GitHub Actions by setting a dailly cron job.

### Why are the changes needed?

For some background, currently PySpark code coverage is being reported in this specific Jenkins job: https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/

Because of the security issue between [Codecov service](https://app.codecov.io/gh/) and Jenkins machines, we had to work around by manually hosting a coverage site via GitHub pages, see also https://spark-test.github.io/pyspark-coverage-site/ by spark-test account (which is shared to only subset of PMC members).

Since we now run the build via GitHub Actions, we can leverage [Codecov plugin](https://github.com/codecov/codecov-action), and remove the workaround we used.

### Does this PR introduce _any_ user-facing change?

Virtually no. Coverage site (UI) might change but the information it holds should be virtually the same.

### How was this patch tested?

I manually tested:
- Scheduled run: https://github.com/HyukjinKwon/spark/actions/runs/1082261484
- Coverage report: 73f0291a7d/python/pyspark
- Run against a PR: https://github.com/HyukjinKwon/spark/actions/runs/1082367175

Closes #33591 from HyukjinKwon/SPARK-36092.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-01 21:37:19 +09:00
attilapiros 801017fdec [SPARK-36358][K8S] Upgrade Kubernetes Client Version to 5.6.0
### What changes were proposed in this pull request?

Upgrade Kubernetes Client Version to 5.6.0.

### Why are the changes needed?

The exponential backoff feature is extended with one more case:
[ Retry HTTP operation in case IOException too (exponential backoff)](https://github.com/fabric8io/kubernetes-client/pull/3293)

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Tested with existing unit and integration tests:

```
KubernetesSuite:
- Run SparkPi with no resources
- Run SparkPi with a very long application name.
- Use SparkLauncher.NO_RESOURCE
- Run SparkPi with a master URL without a scheme.
- Run SparkPi with an argument.
- Run SparkPi with custom labels, annotations, and environment variables.
- All pods have the same service account by default
- Run extraJVMOptions check on driver
- Run SparkRemoteFileTest using a remote data file
- Verify logging configuration is picked from the provided SPARK_CONF_DIR/log4j.properties
- Run SparkPi with env and mount secrets.
- Run PySpark on simple pi.py example
- Run PySpark to test a pyfiles example
- Run PySpark with memory customization
- Run in client mode.
- Start pod creation from template
- Launcher client dependencies
- SPARK-33615: Launcher client archives
- SPARK-33748: Launcher python client respecting PYSPARK_PYTHON
- SPARK-33748: Launcher python client respecting spark.pyspark.python and spark.pyspark.driver.python
- Launcher python client dependencies using a zip file
- Test basic decommissioning
- Test basic decommissioning with shuffle cleanup
- Test decommissioning with dynamic allocation & shuffle cleanups
- Test decommissioning timeouts
...
[INFO] BUILD SUCCESS
```

Closes #33593 from attilapiros/SPARK-36358.

Authored-by: attilapiros <piros.attila.zsolt@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-07-30 08:25:33 -07:00
itholic abce61f3fd [SPARK-36254][INFRA][PYTHON] Install mlflow in Github Actions CI
### What changes were proposed in this pull request?

This PR proposes adding a Python package, `mlflow` and `sklearn` to enable the MLflow test in pandas API on Spark.

### Why are the changes needed?

To enable the MLflow test in pandas API on Spark.

### Does this PR introduce _any_ user-facing change?

No, it's test-only

### How was this patch tested?

Manually test on local, with `python/run-tests --testnames pyspark.pandas.mlflow`.

Closes #33567 from itholic/SPARK-36254.

Lead-authored-by: itholic <haejoon.lee@databricks.com>
Co-authored-by: Haejoon Lee <44108233+itholic@users.noreply.github.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-07-30 00:04:48 -07:00
Yuanjian Li 4cd5fa96d8 [SPARK-36347][SS] Upgrade the RocksDB version to 6.20.3
### What changes were proposed in this pull request?
As the discussion in https://github.com/apache/spark/pull/32928/files#r654049392, after confirming the compatibility, we can use a newer RocksDB version for the state store implementation.

### Why are the changes needed?
For further ARM support and leverage the bug fix for the newer version.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Existing tests.

Closes #33578 from xuanyuanking/SPARK-36347.

Authored-by: Yuanjian Li <yuanjian.li@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-07-29 11:08:58 -07:00
Enrico Minack f90eb6a5db [SPARK-36263][SQL][PYTHON] Add Dataframe.observation to PySpark
### What changes were proposed in this pull request?
With SPARK-34806 we can now easily add an equivalent for `Dataset.observe(Observation, Column, Column*)` to PySpark's `DataFrame` API.

### Why are the changes needed?
This further aligns the Python DataFrame API with Scala Dataset API.

### Does this PR introduce _any_ user-facing change?
Yes, it adds the `Observation` class and the `DataFrame.observe` method.

### How was this patch tested?
Adds test `test_observe` to `pyspark.sql.test.test_dataframe`.

Closes #33484 from EnricoMi/branch-observation-python.

Authored-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-07-28 01:39:34 +08:00
William Hyun 674202e7b6 [SPARK-36285][INFRA][TESTS] Skip MiMa in PySpark/SparkR/Docker GHA job
### What changes were proposed in this pull request?
This PR aims to skip MiMa in PySpark/SparkR/Docker GHA job.

### Why are the changes needed?
This will save GHA resource because MiMa is irrelevant to Python.

### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass the GHA.

Closes #33532 from williamhyun/mima.

Lead-authored-by: William Hyun <william@apache.org>
Co-authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-27 16:47:59 +09:00
Takuya UESHIN 663cbdfbe5 [SPARK-36279][INFRA][PYTHON] Fix lint-python to work with Python 3.9
### What changes were proposed in this pull request?

Fix `lint-python` to pick `PYTHON_EXECUTABLE` from the environment variable first to switch the Python and explicitly specify `PYTHON_EXECUTABLE` to use `python3.9` in CI.

### Why are the changes needed?

Currently `lint-python` uses `python3`, but it's not the one we expect in CI.
As a result, `black` check is not working.

```
The python3 -m black command was not found. Skipping black checks for now.
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

The `black` check in `lint-python` should work.

Closes #33507 from ueshin/issues/SPARK-36279/lint-python.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-24 16:49:11 +09:00
Liang-Chi Hsieh fd36ed4550 [SPARK-36270][BUILD] Change memory settings for enabling GA
### What changes were proposed in this pull request?

Trying to adjust build memory settings and serial execution to re-enable GA.

### Why are the changes needed?

GA tests are failed recently due to return code 137. We need to adjust build settings to make GA work.

### Does this PR introduce _any_ user-facing change?

No, dev only.

### How was this patch tested?

GA

Closes #33447 from viirya/test-ga.

Lead-authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-23 19:10:45 +09:00
Hyukjin Kwon d6bc8cd681 [SPARK-36268][PYTHON] Set the lowerbound of mypy version to 0.910
### What changes were proposed in this pull request?

This PR proposes to set the lowerbound of mypy version to use in the testing script.

### Why are the changes needed?

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/141519/console

```
python/pyspark/mllib/tree.pyi:29: error: Overloaded function signatures 1 and 2 overlap with incompatible return types
python/pyspark/mllib/tree.pyi:38: error: Overloaded function signatures 1 and 2 overlap with incompatible return types
python/pyspark/mllib/feature.pyi:34: error: Overloaded function signatures 1 and 2 overlap with incompatible return types
python/pyspark/mllib/feature.pyi:42: error: Overloaded function signatures 1 and 2 overlap with incompatible return types
python/pyspark/mllib/feature.pyi:48: error: Overloaded function signatures 1 and 2 overlap with incompatible return types
python/pyspark/mllib/feature.pyi:54: error: Overloaded function signatures 1 and 2 overlap with incompatible return types
python/pyspark/mllib/feature.pyi:76: error: Overloaded function signatures 1 and 2 overlap with incompatible return types
python/pyspark/mllib/feature.pyi:124: error: Overloaded function signatures 1 and 2 overlap with incompatible return types
python/pyspark/mllib/feature.pyi:165: error: Overloaded function signatures 1 and 2 overlap with incompatible return types
python/pyspark/mllib/clustering.pyi:45: error: Overloaded function signatures 1 and 2 overlap with incompatible return types
python/pyspark/mllib/clustering.pyi:72: error: Overloaded function signatures 1 and 2 overlap with incompatible return types
python/pyspark/mllib/classification.pyi:39: error: Overloaded function signatures 1 and 2 overlap with incompatible return types
python/pyspark/mllib/classification.pyi:52: error: Overloaded function signatures 1 and 2 overlap with incompatible return types
Found 13 errors in 4 files (checked 314 source files)
1
```

Jenkins installed mypy at SPARK-32797 but seems the version installed is not same as GIthub Actions.

It seems difficult to make the codebase compatible with multiple mypy versions. Therefore, this PR sets the lowerbound.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

Jenkins job in this PR should test it out.

Also manually tested:

Without mypy:

```
...
flake8 checks passed.

The mypy command was not found. Skipping for now.
```

With mypy 0.812:

```
...
flake8 checks passed.

The minimum mypy version needs to be 0.910. Your current version is mypy 0.812. Skipping for now.
```

With mypy 0.910:

```
...
flake8 checks passed.

starting mypy test...
mypy checks passed.

all lint-python tests passed!
```

Closes #33487 from HyukjinKwon/SPARK-36268.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-23 12:28:16 +09:00
Dongjoon Hyun a1a197403b [SPARK-36262][BUILD] Upgrade ZSTD-JNI to 1.5.0-4
### What changes were proposed in this pull request?

This PR aims to upgrade ZSTD-JNI to 1.5.0-4.

### Why are the changes needed?

ZSTD-JNI 1.5.0-3 has a packaging issue. 1.5.0-4 is recommended to be used instead.
- https://github.com/luben/zstd-jni/issues/181#issuecomment-885138495

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

Closes #33483 from dongjoon-hyun/SPARK-36262.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-07-22 14:03:59 -07:00
Sean Owen 518f00fd78 [SPARK-35310][MLLIB] Update to breeze 1.2
### What changes were proposed in this pull request?

Update to the latest breeze 1.2

### Why are the changes needed?

Minor bug fixes

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests

Closes #33449 from srowen/SPARK-35310.

Authored-by: Sean Owen <srowen@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2021-07-22 13:58:01 -05:00
Kousuke Saruta 13aefd6a66 [SPARK-36256][BUILD] Upgrade lz4-java to 1.8.0
### What changes were proposed in this pull request?

This PR upgrades `lz4-java` to `1.8.0`, which includes not only performance improvement  but also Darwin aarch64 support.
https://github.com/lz4/lz4-java/releases/tag/1.8.0
https://github.com/lz4/lz4-java/blob/1.8.0/CHANGES.md

### Why are the changes needed?

For providing better performance and platform support.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI.

Closes #33476 from sarutak/upgrade-lz4-java-1.8.0.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
2021-07-22 20:39:59 +08:00
Kousuke Saruta dcb7db5370 [SPARK-36244][BUILD] Upgrade zstd-jni to 1.5.0-3 to avoid a bug about buffer size calculation
### What changes were proposed in this pull request?

This PR upgrades `zstd-jni` from `1.5.0-2` to `1.5.0-3`.
`1.5.0-3` was released few days ago.
This release resolves an issue about buffer size calculation, which can affect usage in Spark.
https://github.com/luben/zstd-jni/releases/tag/v1.5.0-3

### Why are the changes needed?

It might be a corner case that skipping length is greater than `2^31 - 1` but it's possible to affect Spark.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI.

Closes #33464 from sarutak/upgrade-zstd-jni-1.5.0-3.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-07-21 19:37:05 -07:00
shane knapp ad528a007a [SPARK-32797][SPARK-32391][SPARK-33242][SPARK-32666][ANSIBLE] updating a bunch of python packages
### What changes were proposed in this pull request?
updating the anaconda py36 environment file

### Why are the changes needed?
see:
https://issues.apache.org/jira/browse/SPARK-32666
https://issues.apache.org/jira/browse/SPARK-33242
https://issues.apache.org/jira/browse/SPARK-32391
https://issues.apache.org/jira/browse/SPARK-32797

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
jenkins will test this

Closes #33469 from shaneknapp/updating-python-paks.

Authored-by: shane knapp <incomplete@gmail.com>
Signed-off-by: shane knapp <incomplete@gmail.com>
2021-07-21 15:22:06 -07:00
Hyukjin Kwon 801b369bd0 [SPARK-36204][INFRA][BUILD] Deduplicate Scala 2.13 daily build
### What changes were proposed in this pull request?

Scala 2.13 daily job was added but ideally we should deduplicate it. This PR targets to deduplicate it by creating one more job (`configure-jobs`) that the main job depends on.

`configure-jobs` will properly set the branch, envs, etc. to run the main build properly.

### Why are the changes needed?

To make the maintenance easier

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

See
- https://github.com/HyukjinKwon/spark/actions/runs/1044636792 for a PR
- https://github.com/HyukjinKwon/spark/actions/runs/1048542984 for a cron job

Closes #33410 from HyukjinKwon/SPARK-36204.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-20 22:21:27 +09:00
Kousuke Saruta c7ccc602db [SPARK-36166][TESTS][FOLLOWUP] Add BLOCK_SCALA_VERSION to sparktestssupport/__init__.py
### What changes were proposed in this pull request?

This is a followup PR for SPARK-36166 (#33411), which adds `BLOCK_SCALA_VERSION` to `sparktestssupport/__init__.py`.

### Why are the changes needed?

The following command fails due to the definition is missing.
```
SCALA_PROFILE=scala2.12 dev/run-tests.py
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

The command shown above works.

Closes #33421 from sarutak/followup-SPARK-36166.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-19 22:47:03 +09:00
Hyukjin Kwon 8ee199ef42 [SPARK-36166][TESTS][FOLLOW-UP] Add Scala version change logic into testing script
### What changes were proposed in this pull request?

This PR is a simple followup from https://github.com/apache/spark/pull/33376:
- It simplifies a bit by removing the default Scala version in the testing script (so we don't have to change here in the future when we change the Scala default version).
- Call `change-scala-version.sh` script (when `SCALA_PROFILE` is explicitly specified)

### Why are the changes needed?

More refactoring. In addition, this change will be used at https://github.com/apache/spark/pull/33410

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

CI in this PR should test it out.

Closes #33411 from HyukjinKwon/SPARK-36166.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-19 18:01:02 +09:00
William Hyun c336f73ccd [SPARK-36198][TESTS] Skip UNIDOC generation in PySpark GHA job
### What changes were proposed in this pull request?
This PR aims to skip UNIDOC generation in PySpark GHA job.

### Why are the changes needed?

PySpark GHA jobs do not need to generate Java/Scala doc. This will save about 13 minutes in total.
-https://github.com/apache/spark/runs/3098268973?check_suite_focus=true
```
...
========================================================================
Building Unidoc API Documentation
========================================================================
[info] Building Spark unidoc using SBT with these arguments:  -Phadoop-3.2 -Phive-2.3 -Pscala-2.12 -Phive-thriftserver -Pmesos -Pdocker-integration-tests -Phive -Pkinesis-asl -Pspark-ganglia-lgpl -Pkubernetes -Phadoop-cloud -Pyarn unidoc
...
[info] Main Java API documentation successful.
[success] Total time: 192 s (03:12), completed Jul 18, 2021 6:08:40 PM
```

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Pass the GHA.

Closes #33407 from williamhyun/SKIP_UNIDOC.

Authored-by: William Hyun <william@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-07-18 17:52:28 -07:00
Dongjoon Hyun f66153de78 [SPARK-36166][TESTS] Support Scala 2.13 test in dev/run-tests.py
### What changes were proposed in this pull request?

For Apache Spark 3.2, this PR aims to support Scala 2.13 test in `dev/run-tests.py` by adding `SCALA_PROFILE` and in `dev/run-tests-jenkins.py` by adding `AMPLAB_JENKINS_BUILD_SCALA_PROFILE`.

In addition, `test-dependencies.sh` is skipped for Scala 2.13 because we currently don't maintain the dependency manifests yet. This will be handled after Apache Spark 3.2.0 release.

### Why are the changes needed?

To test Scala 2.13 with `dev/run-tests.py`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual. The following is the result. Note that this PR aims to **run** Scala 2.13 tests instead of **passing** them. We will have daily GitHub Action job via #33358 and will fix UT failures if exists.
```
$ dev/change-scala-version.sh 2.13

$ SCALA_PROFILE=scala2.13 dev/run-tests.py
...
========================================================================
Running Scala style checks
========================================================================
[info] Checking Scala style using SBT with these profiles:  -Phadoop-3.2 -Phive-2.3 -Pscala-2.13 -Pkubernetes -Phadoop-cloud -Phive -Phive-thriftserver -Pyarn -Pmesos -Pdocker-integration-tests -Pkinesis-asl -Pspark-ganglia-lgpl
...
========================================================================
Building Spark
========================================================================
[info] Building Spark using SBT with these arguments:  -Phadoop-3.2 -Phive-2.3 -Pscala-2.13 -Pspark-ganglia-lgpl -Pmesos -Pyarn -Phive-thriftserver -Pkinesis-asl -Pkubernetes -Pdocker-integration-tests -Phive -Phadoop-cloud test:package streaming-kinesis-asl-assembly/assembly
...

[info] Building Spark assembly using SBT with these arguments:  -Phadoop-3.2 -Phive-2.3 -Pscala-2.13 -Pspark-ganglia-lgpl -Pmesos -Pyarn -Phive-thriftserver -Pkinesis-asl -Pkubernetes -Pdocker-integration-tests -Phive -Phadoop-cloud assembly/package
...

========================================================================
Running Java style checks
========================================================================
[info] Checking Java style using SBT with these profiles:  -Phadoop-3.2 -Phive-2.3 -Pscala-2.13 -Pspark-ganglia-lgpl -Pmesos -Pyarn -Phive-thriftserver -Pkinesis-asl -Pkubernetes -Pdocker-integration-tests -Phive -Phadoop-cloud
...

========================================================================
Building Unidoc API Documentation
========================================================================
[info] Building Spark unidoc using SBT with these arguments:  -Phadoop-3.2 -Phive-2.3 -Pscala-2.13 -Pspark-ganglia-lgpl -Pmesos -Pyarn -Phive-thriftserver -Pkinesis-asl -Pkubernetes -Pdocker-integration-tests -Phive -Phadoop-cloud unidoc
...

========================================================================
Running Spark unit tests
========================================================================
[info] Running Spark tests using SBT with these arguments:  -Phadoop-3.2 -Phive-2.3 -Pscala-2.13 -Pspark-ganglia-lgpl -Pmesos -Pyarn -Phive-thriftserver -Pkinesis-asl -Pkubernetes -Pdocker-integration-tests -Phive -Phadoop-cloud test
...
```

Closes #33376 from dongjoon-hyun/SPARK-36166.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-07-15 19:26:07 -07:00
Dongjoon Hyun 5f41a2752f [SPARK-36164][INFRA][FOLLOWUP] Add empty string check back
### What changes were proposed in this pull request?

This is a follow-up of #33371.
At the branch commit GitHub run, we have an empty environment variable.
This PR adds back the empty string check logic.

### Why are the changes needed?

Currently, the failure happens when we use `--modules` in GitHub Action.
```
$ GITHUB_ACTIONS=1 APACHE_SPARK_REF= dev/run-tests.py --modules core
[info] Using build tool sbt with Hadoop profile hadoop3.2 and Hive profile hive2.3 under environment github_actions
fatal: ambiguous argument '': unknown revision or path not in the working tree.
Use '--' to separate paths from revisions, like this:
'git <command> [<revision>...] -- [<file>...]'
Traceback (most recent call last):
  File "/Users/dongjoon/APACHE/spark-merge/dev/run-tests.py", line 785, in <module>
    main()
  File "/Users/dongjoon/APACHE/spark-merge/dev/run-tests.py", line 663, in main
    changed_files = identify_changed_files_from_git_commits(
  File "/Users/dongjoon/APACHE/spark-merge/dev/run-tests.py", line 91, in identify_changed_files_from_git_commits
    raw_output = subprocess.check_output(['git', 'diff', '--name-only', patch_sha, diff_target],
  File "/Users/dongjoon/.pyenv/versions/3.9.5/lib/python3.9/subprocess.py", line 424, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/Users/dongjoon/.pyenv/versions/3.9.5/lib/python3.9/subprocess.py", line 528, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['git', 'diff', '--name-only', 'HEAD', '']' returned non-zero exit status 128.
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manually. The following failure is correct in local environment because it passed `identify_changed_files_from_git_commits` already.
```
$ GITHUB_ACTIONS=1 APACHE_SPARK_REF= dev/run-tests.py --modules core
[info] Using build tool sbt with Hadoop profile hadoop3.2 and Hive profile hive2.3 under environment github_actions
Traceback (most recent call last):
  File "/Users/dongjoon/APACHE/spark-merge/dev/run-tests.py", line 785, in <module>
    main()
  File "/Users/dongjoon/APACHE/spark-merge/dev/run-tests.py", line 668, in main
    os.environ["GITHUB_SHA"], target_ref=os.environ["GITHUB_PREV_SHA"])
  File "/Users/dongjoon/.pyenv/versions/3.9.5/lib/python3.9/os.py", line 679, in __getitem__
    raise KeyError(key) from None
KeyError: 'GITHUB_SHA'
```

Closes #33374 from dongjoon-hyun/SPARK-36164.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-07-15 13:44:17 -07:00
William Hyun c8a3c22628 [SPARK-36164][INFRA] run-test.py should not fail when APACHE_SPARK_REF is not defined
### What changes were proposed in this pull request?
This PR aims to change run-test.py so that it does not fail when os.environ["APACHE_SPARK_REF"] is not defined.

### Why are the changes needed?
Currently, the run-test.py ends with an error.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Pass the CIs.

Closes #33371 from williamhyun/SPARK-36164.

Authored-by: William Hyun <william@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-07-15 11:43:30 -07:00
Hyukjin Kwon 6bd385f1e3 [SPARK-36159][BUILD] Replace 'python' to 'python3' in dev/test-dependencies.sh
### What changes were proposed in this pull request?

This PR is a followup of https://github.com/apache/spark/pull/26330. There is the last place to fix in `dev/test-dependencies.sh`

### Why are the changes needed?

To stick to Python 3 instead of using Python 2 mistakenly.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

Manually tested.

Closes #33368 from HyukjinKwon/change-python-3.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-07-15 07:58:18 -07:00
Dongjoon Hyun 5acfecbf97 [SPARK-36150][INFRA][TESTS] Disable MiMa for Scala 2.13 artifacts
### What changes were proposed in this pull request?

This PR aims to disable MiMa check for Scala 2.13 artifacts.

### Why are the changes needed?

Apache Spark doesn't have Scala 2.13 Maven artifacts yet.
SPARK-36151 will enable this after Apache Spark 3.2.0 release.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual. The following should succeed without real testing.
```
$ dev/mima -Pscala-2.13
```

Closes #33355 from dongjoon-hyun/SPARK-36150.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-07-15 00:34:22 -07:00
Dongjoon Hyun 416a7fd490 [SPARK-36139][INFRA][TESTS] Remove Python 3.6 from pyspark GitHub Action job
### What changes were proposed in this pull request?

This PR aims to remove Python 3.6 installation from `pyspark` job in `build and test` GitHub Action Workflow for Apache Spark 3.3.

### Why are the changes needed?

Python 3.6 is deprecated via SPARK-35938. This will save the GitHub Action resource by removing python3.6 testing.

**BEFORE**
```
Will test against the following Python executables: ['python3.6', 'python3.9', 'pypy3']
```

**AFTER**
```
 Will test against the following Python executables: ['python3.9', 'pypy3']
```

Note that Python 3.6 is still used in the following cases.
- In another jobs like `Linter`
- In `dev/run-pip-tests` script, pip packaing testing via `conda`.
  - This is handled via https://github.com/apache/spark/pull/33351

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the GitHub Action.

Closes #33349 from dongjoon-hyun/SPARK-36139.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-07-14 21:01:25 -07:00
Dongjoon Hyun 5202944b50 [SPARK-36144][INFRA][TESTS] Use Python 3.9 in run-pip-tests conda environment
### What changes were proposed in this pull request?

This PR aims to use Python 3.9 instead of 3.6 during `run-pip-tests` conda environment for Apache Spark 3.3.

### Why are the changes needed?

Python 3.6 is deprecated via SPARK-35938 at Apache Spark 3.2. We had better have Python 3.9 test coverage in Apache Spark 3.3.

### Does this PR introduce _any_ user-facing change?

N/A

### How was this patch tested?

Pass the CIs.

Closes #33351 from dongjoon-hyun/SPARK-36144.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-15 11:45:00 +09:00
Kousuke Saruta fd06cc211d [SPARK-36129][BUILD] Upgrade commons-compress to 1.21 to deal with CVEs
### What changes were proposed in this pull request?

This PR upgrades `commons-compress` from `1.20` to `1.21` to deal with CVEs.

### Why are the changes needed?

Some CVEs which affect `commons-compress 1.20` are reported and fixed in `1.21`.
https://commons.apache.org/proper/commons-compress/security-reports.html

* CVE-2021-35515
* CVE-2021-35516
* CVE-2021-35517
* CVE-2021-36090

The severities are reported as low for all the CVEs but it would be better to deal with them just in case.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI.

Closes #33333 from sarutak/upgrade-commons-compress-1.21.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-07-13 22:53:14 -07:00
Wenchen Fan ae6199af44 Revert "[SPARK-35253][SPARK-35398][SQL][BUILD] Bump up the janino version to v3.1.4"
### What changes were proposed in this pull request?

This PR reverts https://github.com/apache/spark/pull/32455 and its followup https://github.com/apache/spark/pull/32536 , because the new janino version has a bug that is not fixed yet: https://github.com/janino-compiler/janino/pull/148

### Why are the changes needed?

avoid regressions

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

existing tests

Closes #33302 from cloud-fan/revert.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-13 12:14:08 +09:00
Yikun Jiang fdc50f4452 [SPARK-36002][PYTHON] Consolidate tests for data-type-based operations of decimal Series
### What changes were proposed in this pull request?
Merge test_decimal_ops into test_num_ops

- merge test_isnull() into test_num_ops.test_isnull()
- remove test_datatype_ops(), which already covered in 11fcbc73cb/python/pyspark/pandas/tests/data_type_ops/test_base.py (L58-L59)

### Why are the changes needed?
Tests for data-type-based operations of decimal Series are in two places:

- python/pyspark/pandas/tests/data_type_ops/test_decimal_ops.py
- python/pyspark/pandas/tests/data_type_ops/test_num_ops.py

We'd better merge test_decimal_ops into test_num_ops.

See also [SPARK-36002](https://issues.apache.org/jira/browse/SPARK-36002) .

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
unittests passed

Closes #33206 from Yikun/SPARK-36002.

Authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-09 14:08:13 +09:00
attilapiros bad6f89ae2 [SPARK-36026][BUILD][K8S] Upgrade kubernetes-client to 5.5.0
### What changes were proposed in this pull request?

Upgrading the kubernetes-client to 5.5.0

### Why are the changes needed?

There are [several bugfixes](https://github.com/fabric8io/kubernetes-client/releases/tag/v5.5.0) but the main reason is version 5.5.0 contains [Support HTTP operation retry with exponential backoff (for status code >= 500)](https://github.com/fabric8io/kubernetes-client/issues/3087).

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

By running the integration tests including `persistentVolume` tests:

```
./resource-managers/kubernetes/integration-tests/dev/dev-run-integration-tests.sh \
    --spark-tgz $TARBALL_TO_TEST --hadoop-profile $HADOOP_PROFILE --exclude-tags r --include-tags persistentVolume
...
[INFO] --- scalatest-maven-plugin:2.0.0:test (integration-test)  spark-kubernetes-integration-tests_2.12 ---
Discovery starting.
Discovery completed in 413 milliseconds.
Run starting. Expected test count is: 26
KubernetesSuite:
- Run SparkPi with no resources
- Run SparkPi with a very long application name.
- Use SparkLauncher.NO_RESOURCE
- Run SparkPi with a master URL without a scheme.
- Run SparkPi with an argument.
- Run SparkPi with custom labels, annotations, and environment variables.
- All pods have the same service account by default
- Run extraJVMOptions check on driver
- Run SparkRemoteFileTest using a remote data file
- Verify logging configuration is picked from the provided SPARK_CONF_DIR/log4j.properties
- Run SparkPi with env and mount secrets.
- Run PySpark on simple pi.py example
- Run PySpark to test a pyfiles example
- Run PySpark with memory customization
- Run in client mode.
- Start pod creation from template
- PVs with local storage
- Launcher client dependencies
- SPARK-33615: Launcher client archives
- SPARK-33748: Launcher python client respecting PYSPARK_PYTHON
- SPARK-33748: Launcher python client respecting spark.pyspark.python and spark.pyspark.driver.python
- Launcher python client dependencies using a zip file
- Test basic decommissioning
- Test basic decommissioning with shuffle cleanup
- Test decommissioning with dynamic allocation & shuffle cleanups
- Test decommissioning timeouts
Run completed in 18 minutes, 34 seconds.
Total number of tests run: 26
Suites: completed 2, aborted 0
Tests: succeeded 26, failed 0, canceled 0, ignored 0, pending 0
All tests passed.
```

Checked the compatibility matrix and the same k8s versions are supported as were by version 5.4.1.

Closes #33233 from attilapiros/SPARK-36026.

Authored-by: attilapiros <piros.attila.zsolt@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-07 13:02:37 +09:00
Kousuke Saruta 7f70350929 [SPARK-36013][BUILD] Upgrade Dropwizard Metrics to 4.2.2
### What changes were proposed in this pull request?

This PR aims to upgrade Dropwizard Metrics from `4.2.0` to `4.2.2`.

### Why are the changes needed?

Dropwizard `4.2.1` fixes a bug related to `JMXReporter` but `4.2.1` also contains a bug. so upgrading to `4.2.2` seems better.
https://github.com/dropwizard/metrics/releases/tag/v4.2.1
https://github.com/dropwizard/metrics/releases/tag/v4.2.2

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI.

Closes #33209 from sarutak/upgrade-metrics-4.2.2.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-05 17:49:50 +09:00