### What changes were proposed in this pull request?
Make the conversion from/to pandas (for non-ExtensionDtype) data-type-based.
NOTE: Ops class per ExtensionDtype and its data-type-based from/to pandas will be implemented in a separate PR as https://issues.apache.org/jira/browse/SPARK-35614.
### Why are the changes needed?
The conversion from/to pandas includes logic for checking data types and behaving accordingly.
That makes code hard to change or maintain.
Since we have introduced the Ops class per non-ExtensionDtype data type, we ought to make the conversion from/to pandas data-type-based for non-ExtensionDtypes.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Unit tests.
Closes#32592 from xinrong-databricks/datatypeop_pd_conversion.
Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
### What changes were proposed in this pull request?
This PR aims to upgrade kubernetes-client to 5.4.1.
### Why are the changes needed?
This will bring a few bug fixes.
- https://github.com/fabric8io/kubernetes-client/releases/tag/v5.4.1
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass the CIs.
Closes#32798 from dongjoon-hyun/SPARK-35660.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR proposes applying `black` to pandas API on Spark codes, for improving static analysis.
By executing the `./dev/reformat-python` in the spark home directory, all the code of the pandas API on Spark is fixed according to the static analysis rules.
### Why are the changes needed?
This can be reduces the cost of static analysis during development.
It has been used continuously for about a year in the Koalas project and its convenience has been proven.
### Does this PR introduce _any_ user-facing change?
No, it's dev-only.
### How was this patch tested?
Manually reformat the pandas API on Spark codes by running the `./dev/reformat-python`, and checked the `./dev/lint-python` is passed.
Closes#32779 from itholic/SPARK-35499.
Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
### What changes were proposed in this pull request?
This PR proposes to update `dev/requirement.txt` file.
NOTE that:
- This file isn't used anywhere in Apache Spark CI. It's just for convenience
- To minimize the overhead of maintenance, I removed all lowerbounds of dependencies, which means that using the latest versions of them should work in the clean environment (e.g., you can reinstall all of them).
### Why are the changes needed?
To note the dependencies needed for Spark dev, and for easier env setting up.
### Does this PR introduce _any_ user-facing change?
No, dev-only.
### How was this patch tested?
Logically derived from setup.py, and other places like CI
Closes#32780 from HyukjinKwon/SPARK-35648.
Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Splits some tests in `pyspark-pandas` module as slot tests to rebalance the test duration.
Picked the top 12 tests from the previous runs and the total times are almost even.
### Why are the changes needed?
Currently `pyspark-pandas` module tests take long time, so we should rebalance the tests.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing tests.
Closes#32778 from ueshin/issues/SPARK-35642/split-pandas-on-spark-tests.
Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR proposes to port Koalas documentation to PySpark documentation as its initial step.
It ports almost as is except these differences:
- Renamed import from `databricks.koalas` to `pyspark.pandas`.
- Renamed `to_koalas` -> `to_pandas_on_spark`
- Renamed `(Series|DataFrame).koalas` -> `(Series|DataFrame).pandas_on_spark`
- Added a `ps_` prefix in the RST file names of Koalas documentation
Other then that,
- Excluded `python/docs/build/html` in linter
- Fixed GA dependency installataion
### Why are the changes needed?
To document pandas APIs on Spark.
### Does this PR introduce _any_ user-facing change?
Yes, it adds new documentations.
### How was this patch tested?
Manually built the docs and checked the output.
Closes#32726 from HyukjinKwon/SPARK-35587.
Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR adds rules to `checkstyle.xml` and `scalastyle-config.xml` to avoid introducing `Objects.toStringHelper` a Guava's API which is no longer present in newer Guava.
### Why are the changes needed?
SPARK-30272 (#26911) replaced `Objects.toStringHelper` which is an APIs Guava 14 provides with `commons.lang3` API because `Objects.toStringHelper` is no longer present in newer Guava.
But toStringHelper was introduced into Spark again and replaced them in SPARK-35420 (#32567).
I think it's better to have a style rule to avoid such repetition.
SPARK-30272 replaced some APIs aside from `Objects.toStringHelper` but `Objects.toStringHelper` seems to affect Spark for now so I add rules only for it.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
I confirmed that `lint-java` and `lint-scala` detect the usage of `toStringHelper` and let the lint check fail.
```
$ dev/lint-java
exec: curl --silent --show-error -L https://downloads.lightbend.com/scala/2.12.14/scala-2.12.14.tgz
Using `mvn` from path: /opt/maven/3.6.3//bin/mvn
Checkstyle checks failed at following occurrences:
[ERROR] src/main/java/org/apache/spark/network/protocol/OneWayMessage.java:[78] (regexp) RegexpSinglelineJava: Avoid using Object.toStringHelper. Use ToStringBuilder instead.
$ dev/lint-scala
Scalastyle checks failed at following occurrences:
[error] /home/kou/work/oss/spark/core/src/main/scala/org/apache/spark/rdd/RDD.scala:93:25: Avoid using Object.toStringHelper. Use ToStringBuilder instead.
[error] Total time: 25 s, completed 2021/06/02 16:18:25
```
Closes#32740 from sarutak/style-rule-for-guava.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
### What changes were proposed in this pull request?
This PR proposes to remove PySpark documentation build in linter check because:
- to speed up CI build by removing duplicate documentation build (linter and doc build)
- for https://github.com/apache/spark/pull/32726. With this PR PySpark documentation build requires a full Spark build to generate plot images in PySpark documentation. It makes less sense to require it in Python linter.
- to remove unnecessary dependency installation for Python linter in CI
### Why are the changes needed?
Python linter script includes documentation build. Because of this, we run documentation builds duplicately in CI, and requires unnecessary dependencies to be installed, and takes extra time. It would more make sense to exclude this in Python linter.
### Does this PR introduce _any_ user-facing change?
No, dev-only.
### How was this patch tested?
Manually tested, and it will be tested in CI.
Closes#32760 from HyukjinKwon/SPARK-35620.
Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR proposes to add `docker-integratin-tests` to `run-tests.py` and GA.
Once #32631 was merged but there was a lack of consideration.
Diff between this change and 692d95d145 merged in #32631 is as follows.
```
if: github.repository != 'apache/spark'
id: sync-branch
run: |
+ apache_spark_ref=`git rev-parse HEAD`
git fetch https://github.com/$GITHUB_REPOSITORY.git ${GITHUB_REF#refs/heads/}
git -c user.name='Apache Spark Test Account' -c user.email='sparktestaccgmail.com' merge --no-commit --progress --squash FETCH_HEAD
git -c user.name='Apache Spark Test Account' -c user.email='sparktestaccgmail.com' commit -m "Merged commit"
+ echo "::set-output name=APACHE_SPARK_REF::$apache_spark_ref"
- name: Cache Scala, SBT and Maven
uses: actions/cachev2
with:
```
### Why are the changes needed?
CI for `docker-integration-tests` is absent for now.
### Does this PR introduce _any_ user-facing change?
GA.
### How was this patch tested?
Closes#32691 from sarutak/docker-integration-test-ga-take2.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR proposes to add `docker-integratin-tests` to `run-tests.py` and GA.
`doker-integration-tests` can't run if docker is not installed so it run only if `docker-integration-tests` is specified with `--module`.
### Why are the changes needed?
CI for `docker-integration-tests` is absent for now.
### Does this PR introduce _any_ user-facing change?
GA.
### How was this patch tested?
Closes#32631 from sarutak/docker-integration-test-ga.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
BinaryType, which represents byte sequence values in Spark, doesn't support data-type-based operations yet. We are going to introduce BinaryOps for it.
### Why are the changes needed?
The data-type-based-operations class should be set for each individual data type, including BinaryType.
In addition, BinaryType has its special way of addition, which means concatenation.
### Does this PR introduce _any_ user-facing change?
Yes.
Before the change:
```py
>>> import pyspark.pandas as ps
>>> psser = ps.Series([b'1', b'2', b'3'])
>>> psser + psser
Traceback (most recent call last):
...
TypeError: Type object was not understood.
>>> psser + b'1'
Traceback (most recent call last):
...
TypeError: Type object was not understood.
```
After the change:
```py
>>> import pyspark.pandas as ps
>>> psser = ps.Series([b'1', b'2', b'3'])
>>> psser + psser
0 [49, 49]
1 [50, 50]
2 [51, 51]
dtype: object
>>> psser + b'1'
0 [49, 49]
1 [50, 49]
2 [51, 49]
dtype: object
```
### How was this patch tested?
Unit tests.
Closes#32665 from xinrong-databricks/datatypeops_binary.
Lead-authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Co-authored-by: xinrong-databricks <47337188+xinrong-databricks@users.noreply.github.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
### What changes were proposed in this pull request?
The PR is proposed to introduce ArrayOps, MapOps and StructOps to handle data-type-based operations for StructType, ArrayType, and MapType separately.
### Why are the changes needed?
StructType, ArrayType, and MapType are not accepted by DataTypeOps now.
We should handle these complex types. Among them:
- ArrayType supports concatenation: for example, ps.Series([[1,2,3]]) + ps.Series([[4,5,6]]) should work the same as pd.Series([[1,2,3]]) + pd.Series([[4,5,6]]), as concatenation.
- StructOps will be helpful to make to/from pandas conversion data-type-based.
### Does this PR introduce _any_ user-facing change?
Yes.
Before the change:
```py
>>> import pyspark.pandas as ps
>>> from pyspark.pandas.config import set_option
>>> set_option("compute.ops_on_diff_frames", True)
>>> ps.Series([[1, 2, 3]]) + ps.Series([[0.4, 0.5]])
Traceback (most recent call last):
...
TypeError: Type object was not understood.
>>> ps.Series([[1, 2, 3]]) + ps.Series([[4, 5]])
Traceback (most recent call last):
...
TypeError: Type object was not understood.
>>> ps.Series([[1, 2, 3]]) + ps.Series([['x']])
Traceback (most recent call last):
...
TypeError: Type object was not understood.
```
After the change:
```py
>>> import pyspark.pandas as ps
>>> from pyspark.pandas.config import set_option
>>> set_option("compute.ops_on_diff_frames", True)
>>> ps.Series([[1, 2, 3]]) + ps.Series([[0.4, 0.5]])
0 [1.0, 2.0, 3.0, 0.4, 0.5]
dtype: object
>>> ps.Series([[1, 2, 3]]) + ps.Series([[4, 5]])
0 [1, 2, 3, 4, 5]
dtype: object
>>> ps.Series([[1, 2, 3]]) + ps.Series([['x']])
Traceback (most recent call last):
...
TypeError: Concatenation can only be applied to arrays of the same type
```
### How was this patch tested?
Unit tests.
Closes#32626 from xinrong-databricks/datatypeop_complex.
Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
### What changes were proposed in this pull request?
This PR aims to upgrade json4s from 3.7.0-M5 to 3.7.0-M11
Note: json4s version greater than 3.7.0-M11 is not binary compatible with Spark third party jars
### Why are the changes needed?
Multiple defect fixes and improvements like
https://github.com/json4s/json4s/issues/750https://github.com/json4s/json4s/issues/554https://github.com/json4s/json4s/issues/715
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Ran with the existing UTs
Closes#32636 from vinodkc/br_build_upgrade_json4s.
Authored-by: Vinod KC <vinod.kc.in@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
This PR enables GitHub Actions to test PySpark with Python 3.9.
### Why are the changes needed?
To verify the support of Python 3.9.
### Does this PR introduce _any_ user-facing change?
No, test-only.
### How was this patch tested?
Existing tests should cover.
Closes#32657 from HyukjinKwon/SPARK-35506.
Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR aims to upgrade joda-time from 2.10.5 to 2.10.10
### Why are the changes needed?
Improvement and bug fixes in joda-time
https://www.joda.org/joda-time/changes-report.html#a2.10.10
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Ran with the existing UTs
Closes#32661 from vinodkc/br_build_upgrade_joda_time.
Authored-by: Vinod KC <vinod.kc.in@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Automatically update version index of DocSearch via release-tag.sh for releasing new documentation site, instead of the current manual update.
### Why are the changes needed?
Simplify the release process.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Manually run the following command and check the diff
```
R_NEXT_VERSION=3.2.0
sed -i".tmp8" "s/'facetFilters':.*$/'facetFilters': [\"version:$R_NEXT_VERSION\"]/g" docs/_config.yml
```
Closes#32662 from gengliangwang/updateDocsearchInRelease.
Authored-by: Gengliang Wang <ltnwgl@gmail.com>
Signed-off-by: Gengliang Wang <ltnwgl@gmail.com>
### What changes were proposed in this pull request?
This PR aims to upgrade Apache HttpCore from 4.4.12 to 4.4.14.
### Why are the changes needed?
Stability improvements in httpcore 4.4.14
- Bug fix: Non-blocking TLSv1.3 connections can end up in an infinite event spin when closed concurrently by the local and the remote endpoints.
- HTTPCORE-647: Non-blocking connection terminated due to 'java.io.IOException: Broken pipe' can enter an infinite loop flushing buffered output data.
- PR #201, HTTPCORE-634: Fix race condition in AbstractConnPool that can cause internal state
- corruption
- HTTPCORE-612: DefaultConnectionReuseStrategy incorrectly used int to represent Content-Length value
- instead of long
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
With Jenkins Tests
Closes#32638 from vinodkc/br_build_upgrade_httpcore.
Authored-by: Vinod KC <vinod.kc.in@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR aims to upgrade ORC to 1.6.8.
### Why are the changes needed?
This will bring the latest bug fixes.
- https://orc.apache.org/news/2021/05/21/ORC-1.6.8/
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass the existing CIs.
Closes#32635 from dongjoon-hyun/SPARK-35489.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR aims to upgrade ASM to 7.3.1.
- https://issues.apache.org/jira/browse/XBEAN-323
- https://asm.ow2.io/versions.html
### Why are the changes needed?
ASM 7.3.1 bring following changes
- new V15 constant
- experimental support for PermittedSubtypes and RecordComponent
- bug fixes
- - 317885: SKIP_DEBUG now skips MethodParameters attributes
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Ran with the existing UTs
Closes#32634 from vinodkc/br_build_upgrade_asm.
Authored-by: Vinod KC <vinod.kc.in@gmail.com>
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
### What changes were proposed in this pull request?
This PR upgrades Dropwizard metrics to 4.2.0.
I also modified the corresponding links in `docs/monitoring.md`.
### Why are the changes needed?
The latest version was released last week and it contains some improvements.
https://github.com/dropwizard/metrics/releases/tag/v4.2.0
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Build succeeds and all the modified links are reachable.
Closes#32628 from sarutak/upgrade-dropwizard-4.2.0.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR aims to upgrade `kubernetes-client` from 5.3.1 to 5.4.0 to support K8s 1.21 models officially.
### Why are the changes needed?
`kubernetes-client` 5.4.0 has `Kubernetes Model v1.21.0`
- https://github.com/fabric8io/kubernetes-client/releases/tag/v5.4.0
### Does this PR introduce _any_ user-facing change?
No. This is a dev-only change.
### How was this patch tested?
Pass the CIs including Jenkins K8s IT.
- https://github.com/apache/spark/pull/32612#issuecomment-845456039
I tested K8s IT with the following versions.
- minikube version: v1.20.0
- K8s Client Version: v1.21.0
- Server Version: v1.21.0
```
KubernetesSuite:
- Run SparkPi with no resources
- Run SparkPi with a very long application name.
- Use SparkLauncher.NO_RESOURCE
- Run SparkPi with a master URL without a scheme.
- Run SparkPi with an argument.
- Run SparkPi with custom labels, annotations, and environment variables.
- All pods have the same service account by default
- Run extraJVMOptions check on driver
- Run SparkRemoteFileTest using a remote data file
- Verify logging configuration is picked from the provided SPARK_CONF_DIR/log4j.properties
- Run SparkPi with env and mount secrets.
- Run PySpark on simple pi.py example
- Run PySpark to test a pyfiles example
- Run PySpark with memory customization
- Run in client mode.
- Start pod creation from template
- Launcher client dependencies
- SPARK-33615: Launcher client archives
- SPARK-33748: Launcher python client respecting PYSPARK_PYTHON
- SPARK-33748: Launcher python client respecting spark.pyspark.python and spark.pyspark.driver.python
- Launcher python client dependencies using a zip file
- Test basic decommissioning
- Test basic decommissioning with shuffle cleanup
- Test decommissioning with dynamic allocation & shuffle cleanups
- Test decommissioning timeouts
- Run SparkR on simple dataframe.R example
Run completed in 17 minutes, 18 seconds.
Total number of tests run: 26
Suites: completed 2, aborted 0
Tests: succeeded 26, failed 0, canceled 0, ignored 0, pending 0
All tests passed.
```
Closes#32612 from dongjoon-hyun/SPARK-35462.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR changes the antlr4-runtime version from 4.8-1 to 4.8.
### Why are the changes needed?
Version 4.8 is the official release version, with a proper release note (see https://github.com/antlr/antlr4/releases) and artifiacts listed in https://www.antlr.org/download/index.html.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Will rely on tests in the PR.
Closes#32603 from bozhang2820/antlr-4.8.
Authored-by: Bo Zhang <bo.zhang@databricks.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
The PR is proposed for **pandas APIs on Spark**, in order to separate arithmetic operations shown as below into data-type-based structures.
`__add__, __sub__, __mul__, __truediv__, __floordiv__, __pow__, __mod__,
__radd__, __rsub__, __rmul__, __rtruediv__, __rfloordiv__, __rpow__,__rmod__`
DataTypeOps and subclasses are introduced.
The existing behaviors of each arithmetic operation should be preserved.
### Why are the changes needed?
Currently, the same arithmetic operation of all data types is defined in one function, so it’s difficult to extend the behavior change based on the data types.
Introducing DataTypeOps would be the foundation for [pandas APIs on Spark: Separate basic operations into data type based structures.](https://docs.google.com/document/d/12MS6xK0hETYmrcl5b9pX5lgV4FmGVfpmcSKq--_oQlc/edit?usp=sharing).
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Tests are introduced under pyspark.pandas.tests.data_type_ops. One test file per DataTypeOps class.
Closes#32596 from xinrong-databricks/datatypeop_arith_fix.
Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
### What changes were proposed in this pull request?
The PR is proposed for **pandas APIs on Spark**, in order to separate arithmetic operations shown as below into data-type-based structures.
`__add__, __sub__, __mul__, __truediv__, __floordiv__, __pow__, __mod__,
__radd__, __rsub__, __rmul__, __rtruediv__, __rfloordiv__, __rpow__,__rmod__`
DataTypeOps and subclasses are introduced.
The existing behaviors of each arithmetic operation should be preserved.
### Why are the changes needed?
Currently, the same arithmetic operation of all data types is defined in one function, so it’s difficult to extend the behavior change based on the data types.
Introducing DataTypeOps would be the foundation for [pandas APIs on Spark: Separate basic operations into data type based structures.](https://docs.google.com/document/d/12MS6xK0hETYmrcl5b9pX5lgV4FmGVfpmcSKq--_oQlc/edit?usp=sharing).
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Tests are introduced under pyspark.pandas.tests.data_type_ops. One test file per DataTypeOps class.
Closes#32469 from xinrong-databricks/datatypeop_arith.
Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
### What changes were proposed in this pull request?
The following two things are done in this PR.
* Add note about Jinja2 as a required dependency for document build.
* Add Jinja2 dependency for the document build to `spark-rm/Dockerfile`
### Why are the changes needed?
SPARK-35375(#32509) confined the version of Jinja to <3.0.0.
So it's good to note about it in `docs/README.md` and add the dependency to `spark-rm/Dockerfile`.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
I confimed that `make html` succeed under `python/docs` with the following command.
```
sudo pip install 'sphinx<3.1.0' mkdocs numpy pydata_sphinx_theme ipython nbsphinx numpydoc 'jinja2<3.0.0'
```
Closes#32573 from sarutak/required-module-for-python-doc.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
### What changes were proposed in this pull request?
Currently pip packaging test is being skipped:
```
========================================================================
Running PySpark packaging tests
========================================================================
Constructing virtual env for testing
Missing virtualenv & conda, skipping pip installability tests
Cleaning up temporary directory - /tmp/tmp.iILYWISPXW
```
See https://github.com/apache/spark/runs/2568923639?check_suite_focus=true
GitHub Actions's image has its default Conda installed at `/usr/share/miniconda` but seems like the image we're using for PySpark does not have it (which is legitimate).
This PR proposes to install Conda to use in pip packaging tests in GitHub Actions.
### Why are the changes needed?
To recover the test coverage.
### Does this PR introduce _any_ user-facing change?
No, dev-only.
### How was this patch tested?
It was tested in my fork: https://github.com/HyukjinKwon/spark/runs/2575126882?check_suite_focus=true
```
========================================================================
Running PySpark packaging tests
========================================================================
Constructing virtual env for testing
Using conda virtual environments
Testing pip installation with python 3.6
Using /tmp/tmp.qPjTenqfGn for virtualenv
Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... failed with repodata from current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... done
## Package Plan ##
environment location: /tmp/tmp.qPjTenqfGn/3.6
added / updated specs:
- numpy
- pandas
- pip
- python=3.6
- setuptools
...
Successfully ran pip sanity check
```
Closes#32537 from HyukjinKwon/SPARK-35393.
Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR proposes to bump up the janino version from 3.0.16 to v3.1.4.
The major changes of this upgrade are as follows:
- Fixed issue #131: Janino 3.1.2 is 10x slower than 3.0.11: The Compiler's IClassLoader was initialized way too eagerly, thus lots of classes were loaded from the class path, which is very slow.
- Improved the encoding of stack map frames according to JVMS11 4.7.4: Previously, only "full_frame"s were generated.
- Fixed issue #107: Janino requires "org.codehaus.commons.compiler.io", but commons-compiler does not export this package
- Fixed the promotion of the array access index expression (see JLS7 15.13 Array Access Expressions).
For all the changes, please see the change log: http://janino-compiler.github.io/janino/changelog.html
NOTE1: I've checked that there is no obvious performance regression. For all the data, see a link: https://docs.google.com/spreadsheets/d/1srxT9CioGQg1fLKM3Uo8z1sTzgCsMj4pg6JzpdcG6VU/edit?usp=sharing
NOTE2: We upgraded janino to 3.1.2 (#27860) once before, but the commit had been reverted in #29495 because of the correctness issue. Recently, #32374 had checked if Spark could land on v3.1.3 or not, but a new bug was found there. These known issues has been fixed in v3.1.4 by following PRs:
- janino-compiler/janino#145
- janino-compiler/janino#146
### Why are the changes needed?
janino v3.0.X is no longer maintained.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
GA passed.
Closes#32455 from maropu/janino_v3.1.4.
Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
This patch proposes to increase the maximum heap memory setting for release build.
### Why are the changes needed?
When I was cutting RCs for 2.4.8, I frequently encountered OOM during building using mvn. It happens many times until I increased the heap memory setting.
I am not sure if other release managers encounter the same issue. So I propose to increase the heap memory setting and see if it looks good for others.
### Does this PR introduce _any_ user-facing change?
No, dev only.
### How was this patch tested?
Manually used it during cutting RCs of 2.4.8.
Closes#32487 from viirya/release-mvn-oom.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
This PR proposes to add linter for JavaScript source files.
[ESLint](https://eslint.org/) seems to be a popular linter for JavaScript so I choose it.
### Why are the changes needed?
Linter enables us to check style and keeps code clean.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Manually run `dev/lint-js` (Node.js and npm are required).
In this PR, mainly indentation style is also fixed an linter passes.
Closes#32274 from sarutak/introduce-eslint.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
### What changes were proposed in this pull request?
This is a followup of https://github.com/apache/spark/pull/32453.
### Why are the changes needed?
Jenkins doesn't check dependency manifest files.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass the GitHub Action or manually.
Closes#32458 from dongjoon-hyun/SPARK-35326.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR aims to upgrade K8s client to 5.3.1.
### Why are the changes needed?
This will bring the latest bug fixes.
- https://github.com/fabric8io/kubernetes-client/releases/tag/v5.3.1
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass the CIs.
K8s IT is manually tested like the following.
```
KubernetesSuite:
- Run SparkPi with no resources
- Run SparkPi with a very long application name.
- Use SparkLauncher.NO_RESOURCE
- Run SparkPi with a master URL without a scheme.
- Run SparkPi with an argument.
- Run SparkPi with custom labels, annotations, and environment variables.
- All pods have the same service account by default
- Run extraJVMOptions check on driver
- Run SparkRemoteFileTest using a remote data file
- Verify logging configuration is picked from the provided SPARK_CONF_DIR/log4j.properties
- Run SparkPi with env and mount secrets.
- Run PySpark on simple pi.py example
- Run PySpark to test a pyfiles example
- Run PySpark with memory customization
- Run in client mode.
- Start pod creation from template
- PVs with local storage
- Launcher client dependencies
- SPARK-33615: Launcher client archives
- SPARK-33748: Launcher python client respecting PYSPARK_PYTHON
- SPARK-33748: Launcher python client respecting spark.pyspark.python and spark.pyspark.driver.python
- Launcher python client dependencies using a zip file
- Test basic decommissioning
- Test basic decommissioning with shuffle cleanup
- Test decommissioning with dynamic allocation & shuffle cleanups
- Test decommissioning timeouts
- Run SparkR on simple dataframe.R example
Run completed in 18 minutes, 33 seconds.
Total number of tests run: 27
Suites: completed 2, aborted 0
Tests: succeeded 27, failed 0, canceled 0, ignored 0, pending 0
All tests passed.
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary for Spark Project Parent POM 3.2.0-SNAPSHOT:
[INFO]
[INFO] Spark Project Parent POM ........................... SUCCESS [ 3.959 s]
[INFO] Spark Project Tags ................................. SUCCESS [ 7.830 s]
[INFO] Spark Project Local DB ............................. SUCCESS [ 3.457 s]
[INFO] Spark Project Networking ........................... SUCCESS [ 5.496 s]
[INFO] Spark Project Shuffle Streaming Service ............ SUCCESS [ 3.239 s]
[INFO] Spark Project Unsafe ............................... SUCCESS [ 9.006 s]
[INFO] Spark Project Launcher ............................. SUCCESS [ 2.422 s]
[INFO] Spark Project Core ................................. SUCCESS [02:17 min]
[INFO] Spark Project Kubernetes Integration Tests ......... SUCCESS [21:05 min]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 23:59 min
[INFO] Finished at: 2021-05-05T11:59:19-07:00
[INFO] ------------------------------------------------------------------------
```
Closes#32443 from dongjoon-hyun/SPARK-35319.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR aims to upgrade snappy to version 1.1.8.4.
### Why are the changes needed?
This will bring the latest bug fixes and improvements.
- https://github.com/xerial/snappy-java/blob/master/Milestone.md#snappy-java-1183-2021-01-20
- Make pure-java Snappy thread-safe
- Improved SnappyFramedInput/OutputStream performance by using java.util.zip.CRC32C
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass the CIs.
Closes#32402 from williamhyun/snappy1184.
Authored-by: William Hyun <william@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This pr aims to upgrade Apache commons-lang3 to 3.12.0
### Why are the changes needed?
This version will bring the latest bug fixes as follows:
- https://commons.apache.org/proper/commons-lang/changes-report.html#a3.12.0
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Pass the Jenkins or GitHub Action
Closes#32393 from LuciferYang/lang3-to-312.
Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Although AS-IS master branch already works with K8s 1.20, this PR aims to upgrade K8s client to 5.3.0 to support K8s 1.20 officially.
- https://github.com/fabric8io/kubernetes-client#compatibility-matrix
The following are the notable breaking API changes.
1. Remove Doneable (5.0+):
- https://github.com/fabric8io/kubernetes-client/pull/2571
2. Change Watcher.onClose signature (5.0+):
- https://github.com/fabric8io/kubernetes-client/pull/2616
3. Change Readiness (5.1+)
- https://github.com/fabric8io/kubernetes-client/pull/2796
### Why are the changes needed?
According to the compatibility matrix, this makes Apache Spark and its external cluster manager extension support all K8s 1.20 features officially for Apache Spark 3.2.0.
### Does this PR introduce _any_ user-facing change?
Yes, this is a dev dependency change which affects K8s cluster extension users.
### How was this patch tested?
Pass the CIs.
This is manually tested with K8s IT.
```
KubernetesSuite:
- Run SparkPi with no resources
- Run SparkPi with a very long application name.
- Use SparkLauncher.NO_RESOURCE
- Run SparkPi with a master URL without a scheme.
- Run SparkPi with an argument.
- Run SparkPi with custom labels, annotations, and environment variables.
- All pods have the same service account by default
- Run extraJVMOptions check on driver
- Run SparkRemoteFileTest using a remote data file
- Verify logging configuration is picked from the provided SPARK_CONF_DIR/log4j.properties
- Run SparkPi with env and mount secrets.
- Run PySpark on simple pi.py example
- Run PySpark to test a pyfiles example
- Run PySpark with memory customization
- Run in client mode.
- Start pod creation from template
- PVs with local storage
- Launcher client dependencies
- SPARK-33615: Launcher client archives
- SPARK-33748: Launcher python client respecting PYSPARK_PYTHON
- SPARK-33748: Launcher python client respecting spark.pyspark.python and spark.pyspark.driver.python
- Launcher python client dependencies using a zip file
- Test basic decommissioning
- Test basic decommissioning with shuffle cleanup
- Test decommissioning with dynamic allocation & shuffle cleanups
- Test decommissioning timeouts
- Run SparkR on simple dataframe.R example
Run completed in 17 minutes, 44 seconds.
Total number of tests run: 27
Suites: completed 2, aborted 0
Tests: succeeded 27, failed 0, canceled 0, ignored 0, pending 0
All tests passed.
```
Closes#32221 from dongjoon-hyun/SPARK-K8S-530.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
SPARK-10498 added the initial Jira client requirement with 1.0.3 five year ago (2016 January). As of today, it causes `dev/merge_spark_pr.py` failure with `Python 3.9.4` due to this old dependency. This PR aims to upgrade it to the latest version, 2.0.0. The latest version is also a little old (2018 July).
- https://pypi.org/project/jira/#history
### Why are the changes needed?
`Jira==2.0.0` works well with both Python 3.8/3.9 while `Jira==1.0.3` fails with Python 3.9.
**BEFORE**
```
$ pyenv global 3.9.4
$ pip freeze | grep jira
jira==1.0.3
$ dev/merge_spark_pr.py
Traceback (most recent call last):
File "/Users/dongjoon/APACHE/spark-merge/dev/merge_spark_pr.py", line 39, in <module>
import jira.client
File "/Users/dongjoon/.pyenv/versions/3.9.4/lib/python3.9/site-packages/jira/__init__.py", line 5, in <module>
from .config import get_jira
File "/Users/dongjoon/.pyenv/versions/3.9.4/lib/python3.9/site-packages/jira/config.py", line 17, in <module>
from .client import JIRA
File "/Users/dongjoon/.pyenv/versions/3.9.4/lib/python3.9/site-packages/jira/client.py", line 165
validate=False, get_server_info=True, async=False, logging=True, max_retries=3):
^
SyntaxError: invalid syntax
```
**AFTER**
```
$ pip install jira==2.0.0
$ dev/merge_spark_pr.py
git rev-parse --abbrev-ref HEAD
Which pull request would you like to merge? (e.g. 34):
```
### Does this PR introduce _any_ user-facing change?
No. This is a committer-only script.
### How was this patch tested?
Manually.
Closes#32215 from dongjoon-hyun/jira.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Now that we merged the Koalas main code into the PySpark code base (#32036), we should port the Koalas Index unit tests to PySpark.
### Why are the changes needed?
Currently, the pandas-on-Spark modules are not tested fully. We should enable the Index unit tests.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Enable Index unit tests.
Closes#32139 from xinrong-databricks/port.indexes_tests.
Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Now that we merged the Koalas main code into the PySpark code base (#32036), we should port the Koalas miscellaneous unit tests to PySpark.
### Why are the changes needed?
Currently, the pandas-on-Spark modules are not tested fully. We should enable miscellaneous unit tests.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Enable miscellaneous unit tests.
Closes#32152 from xinrong-databricks/port.misc_tests.
Lead-authored-by: xinrong-databricks <47337188+xinrong-databricks@users.noreply.github.com>
Co-authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR bumps up the version of pycodestyle from 2.6.0 to 2.7.0 released a month ago.
### Why are the changes needed?
2.7.0 includes three major fixes below (see https://readthedocs.org/projects/pycodestyle/downloads/pdf/latest/):
- Fix physical checks (such as W191) at end of file. PR #961.
- Add --indent-size option (defaulting to 4). PR #970.
- W605: fix escaped crlf false positive on windows. PR #976
The first and third ones could be useful for dev to detect the styles.
### Does this PR introduce _any_ user-facing change?
No, dev-only.
### How was this patch tested?
Manually tested locally.
Closes#32160 from HyukjinKwon/SPARK-35061.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Now that we merged the Koalas main code into the PySpark code base (#32036), we should port the Koalas internal implementation unit tests to PySpark.
### Why are the changes needed?
Currently, the pandas-on-Spark modules are not tested fully. We should enable the internal implementation unit tests.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Enable internal implementation unit tests.
Closes#32137 from xinrong-databricks/port.test_internal_impl.
Lead-authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Co-authored-by: xinrong-databricks <47337188+xinrong-databricks@users.noreply.github.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR proposes to leverage the GitHub Actions resources from the forked repositories instead of using the resources in ASF organisation at GitHub.
This is how it works:
1. "Build and test" (`build_and_test.yml`) triggers a build on any commit on any branch (except `branch-*.*`), which roughly means:
- The original repository will trigger the build on any commits in `master` branch
- The forked repository will trigger the build on any commit in any branch.
2. The build triggered in the forked repository will checkout the original repository's `master` branch locally, and merge the branch from the forked repository into the original repository's `master` branch locally.
Therefore, the tests in the forked repository will run after being sync'ed with the original repository's `master` branch.
3. In the original repository, it triggers a workflow that detects the workflow triggered in the forked repository, and add a comment, to the PR, pointing out the workflow in forked repository.
In short, please see this example HyukjinKwon#34
1. You create a PR and your repository triggers the workflow. Your PR uses the resources allocated to you for testing.
2. Apache Spark repository finds your workflow, and links it in a comment in your PR
**NOTE** that we will still run the tests in the original repository for each commit pushed to `master` branch. This distributes the workflows only in PRs.
### Why are the changes needed?
ASF shares the resources across all the ASF projects, which makes the development slow down.
Please see also:
- Discussion in the buildsa.o mailing list: https://lists.apache.org/x/thread.html/r48d079eeff292254db22705c8ef8618f87ff7adc68d56c4e5d0b4105%3Cbuilds.apache.org%3E
- Infra ticket: https://issues.apache.org/jira/browse/INFRA-21646
By distributing the workflows to use author's resources, we can get around this issue.
### Does this PR introduce _any_ user-facing change?
No, this is a dev-only change.
### How was this patch tested?
Manually tested at https://github.com/HyukjinKwon/spark/pull/34 and https://github.com/HyukjinKwon/spark/pull/33.
Closes#32092 from HyukjinKwon/poc-fork-resources.
Lead-authored-by: HyukjinKwon <gurwls223@apache.org>
Co-authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Now that we merged the Koalas main code into the PySpark code base (#32036), we should port the Koalas plot unit tests to PySpark.
### Why are the changes needed?
Currently, the pandas-on-Spark modules are not tested fully. We should enable the plot unit tests.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Enable plot unit tests.
Closes#32151 from xinrong-databricks/port.plot_tests.
Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Now that we merged the Koalas main code into the PySpark code base (#32036), we should port the Koalas DataFrame-related unit tests to PySpark.
### Why are the changes needed?
Currently, the pandas-on-Spark modules are not fully tested. We should enable the DataFrame-related unit tests first.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Enable DataFrame-related unit tests.
Closes#32131 from xinrong-databricks/port.test_dataframe_related.
Lead-authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Co-authored-by: xinrong-databricks <47337188+xinrong-databricks@users.noreply.github.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>