Commit graph

158 commits

Author SHA1 Message Date
Liang-Chi Hsieh 7d13ac177b [SPARK-36393][BUILD] Try to raise memory for GHA
### What changes were proposed in this pull request?

According to the feedback from GitHub, the change causing memory issue has been rolled back. We can try to raise memory again for GA.

### Why are the changes needed?

Trying higher memory settings for GA. It could speed up the testing time.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

GA

Closes #33623 from viirya/increasing-mem-ga.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2021-08-05 01:31:35 -07:00
Takuya UESHIN 8cb9cf39b6 [SPARK-36345][SPARK-36367][INFRA][PYTHON] Disable tests failed by the incompatible behavior of pandas 1.3
### What changes were proposed in this pull request?

Disable tests failed by the incompatible behavior of pandas 1.3.

### Why are the changes needed?

Pandas 1.3 has been released.
There are some behavior changes and we should follow it, but it's not ready yet.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Disabled some tests related to the behavior change.

Closes #33598 from ueshin/issues/SPARK-36367/disable_tests.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-03 14:02:18 +09:00
Hyukjin Kwon c0d1860f25 [SPARK-36092][INFRA][BUILD][PYTHON] Migrate to GitHub Actions with Codecov from Jenkins
### What changes were proposed in this pull request?

This PR proposes to migrate Coverage report from Jenkins to GitHub Actions by setting a dailly cron job.

### Why are the changes needed?

For some background, currently PySpark code coverage is being reported in this specific Jenkins job: https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/

Because of the security issue between [Codecov service](https://app.codecov.io/gh/) and Jenkins machines, we had to work around by manually hosting a coverage site via GitHub pages, see also https://spark-test.github.io/pyspark-coverage-site/ by spark-test account (which is shared to only subset of PMC members).

Since we now run the build via GitHub Actions, we can leverage [Codecov plugin](https://github.com/codecov/codecov-action), and remove the workaround we used.

### Does this PR introduce _any_ user-facing change?

Virtually no. Coverage site (UI) might change but the information it holds should be virtually the same.

### How was this patch tested?

I manually tested:
- Scheduled run: https://github.com/HyukjinKwon/spark/actions/runs/1082261484
- Coverage report: 73f0291a7d/python/pyspark
- Run against a PR: https://github.com/HyukjinKwon/spark/actions/runs/1082367175

Closes #33591 from HyukjinKwon/SPARK-36092.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-01 21:37:19 +09:00
Dongjoon Hyun 95a5c1702d Revert "[SPARK-36345][INFRA] Update PySpark GitHubAction docker image to 20210730"
This reverts commit 0e65ed5fb9.
2021-07-30 17:08:50 -07:00
Dongjoon Hyun 0e65ed5fb9 [SPARK-36345][INFRA] Update PySpark GitHubAction docker image to 20210730
### What changes were proposed in this pull request?

This PR aims to upgrade PySpark GitHub Action job to use the latest docker image `20210730` having `sklearn` and `mlflow` additionally.
- 5ca94453d1

```
$ docker run -it --rm dongjoon/apache-spark-github-action-image:20210730 python3.9 -m pip list | grep mlflow
mlflow                    1.19.0

$ docker run -it --rm dongjoon/apache-spark-github-action-image:20210730 python3.9 -m pip list | grep sklearn
sklearn                   0.0
```

### Why are the changes needed?

This will save the installation time.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the GitHub Action PySpark jobs.

Closes #33595 from dongjoon-hyun/SPARK-36345.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-31 07:20:17 +09:00
itholic abce61f3fd [SPARK-36254][INFRA][PYTHON] Install mlflow in Github Actions CI
### What changes were proposed in this pull request?

This PR proposes adding a Python package, `mlflow` and `sklearn` to enable the MLflow test in pandas API on Spark.

### Why are the changes needed?

To enable the MLflow test in pandas API on Spark.

### Does this PR introduce _any_ user-facing change?

No, it's test-only

### How was this patch tested?

Manually test on local, with `python/run-tests --testnames pyspark.pandas.mlflow`.

Closes #33567 from itholic/SPARK-36254.

Lead-authored-by: itholic <haejoon.lee@databricks.com>
Co-authored-by: Haejoon Lee <44108233+itholic@users.noreply.github.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-07-30 00:04:48 -07:00
William Hyun 674202e7b6 [SPARK-36285][INFRA][TESTS] Skip MiMa in PySpark/SparkR/Docker GHA job
### What changes were proposed in this pull request?
This PR aims to skip MiMa in PySpark/SparkR/Docker GHA job.

### Why are the changes needed?
This will save GHA resource because MiMa is irrelevant to Python.

### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass the GHA.

Closes #33532 from williamhyun/mima.

Lead-authored-by: William Hyun <william@apache.org>
Co-authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-27 16:47:59 +09:00
Takuya UESHIN 663cbdfbe5 [SPARK-36279][INFRA][PYTHON] Fix lint-python to work with Python 3.9
### What changes were proposed in this pull request?

Fix `lint-python` to pick `PYTHON_EXECUTABLE` from the environment variable first to switch the Python and explicitly specify `PYTHON_EXECUTABLE` to use `python3.9` in CI.

### Why are the changes needed?

Currently `lint-python` uses `python3`, but it's not the one we expect in CI.
As a result, `black` check is not working.

```
The python3 -m black command was not found. Skipping black checks for now.
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

The `black` check in `lint-python` should work.

Closes #33507 from ueshin/issues/SPARK-36279/lint-python.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-24 16:49:11 +09:00
Liang-Chi Hsieh c2de111ec5 [SPARK-36270][BUILD][FOLLOWUP] Reduce metaspace size for pyspark
### What changes were proposed in this pull request?

Notice that pyspark GA module `pyspark-pandas-slow` sometimes still has return code 137. Try to reduce its metaspace size further.

### Why are the changes needed?

Fix return code 137 for pyspark GA module.

### Does this PR introduce _any_ user-facing change?

No, dev only.

### How was this patch tested?

GA

Closes #33496 from viirya/test-ga-followup.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-07-23 14:20:00 -07:00
Liang-Chi Hsieh fd36ed4550 [SPARK-36270][BUILD] Change memory settings for enabling GA
### What changes were proposed in this pull request?

Trying to adjust build memory settings and serial execution to re-enable GA.

### Why are the changes needed?

GA tests are failed recently due to return code 137. We need to adjust build settings to make GA work.

### Does this PR introduce _any_ user-facing change?

No, dev only.

### How was this patch tested?

GA

Closes #33447 from viirya/test-ga.

Lead-authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-23 19:10:45 +09:00
Hyukjin Kwon 801b369bd0 [SPARK-36204][INFRA][BUILD] Deduplicate Scala 2.13 daily build
### What changes were proposed in this pull request?

Scala 2.13 daily job was added but ideally we should deduplicate it. This PR targets to deduplicate it by creating one more job (`configure-jobs`) that the main job depends on.

`configure-jobs` will properly set the branch, envs, etc. to run the main build properly.

### Why are the changes needed?

To make the maintenance easier

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

See
- https://github.com/HyukjinKwon/spark/actions/runs/1044636792 for a PR
- https://github.com/HyukjinKwon/spark/actions/runs/1048542984 for a cron job

Closes #33410 from HyukjinKwon/SPARK-36204.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-20 22:21:27 +09:00
Hyukjin Kwon c92790a101 [SPARK-36205][INFRA] Use set-env instead of set-output in GitHub Actions
### What changes were proposed in this pull request?

This PR is more a cleanup. It removes unused `sync-branch` id in some steps, and use `set-env` instead of `set-output` to set an env.
This can be backported to branch-3.2 too.

### Why are the changes needed?

Cleanup.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

CI in this PR should test it out.

Closes #33412 from HyukjinKwon/minor-cleanup.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-19 19:43:19 +09:00
William Hyun c336f73ccd [SPARK-36198][TESTS] Skip UNIDOC generation in PySpark GHA job
### What changes were proposed in this pull request?
This PR aims to skip UNIDOC generation in PySpark GHA job.

### Why are the changes needed?

PySpark GHA jobs do not need to generate Java/Scala doc. This will save about 13 minutes in total.
-https://github.com/apache/spark/runs/3098268973?check_suite_focus=true
```
...
========================================================================
Building Unidoc API Documentation
========================================================================
[info] Building Spark unidoc using SBT with these arguments:  -Phadoop-3.2 -Phive-2.3 -Pscala-2.12 -Phive-thriftserver -Pmesos -Pdocker-integration-tests -Phive -Pkinesis-asl -Pspark-ganglia-lgpl -Pkubernetes -Phadoop-cloud -Pyarn unidoc
...
[info] Main Java API documentation successful.
[success] Total time: 192 s (03:12), completed Jul 18, 2021 6:08:40 PM
```

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Pass the GHA.

Closes #33407 from williamhyun/SKIP_UNIDOC.

Authored-by: William Hyun <william@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-07-18 17:52:28 -07:00
Dongjoon Hyun 3218e4e14b [SPARK-36152][INFRA][TESTS] Add Scala 2.13 daily build and test GitHub Action job
### What changes were proposed in this pull request?

This PR aims to add a new GitHub Action daily workflow for Scala 2.13 build and test.

### Why are the changes needed?

Apache Spark 3.2.0 aims to support Scala 2.13 officially. We need a test coverage for master/3.2.

The following is the test result on my repository. The daily schedule triggered correctly for both master/3.2 branches.

- https://github.com/dongjoon-hyun/spark/actions/runs/1036083268
![Screen Shot 2021-07-15 at 10 09 22 PM](https://user-images.githubusercontent.com/9700541/125894950-a5a2ff9c-48f9-4184-913c-5422da5305a3.png)

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

This is a daily job. Since there is no way to see this, I tested this in my repository first as described in the above.

Closes #33358 from dongjoon-hyun/SPARK-36152.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-07-16 07:59:39 -07:00
Dongjoon Hyun d69f981869 [SPARK-36165][INFRA] Fix SQL doc generation in GitHub Action
### What changes were proposed in this pull request?

This PR aims to fix SQL doc generation in GitHub Action by specifying the mkdocs-installed python version explicitly.

### Why are the changes needed?

Currently, the SQL doc generation is using `spark-submit` and picked up another `Python 3` binaries.
```
Generating SQL configuration table HTML file.
Traceback (most recent call last):
  File "/__w/spark/spark/sql/gen-sql-config-docs.py", line 25, in <module>
    from mkdocs.structure.pages import markdown
ModuleNotFoundError: No module named 'mkdocs'
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the GitHub Action linter job.

Closes #33372 from dongjoon-hyun/fix_mkdocs.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-07-15 11:41:48 -07:00
Hyukjin Kwon a71dd6af2f [SPARK-36146][PYTHON][INFRA][TESTS] Upgrade Python version from 3.6 to 3.9 in GitHub Actions' linter/docs
### What changes were proposed in this pull request?

This PR proposes to use Python 3.9 in documentation and linter at GitHub Actions. This PR also contains the fixes for mypy check (introduced by Python 3.9 upgrade)

```
python/pyspark/sql/pandas/_typing/protocols/frame.pyi:64: error: Name "np.ndarray" is not defined
python/pyspark/sql/pandas/_typing/protocols/frame.pyi:91: error: Name "np.recarray" is not defined
python/pyspark/sql/pandas/_typing/protocols/frame.pyi:165: error: Name "np.ndarray" is not defined
python/pyspark/pandas/categorical.py:82: error: Item "dtype[Any]" of "Union[dtype[Any], Any]" has no attribute "categories"
python/pyspark/pandas/categorical.py:109: error: Item "dtype[Any]" of "Union[dtype[Any], Any]" has no attribute "ordered"
python/pyspark/ml/linalg/__init__.pyi:184: error: Return type "ndarray[Any, Any]" of "toArray" incompatible with return type "NoReturn" in supertype "Matrix"
python/pyspark/ml/linalg/__init__.pyi:217: error: Return type "ndarray[Any, Any]" of "toArray" incompatible with return type "NoReturn" in supertype "Matrix"
python/pyspark/pandas/typedef/typehints.py:163: error: Module has no attribute "bool"; maybe "bool_" or "bool8"?
python/pyspark/pandas/typedef/typehints.py:174: error: Module has no attribute "float"; maybe "float_", "cfloat", or "float96"?
python/pyspark/pandas/typedef/typehints.py:180: error: Module has no attribute "int"; maybe "uint", "rint", or "intp"?
python/pyspark/pandas/ml.py:81: error: Value of type variable "_DTypeScalar_co" of "dtype" cannot be "object"
python/pyspark/pandas/indexing.py:1649: error: Module has no attribute "int"; maybe "uint", "rint", or "intp"?
python/pyspark/pandas/indexing.py:1656: error: Module has no attribute "int"; maybe "uint", "rint", or "intp"?
python/pyspark/pandas/frame.py:4969: error: Function "numpy.array" is not valid as a type
python/pyspark/pandas/frame.py:4969: note: Perhaps you need "Callable[...]" or a callback protocol?
python/pyspark/pandas/frame.py:4970: error: Function "numpy.array" is not valid as a type
python/pyspark/pandas/frame.py:4970: note: Perhaps you need "Callable[...]" or a callback protocol?
python/pyspark/pandas/frame.py:7402: error: "List[Any]" has no attribute "tolist"
python/pyspark/pandas/series.py:1030: error: Module has no attribute "_NoValue"
python/pyspark/pandas/series.py:1031: error: Module has no attribute "_NoValue"
python/pyspark/pandas/indexes/category.py:159: error: Item "dtype[Any]" of "Union[dtype[Any], Any]" has no attribute "categories"
python/pyspark/pandas/indexes/category.py:180: error: Item "dtype[Any]" of "Union[dtype[Any], Any]" has no attribute "ordered"
python/pyspark/pandas/namespace.py:2036: error: Argument 1 to "column_name" has incompatible type "float"; expected "str"
python/pyspark/pandas/mlflow.py:59: error: Incompatible types in assignment (expression has type "Type[floating[Any]]", variable has type "str")
python/pyspark/pandas/data_type_ops/categorical_ops.py:43: error: Item "dtype[Any]" of "Union[dtype[Any], Any]" has no attribute "categories"
python/pyspark/pandas/data_type_ops/categorical_ops.py:43: error: Item "dtype[Any]" of "Union[dtype[Any], Any]" has no attribute "ordered"
python/pyspark/pandas/data_type_ops/categorical_ops.py:56: error: Item "dtype[Any]" of "Union[dtype[Any], Any]" has no attribute "categories"
python/pyspark/pandas/tests/test_typedef.py:70: error: Name "np.float" is not defined
python/pyspark/pandas/tests/test_typedef.py:77: error: Name "np.float" is not defined
python/pyspark/pandas/tests/test_typedef.py:85: error: Name "np.float" is not defined
python/pyspark/pandas/tests/test_typedef.py💯 error: Name "np.float" is not defined
python/pyspark/pandas/tests/test_typedef.py:108: error: Name "np.float" is not defined
python/pyspark/mllib/clustering.pyi:152: error: Incompatible types in assignment (expression has type "ndarray[Any, Any]", base class "KMeansModel" defined the type as "List[ndarray[Any, Any]]")
python/pyspark/mllib/classification.pyi:93: error: Signature of "predict" incompatible with supertype "LinearClassificationModel"
Found 32 errors in 15 files (checked 315 source files)
1
```

### Why are the changes needed?

Python 3.6 is deprecated at SPARK-35938

### Does this PR introduce _any_ user-facing change?

No. Maybe static analysis, etc. by some type hints but they are really non-breaking..

### How was this patch tested?

I manually checked by GitHub Actions build in forked repository.

Closes #33356 from HyukjinKwon/SPARK-36146.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-07-15 08:01:54 -07:00
Dongjoon Hyun 416a7fd490 [SPARK-36139][INFRA][TESTS] Remove Python 3.6 from pyspark GitHub Action job
### What changes were proposed in this pull request?

This PR aims to remove Python 3.6 installation from `pyspark` job in `build and test` GitHub Action Workflow for Apache Spark 3.3.

### Why are the changes needed?

Python 3.6 is deprecated via SPARK-35938. This will save the GitHub Action resource by removing python3.6 testing.

**BEFORE**
```
Will test against the following Python executables: ['python3.6', 'python3.9', 'pypy3']
```

**AFTER**
```
 Will test against the following Python executables: ['python3.9', 'pypy3']
```

Note that Python 3.6 is still used in the following cases.
- In another jobs like `Linter`
- In `dev/run-pip-tests` script, pip packaing testing via `conda`.
  - This is handled via https://github.com/apache/spark/pull/33351

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the GitHub Action.

Closes #33349 from dongjoon-hyun/SPARK-36139.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-07-14 21:01:25 -07:00
Karen Feng 71c086eb87 [SPARK-35958][CORE] Refactor SparkError.scala to SparkThrowable.java
### What changes were proposed in this pull request?

Refactors the base Throwable trait `SparkError.scala` (introduced in SPARK-34920) an interface `SparkThrowable.java`.

### Why are the changes needed?

- Renaming `SparkError` to `SparkThrowable` better reflect sthat this is the base interface for both `Exception` and `Error`
- Migrating to Java maximizes its extensibility

### Does this PR introduce _any_ user-facing change?

Yes; the base trait has been renamed and the accessor methods have changed (eg. `sqlState` -> `getSqlState()`).

### How was this patch tested?

Unit tests.

Closes #33164 from karenfeng/SPARK-35958.

Authored-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-07-08 23:54:53 +08:00
Hyukjin Kwon 16c195ccfb [SPARK-35684][INFRA][PYTHON] Bump up mypy version in GitHub Actions
### What changes were proposed in this pull request?

This PR proposes to bump up the mypy version to 0.910 which is the latest.

### Why are the changes needed?

To catch the type hint mistakes better in PySpark.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

GitHub Actions should test it out.

Closes #33223 from HyukjinKwon/SPARK-35684.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-07 13:26:28 +09:00
Kevin Su 11fcbc73cb [SPARK-36007][INFRA] Failed to run benchmark in GA
### What changes were proposed in this pull request?

When I'm running the benchmark in GA, I met the below error.

https://github.com/pingsutw/spark/runs/2867617238?check_suite_focus=true
```
java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.j
ava:1692)java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:175)
21/06/20 07:40:02 ERROR SparkContext: Error initializing SparkContext.java.lang.AssertionError: assertion failed:
spark.test.home is not set! at scala.Predef$.assert(Predef.scala:223) at org.apache.spark.deploy.worker.Worker.<init>
(Worker.scala:148) at org.apache.spark.deploy.worker.Worker$.startRpcEnvAndEndpoint(Worker.scala:954) at
org.apache.spark.deploy.LocalSparkCluster.$anonfun$start$2(LocalSparkCluster.scala:68) at
org.apache.spark.deploy.LocalSparkCluster.$anonfun$start$2$adapted(LocalSparkCluster.scala:65) at
scala.collection.immutable.Range.foreach(Range.scala:158) at
org.apache.spark.deploy.LocalSparkCluster.start(LocalSparkCluster.scala:65) at
org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2954) at
org.apache.spark.SparkContext.<init>(SparkContext.scala:559) at org.apache.spark.SparkContext.<init>
(SparkContext.scala:137) at
org.apache.spark.serializer.KryoSerializerBenchmark$.createSparkContext(KryoSerializerBenchmark.scala:86) at
org.apache.spark.serializer.KryoSerializerBenchmark$.sc$lzycompute$1(KryoSerializerBenchmark.scala:58) at
org.apache.spark.serializer.KryoSerializerBenchmark$.sc$1(KryoSerializerBenchmark.scala:58) at
org.apache.spark.serializer.KryoSerializerBenchmark$.$anonfun$run$3(KryoSerializerBenchmark.scala:63)
```

### Why are the changes needed?

Set `spark.test.home` in the benchmark workflow.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Rerun the benchmark in my fork.
https://github.com/pingsutw/spark/actions/runs/996067851

Closes #33203 from pingsutw/SPARK-36007.

Lead-authored-by: Kevin Su <pingsutw@apache.org>
Co-authored-by: Kevin Su <pingsutw@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-05 09:17:06 +09:00
Dongjoon Hyun dcc405743e [SPARK-35994][INFRA] Publish snapshot from branch-3.2
### What changes were proposed in this pull request?

This PR aims to publish snapshot artifacts from branch-3.2 additionally.

### Why are the changes needed?

`GitHub Action`'s cronjob feature is only supported in the default branch. So, to have a daily job, we should add here.

Currently, it's publishing master and 3.1.
- https://github.com/apache/spark/actions/workflows/publish_snapshot.yml

<img width="273" alt="Screen Shot 2021-07-02 at 10 22 41 AM" src="https://user-images.githubusercontent.com/9700541/124309380-7c407400-db1f-11eb-9aa4-30db61a72b80.png">

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

N/A

Closes #33192 from dongjoon-hyun/SPARK-35994.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-07-02 10:54:30 -07:00
William Hyun a6088e5036 [SPARK-35924][BUILD][TESTS] Add Java 17 ea build test to GitHub action
### What changes were proposed in this pull request?
This PR aims to add Java 17-ea build test to GitHub action.

### Why are the changes needed?
To improve test coverage.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Pass newly added Java 17-ea GitHub action job.

Closes #33126 from williamhyun/SPARK-35924.

Authored-by: William Hyun <william@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-06-29 11:19:38 -07:00
Hyukjin Kwon 2d47fb7683 [SPARK-35755][PYTHON][INFRA] Use higher PyArrow versions in GitHub Actions build
### What changes were proposed in this pull request?

This PR proposes to use higher versions of PyArrow which more users use in general.

Without this PR, the testing matrix as follows:

- (Python 3.8) Use PyArrow **2.x** in [pandas UDF tests in SQL side](https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/IntegratedUDFTestUtils.scala)
- (Python 3.6) Use PyArrow **2.x** in PySpark tests
- (Python 3.9) Use PyArrow 4.x in PySpark tests (no change)
- (Python 3.6) Use PyArrow **2.x** in PySpark documentation generation (it runs Spark jobs to generate images to use in PySpark API docs)

After this PR, the testing matrix as follows:

- (Python 3.8) Use PyArrow **4.x** in [pandas UDF tests in SQL side](https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/IntegratedUDFTestUtils.scala)
- (Python 3.6) Use PyArrow **3.x** in PySpark tests
- (Python 3.9) Use PyArrow 4.x in PySpark tests (no change)
- (Python 3.6) Use PyArrow **4.x** in PySpark documentation generation (it runs Spark jobs to generate images to use in PySpark API docs)

### Why are the changes needed?

Test matrix which more people use.

### Does this PR introduce _any_ user-facing change?

No, dev and testing only.

### How was this patch tested?

GitHub Actions in this PR should test it out.

Closes #32906 from HyukjinKwon/SPARK-35755.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-15 09:59:38 +09:00
Gengliang Wang 62be22e929 [SPARK-35694][INFRA][FOLLOWUP] Increase the default JVM stack size of SBT/Maven
### What changes were proposed in this pull request?

In https://github.com/apache/spark/pull/32838, we set the default JVM stack size to 16M from 4M.
However, there are still stackoverflow error in builds:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/139672/console

Let's update the value to 64M

### Why are the changes needed?

Make test build stable.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Manual trigger test builds.

Closes #32879 from gengliangwang/increaseStackAgain.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-11 18:51:07 +09:00
Gengliang Wang 0b5683a4d5 [SPARK-35694][INFRA] Increase the default JVM stack size of SBT/Maven
### What changes were proposed in this pull request?

The jenkins SBT/Maven build keep failing with stack overflow error:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/139542

We should increase the JVM stack size to 16MB.
Also, https://github.com/apache/spark/pull/32521 set the stack size to 256MB for Java 11 build, which might be too big since every thread will allocate this memory for the stack. This PR also set it as 16MB to make the config consistent.

### Why are the changes needed?

Fix SBT/Maven build.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Jenkins and GA tests.

Closes #32838 from gengliangwang/increaseSBTStackSize.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
2021-06-09 19:36:29 +08:00
Hyukjin Kwon 3be7b29cd8 Revert "[SPARK-35668][INFRA] Use "concurrency" syntax on Github Actions workflow"
This reverts commit f3dc549d9c.
2021-06-09 16:48:29 +09:00
Hyukjin Kwon d12c1472f6 [SPARK-35682][PYTHON][TESTS] Pin mypy==0.812 in GitHub Actions CI
### What changes were proposed in this pull request?

Seems like the new MyPy version was released (0.901) and it broke the CI: https://github.com/python/mypy/releases.

```
python/pyspark/pandas/indexes/base.py:2007: error: Argument 1 to "from_tuples" of "MultiIndex" has incompatible type "Index"; expected "List[Tuple[Any, ...]]"
python/pyspark/testing/pandasutils.py:41: error: Library stubs not installed for "tabulate" (or incompatible with Python 3.6)
python/pyspark/testing/pandasutils.py:41: note: Hint: "python3 -m pip install types-tabulate"
python/pyspark/testing/pandasutils.py:41: note: (or run "mypy --install-types" to install all missing stub packages)
python/pyspark/testing/pandasutils.py:41: note: See https://mypy.readthedocs.io/en/stable/running_mypy.html#missing-imports
Found 2 errors in 2 files (checked 312 source files)
```

I tried to fix these instances and pin it to the latest version (0.901). However, I realised that `python/pyspark/pandas/indexes/base.py:2007` has a logic issue (see https://github.com/databricks/koalas/pull/1325#discussion_r647889901 and https://github.com/databricks/koalas/pull/1325#discussion_r647890007) which cannot be fixed quickly.

Therefore, I decided to pin it to the previous version we used before for now, in order to unblock other PRs builds.

### Why are the changes needed?

To unblock other PRs.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

I tested in my local but it has to be tested and passed in GitHub Actions in this PR.

Closes #32829 from HyukjinKwon/SPARK-35682.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-09 11:05:16 +09:00
Yikun Jiang f3dc549d9c [SPARK-35668][INFRA] Use "concurrency" syntax on Github Actions workflow
### What changes were proposed in this pull request?

This patch uses the "concurrency" syntax to replace the "cancel job" workflow:
- .github/workflows/benchmark.yml
- .github/workflows/labeler.yml
- .github/workflows/notify_test_workflow.yml
- .github/workflows/test_report.yml

Remove the .github/workflows/cancel_duplicate_workflow_runs.yml

Note that the push/schedule based job are not changed to keep the same config in a4b70758d3:
- .github/workflows/build_and_test.yml
- .github/workflows/publish_snapshot.yml
- .github/workflows/stale.yml
- .github/workflows/update_build_status.yml

### Why are the changes needed?
We are using [cancel_duplicate_workflow_runs](a70e66ecfa/.github/workflows/cancel_duplicate_workflow_runs.yml (L1)) job to cancel previous jobs when a new job is queued. Now, it has been supported by the github action by using ["concurrency"](https://docs.github.com/en/actions/reference/workflow-syntax-for-github-actions#concurrency) syntax to make sure only a single job or workflow using the same concurrency group.

Related: https://github.com/apache/arrow/pull/10416 and https://github.com/potiuk/cancel-workflow-runs

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
triger the PR manaully

Closes #32806 from Yikun/SPARK-X.

Authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-08 12:10:40 +09:00
itholic b8740a1d1e [SPARK-35499][PYTHON] Apply black to pandas API on Spark codes
### What changes were proposed in this pull request?

This PR proposes applying `black` to pandas API on Spark codes, for improving static analysis.

By executing the `./dev/reformat-python` in the spark home directory, all the code of the pandas API on Spark is fixed according to the static analysis rules.

### Why are the changes needed?

This can be reduces the cost of static analysis during development.

It has been used continuously for about a year in the Koalas project and its convenience has been proven.

### Does this PR introduce _any_ user-facing change?

No, it's dev-only.

### How was this patch tested?

Manually reformat the pandas API on Spark codes by running the `./dev/reformat-python`, and checked the `./dev/lint-python` is passed.

Closes #32779 from itholic/SPARK-35499.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2021-06-06 17:30:07 -07:00
Takuya UESHIN 221553c204 [SPARK-35642][INFRA] Split pyspark-pandas tests to rebalance the test duration
### What changes were proposed in this pull request?

Splits some tests in `pyspark-pandas` module as slot tests to rebalance the test duration.

Picked the top 12 tests from the previous runs and the total times are almost even.

### Why are the changes needed?

Currently `pyspark-pandas` module tests take long time, so we should rebalance the tests.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #32778 from ueshin/issues/SPARK-35642/split-pandas-on-spark-tests.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-04 12:52:52 +09:00
Hyukjin Kwon 3d158f9c91 [SPARK-35587][PYTHON][DOCS] Initial porting of Koalas documentation
### What changes were proposed in this pull request?

This PR proposes to port Koalas documentation to PySpark documentation as its initial step.
It ports almost as is except these differences:

- Renamed import from `databricks.koalas` to `pyspark.pandas`.
- Renamed `to_koalas` -> `to_pandas_on_spark`
- Renamed `(Series|DataFrame).koalas` -> `(Series|DataFrame).pandas_on_spark`
- Added a `ps_` prefix in the RST file names of Koalas documentation

Other then that,

- Excluded `python/docs/build/html` in linter
- Fixed GA dependency installataion

### Why are the changes needed?

To document pandas APIs on Spark.

### Does this PR introduce _any_ user-facing change?

Yes, it adds new documentations.

### How was this patch tested?

Manually built the docs and checked the output.

Closes #32726 from HyukjinKwon/SPARK-35587.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-04 11:11:09 +09:00
Xinrong Meng 7eeb07d0f9 [SPARK-35606][PYTHON][INFRA] List Python 3.9 installed libraries in build_and_test workflow
### What changes were proposed in this pull request?

In the build_and_test workflow, tests are run against both Python 3.6 and Python 3.9. However, only libraries installed in Python 3.6 are listed. We should list Python 3.9's installed libraries as well.

### Why are the changes needed?

Listing Python 3.9's installed libraries is helpful for debugging.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual check.

Closes #32737 from xinrong-databricks/ci_py3.9lib.

Lead-authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Co-authored-by: xinrong-databricks <47337188+xinrong-databricks@users.noreply.github.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-06-03 14:24:53 -07:00
Hyukjin Kwon d478cff8bb [SPARK-35620][BUILD][PYTHON] Remove documentation build in Python linter
### What changes were proposed in this pull request?

This PR proposes to remove PySpark documentation build in linter check because:

- to speed up CI build by removing duplicate documentation build (linter and doc build)
- for https://github.com/apache/spark/pull/32726. With this PR PySpark documentation build requires a full Spark build to generate plot images in PySpark documentation. It makes less sense to require it in Python linter.
- to remove unnecessary dependency installation for Python linter in CI

### Why are the changes needed?

Python linter script includes documentation build. Because of this, we run documentation builds duplicately in CI, and requires unnecessary dependencies to be installed, and takes extra time. It would more make sense to exclude this in Python linter.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

Manually tested, and it will be tested in CI.

Closes #32760 from HyukjinKwon/SPARK-35620.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-03 12:48:30 +09:00
Dongjoon Hyun 2550490c09 [SPARK-35617][INFRA] Update GitHub Action docker image to 20210602
### What changes were proposed in this pull request?

This PR aims to update GitHub Action docker image with the following updates.
1. Add `pip` explicitly to Python 3.8/3.9
2. Add `plotly` to Python 3.8.
3. Since SPARK-35573 fixes SparkR UT failures on R 4.1.0, update SparkR job to run R 4.1.0.

### Why are the changes needed?

To improve the GitHub Action test infra and unblock #32737

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the GitHub Action.

Closes #32755 from dongjoon-hyun/SPARK-35617.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-06-02 18:30:38 -07:00
Hyukjin Kwon 14e12c64d3 [SPARK-35575][INFRA] Recover updating build status in GitHub Actions
### What changes were proposed in this pull request?

This PR fixes the logic to be fault tolerant when it gets the status of the workflow run from PR author's forked repository.

Looks like https://github.com/apache/spark/pull/32483 removed and disabled (see also https://github.com/apache/spark/pull/32486/checks?check_run_id=2648696751) the GitHub actions workflow runs in the forked repositories, and the detection logic in the main repo fails because the runs don't exist anymore.

See also https://github.com/apache/spark/runs/2709537998?check_suite_focus=true

### Why are the changes needed?

To recover the status update of GitHub Actions in PRs.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

It cannot be tested without being merged.

Closes #32711 from HyukjinKwon/SPARK-35575.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-05-31 19:29:54 +09:00
Dongjoon Hyun c225196be0 [SPARK-35507][INFRA] Add Python 3.9 in the docker image for GitHub Action
### What changes were proposed in this pull request?

This PR aims to add `Python 3.9.5` and updates the docker image references except SparkR job.

### Why are the changes needed?

To save GitHub Action resource and be more robust on the the Python and R library changes.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the GitHub Action.

Closes #32706 from dongjoon-hyun/SPARK-35507.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-05-31 05:56:47 +00:00
Kousuke Saruta 2de19e460b [SPARK-35483][INFRA] Add docker-integration-tests to run-tests.py and GA
### What changes were proposed in this pull request?

This PR proposes to add `docker-integratin-tests` to `run-tests.py` and GA.
Once #32631 was merged but there was a lack of consideration.

Diff between this change and 692d95d145 merged in #32631 is as follows.

```
       if: github.repository != 'apache/spark'
       id: sync-branch
       run: |
+        apache_spark_ref=`git rev-parse HEAD`
         git fetch https://github.com/$GITHUB_REPOSITORY.git ${GITHUB_REF#refs/heads/}
         git -c user.name='Apache Spark Test Account' -c user.email='sparktestaccgmail.com' merge --no-commit --progress --squash FETCH_HEAD
         git -c user.name='Apache Spark Test Account' -c user.email='sparktestaccgmail.com' commit -m "Merged commit"
+        echo "::set-output name=APACHE_SPARK_REF::$apache_spark_ref"
     - name: Cache Scala, SBT and Maven
       uses: actions/cachev2
       with:
```

### Why are the changes needed?

CI for `docker-integration-tests` is absent for now.

### Does this PR introduce _any_ user-facing change?

GA.

### How was this patch tested?

Closes #32691 from sarutak/docker-integration-test-ga-take2.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-05-28 16:54:47 +09:00
Hyukjin Kwon d189cf75f9 Revert "[SPARK-35483][INFRA] Add docker-integration-tests to run-tests.py and GA"
This reverts commit 0a74ad66b3.
2021-05-28 14:29:12 +09:00
Kousuke Saruta 0a74ad66b3 [SPARK-35483][INFRA] Add docker-integration-tests to run-tests.py and GA
### What changes were proposed in this pull request?

This PR proposes to add `docker-integratin-tests` to `run-tests.py` and GA.
`doker-integration-tests` can't run if docker is not installed so it run only if `docker-integration-tests` is specified with `--module`.

### Why are the changes needed?

CI for `docker-integration-tests` is absent for now.

### Does this PR introduce _any_ user-facing change?

GA.

### How was this patch tested?

Closes #32631 from sarutak/docker-integration-test-ga.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-05-28 07:56:37 +09:00
Hyukjin Kwon e47e615c0e [SPARK-35506][PYTHON][INFRA] Run tests with Python 3.9 in GitHub Actions
### What changes were proposed in this pull request?

This PR enables GitHub Actions to test PySpark with Python 3.9.

### Why are the changes needed?

To verify the support of Python 3.9.

### Does this PR introduce _any_ user-facing change?

No, test-only.

### How was this patch tested?

Existing tests should cover.

Closes #32657 from HyukjinKwon/SPARK-35506.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-05-26 09:25:51 +09:00
Hyukjin Kwon 4a6d844184 [SPARK-35497][PYTHON] Enable plotly tests in pandas-on-Spark
### What changes were proposed in this pull request?

This PR enables plot tests with plotly

```bash
./python/run-tests --python-executables=python3 --modules=pyspark-pandas
```

**Before**:

```
Traceback (most recent call last):
  File "/.../miniconda3/envs/python3.8/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/.../miniconda3/envs/python3.8/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/.../pyspark/pandas/tests/plot/test_frame_plot_plotly.py", line 42, in <module>
    plotly_requirement_message + " Or pandas<1.0; pandas<1.0 does not support latest plotly "
TypeError: unsupported operand type(s) for +: 'NoneType' and 'str'

```

**After**:

```
...
Starting test(python3): pyspark.pandas.tests.plot.test_series_plot_plotly
...
Finished test(python3): pyspark.pandas.tests.plot.test_series_plot_plotly (23s)
...
Tests passed in 1296 seconds
```

### Why are the changes needed?

For test coverage.

### Does this PR introduce _any_ user-facing change?

No, test-only.

### How was this patch tested?

By running the tests.

Closes #32649 from HyukjinKwon/SPARK-35497.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-05-25 12:31:32 +09:00
Takuya UESHIN c06480519e [SPARK-35450][INFRA] Follow checkout-merge way to use the latest commit for linter, or other workflows
### What changes were proposed in this pull request?

Follows checkout-merge way to use the latest commit for linter, or other workflows.

### Why are the changes needed?

For linter or other workflows besides build-and-tests, we should follow checkout-merge way to use the latest commit; otherwise, those could work on the old settings.

### Does this PR introduce _any_ user-facing change?

No, this is a dev-only change.

### How was this patch tested?

Existing tests.

Closes #32597 from ueshin/issues/SPARK-35450/infra.

Lead-authored-by: Takuya UESHIN <ueshin@databricks.com>
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-05-20 10:07:28 +09:00
Takeshi Yamamuro 2390b9dbcb [SPARK-35413][INFRA] Use the SHA of the latest commit when checking out databricks/tpcds-kit
### What changes were proposed in this pull request?

This PR proposes to use the SHA of the latest commit ([2a5078a782192ddb6efbcead8de9973d6ab4f069](2a5078a782)) when checking out `databricks/tpcds-kit`. This can prevent the test workflow from breaking accidentally if the repository changes drastically.

### Why are the changes needed?

For better test workflow.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

GA passed.

Closes #32561 from maropu/UseRefInCheckout.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2021-05-17 09:26:04 +09:00
Hyukjin Kwon 7d371d27f2 [SPARK-35393][PYTHON][INFRA][TESTS] Recover pip packaging test in Github Actions
### What changes were proposed in this pull request?

Currently pip packaging test is being skipped:

```
========================================================================
Running PySpark packaging tests
========================================================================
Constructing virtual env for testing
Missing virtualenv & conda, skipping pip installability tests
Cleaning up temporary directory - /tmp/tmp.iILYWISPXW
```

See https://github.com/apache/spark/runs/2568923639?check_suite_focus=true

GitHub Actions's image has its default Conda installed at `/usr/share/miniconda` but seems like the image we're using for PySpark does not have it (which is legitimate).

This PR proposes to install Conda to use in pip packaging tests in GitHub Actions.

### Why are the changes needed?

To recover the test coverage.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

It was tested in my fork: https://github.com/HyukjinKwon/spark/runs/2575126882?check_suite_focus=true

```
========================================================================
Running PySpark packaging tests
========================================================================
Constructing virtual env for testing
Using conda virtual environments
Testing pip installation with python 3.6
Using /tmp/tmp.qPjTenqfGn for virtualenv
Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... failed with repodata from current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... done

## Package Plan ##

  environment location: /tmp/tmp.qPjTenqfGn/3.6

  added / updated specs:
    - numpy
    - pandas
    - pip
    - python=3.6
    - setuptools

...

Successfully ran pip sanity check
```

Closes #32537 from HyukjinKwon/SPARK-35393.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-05-13 10:35:56 -07:00
Takuya UESHIN c0b52da89e [SPARK-35388][INFRA] Allow the PR source branch to include slashes
### What changes were proposed in this pull request?

This PR allows the PR source branch to include slashes.

### Why are the changes needed?

There are PRs whose source branches include slashes, like `issues/SPARK-35119/gha` here or #32523.

Before the fix, the PR build fails in `Sync the current branch with the latest in Apache Spark` phase.
For example, at #32523, the source branch is `issues/SPARK-35382/nested_higher_order_functions`:

```
...
fatal: couldn't find remote ref nested_higher_order_functions
Error: Process completed with exit code 128.
```

(https://github.com/ueshin/apache-spark/runs/2569356241)

### Does this PR introduce _any_ user-facing change?

No, this is a dev-only change.

### How was this patch tested?

This PR source branch includes slashes and #32525 doesn't.

Closes #32524 from ueshin/issues/SPARK-35119/gha.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-05-13 10:59:30 +09:00
Gengliang Wang dac6f175a6 [SPARK-35387][INFRA] Increase the JVM stack size for Java 11 build test
### What changes were proposed in this pull request?

After merging https://github.com/apache/spark/pull/32439, there is flaky error from the Github action job "Java 11 build with Maven":

```
Error:  ## Exception when compiling 473 sources to /home/runner/work/spark/spark/sql/catalyst/target/scala-2.12/classes
java.lang.StackOverflowError
scala.reflect.internal.Trees.itransform(Trees.scala:1376)
scala.reflect.internal.Trees.itransform$(Trees.scala:1374)
scala.reflect.internal.SymbolTable.itransform(SymbolTable.scala:28)
scala.reflect.internal.SymbolTable.itransform(SymbolTable.scala:28)
scala.reflect.api.Trees$Transformer.transform(Trees.scala:2563)
scala.tools.nsc.transform.TypingTransformers$TypingTransformer.transform(TypingTransformers.scala:51)
```
We can resolve it by increasing the stack size of JVM to 256M. The container for Github action jobs has 7G memory so this should be fine.

### Why are the changes needed?

Fix flaky test failure in Java 11 build test

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Github action test

Closes #32521 from gengliangwang/increaseStackSize.

Authored-by: Gengliang Wang <ltnwgl@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-05-12 10:49:09 -07:00
Kousuke Saruta 7e3446a204 [SPARK-35377][INFRA] Add JS linter to GA
### What changes were proposed in this pull request?

SPARK-35175 (#32274) added a linter for JS so let's add it to GA.

### Why are the changes needed?

To JS code keep clean.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

GA

Closes #32512 from sarutak/ga-lintjs.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-05-12 16:00:55 +09:00
Kousuke Saruta af0d99cce6 [SPARK-35375][INFRA] Use Jinja2 < 3.0.0 for Python linter dependency in GA
### What changes were proposed in this pull request?

From a few hours ago, Python linter fails in GA.
The latest Jinja 3.0.0 seems to cause this failure.
https://pypi.org/project/Jinja2/

```
Run ./dev/lint-python
starting python compilation test...
python compilation succeeded.

starting pycodestyle test...
pycodestyle checks passed.

starting flake8 test...
flake8 checks passed.

starting mypy test...
mypy checks passed.

starting sphinx-build tests...
sphinx-build checks failed:
Running Sphinx v3.0.4
making output directory... done
[autosummary] generating autosummary for: development/contributing.rst, development/debugging.rst, development/index.rst, development/setting_ide.rst, development/testing.rst, getting_started/index.rst, getting_started/install.rst, getting_started/quickstart.ipynb, index.rst, migration_guide/index.rst, ..., reference/pyspark.ml.rst, reference/pyspark.mllib.rst, reference/pyspark.resource.rst, reference/pyspark.rst, reference/pyspark.sql.rst, reference/pyspark.ss.rst, reference/pyspark.streaming.rst, user_guide/arrow_pandas.rst, user_guide/index.rst, user_guide/python_packaging.rst

Exception occurred:
  File "/__w/spark/spark/python/docs/source/_templates/autosummary/class_with_docs.rst", line 26, in top-level template code
    {% if '__init__' in methods %}
jinja2.exceptions.UndefinedError: 'methods' is undefined
The full traceback has been saved in /tmp/sphinx-err-ypgyi75y.log, if you want to report the issue to the developers.
Please also report this if it was a user error, so that a better error message can be provided next time.
A bug report can be filed in the tracker at <https://github.com/sphinx-doc/sphinx/issues>. Thanks!
make: *** [Makefile:20: html] Error 2

re-running make html to print full warning list:
Running Sphinx v3.0.4
making output directory... done
[autosummary] generating autosummary for: development/contributing.rst, development/debugging.rst, development/index.rst, development/setting_ide.rst, development/testing.rst, getting_started/index.rst, getting_started/install.rst, getting_started/quickstart.ipynb, index.rst, migration_guide/index.rst, ..., reference/pyspark.ml.rst, reference/pyspark.mllib.rst, reference/pyspark.resource.rst, reference/pyspark.rst, reference/pyspark.sql.rst, reference/pyspark.ss.rst, reference/pyspark.streaming.rst, user_guide/arrow_pandas.rst, user_guide/index.rst, user_guide/python_packaging.rst

Exception occurred:
  File "/__w/spark/spark/python/docs/source/_templates/autosummary/class_with_docs.rst", line 26, in top-level template code
    {% if '__init__' in methods %}
jinja2.exceptions.UndefinedError: 'methods' is undefined
The full traceback has been saved in /tmp/sphinx-err-fvtmvvwv.log, if you want to report the issue to the developers.
Please also report this if it was a user error, so that a better error message can be provided next time.
A bug report can be filed in the tracker at <https://github.com/sphinx-doc/sphinx/issues>. Thanks!
make: *** [Makefile:20: html] Error 2
Error: Process completed with exit code 2.
```

### Why are the changes needed?

To recover GA build.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

GA.

Closes #32509 from sarutak/fix-python-lint-error.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-05-12 10:13:38 +09:00
Takeshi Yamamuro e834ef74dc [SPARK-35293][SQL][TESTS][FOLLOWUP] Update the hash key to refresh TPC-DS cache data in forked GA jobs
### What changes were proposed in this pull request?

This is a follow-up PRi of #32420 and it intends to update the hash key to refresh TPC-DS cache data in forked GA jobs.

### Why are the changes needed?

To recover GA jobs.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

GA passed.

Closes #32460 from maropu/SPARK-35293-FOLLOWUP.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-05-06 16:06:50 -07:00
Takeshi Yamamuro 5c67d0c8f7 [SPARK-35293][SQL][TESTS] Use the newer dsdgen for TPCDSQueryTestSuite
### What changes were proposed in this pull request?

This PR intends to replace `maropu/spark-tpcds-datagen` with `databricks/tpcds-kit` for using a newer dsdgen and update the golden files in `tpcds-query-results`.

### Why are the changes needed?

For better testing.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

GA passed.

Closes #32420 from maropu/UseTpcdsKit.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2021-05-06 15:25:46 +09:00