Commit graph

134 commits

Author SHA1 Message Date
Gengliang Wang 0b5683a4d5 [SPARK-35694][INFRA] Increase the default JVM stack size of SBT/Maven
### What changes were proposed in this pull request?

The jenkins SBT/Maven build keep failing with stack overflow error:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/139542

We should increase the JVM stack size to 16MB.
Also, https://github.com/apache/spark/pull/32521 set the stack size to 256MB for Java 11 build, which might be too big since every thread will allocate this memory for the stack. This PR also set it as 16MB to make the config consistent.

### Why are the changes needed?

Fix SBT/Maven build.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Jenkins and GA tests.

Closes #32838 from gengliangwang/increaseSBTStackSize.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
2021-06-09 19:36:29 +08:00
Hyukjin Kwon 3be7b29cd8 Revert "[SPARK-35668][INFRA] Use "concurrency" syntax on Github Actions workflow"
This reverts commit f3dc549d9c.
2021-06-09 16:48:29 +09:00
Hyukjin Kwon d12c1472f6 [SPARK-35682][PYTHON][TESTS] Pin mypy==0.812 in GitHub Actions CI
### What changes were proposed in this pull request?

Seems like the new MyPy version was released (0.901) and it broke the CI: https://github.com/python/mypy/releases.

```
python/pyspark/pandas/indexes/base.py:2007: error: Argument 1 to "from_tuples" of "MultiIndex" has incompatible type "Index"; expected "List[Tuple[Any, ...]]"
python/pyspark/testing/pandasutils.py:41: error: Library stubs not installed for "tabulate" (or incompatible with Python 3.6)
python/pyspark/testing/pandasutils.py:41: note: Hint: "python3 -m pip install types-tabulate"
python/pyspark/testing/pandasutils.py:41: note: (or run "mypy --install-types" to install all missing stub packages)
python/pyspark/testing/pandasutils.py:41: note: See https://mypy.readthedocs.io/en/stable/running_mypy.html#missing-imports
Found 2 errors in 2 files (checked 312 source files)
```

I tried to fix these instances and pin it to the latest version (0.901). However, I realised that `python/pyspark/pandas/indexes/base.py:2007` has a logic issue (see https://github.com/databricks/koalas/pull/1325#discussion_r647889901 and https://github.com/databricks/koalas/pull/1325#discussion_r647890007) which cannot be fixed quickly.

Therefore, I decided to pin it to the previous version we used before for now, in order to unblock other PRs builds.

### Why are the changes needed?

To unblock other PRs.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

I tested in my local but it has to be tested and passed in GitHub Actions in this PR.

Closes #32829 from HyukjinKwon/SPARK-35682.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-09 11:05:16 +09:00
Yikun Jiang f3dc549d9c [SPARK-35668][INFRA] Use "concurrency" syntax on Github Actions workflow
### What changes were proposed in this pull request?

This patch uses the "concurrency" syntax to replace the "cancel job" workflow:
- .github/workflows/benchmark.yml
- .github/workflows/labeler.yml
- .github/workflows/notify_test_workflow.yml
- .github/workflows/test_report.yml

Remove the .github/workflows/cancel_duplicate_workflow_runs.yml

Note that the push/schedule based job are not changed to keep the same config in a4b70758d3:
- .github/workflows/build_and_test.yml
- .github/workflows/publish_snapshot.yml
- .github/workflows/stale.yml
- .github/workflows/update_build_status.yml

### Why are the changes needed?
We are using [cancel_duplicate_workflow_runs](a70e66ecfa/.github/workflows/cancel_duplicate_workflow_runs.yml (L1)) job to cancel previous jobs when a new job is queued. Now, it has been supported by the github action by using ["concurrency"](https://docs.github.com/en/actions/reference/workflow-syntax-for-github-actions#concurrency) syntax to make sure only a single job or workflow using the same concurrency group.

Related: https://github.com/apache/arrow/pull/10416 and https://github.com/potiuk/cancel-workflow-runs

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
triger the PR manaully

Closes #32806 from Yikun/SPARK-X.

Authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-08 12:10:40 +09:00
itholic b8740a1d1e [SPARK-35499][PYTHON] Apply black to pandas API on Spark codes
### What changes were proposed in this pull request?

This PR proposes applying `black` to pandas API on Spark codes, for improving static analysis.

By executing the `./dev/reformat-python` in the spark home directory, all the code of the pandas API on Spark is fixed according to the static analysis rules.

### Why are the changes needed?

This can be reduces the cost of static analysis during development.

It has been used continuously for about a year in the Koalas project and its convenience has been proven.

### Does this PR introduce _any_ user-facing change?

No, it's dev-only.

### How was this patch tested?

Manually reformat the pandas API on Spark codes by running the `./dev/reformat-python`, and checked the `./dev/lint-python` is passed.

Closes #32779 from itholic/SPARK-35499.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2021-06-06 17:30:07 -07:00
Takuya UESHIN 221553c204 [SPARK-35642][INFRA] Split pyspark-pandas tests to rebalance the test duration
### What changes were proposed in this pull request?

Splits some tests in `pyspark-pandas` module as slot tests to rebalance the test duration.

Picked the top 12 tests from the previous runs and the total times are almost even.

### Why are the changes needed?

Currently `pyspark-pandas` module tests take long time, so we should rebalance the tests.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #32778 from ueshin/issues/SPARK-35642/split-pandas-on-spark-tests.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-04 12:52:52 +09:00
Hyukjin Kwon 3d158f9c91 [SPARK-35587][PYTHON][DOCS] Initial porting of Koalas documentation
### What changes were proposed in this pull request?

This PR proposes to port Koalas documentation to PySpark documentation as its initial step.
It ports almost as is except these differences:

- Renamed import from `databricks.koalas` to `pyspark.pandas`.
- Renamed `to_koalas` -> `to_pandas_on_spark`
- Renamed `(Series|DataFrame).koalas` -> `(Series|DataFrame).pandas_on_spark`
- Added a `ps_` prefix in the RST file names of Koalas documentation

Other then that,

- Excluded `python/docs/build/html` in linter
- Fixed GA dependency installataion

### Why are the changes needed?

To document pandas APIs on Spark.

### Does this PR introduce _any_ user-facing change?

Yes, it adds new documentations.

### How was this patch tested?

Manually built the docs and checked the output.

Closes #32726 from HyukjinKwon/SPARK-35587.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-04 11:11:09 +09:00
Xinrong Meng 7eeb07d0f9 [SPARK-35606][PYTHON][INFRA] List Python 3.9 installed libraries in build_and_test workflow
### What changes were proposed in this pull request?

In the build_and_test workflow, tests are run against both Python 3.6 and Python 3.9. However, only libraries installed in Python 3.6 are listed. We should list Python 3.9's installed libraries as well.

### Why are the changes needed?

Listing Python 3.9's installed libraries is helpful for debugging.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual check.

Closes #32737 from xinrong-databricks/ci_py3.9lib.

Lead-authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Co-authored-by: xinrong-databricks <47337188+xinrong-databricks@users.noreply.github.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-06-03 14:24:53 -07:00
Hyukjin Kwon d478cff8bb [SPARK-35620][BUILD][PYTHON] Remove documentation build in Python linter
### What changes were proposed in this pull request?

This PR proposes to remove PySpark documentation build in linter check because:

- to speed up CI build by removing duplicate documentation build (linter and doc build)
- for https://github.com/apache/spark/pull/32726. With this PR PySpark documentation build requires a full Spark build to generate plot images in PySpark documentation. It makes less sense to require it in Python linter.
- to remove unnecessary dependency installation for Python linter in CI

### Why are the changes needed?

Python linter script includes documentation build. Because of this, we run documentation builds duplicately in CI, and requires unnecessary dependencies to be installed, and takes extra time. It would more make sense to exclude this in Python linter.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

Manually tested, and it will be tested in CI.

Closes #32760 from HyukjinKwon/SPARK-35620.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-03 12:48:30 +09:00
Dongjoon Hyun 2550490c09 [SPARK-35617][INFRA] Update GitHub Action docker image to 20210602
### What changes were proposed in this pull request?

This PR aims to update GitHub Action docker image with the following updates.
1. Add `pip` explicitly to Python 3.8/3.9
2. Add `plotly` to Python 3.8.
3. Since SPARK-35573 fixes SparkR UT failures on R 4.1.0, update SparkR job to run R 4.1.0.

### Why are the changes needed?

To improve the GitHub Action test infra and unblock #32737

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the GitHub Action.

Closes #32755 from dongjoon-hyun/SPARK-35617.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-06-02 18:30:38 -07:00
Hyukjin Kwon 14e12c64d3 [SPARK-35575][INFRA] Recover updating build status in GitHub Actions
### What changes were proposed in this pull request?

This PR fixes the logic to be fault tolerant when it gets the status of the workflow run from PR author's forked repository.

Looks like https://github.com/apache/spark/pull/32483 removed and disabled (see also https://github.com/apache/spark/pull/32486/checks?check_run_id=2648696751) the GitHub actions workflow runs in the forked repositories, and the detection logic in the main repo fails because the runs don't exist anymore.

See also https://github.com/apache/spark/runs/2709537998?check_suite_focus=true

### Why are the changes needed?

To recover the status update of GitHub Actions in PRs.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

It cannot be tested without being merged.

Closes #32711 from HyukjinKwon/SPARK-35575.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-05-31 19:29:54 +09:00
Dongjoon Hyun c225196be0 [SPARK-35507][INFRA] Add Python 3.9 in the docker image for GitHub Action
### What changes were proposed in this pull request?

This PR aims to add `Python 3.9.5` and updates the docker image references except SparkR job.

### Why are the changes needed?

To save GitHub Action resource and be more robust on the the Python and R library changes.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the GitHub Action.

Closes #32706 from dongjoon-hyun/SPARK-35507.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-05-31 05:56:47 +00:00
Kousuke Saruta 2de19e460b [SPARK-35483][INFRA] Add docker-integration-tests to run-tests.py and GA
### What changes were proposed in this pull request?

This PR proposes to add `docker-integratin-tests` to `run-tests.py` and GA.
Once #32631 was merged but there was a lack of consideration.

Diff between this change and 692d95d145 merged in #32631 is as follows.

```
       if: github.repository != 'apache/spark'
       id: sync-branch
       run: |
+        apache_spark_ref=`git rev-parse HEAD`
         git fetch https://github.com/$GITHUB_REPOSITORY.git ${GITHUB_REF#refs/heads/}
         git -c user.name='Apache Spark Test Account' -c user.email='sparktestaccgmail.com' merge --no-commit --progress --squash FETCH_HEAD
         git -c user.name='Apache Spark Test Account' -c user.email='sparktestaccgmail.com' commit -m "Merged commit"
+        echo "::set-output name=APACHE_SPARK_REF::$apache_spark_ref"
     - name: Cache Scala, SBT and Maven
       uses: actions/cachev2
       with:
```

### Why are the changes needed?

CI for `docker-integration-tests` is absent for now.

### Does this PR introduce _any_ user-facing change?

GA.

### How was this patch tested?

Closes #32691 from sarutak/docker-integration-test-ga-take2.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-05-28 16:54:47 +09:00
Hyukjin Kwon d189cf75f9 Revert "[SPARK-35483][INFRA] Add docker-integration-tests to run-tests.py and GA"
This reverts commit 0a74ad66b3.
2021-05-28 14:29:12 +09:00
Kousuke Saruta 0a74ad66b3 [SPARK-35483][INFRA] Add docker-integration-tests to run-tests.py and GA
### What changes were proposed in this pull request?

This PR proposes to add `docker-integratin-tests` to `run-tests.py` and GA.
`doker-integration-tests` can't run if docker is not installed so it run only if `docker-integration-tests` is specified with `--module`.

### Why are the changes needed?

CI for `docker-integration-tests` is absent for now.

### Does this PR introduce _any_ user-facing change?

GA.

### How was this patch tested?

Closes #32631 from sarutak/docker-integration-test-ga.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-05-28 07:56:37 +09:00
Hyukjin Kwon e47e615c0e [SPARK-35506][PYTHON][INFRA] Run tests with Python 3.9 in GitHub Actions
### What changes were proposed in this pull request?

This PR enables GitHub Actions to test PySpark with Python 3.9.

### Why are the changes needed?

To verify the support of Python 3.9.

### Does this PR introduce _any_ user-facing change?

No, test-only.

### How was this patch tested?

Existing tests should cover.

Closes #32657 from HyukjinKwon/SPARK-35506.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-05-26 09:25:51 +09:00
Hyukjin Kwon 4a6d844184 [SPARK-35497][PYTHON] Enable plotly tests in pandas-on-Spark
### What changes were proposed in this pull request?

This PR enables plot tests with plotly

```bash
./python/run-tests --python-executables=python3 --modules=pyspark-pandas
```

**Before**:

```
Traceback (most recent call last):
  File "/.../miniconda3/envs/python3.8/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/.../miniconda3/envs/python3.8/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/.../pyspark/pandas/tests/plot/test_frame_plot_plotly.py", line 42, in <module>
    plotly_requirement_message + " Or pandas<1.0; pandas<1.0 does not support latest plotly "
TypeError: unsupported operand type(s) for +: 'NoneType' and 'str'

```

**After**:

```
...
Starting test(python3): pyspark.pandas.tests.plot.test_series_plot_plotly
...
Finished test(python3): pyspark.pandas.tests.plot.test_series_plot_plotly (23s)
...
Tests passed in 1296 seconds
```

### Why are the changes needed?

For test coverage.

### Does this PR introduce _any_ user-facing change?

No, test-only.

### How was this patch tested?

By running the tests.

Closes #32649 from HyukjinKwon/SPARK-35497.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-05-25 12:31:32 +09:00
Takuya UESHIN c06480519e [SPARK-35450][INFRA] Follow checkout-merge way to use the latest commit for linter, or other workflows
### What changes were proposed in this pull request?

Follows checkout-merge way to use the latest commit for linter, or other workflows.

### Why are the changes needed?

For linter or other workflows besides build-and-tests, we should follow checkout-merge way to use the latest commit; otherwise, those could work on the old settings.

### Does this PR introduce _any_ user-facing change?

No, this is a dev-only change.

### How was this patch tested?

Existing tests.

Closes #32597 from ueshin/issues/SPARK-35450/infra.

Lead-authored-by: Takuya UESHIN <ueshin@databricks.com>
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-05-20 10:07:28 +09:00
Takeshi Yamamuro 2390b9dbcb [SPARK-35413][INFRA] Use the SHA of the latest commit when checking out databricks/tpcds-kit
### What changes were proposed in this pull request?

This PR proposes to use the SHA of the latest commit ([2a5078a782192ddb6efbcead8de9973d6ab4f069](2a5078a782)) when checking out `databricks/tpcds-kit`. This can prevent the test workflow from breaking accidentally if the repository changes drastically.

### Why are the changes needed?

For better test workflow.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

GA passed.

Closes #32561 from maropu/UseRefInCheckout.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2021-05-17 09:26:04 +09:00
Hyukjin Kwon 7d371d27f2 [SPARK-35393][PYTHON][INFRA][TESTS] Recover pip packaging test in Github Actions
### What changes were proposed in this pull request?

Currently pip packaging test is being skipped:

```
========================================================================
Running PySpark packaging tests
========================================================================
Constructing virtual env for testing
Missing virtualenv & conda, skipping pip installability tests
Cleaning up temporary directory - /tmp/tmp.iILYWISPXW
```

See https://github.com/apache/spark/runs/2568923639?check_suite_focus=true

GitHub Actions's image has its default Conda installed at `/usr/share/miniconda` but seems like the image we're using for PySpark does not have it (which is legitimate).

This PR proposes to install Conda to use in pip packaging tests in GitHub Actions.

### Why are the changes needed?

To recover the test coverage.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

It was tested in my fork: https://github.com/HyukjinKwon/spark/runs/2575126882?check_suite_focus=true

```
========================================================================
Running PySpark packaging tests
========================================================================
Constructing virtual env for testing
Using conda virtual environments
Testing pip installation with python 3.6
Using /tmp/tmp.qPjTenqfGn for virtualenv
Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... failed with repodata from current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... done

## Package Plan ##

  environment location: /tmp/tmp.qPjTenqfGn/3.6

  added / updated specs:
    - numpy
    - pandas
    - pip
    - python=3.6
    - setuptools

...

Successfully ran pip sanity check
```

Closes #32537 from HyukjinKwon/SPARK-35393.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-05-13 10:35:56 -07:00
Takuya UESHIN c0b52da89e [SPARK-35388][INFRA] Allow the PR source branch to include slashes
### What changes were proposed in this pull request?

This PR allows the PR source branch to include slashes.

### Why are the changes needed?

There are PRs whose source branches include slashes, like `issues/SPARK-35119/gha` here or #32523.

Before the fix, the PR build fails in `Sync the current branch with the latest in Apache Spark` phase.
For example, at #32523, the source branch is `issues/SPARK-35382/nested_higher_order_functions`:

```
...
fatal: couldn't find remote ref nested_higher_order_functions
Error: Process completed with exit code 128.
```

(https://github.com/ueshin/apache-spark/runs/2569356241)

### Does this PR introduce _any_ user-facing change?

No, this is a dev-only change.

### How was this patch tested?

This PR source branch includes slashes and #32525 doesn't.

Closes #32524 from ueshin/issues/SPARK-35119/gha.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-05-13 10:59:30 +09:00
Gengliang Wang dac6f175a6 [SPARK-35387][INFRA] Increase the JVM stack size for Java 11 build test
### What changes were proposed in this pull request?

After merging https://github.com/apache/spark/pull/32439, there is flaky error from the Github action job "Java 11 build with Maven":

```
Error:  ## Exception when compiling 473 sources to /home/runner/work/spark/spark/sql/catalyst/target/scala-2.12/classes
java.lang.StackOverflowError
scala.reflect.internal.Trees.itransform(Trees.scala:1376)
scala.reflect.internal.Trees.itransform$(Trees.scala:1374)
scala.reflect.internal.SymbolTable.itransform(SymbolTable.scala:28)
scala.reflect.internal.SymbolTable.itransform(SymbolTable.scala:28)
scala.reflect.api.Trees$Transformer.transform(Trees.scala:2563)
scala.tools.nsc.transform.TypingTransformers$TypingTransformer.transform(TypingTransformers.scala:51)
```
We can resolve it by increasing the stack size of JVM to 256M. The container for Github action jobs has 7G memory so this should be fine.

### Why are the changes needed?

Fix flaky test failure in Java 11 build test

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Github action test

Closes #32521 from gengliangwang/increaseStackSize.

Authored-by: Gengliang Wang <ltnwgl@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-05-12 10:49:09 -07:00
Kousuke Saruta 7e3446a204 [SPARK-35377][INFRA] Add JS linter to GA
### What changes were proposed in this pull request?

SPARK-35175 (#32274) added a linter for JS so let's add it to GA.

### Why are the changes needed?

To JS code keep clean.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

GA

Closes #32512 from sarutak/ga-lintjs.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-05-12 16:00:55 +09:00
Kousuke Saruta af0d99cce6 [SPARK-35375][INFRA] Use Jinja2 < 3.0.0 for Python linter dependency in GA
### What changes were proposed in this pull request?

From a few hours ago, Python linter fails in GA.
The latest Jinja 3.0.0 seems to cause this failure.
https://pypi.org/project/Jinja2/

```
Run ./dev/lint-python
starting python compilation test...
python compilation succeeded.

starting pycodestyle test...
pycodestyle checks passed.

starting flake8 test...
flake8 checks passed.

starting mypy test...
mypy checks passed.

starting sphinx-build tests...
sphinx-build checks failed:
Running Sphinx v3.0.4
making output directory... done
[autosummary] generating autosummary for: development/contributing.rst, development/debugging.rst, development/index.rst, development/setting_ide.rst, development/testing.rst, getting_started/index.rst, getting_started/install.rst, getting_started/quickstart.ipynb, index.rst, migration_guide/index.rst, ..., reference/pyspark.ml.rst, reference/pyspark.mllib.rst, reference/pyspark.resource.rst, reference/pyspark.rst, reference/pyspark.sql.rst, reference/pyspark.ss.rst, reference/pyspark.streaming.rst, user_guide/arrow_pandas.rst, user_guide/index.rst, user_guide/python_packaging.rst

Exception occurred:
  File "/__w/spark/spark/python/docs/source/_templates/autosummary/class_with_docs.rst", line 26, in top-level template code
    {% if '__init__' in methods %}
jinja2.exceptions.UndefinedError: 'methods' is undefined
The full traceback has been saved in /tmp/sphinx-err-ypgyi75y.log, if you want to report the issue to the developers.
Please also report this if it was a user error, so that a better error message can be provided next time.
A bug report can be filed in the tracker at <https://github.com/sphinx-doc/sphinx/issues>. Thanks!
make: *** [Makefile:20: html] Error 2

re-running make html to print full warning list:
Running Sphinx v3.0.4
making output directory... done
[autosummary] generating autosummary for: development/contributing.rst, development/debugging.rst, development/index.rst, development/setting_ide.rst, development/testing.rst, getting_started/index.rst, getting_started/install.rst, getting_started/quickstart.ipynb, index.rst, migration_guide/index.rst, ..., reference/pyspark.ml.rst, reference/pyspark.mllib.rst, reference/pyspark.resource.rst, reference/pyspark.rst, reference/pyspark.sql.rst, reference/pyspark.ss.rst, reference/pyspark.streaming.rst, user_guide/arrow_pandas.rst, user_guide/index.rst, user_guide/python_packaging.rst

Exception occurred:
  File "/__w/spark/spark/python/docs/source/_templates/autosummary/class_with_docs.rst", line 26, in top-level template code
    {% if '__init__' in methods %}
jinja2.exceptions.UndefinedError: 'methods' is undefined
The full traceback has been saved in /tmp/sphinx-err-fvtmvvwv.log, if you want to report the issue to the developers.
Please also report this if it was a user error, so that a better error message can be provided next time.
A bug report can be filed in the tracker at <https://github.com/sphinx-doc/sphinx/issues>. Thanks!
make: *** [Makefile:20: html] Error 2
Error: Process completed with exit code 2.
```

### Why are the changes needed?

To recover GA build.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

GA.

Closes #32509 from sarutak/fix-python-lint-error.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-05-12 10:13:38 +09:00
Takeshi Yamamuro e834ef74dc [SPARK-35293][SQL][TESTS][FOLLOWUP] Update the hash key to refresh TPC-DS cache data in forked GA jobs
### What changes were proposed in this pull request?

This is a follow-up PRi of #32420 and it intends to update the hash key to refresh TPC-DS cache data in forked GA jobs.

### Why are the changes needed?

To recover GA jobs.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

GA passed.

Closes #32460 from maropu/SPARK-35293-FOLLOWUP.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-05-06 16:06:50 -07:00
Takeshi Yamamuro 5c67d0c8f7 [SPARK-35293][SQL][TESTS] Use the newer dsdgen for TPCDSQueryTestSuite
### What changes were proposed in this pull request?

This PR intends to replace `maropu/spark-tpcds-datagen` with `databricks/tpcds-kit` for using a newer dsdgen and update the golden files in `tpcds-query-results`.

### Why are the changes needed?

For better testing.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

GA passed.

Closes #32420 from maropu/UseTpcdsKit.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2021-05-06 15:25:46 +09:00
HyukjinKwon a2927cb28b [SPARK-35302][INFRA] Benchmark workflow should create new files for new benchmarks
### What changes were proposed in this pull request?

Currently, it fails at `git diff --name-only` when new benchmarks are added, see https://github.com/HyukjinKwon/spark/actions/runs/808870999

We should include untracked files (new benchmark result files) to upload so developers download the results.

### Why are the changes needed?

So the new benchmark results can be added and uploaded.

### Does this PR introduce _any_ user-facing change?

No, dev-only

### How was this patch tested?

Tested at:

https://github.com/HyukjinKwon/spark/actions/runs/808867285

Closes #32428 from HyukjinKwon/include-new-benchmarks.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-05-04 19:02:52 +09:00
Takeshi Yamamuro cd689c942c [SPARK-35192][SQL][TESTS] Port minimal TPC-DS datagen code from databricks/spark-sql-perf
### What changes were proposed in this pull request?

This PR proposes to port minimal code to generate TPC-DS data from [databricks/spark-sql-perf](https://github.com/databricks/spark-sql-perf). The classes in a new class file `tpcdsDatagen.scala` are basically copied from the `databricks/spark-sql-perf` codebase.
Note that I've modified them a bit to follow the Spark code style and removed unnecessary parts from them.

The code authors of these classes are:
juliuszsompolski
npoggi
wangyum

### Why are the changes needed?

We frequently use TPCDS data now for benchmarks/tests, but the classes for the TPCDS schemas of datagen and benchmarks/tests are managed separately, e.g.,
 - https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/TPCDSBase.scala
 - https://github.com/databricks/spark-sql-perf/blob/master/src/main/scala/com/databricks/spark/sql/perf/tpcds/TPCDSTables.scala

 I think this causes some inconveniences, e.g., we need to update both files in the separate repositories if we update the TPCDS schema #32037. So, it would be useful for the Spark codebase to generate them by referring to the same schema definition.

### Does this PR introduce _any_ user-facing change?

dev only.

### How was this patch tested?

Manually checked and GA passed.

Closes #32243 from maropu/tpcdsDatagen.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2021-05-03 12:04:42 +09:00
Karen Feng 355c39939d [SPARK-35140][INFRA] Add error message guidelines to PR template
### What changes were proposed in this pull request?

Adds a link to the [error message guidelines](https://spark.apache.org/error-message-guidelines.html) to the PR template to increase visibility.

### Why are the changes needed?

Increases visibility of the error message guidelines, which are otherwise hidden in the [Contributing guidelines](https://spark.apache.org/contributing.html).

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Not needed.

Closes #32241 from karenfeng/spark-35140.

Authored-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-04-21 21:34:49 +09:00
HyukjinKwon 97ec57e667 [SPARK-35120][INFRA][FOLLOW-UP] Try catch an error to show the correct guidance
### What changes were proposed in this pull request?

This PR proposes to handle 404 not found, see https://github.com/apache/spark/pull/32255/checks?check_run_id=2390446579 as an example.

If a fork does not have any previous workflow runs, it seems throwing 404 error instead of empty runs.

### Why are the changes needed?

To show the correct guidance to contributors.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

Manually tested at https://github.com/HyukjinKwon/spark/pull/48. See https://github.com/HyukjinKwon/spark/runs/2391469416 as an example.

Closes #32258 from HyukjinKwon/SPARK-35120-followup.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Gengliang Wang <ltnwgl@gmail.com>
2021-04-21 13:32:04 +08:00
HyukjinKwon dc7d41eee9 [SPARK-35120][INFRA] Guide users to sync branch and enable GitHub Actions in their forked repository
### What changes were proposed in this pull request?

This PR proposes to add messages when the workflow fails to find the workflow run in a forked repository, for example as below:

**Before**

![Screen Shot 2021-04-19 at 9 41 52 PM](https://user-images.githubusercontent.com/6477701/115238011-28e19b00-a158-11eb-8c5c-6374ca1e9790.png)

![Screen Shot 2021-04-19 at 9 42 00 PM](https://user-images.githubusercontent.com/6477701/115237984-22ebba00-a158-11eb-9b0f-11fe11072830.png)

**After**

![Screen Shot 2021-04-19 at 9 25 32 PM](https://user-images.githubusercontent.com/6477701/115237507-9c36dd00-a157-11eb-8ba7-f5f88caa1058.png)

![Screen Shot 2021-04-19 at 9 23 13 PM](https://user-images.githubusercontent.com/6477701/115236793-c2a84880-a156-11eb-98fc-1bb7d4bc31dd.png)
(typo `foce` in the image was fixed)

See this example: https://github.com/HyukjinKwon/spark/runs/2380644793

### Why are the changes needed?

To guide users to enable Github Actions in their forked repositories (and sync their branch to the latest `master` in Apache Spark).

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

Manually tested in:
- https://github.com/HyukjinKwon/spark/pull/47
- https://github.com/HyukjinKwon/spark/pull/46

Closes #32235 from HyukjinKwon/test-test-test.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-04-19 10:40:50 -07:00
HyukjinKwon 2bdb26b374 [SPARK-35101][INFRA] Add GitHub status check in PR instead of a comment
### What changes were proposed in this pull request?

TL;DR: now it shows green yellow read status of tests instead of relying on a comment in a PR, **see https://github.com/HyukjinKwon/spark/pull/41 for an example**.

This PR proposes the GitHub status checks instead of a comment that link to the build (from forked repository) in PRs.

This is how it works:

1. **forked repo**: "Build and test" workflow is triggered when you create a branch to create a PR which uses your resources in GitHub Actions.
1. **main repo**: "Notify test workflow" (previously created a comment) now creates a in-progress status (yellow status) as a GitHub Actions check to your current PR.
1.  **main repo**: "Update build status workflow" regularly (every 15 mins) checks open PRs, and updates the status of GitHub Actions checks at PRs according to the status of workflows in the forked repositories (status sync).

**NOTE** that creating/updating statuses in the PRs is only allowed from the main repo. That's why the flow is as above.

### Why are the changes needed?

The GitHub status shows a green although the tests are running, which is confusing.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

Manually tested at:
- https://github.com/HyukjinKwon/spark/pull/41
- HyukjinKwon#42
- HyukjinKwon#43
- https://github.com/HyukjinKwon/spark/pull/37

**queued**:
<img width="861" alt="Screen Shot 2021-04-16 at 10 56 03 AM" src="https://user-images.githubusercontent.com/6477701/114960831-c9a73080-9ea2-11eb-8442-ddf3f6008a45.png">

**in progress**:
<img width="871" alt="Screen Shot 2021-04-16 at 12 14 39 PM" src="https://user-images.githubusercontent.com/6477701/114966359-59ea7300-9ead-11eb-98cb-1e63323980ad.png">

**passed**:
![Screen Shot 2021-04-16 at 2 04 07 PM](https://user-images.githubusercontent.com/6477701/114974045-a12c3000-9ebc-11eb-9be5-653393a863e6.png)

**failure**:
![Screen Shot 2021-04-16 at 10 46 10 PM](https://user-images.githubusercontent.com/6477701/115033584-90ec7300-9f05-11eb-8f2e-0fc2ef986a70.png)

Closes #32193 from HyukjinKwon/update-checks-pr-poc.

Lead-authored-by: HyukjinKwon <gurwls223@apache.org>
Co-authored-by: Hyukjin Kwon <gurwls223@apache.org>
Co-authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-04-18 11:33:42 +09:00
Yikun Jiang 963f434ed5 [SPARK-35082][INFRA] Use permissive and squshed merge when syncing to the latest branch in GitHub Actions testing
### What changes were proposed in this pull request?

There is an issue when syncing to the Apache master branch, see also https://github.com/apache/spark/pull/32168:

```
From https://github.com/yaooqinn/spark
 * branch                  SPARK-35044 -> FETCH_HEAD
fatal: Not possible to fast-forward, aborting.
Error: Process completed with exit code 128.
```

This is because we use `--ff-only` option so it assumes that the fork is always based on the latest master branch.
We should make it less strict.

This PR proposes to use the same command when we merge PRs:

c8f56eb7bb/dev/merge_spark_pr.py (L127)

### Why are the changes needed?

To unblock PR testing broken.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

Locally tested

Co-authored-by: Kent Yao <yaooqinnhotmail.com>

Closes #32168

Closes #32182 from Yikun/SPARK-rm-fast-forward.

Lead-authored-by: Yikun Jiang <yikunkero@gmail.com>
Co-authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-04-15 12:37:17 +09:00
HyukjinKwon 2974b70d1e [SPARK-35048][INFRA] Distribute GitHub Actions workflows to fork repositories to share the resources
### What changes were proposed in this pull request?

This PR proposes to leverage the GitHub Actions resources from the forked repositories instead of using the resources in ASF organisation at GitHub.

This is how it works:

1. "Build and test" (`build_and_test.yml`)  triggers a build on any commit on any branch (except `branch-*.*`), which roughly means:
    - The original repository will trigger the build on any commits in `master` branch
    - The forked repository will trigger the build on any commit in any branch.
2. The build triggered in the forked repository will checkout the original repository's `master` branch locally, and merge the branch from the forked repository into the original repository's `master` branch locally.
  Therefore, the tests in the forked repository will run after being sync'ed with the original repository's `master` branch.
3. In the original repository, it triggers a workflow that detects the workflow triggered in the forked repository, and add a comment, to the PR, pointing out the workflow in forked repository.

In short, please see this example HyukjinKwon#34

1. You create a PR and your repository triggers the workflow. Your PR uses the resources allocated to you for testing.
2. Apache Spark repository finds your workflow, and links it in a comment in your PR

**NOTE** that we will still run the tests in the original repository for each commit pushed to `master` branch. This distributes the workflows only in PRs.

### Why are the changes needed?

ASF shares the resources across all the ASF projects, which makes the development slow down.
Please see also:
- Discussion in the buildsa.o mailing list: https://lists.apache.org/x/thread.html/r48d079eeff292254db22705c8ef8618f87ff7adc68d56c4e5d0b4105%3Cbuilds.apache.org%3E
- Infra ticket: https://issues.apache.org/jira/browse/INFRA-21646

By distributing the workflows to use author's resources, we can get around this issue.

### Does this PR introduce _any_ user-facing change?

No, this is a dev-only change.

### How was this patch tested?

Manually tested at https://github.com/HyukjinKwon/spark/pull/34 and https://github.com/HyukjinKwon/spark/pull/33.

Closes #32092 from HyukjinKwon/poc-fork-resources.

Lead-authored-by: HyukjinKwon <gurwls223@apache.org>
Co-authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-04-14 13:28:11 +09:00
Kent Yao 16e2faadac [SPARK-34944][SQL][TESTS] Replace bigint with int for web_returns and store_returns in TPCDS tests to employ correct data type
### What changes were proposed in this pull request?

According to  http://www.tpc.org/tpc_documents_current_versions/pdf/tpc-ds_v2.9.0.pdf

```
 2.2.2 Datatype
2.2.2.1 Each column employs one of the following datatypes:
a) Identifier means that the column shall be able to hold any key value generated for that column.
b) Integer means that the column shall be able to exactly represent integer values (i.e., values in increments of
1) in the range of at least ( − 2n − 1) to (2n − 1 − 1), where n is 64.
c) Decimal(d, f) means that the column shall be able to represent decimal values up to and including d digits,
of which f shall occur to the right of the decimal place; the values can be either represented exactly or
interpreted to be in this range.
d) Char(N) means that the column shall be able to hold any string of characters of a fixed length of N.
Comment: If the string that a column of datatype char(N) holds is shorter than N characters, then trailing
spaces shall be stored in the database or the database shall automatically pad with spaces upon retrieval such
that a CHAR_LENGTH() function will return N.
e) Varchar(N) means that the column shall be able to hold any string of characters of a variable length with a
maximum length of N. Columns defined as "varchar(N)" may optionally be implemented as "char(N)".
f) Date means that the column shall be able to express any calendar day between January 1, 1900 and
December 31, 2199.
2.2.2.2 The datatypes do not correspond to any specific SQL-standard datatype. The definitions are provided to
highlight the properties that are required for a particular column. The benchmark implementer may employ any internal representation or SQL datatype that meets those requirements.
```

This PR proposes that we use int for identifiers instead of bigint to reach a compromise with TPC-DS Standard Specification.

After this PR, the field schemas are now consistent with those DDLs in the `tpcds.sql` from tpc-ds tool kit, see https://gist.github.com/yaooqinn/b9978a77bbf4f871a95d6a9103019907

### Why are the changes needed?

reach a compromise with TPC-DS Standard Specification

### Does this PR introduce _any_ user-facing change?

no test only

### How was this patch tested?

test only

Closes #32037 from yaooqinn/SPARK-34944.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Kent Yao <yao@apache.org>
2021-04-13 11:28:35 +08:00
HyukjinKwon a3d1e00317 [SPARK-35002][INFRA][FOLLOW-UP] Use localhost instead of 127.0.0.1 at SPARK_LOCAL_IP in GA builds
### What changes were proposed in this pull request?

This PR replaces 127.0.0.1 to `localhost`.

### Why are the changes needed?

- https://github.com/apache/spark/pull/32096#discussion_r610349269
- https://github.com/apache/spark/pull/32096#issuecomment-816442481

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

I didn't test it because it's CI specific issue. I will test it in Github Actions build in this PR.

Closes #32102 from HyukjinKwon/SPARK-35002.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Yuming Wang <yumwang@ebay.com>
2021-04-09 16:39:20 +08:00
Yuming Wang 9663c4061a [SPARK-35002][INFRA] Fix the java.net.BindException when testing with Github Action
### What changes were proposed in this pull request?

This PR tries to fix the `java.net.BindException` when testing with Github Action:
```
[info] org.apache.spark.sql.kafka010.producer.InternalKafkaProducerPoolSuite *** ABORTED *** (282 milliseconds)
[info]   java.net.BindException: Cannot assign requested address: Service 'sparkDriver' failed after 100 retries (on a random free port)! Consider explicitly setting the appropriate binding address for the service 'sparkDriver' (for example spark.driver.bindAddress for SparkDriver) to the correct binding address.
[info]   at sun.nio.ch.Net.bind0(Native Method)
[info]   at sun.nio.ch.Net.bind(Net.java:461)
[info]   at sun.nio.ch.Net.bind(Net.java:453)
```

https://github.com/apache/spark/pull/32090/checks?check_run_id=2295418529

### Why are the changes needed?

Fix test framework.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Test by Github Action.

Closes #32096 from wangyum/SPARK_LOCAL_IP=localhost.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-04-08 23:03:51 -07:00
Takuya UESHIN 2635c3894f [SPARK-34972][PYTHON] Make pandas-on-Spark doctests work
### What changes were proposed in this pull request?

Now that we merged the Koalas main code into PySpark code base (#32036), we should enable doctests on the Spark's infrastructure.

### Why are the changes needed?

Currently the pandas-on-Spark modules are not tested at all.
We should enable doctests first, and we will port other unit tests separately later.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Enabled the whole doctests.

Closes #32069 from ueshin/issues/SPARK-34972/pyspark-pandas_doctests.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-04-07 20:50:41 +09:00
HyukjinKwon 82ad2f9dff [SPARK-34951][INFRA][PYTHON][TESTS] Set the system encoding as UTF-8 to recover the Sphinx build in GitHub Actions
### What changes were proposed in this pull request?

This PR proposes to set the system encoding as UTF-8. For some reasons, it looks like GitHub Actions machines changed theirs to ASCII by default. This leads to default encoding/decoding to use ASCII in Python, e.g.) `"a".encode()`, and looks like Sphinx depends on that.

### Why are the changes needed?

To recover GItHub Actions build.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

Tested in https://github.com/apache/spark/pull/32046

Closes #32047 from HyukjinKwon/SPARK-34951.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-04-04 16:22:57 +09:00
HyukjinKwon 71effba5f2 [SPARK-34821][INFRA] Set up a workflow for developers to run benchmark in their fork
### What changes were proposed in this pull request?

This PR proposes to add a workflow that allows developers to run benchmarks and download the results files.  After this PR, developers can run benchmarks in GitHub Actions in their fork.

### Why are the changes needed?

1. Very easy to use.
2. We can use the (almost) same environment to run the benchmarks. Given my few experiments and observation, the CPU, cores, and memory are same.
3. Does not burden ASF's resource at GitHub Actions.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

Manually tested in https://github.com/HyukjinKwon/spark/pull/31.

Entire benchmarks are being run as below:
- [Run benchmarks: * (JDK 11)](https://github.com/HyukjinKwon/spark/actions/runs/713575465)
- [Run benchmarks: * (JDK 8)](https://github.com/HyukjinKwon/spark/actions/runs/713154337)

### How do developers use it in their fork?

1. **Go to Actions in your fork, and click "Run benchmarks"**

    ![Screen Shot 2021-03-31 at 10 15 13 PM](https://user-images.githubusercontent.com/6477701/113150018-99d71680-926e-11eb-8647-4ecf062c55f2.png)

2. **Run the benchmarks with JDK 8 or 11 with benchmark classes to run. Glob pattern is supported just like `testOnly` in SBT**

    ![Screen Shot 2021-04-02 at 8 35 02 PM](https://user-images.githubusercontent.com/6477701/113412599-ab95f680-93f3-11eb-9a15-c6ed54587b9d.png)

3. **After finishing the jobs, the benchmark results are available on the top in the underlying workflow:**

    ![Screen Shot 2021-03-31 at 10 17 21 PM](https://user-images.githubusercontent.com/6477701/113150332-ede1fb00-926e-11eb-9c0e-97d195070508.png)

4. **After downloading it, unzip and untar at Spark git root directory:**

    ```bash
    cd .../spark
    mv ~/Downloads/benchmark-results-8.zip .
    unzip benchmark-results-8.zip
    tar -xvf benchmark-results-8.tar
    ```

5. **Check the results:**

    ```bash
    git status
    ```

    ```
    ...
        modified:   core/benchmarks/MapStatusesSerDeserBenchmark-results.txt
    ```

Closes #32015 from HyukjinKwon/SPARK-34821-pr.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-04-03 20:55:54 +09:00
Kousuke Saruta f99a831dab [SPARK-24931][INFRA] Fix the GA failure related to R linter
### What changes were proposed in this pull request?

This PR fixes the GA failure related to R linter which happens on some PRs (e.g. #32023, #32025).
The reason seems `Rscript -e "devtools::install_github('jimhester/lintrv2.0.0')"` fails to download `lintrv2.0.0`.
I don't know why but I confirmed we can download `v2.0.1`.

### Why are the changes needed?

To keep GA healthy.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

GA itself.

Closes #32028 from sarutak/hotfix-r.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-04-01 12:04:19 -07:00
HyukjinKwon 48ef9bd2b3 [SPARK-34915][INFRA] Cache Maven, SBT and Scala in all jobs that use them
### What changes were proposed in this pull request?

This PR proposes to cache Maven, SBT and Scala in all jobs that use them. For simplicity, we use the same key `build-` and just cache all SBT, Maven and Scala. The cache is not very large.

### Why are the changes needed?

To speed up the build.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

It will be tested in this PR's GA jobs.

Closes #32011 from HyukjinKwon/SPARK-34915.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Gengliang Wang <ltnwgl@gmail.com>
2021-03-31 18:13:50 +08:00
Takeshi Yamamuro 46f96e9ce1 [SPARK-34795][SQL][TESTS] Adds a new job in GitHub Actions to check the output of TPC-DS queries
### What changes were proposed in this pull request?

This PR proposes to add a new job in GitHub Actions to check the output of TPC-DS queries.

NOTE: I've checked that the new job took 17m 35s in the GitHub Actions env.

### Why are the changes needed?

There are some cases where we noticed runtime-realted bugs after merging commits (e.g. .SPARK-33822). Therefore, I think it is worth adding a new job in GitHub Actions to check query output of TPC-DS (sf=1).

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

The new test added.

Closes #31886 from maropu/TPCDSQueryTestSuite.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2021-03-31 09:48:19 +09:00
HyukjinKwon c8233f1be5 [SPARK-34874][INFRA] Recover test reports for failed GA builds
### What changes were proposed in this pull request?

621becc6d7 (r48726074)
there was a behaviour change in the download artifact plugin and it disabled the test reporting in failed builds.

This PR recovers it by explicitly setting the conclusion from the workflow runs to search for the artifacts to download.

### Why are the changes needed?

In order to properly report the failed test cases.

### Does this PR introduce _any_ user-facing change?

No, it's dev only.

### How was this patch tested?

Manually tested at https://github.com/HyukjinKwon/spark/pull/30

Before:

![Screen Shot 2021-03-26 at 10 54 48 AM](https://user-images.githubusercontent.com/6477701/112566110-b7951d80-8e21-11eb-8fad-f637db9314d5.png)

After:

![Screen Shot 2021-03-26 at 5 04 01 PM](https://user-images.githubusercontent.com/6477701/112606215-7588cd80-8e5b-11eb-8fdd-3afebd629f4f.png)

Closes #31970 from HyukjinKwon/SPARK-34874.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-03-26 18:12:28 +09:00
Dongjoon Hyun 4818847e87 [SPARK-34578][SQL][TESTS][TEST-MAVEN] Refactor ORC encryption tests and ignore ORC shim loaded by old Hadoop library
### What changes were proposed in this pull request?

1. This PR aims to ignore ORC encryption tests when ORC shim is loaded by old Hadoop library by some other tests. The test coverage is preserved by Jenkins SBT runs and GitHub Action jobs. This PR only aims to recover Maven Jenkins jobs.
2. In addition, this PR simplifies SBT testing by refactor the test config to `SparkBuild.scala/pom.xml` and remove `DedicatedJVMTest`. This will remove one GitHub Action job which was recently added for `DedicatedJVMTest` tag.

### Why are the changes needed?

Currently, Maven test fails when it runs in a batch mode because `HadoopShimsPre2_3$NullKeyProvider` is loaded.

**MVN COMMAND**
```
$ mvn test -pl sql/core --am -Dtest=none -DwildcardSuites=org.apache.spark.sql.execution.datasources.orc.OrcV1QuerySuite,org.apache.spark.sql.execution.datasources.orc.OrcEncryptionSuite
```

**BEFORE**
```
- Write and read an encrypted table *** FAILED ***
...
  Cause: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 1) (localhost executor driver): java.lang.IllegalArgumentException: Unknown key pii
	at org.apache.orc.impl.HadoopShimsPre2_3$NullKeyProvider.getCurrentKeyVersion(HadoopShimsPre2_3.java:71)
	at org.apache.orc.impl.WriterImpl.getKey(WriterImpl.java:871)
```

**AFTER**
```
OrcV1QuerySuite
...
OrcEncryptionSuite:
- Write and read an encrypted file !!! CANCELED !!!
  [] was empty org.apache.orc.impl.HadoopShimsPre2_3$NullKeyProvider1b705f65 doesn't has the test keys. ORC shim is created with old Hadoop libraries (OrcEncryptionSuite.scala:39)
- Write and read an encrypted table !!! CANCELED !!!
  [] was empty org.apache.orc.impl.HadoopShimsPre2_3$NullKeyProvider22adeee1 doesn't has the test keys. ORC shim is created with old Hadoop libraries (OrcEncryptionSuite.scala:67)
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the Jenkins Maven tests.

For SBT command,
- the test suite required a dedicated JVM (Before)
- the test suite doesn't require a dedicated JVM (After)
```
$ build/sbt "sql/testOnly *.OrcV1QuerySuite *.OrcEncryptionSuite"
...
[info] OrcV1QuerySuite
...
[info] - SPARK-20728 Make ORCFileFormat configurable between sql/hive and sql/core (26 milliseconds)
[info] OrcEncryptionSuite:
[info] - Write and read an encrypted file (431 milliseconds)
[info] - Write and read an encrypted table (359 milliseconds)
[info] All tests passed.
[info] Passed: Total 35, Failed 0, Errors 0, Passed 35
```

Closes #31697 from dongjoon-hyun/SPARK-34578-TEST.

Lead-authored-by: Dongjoon Hyun <dhyun@apple.com>
Co-authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-03-02 16:52:27 +09:00
Yikun Jiang 85b50d4258 [SPARK-34539][BUILD][INFRA] Remove stand-alone version Zinc server
### What changes were proposed in this pull request?
Cleanup all Zinc standalone server code, and realated coniguration.

### Why are the changes needed?
![image](https://user-images.githubusercontent.com/1736354/109154790-c1d3e580-77a9-11eb-8cde-835deed6e10e.png)
- Zinc is the incremental compiler to speed up builds of compilation.
- The scala-maven-plugin is the mave plugin, which is used by Spark, one of the function is to integrate the Zinc to enable the incremental compiler.
- Since Spark v3.0.0 ([SPARK-28759](https://issues.apache.org/jira/browse/SPARK-28759)), the scala-maven-plugin is upgraded to v4.X, that means Zinc v0.3.13 standalone server is useless anymore.

However, we still download, install, start the standalone Zinc server. we should remove all zinc standalone server code, and all related configuration.

See more in [SPARK-34539](https://issues.apache.org/jira/projects/SPARK/issues/SPARK-34539) or the doc [Zinc standalone server is useless after scala-maven-plugin 4.x](https://docs.google.com/document/d/1u4kCHDx7KjVlHGerfmbcKSB0cZo6AD4cBdHSse-SBsM).

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Run any mvn build:
./build/mvn -DskipTests clean package -pl core
You could see the increamental compilation is still working, the stage of "scala-maven-plugin:4.3.0:compile (scala-compile-first)" with incremental compilation info, like:
```
[INFO] --- scala-maven-plugin:4.3.0:testCompile (scala-test-compile-first)  spark-core_2.12 ---
[INFO] Using incremental compilation using Mixed compile order
[INFO] Compiler bridge file: /root/.sbt/1.0/zinc/org.scala-sbt/org.scala-sbt-compiler-bridge_2.12-1.3.1-bin_2.12.10__52.0-1.3.1_20191012T045515.jar
[INFO] compiler plugin: BasicArtifact(com.github.ghik,silencer-plugin_2.12.10,1.6.0,null)
[INFO] Compiling 303 Scala sources and 27 Java sources to /root/spark/core/target/scala-2.12/test-classes ...
```

Closes #31647 from Yikun/cleanup-zinc.

Authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2021-03-01 08:39:38 -06:00
Dongjoon Hyun 2fb5f21b1e [SPARK-34495][TESTS] Add DedicatedJVMTest test tag
### What changes were proposed in this pull request?

This PR aims to add a test tag, `DedicatedJVMTest`, and replace `SecurityTest` with this.

### Why are the changes needed?

To have a reusable general test tag.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

Closes #31607 from dongjoon-hyun/SPARK-34495.

Lead-authored-by: Dongjoon Hyun <dhyun@apple.com>
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-02-22 16:00:48 +09:00
Dongjoon Hyun 03f4cf5845 [SPARK-34029][SQL][TESTS] Add OrcEncryptionSuite and FakeKeyProvider
### What changes were proposed in this pull request?

This is a retry of #31065 . Last time, the newly add test cases passed in Jenkins and individually, but it's reverted because they fail when `GitHub Action` runs with  `SERIAL_SBT_TESTS=1`.

In this PR, `SecurityTest` tag is used to isolate `KeyProvider`.

This PR aims to add a basis for columnar encryption test framework by add `OrcEncryptionSuite` and `FakeKeyProvider`.

Please note that we will improve more in both Apache Spark and Apache ORC in Apache Spark 3.2.0 timeframe.

### Why are the changes needed?

Apache ORC 1.6 supports columnar encryption.

### Does this PR introduce _any_ user-facing change?

No. This is for a test case.

### How was this patch tested?

Pass the newly added test suite.

Closes #31603 from dongjoon-hyun/SPARK-34486-RETRY.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-02-21 15:05:29 -08:00
“attilapiros” bdcad33d8b [SPARK-34433][DOCS] Lock Jekyll version by Gemfile and Bundler
### What changes were proposed in this pull request?

Improving the documentation and release process by pinning Jekyll version by Gemfile and Bundler.

Some files and their responsibilities within this PR:
- `docs/.bundle/config` is used to specify a directory "docs/.local_ruby_bundle" which will be used as destination to install the ruby packages into instead of the global one which requires root access
- `docs/Gemfile` is specifying the required Jekyll version and other top level gem versions
- `docs/Gemfile.lock` is generated by the "bundle install". This file contains the exact resolved versions of all the gems including the top level gems and all the direct and transitive dependencies of those gems. When this file is generated it contains a platform related section "PLATFORMS" (in my case after the generation it was "universal-darwin-19"). Still this file must be under version control as when the version of a gem does not fit to the one specified in `Gemfile` an error comes (i.e. if the `Gemfile.lock` was generated for Jekyll 4.1.0 and its version is updated in the `Gemfile` to 4.2.0 then it triggers the error: "The bundle currently has jekyll locked at 4.1.0."). This is solution is also suggested officially in [its documentation](https://bundler.io/rationale.html#checking-your-code-into-version-control). To get rid of the specific platform (like "universal-darwin-19") first we have to add "ruby" as platform [which means this should work on every platform where Ruby runs](https://guides.rubygems.org/what-is-a-gem/)) by running "bundle lock --add-platform ruby" then the specific platform can be removed by "bundle lock --remove-platform universal-darwin-19".

After this the correct process to update Jekyll version is the following:
1. update the version in `Gemfile`
2. run "bundle update" which updates the `Gemfile.lock`
3. commit both files

This process for version update is tested for details please check the testing section.

### Why are the changes needed?

Using different Jekyll versions can generate different output documents.
This PR standardize the process.

### Does this PR introduce _any_ user-facing change?

No, assuming the release was done via docker by using `do-release-docker.sh`.
In that case  there should be no difference at all as the same Jekyll version is specified in the Gemfile.

### How was this patch tested?

#### Testing document generation

Doc generation step was triggered via  the docker release:

```
$ ./do-release-docker.sh -d ~/working -n -s docs
...
========================
= Building documentation...
Command: /opt/spark-rm/release-build.sh docs
Log file: docs.log
Skipping publish step.
```

The docs.log contains the followings:
```
Building Spark docs
Fetching gem metadata from https://rubygems.org/.........
Using bundler 2.2.9
Fetching rb-fsevent 0.10.4
Fetching forwardable-extended 2.6.0
Fetching public_suffix 4.0.6
Fetching colorator 1.1.0
Fetching eventmachine 1.2.7
Fetching http_parser.rb 0.6.0
Fetching ffi 1.14.2
Fetching concurrent-ruby 1.1.8
Installing colorator 1.1.0
Installing forwardable-extended 2.6.0
Installing rb-fsevent 0.10.4
Installing public_suffix 4.0.6
Installing http_parser.rb 0.6.0 with native extensions
Installing eventmachine 1.2.7 with native extensions
Installing concurrent-ruby 1.1.8
Fetching rexml 3.2.4
Fetching liquid 4.0.3
Installing ffi 1.14.2 with native extensions
Installing rexml 3.2.4
Installing liquid 4.0.3
Fetching mercenary 0.4.0
Installing mercenary 0.4.0
Fetching rouge 3.26.0
Installing rouge 3.26.0
Fetching safe_yaml 1.0.5
Installing safe_yaml 1.0.5
Fetching unicode-display_width 1.7.0
Installing unicode-display_width 1.7.0
Fetching webrick 1.7.0
Installing webrick 1.7.0
Fetching pathutil 0.16.2
Fetching kramdown 2.3.0
Fetching terminal-table 2.0.0
Fetching addressable 2.7.0
Fetching i18n 1.8.9
Installing terminal-table 2.0.0
Installing pathutil 0.16.2
Installing i18n 1.8.9
Installing addressable 2.7.0
Installing kramdown 2.3.0
Fetching kramdown-parser-gfm 1.1.0
Installing kramdown-parser-gfm 1.1.0
Fetching rb-inotify 0.10.1
Fetching sassc 2.4.0
Fetching em-websocket 0.5.2
Installing rb-inotify 0.10.1
Installing em-websocket 0.5.2
Installing sassc 2.4.0 with native extensions
Fetching listen 3.4.1
Installing listen 3.4.1
Fetching jekyll-watch 2.2.1
Installing jekyll-watch 2.2.1
Fetching jekyll-sass-converter 2.1.0
Installing jekyll-sass-converter 2.1.0
Fetching jekyll 4.2.0
Installing jekyll 4.2.0
Fetching jekyll-redirect-from 0.16.0
Installing jekyll-redirect-from 0.16.0
Bundle complete! 4 Gemfile dependencies, 30 gems now installed.
Bundled gems are installed into `./.local_ruby_bundle`
```

#### Testing Jekyll (or other gem) update

First locally I reverted Jekyll to 4.1.0:
```
$ rm Gemfile.lock
$ rm -rf .local_ruby_bundle

# edited Gemfile to use version 4.1.0
$ cat Gemfile
source "https://rubygems.org"

gem "jekyll", "4.1.0"
gem "rouge", "3.26.0"
gem "jekyll-redirect-from", "0.16.0"
gem "webrick", "1.7"
$ bundle install
...
```

Testing Jekyll version before the update:

```
$ bundle exec jekyll --version
jekyll 4.1.0
```

Imitating Jekyll update coming from git by reverting my local changes:

```
$ git checkout Gemfile
Updated 1 path from the index
$ cat Gemfile
source "https://rubygems.org"

gem "jekyll", "4.2.0"
gem "rouge", "3.26.0"
gem "jekyll-redirect-from", "0.16.0"
gem "webrick", "1.7"

$ git checkout Gemfile.lock
Updated 1 path from the index
```

Run the install:

```
$ bundle install
...
```

Checking the updated Jekyll version:
```
$ bundle exec jekyll --version
jekyll 4.2.0
```

Closes #31559 from attilapiros/pin-jekyll-version.

Lead-authored-by: “attilapiros” <piros.attila.zsolt@gmail.com>
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Co-authored-by: Attila Zsolt Piros <2017933+attilapiros@users.noreply.github.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-02-18 12:17:57 +09:00
Jarek Potiuk a4b70758d3 [SPARK-34053][INFRA][FOLLOW-UP] Disables canceling push/schedule workflows
Changes the configuration of the cancel workflow action to skip schedule/push events from canceling. This has the effect that duplicates of all direct pushes (master merges or direct pushes to the spark repository are not cancelled)

### What changes were proposed in this pull request?

Update to CI cancel policy to skip direct pushes. Duplicates will only be cancelled for Pull Requests.

### Why are the changes needed?

[Apparenlty](https://github.com/apache/spark/pull/31104#issuecomment-758318463) the aggressive behavior of the cancel action which also cancels duplicate master builds is too agressive for spark community. This change spares merges to master and scheduled builds from duplicate checking (as a result all merges to master will be always build to completion).

The previous behavior of the action was that in case of subsequent merges to master, only the latest one was guaranteed to complete. Other changes could be cancelled before they complete to save job queue.

### Does this PR introduce _any_ user-facing change?

No, except if the master builds were somehow facing the users (but it's unlikely taking into account the ASF release policy).
There was a potential that some changes that could be detected by specific master merge failing could be detected later (in one of the subsequent builds) which could make investigation of the root cause of failure a bit more difficult, because it could have been introduced in one of the commits between two completed builds merges. But this is at most impacting the timeline of release close to release itself, not the release itself.

### How was this patch tested?

This configuration parameter has been tested earlier in Airflow. We used it initially, but since our master builds are heavy and we have extensive tests in the PRs and investigation for failed builds is not as difficult we found that limiting the strain on Github Action and cancelling master builds was more important for the health of the whole ASF community and we removed that configuration.

Tested in https://github.com/potiuk/spark/runs/1688506527?check_suite_focus=true#step:2:46 where the action found other master builds in progress but did not add them as candidates to cancel.

Closes #31153 from potiuk/skip-schedule-push-branches.

Authored-by: Jarek Potiuk <potiuk@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-01-12 23:10:20 +09:00