Commit graph

158 commits

Author SHA1 Message Date
Kousuke Saruta 513b6f5af2 [SPARK-33079][TESTS] Replace the existing Maven job for Scala 2.13 in Github Actions with SBT job
### What changes were proposed in this pull request?

SPARK-32926 added a build test to GitHub Action for Scala 2.13 but it's only with Maven.
As SPARK-32873 reported, some compilation error happens only with SBT so I think we need to add another build test to GitHub Action for SBT.
Unfortunately, we don't have abundant resources for GitHub Actions so instead of just adding the new SBT job, let's replace the existing Maven job with the new SBT job for Scala 2.13.

### Why are the changes needed?

To ensure build test passes even with SBT for Scala 2.13.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

GitHub Actions' job.

Closes #29958 from sarutak/add-sbt-job-for-scala-2.13.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-10-15 20:51:20 +09:00
Dongjoon Hyun e85ed8a14c [SPARK-33156][INFRA] Upgrade GithubAction image from 18.04 to 20.04
### What changes were proposed in this pull request?

This PR aims to upgrade `Github Action` runner image from `Ubuntu 18.04 (LTS)` to `Ubuntu 20.04 (LTS)`.

### Why are the changes needed?

`ubuntu-latest` in `GitHub Action` is still `Ubuntu 18.04 (LTS)`.
- https://github.com/actions/virtual-environments#available-environments

This upgrade will help Apache Spark 3.1+ preparation for vote and release on the latest OS.

This is tested here.
- https://github.com/dongjoon-hyun/spark/pull/36

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the `Github Action` in this PR.

Closes #30050 from dongjoon-hyun/ubuntu_20.04.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-10-15 02:24:49 -07:00
HyukjinKwon a0aa8f33a9 [SPARK-33069][INFRA] Skip test result report if no JUnit XML files are found
### What changes were proposed in this pull request?

This PR proposes to skip test reporting ("Report test results") if there are no JUnit XML files are found.

Currently, we're running and skipping the tests dynamically. For example,
- if there are only changes in SparkR at the underlying commit, it only runs the SparkR tests, and skip the other tests and generate JUnit XML files for SparkR test cases.
- if there are only changes in `docs` at the underlying commit, the build skips all tests except linters and do not generate any JUnit XML files.

When test reporting ("Report test results") job is triggered after the main build ("Build and test
") is finished, and there are no JUnit XML files found, it reports the case as a failure. See https://github.com/apache/spark/runs/1196184007 as an example.

This PR works around it by simply skipping the testing report when there are no JUnit XML files are found.
Please see https://github.com/apache/spark/pull/29906#issuecomment-702525542 for more details.

### Why are the changes needed?

To avoid false alarm for test results.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

Manually tested in my fork.

Positive case:

https://github.com/HyukjinKwon/spark/runs/1208624679?check_suite_focus=true
https://github.com/HyukjinKwon/spark/actions/runs/288996327

Negative case:

https://github.com/HyukjinKwon/spark/runs/1208229838?check_suite_focus=true
https://github.com/HyukjinKwon/spark/actions/runs/289000058

Closes #29946 from HyukjinKwon/test-junit-files.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-10-06 09:09:58 +09:00
HyukjinKwon b205be5ff6 [SPARK-33051][INFRA][R] Uses setup-r to install R in GitHub Actions build
### What changes were proposed in this pull request?

At SPARK-32493, the R installation was switched to manual installation because setup-r was broken. This seems fixed in the upstream so we should better switch it back.

### Why are the changes needed?

To avoid maintaining the installation steps by ourselve.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

GitHub Actions build in this PR should test it.

Closes #29931 from HyukjinKwon/recover-r-build.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-10-02 15:12:33 +09:00
Dongjoon Hyun a8442c2826 [SPARK-32926][TESTS] Add Scala 2.13 build test in GitHub Action
### What changes were proposed in this pull request?

The PR aims to add Scala 2.13 build test coverage into GitHub Action for Apache Spark 3.1.0.

### Why are the changes needed?

The branch is ready for Scala 2.13 and this will prevent any regression.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

Pass the GitHub Action.

Closes #29793 from dongjoon-hyun/SPARK-32926.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-09-17 14:01:52 -07:00
HyukjinKwon b07e7429a6 [SPARK-32695][INFRA] Explicitly cache and hash 'build' directly in GitHub Actions
### What changes were proposed in this pull request?

This PR proposes to explicitly cache and hash the files/directories under 'build' for SBT and Zinc at GitHub Actions. Otherwise, it can end up with overwriting `build` directory. See also https://github.com/apache/spark/pull/29286#issuecomment-679368436

Previously, other files like `build/mvn` and `build/sbt` are also cached and overwritten. So, when you have some changes there, they are ignored.

### Why are the changes needed?

To make GitHub Actions build stable.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

The builds in this PR test it out.

Closes #29536 from HyukjinKwon/SPARK-32695.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-08-26 12:25:59 +09:00
HyukjinKwon b54103016a [SPARK-32204][SPARK-32182][DOCS] Add a quickstart page with Binder integration in PySpark documentation
### What changes were proposed in this pull request?

This PR proposes to:
- add a notebook with a Binder integration which allows users to try PySpark in a live notebook. Please [try this here](https://mybinder.org/v2/gh/HyukjinKwon/spark/SPARK-32204?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart.ipynb).
- reuse this notebook as a quickstart guide in PySpark documentation.

Note that Binder turns a Git repo into a collection of interactive notebooks. It works based on Docker image. Once somebody builds, other people can reuse the image against a specific commit.
Therefore, if we run Binder with the images based on released tags in Spark, virtually all users can instantly launch the Jupyter notebooks.

<br/>

I made a simple demo to make it easier to review. Please see:
- [Main page](https://hyukjin-spark.readthedocs.io/en/stable/). Note that the link ("Live Notebook") in the main page wouldn't work since this PR is not merged yet.
- [Quickstart page](https://hyukjin-spark.readthedocs.io/en/stable/getting_started/quickstart.html)

<br/>

When reviewing the notebook file itself, please give my direct feedback which I will appreciate and address.
Another way might be:
- open [here](https://mybinder.org/v2/gh/HyukjinKwon/spark/SPARK-32204?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart.ipynb).
- edit / change / update the notebook. Please feel free to change as whatever you want. I can apply as are or slightly update more when I apply to this PR.
- download it as a `.ipynb` file:
    ![Screen Shot 2020-08-20 at 10 12 19 PM](https://user-images.githubusercontent.com/6477701/90774311-3e38c800-e332-11ea-8476-699a653984db.png)
- upload the `.ipynb` file here in a GitHub comment. Then, I will push a commit with that file with crediting correctly, of course.
- alternatively, push a commit into this PR right away if that's easier for you (if you're a committer).

References:
- https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html
- https://databricks.com/jp/blog/2020/03/31/10-minutes-from-pandas-to-koalas-on-apache-spark.html - my own blog post .. :-) and https://koalas.readthedocs.io/en/latest/getting_started/10min.html

### Why are the changes needed?

To improve PySpark's usability. The current quickstart for Python users are very friendly.

### Does this PR introduce _any_ user-facing change?

Yes, it will add a documentation page, and expose a live notebook to PySpark users.

### How was this patch tested?

Manually tested, and GitHub Actions builds will test.

Closes #29491 from HyukjinKwon/SPARK-32204.

Lead-authored-by: HyukjinKwon <gurwls223@apache.org>
Co-authored-by: Fokko Driesprong <fokko@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-08-26 12:23:24 +09:00
Takeshi Yamamuro 6dd37cbaac [SPARK-32682][INFRA] Use workflow_dispatch to enable manual test triggers
### What changes were proposed in this pull request?

This PR proposes to add a `workflow_dispatch` entry in the GitHub Action script (`build_and_test.yml`). This update can enable developers to run the Spark tests for a specific branch on their own local repository, so I think it might help to check if al the tests can pass before opening a new PR.

<img width="944" alt="Screen Shot 2020-08-21 at 16 28 41" src="https://user-images.githubusercontent.com/692303/90866249-96250c80-e3ce-11ea-8496-3dd6683e92ea.png">

### Why are the changes needed?

To reduce the pressure of GitHub Actions on the Spark repository.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manually checked.

Closes #29504 from maropu/DispatchTest.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-08-21 21:23:41 +09:00
HyukjinKwon bfd8c34154 [SPARK-32645][INFRA] Upload unit-tests.log as an artifact
### What changes were proposed in this pull request?

This PR proposes to upload `target/unit-tests.log` into the artifact so it will be able to download here:
![Screen Shot 2020-08-18 at 2 23 18 PM](https://user-images.githubusercontent.com/6477701/90474095-789e3b80-e15f-11ea-87f8-e7da3df3c03e.png)

### Why are the changes needed?

Jenkins has this feature. It should be best to have the same dev functionalities with it.
Also, note that this was pointed out https://github.com/apache/spark/pull/29225#discussion_r471485011.

### Does this PR introduce _any_ user-facing change?

No, dev-only

### How was this patch tested?

https://github.com/apache/spark/actions/runs/213000777 should demonstrate it

Closes #29454 from HyukjinKwon/SPARK-32645.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-08-19 12:28:36 +09:00
HyukjinKwon d0dfe4986b [MINOR][INFRA] Rename master.yml to build_and_test.yml
### What changes were proposed in this pull request?

This PR renames `master.yml` to `build_and_test.yml` to indicate this is the workflow that builds and runs the tests.

### Why are the changes needed?

Just for readability. `master.yml` looks like the name of the branch (to me).

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

GitHub Actions build in this PR will test it out.

Closes #29459 from HyukjinKwon/minor-rename.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>
2020-08-18 18:18:47 +08:00
HyukjinKwon 86852c57af [SPARK-32606][SPARK-32605][INFRA] Remove the forks of action-surefire-report and action-download-artifact in test_report.yml
### What changes were proposed in this pull request?

This PR proposes to remove the usage of my own forks and use the original plugins in GitHub Actions testing report.

SPARK-32357 introduced the GitHub Actions test reporting by leveraging two plugins:
 - [ScaCap/action-surefire-report](https://github.com/ScaCap/action-surefire-report)
 - [dawidd6/action-download-artifact](https://github.com/dawidd6/action-download-artifact)

In order to make it working, it had to fork two repositories with custom fixes:
  - HyukjinKwon/action-surefire-reportc96094c
  - f86c565d52

The two custom fixes are thankfully merged at https://github.com/ScaCap/action-surefire-report/pull/14 and https://github.com/dawidd6/action-download-artifact/pull/24, and they released new ones to use at [ScaCap/action-surefire-report/commits/v1](https://github.com/ScaCap/action-surefire-report/commits/v1) and [dawidd6/action-download-artifact/commits/v2](https://github.com/dawidd6/action-download-artifact/commits/v2)  - thanks jmisur and dawidd6 again.

### Why are the changes needed?

To avoid relying on forks and code duplications.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

Logically there is no diff. I tested it at https://github.com/HyukjinKwon/spark/runs/992824229 for doubly sure.

NOTE that this PR cannot be tested here within the workflow triggered by this PR without merging the changes in `test_report.yml` into the master.

Closes #29449 from HyukjinKwon/SPARK-32606-SPARK-32605.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-08-17 11:17:50 -07:00
Hyukjin Kwon 5debde9401 [SPARK-32357][INFRA] Publish failed and succeeded test reports in GitHub Actions
### What changes were proposed in this pull request?

This PR proposes to report the failed and succeeded tests in GitHub Actions in order to improve the development velocity by leveraging [ScaCap/action-surefire-report](https://github.com/ScaCap/action-surefire-report). See the example below:

![Screen Shot 2020-08-13 at 8 17 52 PM](https://user-images.githubusercontent.com/6477701/90128649-28f7f280-dda2-11ea-9211-e98e34332f6b.png)

Note that we cannot just use [ScaCap/action-surefire-report](https://github.com/ScaCap/action-surefire-report) in Apache Spark because PRs are from the forked repository, and GitHub secrets are unavailable for the security reason. This plugin and all similar plugins require to have the GitHub token that has the write access in order to post test results but it is unavailable in PRs.

To work around this limitation, I took this approach:

1. In workflow A, run the tests and upload the JUnit XML test results. GitHub provides to upload and download some files.
2. GitHub introduced new event type [`workflow_run`](https://github.blog/2020-08-03-github-actions-improvements-for-fork-and-pull-request-workflows/) 10 days ago. By leveraging this, it triggers another workflow B.
3. Workflow B is in the main repo instead of fork repo, and has the write access the plugin needs. In workflow B, it downloads the artifact uploaded from workflow A (from the forked repository).
4. Workflow B generates the test reports to port from JUnit xml files.
5. Workflow B looks up the PR and posts the test reports.

The `workflow_run` event is very new feature, and looks not so many GitHub Actions plugins support. In order to make this working with [ScaCap/action-surefire-report](https://github.com/ScaCap/action-surefire-report), I had to fork two GitHub Actions plugins to use:
 - [ScaCap/action-surefire-report](https://github.com/ScaCap/action-surefire-report) to have this custom fix: c96094cc35
    It added `commit` argument to specify the commit to post the test reports. With `workflow_run`, it can access, in workflow B, to the commit from workflow A.

 - [dawidd6/action-download-artifact](https://github.com/dawidd6/action-download-artifact) to have this custom fix: 750b71af35
    It added the support of downloading all artifacts from workflow A, in workflow B. By default, it only supports to specify the name of artifact.

    Note that I was not able to use the official [actions/download-artifact](https://github.com/actions/download-artifact) because:
      - It does not support to download artifacts between different workflows, see also https://github.com/actions/download-artifact/issues/3. Once this issue is resolved, we can switch it back to [actions/download-artifact](https://github.com/actions/download-artifact).

I plan to make a pull request for both repositories so we don't have to rely on forks.

### Why are the changes needed?

Currently, it's difficult to check the failed tests. You should scroll down long logs from GitHub Actions logs.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

Manually tested at: https://github.com/HyukjinKwon/spark/pull/17, https://github.com/HyukjinKwon/spark/pull/18, https://github.com/HyukjinKwon/spark/pull/19, https://github.com/HyukjinKwon/spark/pull/20, and master branch of my forked repository.

Closes #29333 from HyukjinKwon/SPARK-32357-fix.

Lead-authored-by: Hyukjin Kwon <gurwls223@apache.org>
Co-authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-08-13 20:50:47 -07:00
HyukjinKwon 32f4ef005f [SPARK-32497][INFRA] Installs qpdf package for CRAN check in GitHub Actions
### What changes were proposed in this pull request?

CRAN check fails due to the size of the generated PDF docs as below:

```
...
 WARNING
‘qpdf’ is needed for checks on size reduction of PDFs
...
Status: 1 WARNING, 1 NOTE
See
  ‘/home/runner/work/spark/spark/R/SparkR.Rcheck/00check.log’
for details.
```

This PR proposes to install `qpdf` in GitHub Actions.

Note that I cannot reproduce in my local with the same R version so I am not documenting it for now.

Also, while I am here, I piggyback to install SparkR when the module includes `sparkr`. it is rather a followup of SPARK-32491.

### Why are the changes needed?

To fix SparkR CRAN check failure.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

GitHub Actions will test it out.

Closes #29306 from HyukjinKwon/SPARK-32497.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-07-31 00:57:24 +09:00
HyukjinKwon e0c8bd07af [SPARK-32493][INFRA] Manually install R instead of using setup-r in GitHub Actions
### What changes were proposed in this pull request?

This PR proposes to manually install R instead of using `setup-r` which seems broken. Currently, GitHub Actions uses its default R 3.4.4 installed, which we dropped as of SPARK-32073.

While I am here, I am also upgrading R version to 4.0. Jenkins will test the old version and GitHub Actions tests the new version. AppVeyor uses R 4.0 but it does not check CRAN which is important when we make a release.

### Why are the changes needed?

To recover GitHub Actions build.

### Does this PR introduce _any_ user-facing change?

No, dev-only

### How was this patch tested?

Manually tested at https://github.com/HyukjinKwon/spark/pull/15

Closes #29302 from HyukjinKwon/SPARK-32493.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-07-30 20:06:35 +09:00
Dongjoon Hyun 08a66f8fd0 [SPARK-32248][BUILD] Recover Java 11 build in Github Actions
### What changes were proposed in this pull request?

This PR aims to recover Java 11 build in `GitHub Action`.

### Why are the changes needed?

This test coverage is removed before. Now, it's time to recover it.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the GitHub Action.

Closes #29295 from dongjoon-hyun/SPARK-32248.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-07-29 18:05:53 -07:00
HyukjinKwon 6ab29b37cf [SPARK-32179][SPARK-32188][PYTHON][DOCS] Replace and redesign the documentation base
### What changes were proposed in this pull request?

This PR proposes to redesign the PySpark documentation.

I made a demo site to make it easier to review: https://hyukjin-spark.readthedocs.io/en/stable/reference/index.html.

Here is the initial draft for the final PySpark docs shape: https://hyukjin-spark.readthedocs.io/en/latest/index.html.

In more details, this PR proposes:
1. Use [pydata_sphinx_theme](https://github.com/pandas-dev/pydata-sphinx-theme) theme - [pandas](https://pandas.pydata.org/docs/) and [Koalas](https://koalas.readthedocs.io/en/latest/) use this theme. The CSS overwrite is ported from Koalas. The colours in the CSS were actually chosen by designers to use in Spark.
2. Use the Sphinx option to separate `source` and `build` directories as the documentation pages will likely grow.
3. Port current API documentation into the new style. It mimics Koalas and pandas to use the theme most effectively.

    One disadvantage of this approach is that you should list up APIs or classes; however, I think this isn't a big issue in PySpark since we're being conservative on adding APIs. I also intentionally listed classes only instead of functions in ML and MLlib to make it relatively easier to manage.

### Why are the changes needed?

Often I hear the complaints, from the users, that current PySpark documentation is pretty messy to read - https://spark.apache.org/docs/latest/api/python/index.html compared other projects such as [pandas](https://pandas.pydata.org/docs/) and [Koalas](https://koalas.readthedocs.io/en/latest/).

It would be nicer if we can make it more organised instead of just listing all classes, methods and attributes to make it easier to navigate.

Also, the documentation has been there from almost the very first version of PySpark. Maybe it's time to update it.

### Does this PR introduce _any_ user-facing change?

Yes, PySpark API documentation will be redesigned.

### How was this patch tested?

Manually tested, and the demo site was made to show.

Closes #29188 from HyukjinKwon/SPARK-32179.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-07-27 17:49:21 +09:00
HyukjinKwon 6bdd710c4d [SPARK-32316][TESTS][INFRA] Test PySpark with Python 3.8 in Github Actions
### What changes were proposed in this pull request?

This PR aims to test PySpark with Python 3.8 in Github Actions. In the script side, it is already ready:

4ad9bfd53b/python/run-tests.py (L161)

This PR includes small related fixes together:

1. Install Python 3.8
2. Only install one Python implementation instead of installing many for SQL and Yarn test cases because they need one Python executable in their test cases that is higher than Python 2.
3. Do not install Python 2 which is not needed anymore after we dropped Python 2 at SPARK-32138
4. Remove a comment about installing PyPy3 on Jenkins - SPARK-32278. It is already installed.

### Why are the changes needed?

Currently, only PyPy3 and Python 3.6 are being tested with PySpark in Github Actions. We should test the latest version of Python as well because some optimizations can be only enabled with Python 3.8+. See also https://github.com/apache/spark/pull/29114

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

Was not tested. Github Actions build in this PR will test it out.

Closes #29116 from HyukjinKwon/test-python3.8-togehter.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-07-14 20:44:09 -07:00
HyukjinKwon 4ad9bfd53b [SPARK-32138] Drop Python 2.7, 3.4 and 3.5
### What changes were proposed in this pull request?

This PR aims to drop Python 2.7, 3.4 and 3.5.

Roughly speaking, it removes all the widely known Python 2 compatibility workarounds such as `sys.version` comparison, `__future__`. Also, it removes the Python 2 dedicated codes such as `ArrayConstructor` in Spark.

### Why are the changes needed?

 1. Unsupport EOL Python versions
 2. Reduce maintenance overhead and remove a bit of legacy codes and hacks for Python 2.
 3. PyPy2 has a critical bug that causes a flaky test, SPARK-28358 given my testing and investigation.
 4. Users can use Python type hints with Pandas UDFs without thinking about Python version
 5. Users can leverage one latest cloudpickle, https://github.com/apache/spark/pull/28950. With Python 3.8+ it can also leverage C pickle.

### Does this PR introduce _any_ user-facing change?

Yes, users cannot use Python 2.7, 3.4 and 3.5 in the upcoming Spark version.

### How was this patch tested?

Manually tested and also tested in Jenkins.

Closes #28957 from HyukjinKwon/SPARK-32138.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-07-14 11:22:44 +09:00
Hyukjin Kwon 27ef3629dd [SPARK-32292][SPARK-32252][INFRA] Run the relevant tests only in GitHub Actions
### What changes were proposed in this pull request?

This PR mainly proposes to run only relevant tests just like Jenkins PR builder does. Currently, GitHub Actions always run full tests which wastes the resources.

In addition, this PR also fixes 3 more issues  very closely related together while I am here.

1. The main idea here is: It reuses the existing logic embedded in `dev/run-tests.py` which Jenkins PR builder use in order to run only the related test cases.

2. While I am here, I fixed SPARK-32292 too to run the doc tests. It was because other references were not available when it is cloned via `checkoutv2`. With `fetch-depth: 0`, the history is available.

3. In addition, it fixes the `dev/run-tests.py` to match with `python/run-tests.py` in terms of its options. Environment variables such as `TEST_ONLY_XXX` were moved as proper options. For example,

    ```bash
    dev/run-tests.py --modules sql,core
    ```

    which is consistent with `python/run-tests.py`, for example,

    ```bash
    python/run-tests.py --modules pyspark-core,pyspark-ml
    ```

4. Lastly, also fixed the formatting issue in module specification in the matrix:

    ```diff
    -            network_common, network_shuffle, repl, launcher
    +            network-common, network-shuffle, repl, launcher,
    ```

    which incorrectly runs build/test the modules.

### Why are the changes needed?

By running only related tests, we can hugely save the resources and avoid unrelated flaky tests, etc.
Also, now it runs the doctest of `dev/run-tests.py` properly, the usages are similar between `dev/run-tests.py` and `python/run-tests.py`, and run `network-common`, `network-shuffle`, `launcher` and `examples` modules too.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

Manually tested in my own forked Spark:

https://github.com/HyukjinKwon/spark/pull/7
https://github.com/HyukjinKwon/spark/pull/8
https://github.com/HyukjinKwon/spark/pull/9
https://github.com/HyukjinKwon/spark/pull/10
https://github.com/HyukjinKwon/spark/pull/11
https://github.com/HyukjinKwon/spark/pull/12

Closes #29086 from HyukjinKwon/SPARK-32292.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-07-13 08:31:39 -07:00
Dongjoon Hyun bc3d4bacb5 [SPARK-32245][INFRA][FOLLOWUP] Reenable Github Actions on commit
### What changes were proposed in this pull request?

This PR reenables GitHub Action on every commit as a next step.

### Why are the changes needed?

We carefully enabled GitHub Action on every PRs, and it looks good so far.

As we saw at https://github.com/apache/spark/pull/29072, GitHub Action is already triggered at every commits on every PRs. Enabling GitHub Action on `master` branch commit doesn't make a big difference. And, we need to start to test at every commit as a next step.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual.

Closes #29076 from dongjoon-hyun/reenable_gha_commit.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-07-12 14:50:47 -07:00
HyukjinKwon b84ed4146d [SPARK-32245][INFRA] Run Spark tests in Github Actions
### What changes were proposed in this pull request?

This PR aims to run the Spark tests in Github Actions.

To briefly explain the main idea:

- Reuse `dev/run-tests.py` with SBT build
- Reuse the modules in `dev/sparktestsupport/modules.py` to test each module
- Pass the modules to test into `dev/run-tests.py` directly via `TEST_ONLY_MODULES` environment variable. For example, `pyspark-sql,core,sql,hive`.
- `dev/run-tests.py` _does not_ take the dependent modules into account but solely the specified modules to test.

Another thing to note might be `SlowHiveTest` annotation. Running the tests in Hive modules takes too much so the slow tests are extracted and it runs as a separate job. It was extracted from the actual elapsed time in Jenkins:

![Screen Shot 2020-07-09 at 7 48 13 PM](https://user-images.githubusercontent.com/6477701/87050238-f6098e80-c238-11ea-9c4a-ab505af61381.png)

So, Hive tests are separated into to jobs. One is slow test cases, and the other one is the other test cases.

_Note that_ the current GitHub Actions build virtually copies what the default PR builder on Jenkins does (without other profiles such as JDK 11, Hadoop 2, etc.). The only exception is Kinesis https://github.com/apache/spark/pull/29057/files#diff-04eb107ee163a50b61281ca08f4e4c7bR23

### Why are the changes needed?

Last week and onwards, the Jenkins machines became very unstable for many reasons:
  - Apparently, the machines became extremely slow. Almost all tests can't pass.
  - One machine (worker 4) started to have the corrupt `.m2` which fails the build.
  - Documentation build fails time to time for an unknown reason in Jenkins machine specifically. This is disabled for now at https://github.com/apache/spark/pull/29017.
  - Almost all PRs are basically blocked by this instability currently.

The advantages of using Github Actions:
  - To avoid depending on few persons who can access to the cluster.
  - To reduce the elapsed time in the build - we could split the tests (e.g., SQL, ML, CORE), and run them in parallel so the total build time will significantly reduce.
  - To control the environment more flexibly.
  - Other contributors can test and propose to fix Github Actions configurations so we can distribute this build management cost.

Note that:
- The current build in Jenkins takes _more than 7 hours_. With Github actions it takes _less than 2 hours_
- We can now control the environments especially for Python easily.
- The test and build look more stable than the Jenkins'.

### Does this PR introduce _any_ user-facing change?

No, dev-only change.

### How was this patch tested?

Tested at https://github.com/HyukjinKwon/spark/pull/4

Closes #29057 from HyukjinKwon/migrate-to-github-actions.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-07-11 13:09:06 -07:00
HyukjinKwon f0c79ad88a [MINOR][INFRA] Add a guide to clarify release/unreleased Spark versions of user-facing change in the Github PR template
### What changes were proposed in this pull request?

This PR proposes to add a guide to clarify the Spark version when describing "Does this PR introduce any user-facing change?".

### Why are the changes needed?

It seems confusing to write when the user facing changes happen within unreleased branches.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

Manually tested in Github and it renders find as intended.

Closes #28403 from HyukjinKwon/minor-more-guide.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-04-30 09:22:07 +09:00
Dongjoon Hyun 2d3e9601b5 [SPARK-31589][INFRA] Use r-lib/actions/setup-r in GitHub Action
### What changes were proposed in this pull request?

This PR aims to use `r-lib/actions/setup-r` because it's more stable and maintained by 3rd party.

### Why are the changes needed?

This will recover the current outage. In addition, this will be more robust in the future.
As of now, this is tested via https://github.com/dongjoon-hyun/spark/pull/17 .

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Pass the GitHub Actions, especially `Linter R` and `Generate Documents`.

Closes #28382 from dongjoon-hyun/SPARK-31589.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-04-28 13:22:43 +09:00
HyukjinKwon 98ec4a8ced [SPARK-31330][INFRA][FOLLOW-UP] Exclude 'ui' and 'UI.scala' in CORE and 'dev/.rat-excludes' in BUILD autolabeller
### What changes were proposed in this pull request?

This PR excludes `ui` directly and `UI.scala` configuration file in `CORE` label, and exclude `dev/.rat-excludes` in `BUILD` label  in autolabeller. See https://github.com/apache/spark/pull/28218, https://github.com/apache/spark/pull/28217, https://github.com/apache/spark/pull/28214 and https://github.com/apache/spark/pull/28213

There are some contexts about this https://github.com/apache/spark/pull/28114.

The syntax is from https://git-scm.com/docs/gitignore#_pattern_format (see also https://github.com/kaelzhang/node-ignore)

### Why are the changes needed?

To label UI component properly.

### Does this PR introduce any user-facing change?

No, dev-only.

### How was this patch tested?

It uses the same syntax used for other places. I expect to see the actual results after it gets merged as it's difficult to test it out.

Closes #28228 from HyukjinKwon/SPARK-31330-followup.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-04-16 10:16:58 +09:00
HyukjinKwon c519fe1358 [SPARK-31330][INFRA][FOLLOW-UP] Move sbin and some files into appropriate categories in autolabeller
### What changes were proposed in this pull request?

This PR is a followup of 1b87015044. Now, we automatically label PRs, and seems working fine.

This PR proposes to correct some minor list and categories.

**1.** Move `sbin` from `CORE` into `DEPLOY` components.

```
$ ls sbin

decommission-slave.sh          start-all.sh                   start-slave.sh                 stop-master.sh                 stop-thriftserver.sh
slaves.sh                      start-history-server.sh        start-slaves.sh                stop-mesos-dispatcher.sh
spark-config.sh                start-master.sh                start-thriftserver.sh          stop-mesos-shuffle-service.sh
spark-daemon.sh                start-mesos-dispatcher.sh      stop-all.sh                    stop-slave.sh
spark-daemons.sh               start-mesos-shuffle-service.sh stop-history-server.sh         stop-slaves.sh
```

**2.**

`/sbin/*mesos*.sh` -> `MESOS`
`/bin/spark-shell*` -> `SPARK SHELL`.

### Why are the changes needed?

To label correctly and dev can take an advantage of it such as checking the PRs of a specific component.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

It was not tested yet. It can be tested after it was merged.

Closes #28201 from HyukjinKwon/SPARK-31330.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-04-13 18:48:41 +09:00
Nicholas Chammas 1b87015044 [SPARK-31330] Automatically label PRs based on the paths they touch
### What changes were proposed in this pull request?

This PR adds some rules that will be used by Probot Auto Labeler to label PRs based on what paths they modify.

### Why are the changes needed?

This should make it easier for committers to organize PRs, and it could also help drive downstream tooling like the PR dashboard.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

We'll only be able to test it, I believe, after merging it in. Given that [the Avro project is using this same bot already](https://github.com/apache/avro/blob/master/.github/autolabeler.yml), I expect it will be straightforward to get this working.

Closes #28114 from nchammas/SPARK-31330-auto-label-prs.

Lead-authored-by: Nicholas Chammas <nicholas.chammas@gmail.com>
Co-authored-by: HyukjinKwon <gurwls223@apache.org>
Co-authored-by: Nicholas Chammas <nicholas.chammas@liveramp.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-04-13 10:01:31 +09:00
Dongjoon Hyun 2b744fe885 [SPARK-30963][INFRA] Add GitHub Action job for document generation
### What changes were proposed in this pull request?

This PR aims to add a new `GitHub Action` job for document generation.

### Why are the changes needed?

We had better test the document generation in PR Builder.
- https://lists.apache.org/thread.html/rd06a2154e853812652b8f7fa3c003746ed531b213c531517f055e1dc%40%3Cdev.spark.apache.org%3E

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Pass the GitHub Action in this PR.

Closes #27715 from dongjoon-hyun/SPARK-30963.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-02-26 19:24:41 -08:00
Takeshi Yamamuro 29b3e42779 [MINOR] Update the PR template for adding a link to the configuration naming guideline
### What changes were proposed in this pull request?

This is a follow-up of #27577. This pr intends to add a link to the configuration naming guideline in `.github/PULL_REQUEST_TEMPLATE`.

### Why are the changes needed?

For reminding developers to follow the naming rules.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

N/A

Closes #27602 from maropu/pr27577-FOLLOWUP.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-02-17 16:05:08 +09:00
HyukjinKwon cd9ccdc0ac [SPARK-30601][BUILD] Add a Google Maven Central as a primary repository
### What changes were proposed in this pull request?

This PR proposes to address four things. Three issues and fixes were a bit mixed so this PR sorts it out. See also http://apache-spark-developers-list.1001551.n3.nabble.com/Adding-Maven-Central-mirror-from-Google-to-the-build-td28728.html for the discussion in the mailing list.

1. Add the Google Maven Central mirror (GCS) as a primary repository. This will not only help development more stable but also in order to make Github Actions build (where it is always required to download jars) stable. In case of Jenkins PR builder, it wouldn't be affected too much as it uses the pre-downloaded jars under `.m2`.

    - Google Maven Central seems stable for heavy workload but not synced very quickly (e.g., new release is missing)
    - Maven Central (default) seems less stable but synced quickly.

    We already added this GCS mirror as a default additional remote repository at SPARK-29175. So I don't see an issue to add it as a repo.
    abf759a91e/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala (L2111-L2118)

2. Currently, we have the hard-corded repository in [`sbt-pom-reader`](https://github.com/JoshRosen/sbt-pom-reader/blob/v1.0.0-spark/src/main/scala/com/typesafe/sbt/pom/MavenPomResolver.scala#L32) and this seems overwriting Maven's existing resolver by the same ID `central` with `http://` when initially the pom file is ported into SBT instance. This uses `http://` which latently Maven Central disallowed (see https://github.com/apache/spark/pull/27242)

    My speculation is that we just need to be able to load plugin and let it convert POM to SBT instance with another fallback repo. After that, it _seems_ using `central` with `https` properly. See also https://github.com/apache/spark/pull/27307#issuecomment-576720395.

    I double checked that we use `https` properly from the SBT build as well:

    ```
    [debug] downloading https://repo1.maven.org/maven2/com/etsy/sbt-checkstyle-plugin_2.10_0.13/3.1.1/sbt-checkstyle-plugin-3.1.1.pom ...
    [debug] 	public: downloading https://repo1.maven.org/maven2/com/etsy/sbt-checkstyle-plugin_2.10_0.13/3.1.1/sbt-checkstyle-plugin-3.1.1.pom
    [debug] 	public: downloading https://repo1.maven.org/maven2/com/etsy/sbt-checkstyle-plugin_2.10_0.13/3.1.1/sbt-checkstyle-plugin-3.1.1.pom.sha1
    ```

    This was fixed by adding the same repo (https://github.com/apache/spark/pull/27281), `central_without_mirror`, which is a bit awkward. Instead, this PR adds GCS as a main repo, and community Maven central as a fallback repo. So, presumably the community Maven central repo is used when the plugin is loaded as a fallback.

3. While I am here, I fix another issue. Github Action at https://github.com/apache/spark/pull/27279 is being failed. The reason seems to be scalafmt 1.0.3 is in Maven central but not in GCS.

    ```
    org.apache.maven.plugin.PluginResolutionException: Plugin org.antipathy:mvn-scalafmt_2.12:1.0.3 or one of its dependencies could not be resolved: Could not find artifact org.antipathy:mvn-scalafmt_2.12🫙1.0.3 in google-maven-central (https://maven-central.storage-download.googleapis.com/repos/central/data/)
        at org.apache.maven.plugin.internal.DefaultPluginDependenciesResolver.resolve     (DefaultPluginDependenciesResolver.java:131)
    ```

   `mvn-scalafmt` exists in Maven central:

    ```bash
    $ curl https://repo.maven.apache.org/maven2/org/antipathy/mvn-scalafmt_2.12/1.0.3/mvn-scalafmt_2.12-1.0.3.pom
    ```

    ```xml
    <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
        <modelVersion>4.0.0</modelVersion>
        ...
    ```

    whereas not in GCS mirror:

    ```bash
    $ curl https://maven-central.storage-download.googleapis.com/repos/central/data/org/antipathy/mvn-scalafmt_2.12/1.0.3/mvn-scalafmt_2.12-1.0.3.pom
    ```
    ```xml
    <?xml version='1.0' encoding='UTF-8'?><Error><Code>NoSuchKey</Code><Message>The specified key does not exist.</Message><Details>No such object: maven-central/repos/central/data/org/antipathy/mvn-scalafmt_2.12/1.0.3/mvn-scalafmt_2.12-1.0.3.pom</Details></Error>%
    ```

    In this PR, simply make both repos accessible by adding to `pluginRepositories`.

4. Remove the workarounds in Github Actions to switch mirrors because now we have same repos in the same order (Google Maven Central first, and Maven Central second)

### Why are the changes needed?

To make the build and Github Action more stable.

### Does this PR introduce any user-facing change?

No, dev only change.

### How was this patch tested?

I roughly checked local and PR against my fork (https://github.com/HyukjinKwon/spark/pull/2 and https://github.com/HyukjinKwon/spark/pull/3).

Closes #27307 from HyukjinKwon/SPARK-30572.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-01-23 16:00:21 +09:00
Dongjoon Hyun c992716a33 [SPARK-30572][BUILD] Add a fallback Maven repository
### What changes were proposed in this pull request?

This PR aims to add a fallback Maven repository when a mirror to `central` fail.

### Why are the changes needed?

We use `Google Maven Central` in GitHub Action as a mirror of `central`.
However, `Google Maven Central` sometimes doesn't have newly published artifacts
and there is no guarantee when we get the newly published artifacts.

By duplicating `Maven Central` with a new ID, we can add a fallback Maven repository
which is not mirrored by `Google Maven Central`.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Manually testing with the new `Twitter` chill artifacts by switching `chill.version` from `0.9.3` to `0.9.5`.

```
$ rm -rf ~/.m2/repository/com/twitter/chill*
$ mvn compile | grep chill
Downloading from google-maven-central: https://maven-central.storage-download.googleapis.com/repos/central/data/com/twitter/chill_2.12/0.9.5/chill_2.12-0.9.5.pom
Downloading from central_without_mirror: https://repo.maven.apache.org/maven2/com/twitter/chill_2.12/0.9.5/chill_2.12-0.9.5.pom
Downloaded from central_without_mirror: https://repo.maven.apache.org/maven2/com/twitter/chill_2.12/0.9.5/chill_2.12-0.9.5.pom (2.8 kB at 11 kB/s)
```

Closes #27281 from dongjoon-hyun/SPARK-30572.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-01-19 17:42:34 -08:00
Nicholas Chammas f399d655c4 [SPARK-30173] Tweak stale PR message
Follow-on to #26877.

### What changes were proposed in this pull request?

This PR tweaks the stale PR message to [clarify](https://github.com/apache/spark/pull/24457#issuecomment-571393900) the procedure for reopening a PR after it has been marked as stale.

### Why are the changes needed?

This change should clarify the reopening process for contributors.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

N/A

Closes #27114 from nchammas/SPARK-30173-stale-tweaks.

Authored-by: Nicholas Chammas <nicholas.chammas@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-01-07 08:34:59 -06:00
Nicholas Chammas 58b29392f8 [SPARK-30173][INFRA] Automatically close stale PRs
### What changes were proposed in this pull request?

This PR adds [a GitHub workflow to automatically close stale PRs](https://github.com/marketplace/actions/close-stale-issues).

### Why are the changes needed?

This will help cut down the number of open but stale PRs and keep the PR queue manageable.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

I'm not sure how to test this PR without impacting real PRs on the repo.

See: https://github.com/actions/stale/issues/32

Closes #26877 from nchammas/SPARK-30173-stale-prs.

Authored-by: Nicholas Chammas <nicholas.chammas@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2019-12-15 08:42:16 -06:00
Dongjoon Hyun 16f1b23d75 [SPARK-30163][INFRA][FOLLOWUP] Make .m2 directory for cold start without cache
### What changes were proposed in this pull request?

This PR is a follow-up of https://github.com/apache/spark/pull/26793 and aims to initialize `~/.m2` directory.

### Why are the changes needed?

In case of cache reset, `~/.m2` directory doesn't exist. It causes a failure.
- `master` branch has a cache as of now. So, we missed this.
- `branch-2.4` has no cache as of now, and we hit this failure.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

This PR is tested against personal `branch-2.4`.
- https://github.com/dongjoon-hyun/spark/pull/12

Closes #26794 from dongjoon-hyun/SPARK-30163-2.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-12-07 12:58:00 -08:00
Dongjoon Hyun 1068b8b249 [SPARK-30163][INFRA] Use Google Maven mirror in GitHub Action
### What changes were proposed in this pull request?

This PR aims to use [Google Maven mirror](https://cloudplatform.googleblog.com/2015/11/faster-builds-for-Java-developers-with-Maven-Central-mirror.html) in `GitHub Action` jobs to improve the stability.

```xml
<settings>
  <mirrors>
    <mirror>
      <id>google-maven-central</id>
      <name>GCS Maven Central mirror</name>
      <url>https://maven-central.storage-download.googleapis.com/repos/central/data/</url>
      <mirrorOf>central</mirrorOf>
    </mirror>
  </mirrors>
</settings>
```

### Why are the changes needed?

Although we added Maven cache inside `GitHub Action`, the timeouts happen too frequently during access `artifact descriptor`.

```
[ERROR] Failed to execute goal on project spark-mllib_2.12:
...
Failed to read artifact descriptor for ...
...
Connection timed out (Read failed) -> [Help 1]
```

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

This PR is irrelevant to Jenkins.

This is tested on the personal repository first. `GitHub Action` of this PR should pass.
- https://github.com/dongjoon-hyun/spark/pull/11

Closes #26793 from dongjoon-hyun/SPARK-30163.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-12-07 12:04:10 -08:00
Dongjoon Hyun 81996f9e4d [SPARK-30152][INFRA] Enable Hadoop-2.7/JDK11 build at GitHub Action
### What changes were proposed in this pull request?

This PR enables JDK11 build with `hadoop-2.7` profile at `GitHub Action`.

**BEFORE (6 jobs including one JDK11 job)**
![before](https://user-images.githubusercontent.com/9700541/70342731-7763f300-180a-11ea-859f-69038b88451f.png)

**AFTER (7 jobs including two JDK11 jobs)**
![after](https://user-images.githubusercontent.com/9700541/70342658-54d1da00-180a-11ea-9fba-507fc087dc62.png)

### Why are the changes needed?

SPARK-29957 makes JDK11 test work with `hadoop-2.7` profile. We need to protect it.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

This is `GitHub Action` only PR. See the result of `GitHub Action` on this PR.

Closes #26782 from dongjoon-hyun/SPARK-GHA-HADOOP-2.7.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-12-06 12:01:36 -08:00
Dongjoon Hyun cb68e58f88 [MINOR][INFRA] Use GitHub Action Cache for build
### What changes were proposed in this pull request?

This PR adds `GitHub Action Cache` task on `build` directory.

### Why are the changes needed?

This will replace the Maven downloading with the cache.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Manually check the GitHub Action log of this PR.

Closes #26652 from dongjoon-hyun/SPARK-MAVEN-CACHE.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-11-24 12:35:57 -08:00
Dongjoon Hyun c98e5eb339 [SPARK-29981][BUILD] Add hive-1.2/2.3 profiles
### What changes were proposed in this pull request?

This PR aims the followings.
- Add two profiles, `hive-1.2` and `hive-2.3` (default)
- Validate if we keep the existing combination at least. (Hadoop-2.7 + Hive 1.2 / Hadoop-3.2 + Hive 2.3).

For now, we assumes that `hive-1.2` is explicitly used with `hadoop-2.7` and `hive-2.3` with `hadoop-3.2`. The followings are beyond the scope of this PR.

- SPARK-29988 Adjust Jenkins jobs for `hive-1.2/2.3` combination
- SPARK-29989 Update release-script for `hive-1.2/2.3` combination
- SPARK-29991 Support `hive-1.2/2.3` in PR Builder

### Why are the changes needed?

This will help to switch our dependencies to update the exposed dependencies.

### Does this PR introduce any user-facing change?

This is a dev-only change that the build profile combinations are changed.
- `-Phadoop-2.7` => `-Phadoop-2.7 -Phive-1.2`
- `-Phadoop-3.2` => `-Phadoop-3.2 -Phive-2.3`

### How was this patch tested?

Pass the Jenkins with the dependency check and tests to make it sure we don't change anything for now.

- [Jenkins (-Phadoop-2.7 -Phive-1.2)](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/114192/consoleFull)
- [Jenkins (-Phadoop-3.2 -Phive-2.3)](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/114192/consoleFull)

Also, from now, GitHub Action validates the following combinations.
![gha](https://user-images.githubusercontent.com/9700541/69355365-822d5e00-0c36-11ea-93f7-e00e5459e1d0.png)

Closes #26619 from dongjoon-hyun/SPARK-29981.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-11-23 10:02:22 -08:00
Dongjoon Hyun affaefe1f3 [MINOR][INFRA] Add io and net to GitHub Action Cache
### What changes were proposed in this pull request?

This PR aims to cache `~/.m2/repository/net` and `~/.m2/repository/io` to reduce the flakiness.

### Why are the changes needed?

This will stabilize GitHub Action more before adding `hive-1.2` and `hive-2.3` combination.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

After the GitHub Action on this PR passes, check the log.

Closes #26621 from dongjoon-hyun/SPARK-GHA-CACHE.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-11-21 15:43:57 +09:00
Liang-Chi Hsieh e753aa30e6 [SPARK-29964][BUILD] lintr github workflows failed due to buggy GnuPG
### What changes were proposed in this pull request?

Linter (R) github workflows failed sometimes like:

https://github.com/apache/spark/pull/26509/checks?check_run_id=310718016

Failed message:
```
Executing: /tmp/apt-key-gpghome.8r74rQNEjj/gpg.1.sh --keyserver keyserver.ubuntu.com --recv-keys E298A3A825C0D65DFD57CBB651716619E084DAB9
gpg: connecting dirmngr at '/tmp/apt-key-gpghome.8r74rQNEjj/S.dirmngr' failed: IPC connect call failed
gpg: keyserver receive failed: No dirmngr
##[error]Process completed with exit code 2.
```

It is due to a buggy GnuPG. Context:
https://github.com/sbt/website/pull/825
https://github.com/sbt/sbt/issues/4261
https://github.com/microsoft/WSL/issues/3286

### Why are the changes needed?

Make lint-r github workflows work.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Pass github workflows.

Closes #26602 from viirya/SPARK-29964.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-11-19 15:56:50 -08:00
Dongjoon Hyun 42f8f79ff0 [SPARK-29936][R] Fix SparkR lint errors and add lint-r GitHub Action
### What changes were proposed in this pull request?

This PR fixes SparkR lint errors and adds `lint-r` GitHub Action to protect the branch.

### Why are the changes needed?

It turns out that we currently don't run it. It's recovered yesterday. However, after that, our Jenkins linter jobs (`master`/`branch-2.4`) has been broken on `lint-r` tasks.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Pass the GitHub Action on this PR in addition to Jenkins R and AppVeyor R.

Closes #26564 from dongjoon-hyun/SPARK-29936.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-11-17 21:01:01 -08:00
Dongjoon Hyun 4b71ad6ffb [SPARK-29820][INFRA] Use GitHub Action Cache for ./.m2/repository/[com|org]
### What changes were proposed in this pull request?

This PR aims to enable [GitHub Action Cache on Maven local repository](https://github.com/actions/cache/blob/master/examples.md#java---maven) for the following goals.
1. To reduce the chance of failure due to the Maven download flakiness.
2. To speed up the build a little bit.

Unfortunately, due to the GitHub Action Cache limitation, it seems that we cannot put all into a single cache. It's ignored like the following.
- **.m2/repository is 680777194 bytes**
```
/bin/tar -cz -f /home/runner/work/_temp/01f162c3-0c78-4772-b3de-b619bb5d7721/cache.tgz -C /home/runner/.m2/repository .
3
##[warning]Cache size of 680777194 bytes is over the 400MB limit, not saving cache.
```

### Why are the changes needed?

Not only for the speed up, but also for reducing the Maven download flakiness, we had better enable caching on local maven repository. The followings are the failure examples in these days.
- https://github.com/apache/spark/runs/295869450

```
[ERROR] Failed to execute goal on project spark-streaming-kafka-0-10_2.12:
Could not resolve dependencies for project org.apache.spark:spark-streaming-kafka-0-10_2.12🫙spark-367716:
Could not transfer artifact com.fasterxml.jackson.datatype:jackson-datatype-jdk8:jar:2.10.0
from/to central (https://repo.maven.apache.org/maven2):
Failed to transfer file https://repo.maven.apache.org/maven2/com/fasterxml/jackson/datatype/
jackson-datatype-jdk8/2.10.0/jackson-datatype-jdk8-2.10.0.jar with status code 503 -> [Help 1]
...
[ERROR]   mvn <args> -rf :spark-streaming-kafka-0-10_2.12
```

```
[ERROR] Failed to execute goal on project spark-tools_2.12:
Could not resolve dependencies for project org.apache.spark:spark-tools_2.12🫙3.0.0-SNAPSHOT:
Failed to collect dependencies at org.clapper:classutil_2.12🫙1.5.1:
Failed to read artifact descriptor for org.clapper:classutil_2.12🫙1.5.1:
Could not transfer artifact org.clapper:classutil_2.12:pom:1.5.1 from/to central (https://repo.maven.apache.org/maven2):
Connection timed out (Read failed) -> [Help 1]
```

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Manually check the GitHub Action log of this PR.

```
Cache restored from key: 1.8-hadoop-2.7-maven-com-5b4a9fb13c5f5ff78e65a20003a3810796e4d1fde5f24d397dfe6e5153960ce4
Cache restored from key: 1.8-hadoop-2.7-maven-org-5b4a9fb13c5f5ff78e65a20003a3810796e4d1fde5f24d397dfe6e5153960ce4
```

Closes #26456 from dongjoon-hyun/SPARK-29820.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-11-10 11:02:54 -08:00
Yuming Wang e5c176a243 [MINOR][INFRA] Change the Github Actions build command to mvn install
### What changes were proposed in this pull request?

This PR change the Github Actions build command from `mvn package` to `mvn install` to build Scaladoc jars.

### Why are the changes needed?

Sometimes `mvn install` build failure with error: `not found: type ClassName...`.
More details:  https://github.com/apache/spark/pull/24628#issuecomment-495655747

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
N/A

Closes #26414 from wangyum/github-action-install.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-11-06 09:16:50 -08:00
Dongjoon Hyun 3e2649287d [SPARK-29199][INFRA] Add linters and license/dependency checkers to GitHub Action
### What changes were proposed in this pull request?

This PR aims to add linters and license/dependency checkers to GitHub Action. This excludes `lint-r` intentionally because https://github.com/actions/setup-r is not ready. We can add that later when it becomes available.

### Why are the changes needed?

This will help the PR reviews.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

See the GitHub Action result on this PR.

Closes #25879 from dongjoon-hyun/SPARK-29199.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-09-21 08:13:00 -07:00
Yuming Wang 9e234a5434 [MINOR][INFRA] Use java-version instead of version for GitHub Action
### What changes were proposed in this pull request?

This PR use `java-version` instead of `version` for GitHub Action. More details:
204b974cf4
ac25aeee3a

### Why are the changes needed?

The `version` property will not be supported after October 1, 2019.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
N/A

Closes #25866 from wangyum/java-version.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-09-20 08:54:34 -07:00
Dongjoon Hyun 3bf43fb60d [SPARK-29159][BUILD] Increase ReservedCodeCacheSize to 1G
### What changes were proposed in this pull request?

This PR aims to increase the JVM CodeCacheSize from 0.5G to 1G.

### Why are the changes needed?

After upgrading to `Scala 2.12.10`, the following is observed during building.
```
2019-09-18T20:49:23.5030586Z OpenJDK 64-Bit Server VM warning: CodeCache is full. Compiler has been disabled.
2019-09-18T20:49:23.5032920Z OpenJDK 64-Bit Server VM warning: Try increasing the code cache size using -XX:ReservedCodeCacheSize=
2019-09-18T20:49:23.5034959Z CodeCache: size=524288Kb used=521399Kb max_used=521423Kb free=2888Kb
2019-09-18T20:49:23.5035472Z  bounds [0x00007fa62c000000, 0x00007fa64c000000, 0x00007fa64c000000]
2019-09-18T20:49:23.5035781Z  total_blobs=156549 nmethods=155863 adapters=592
2019-09-18T20:49:23.5036090Z  compilation: disabled (not enough contiguous free space left)
```

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Manually check the Jenkins or GitHub Action build log (which should not have the above).

Closes #25836 from dongjoon-hyun/SPARK-CODE-CACHE-1G.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-09-19 00:24:15 -07:00
Dongjoon Hyun 197732e1f4 [SPARK-29125][INFRA] Add Hadoop 2.7 combination to GitHub Action
### What changes were proposed in this pull request?

Until now, we are testing JDK8/11 with Hadoop-3.2. This PR aims to extend the test coverage for JDK8/Hadoop-2.7.

### Why are the changes needed?

This will prevent Hadoop 2.7 compile/package issues at PR stage.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

GitHub Action on this PR shows all three combinations now. And, this is irrelevant to Jenkins test.

Closes #25824 from dongjoon-hyun/SPARK-29125.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-09-17 16:53:21 -07:00
Dongjoon Hyun 703fb2b054 [SPARK-29079][INFRA] Enable GitHub Action on PR
### What changes were proposed in this pull request?

This PR enables GitHub Action on PRs.

### Why are the changes needed?

So far, we detect JDK11 compilation error after merging.
This PR aims to prevent JDK11 compilation error at PR stage.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Manual. See the GitHub Action on this PR.

Closes #25786 from dongjoon-hyun/SPARK-29079.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: DB Tsai <d_tsai@apple.com>
2019-09-13 21:50:06 +00:00
Liang-Chi Hsieh 2db45cbd5a [SPARK-28920][INFRA] Set up java version for github workflow
This patch adds java version parameter to GitHub workflow conf for JDK8/11.

As we want to build JDK8/11 on GitHub workflow, we might need to add java version according current matrix.

No

See the GitHub workflow run result.

Closes #25625 from viirya/github-workflow-java.

Authored-by: Liang-Chi Hsieh <liangchi@uber.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-08-29 20:55:14 -07:00
Dongjoon Hyun 780aa71749 [SPARK-28919][INFRA] Add more profiles for JDK8/11 build test for Github workflow
### What changes were proposed in this pull request?

This PR aims to add `-Pyarn -Pmesos -Pkubernetes -Phive -Phive-thriftserver -Phadoop-3.2 -Phadoop-cloud` profiles to GitHub workflow conf.

### Why are the changes needed?

Currently, we build with JDK8 and test with JDK8/11 in Jenkins.
And, we use GitHub Workflow for JDK8/JDK11 building test.
To test JDK11 fully, we need to enable `hive` and `hadoop-3.2` profiles for `Hive 2.3.6` and `Hadoop 3.2`. Also, this PR adds all resource manager modules, too.

### Does this PR introduce any user-facing change?

No. In addition, Jenkins workload will be the same because this is specific to GitHub workflow.

### How was this patch tested?

See the GitHub workflow run result.

Closes #25624 from dongjoon-hyun/SPARK-JDK11-HIVE.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-08-29 19:46:21 -07:00
HyukjinKwon 0ea8db9fd3 [SPARK-28578][INFRA] Improve Github pull request template
<!--
Thanks for sending a pull request!  Here are some tips for you:
  1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
  2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
  3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
  4. Be sure to keep the PR description updated to reflect all changes.
  5. Please write your PR title to summarize what this PR proposes.
  6. If possible, provide a concise example to reproduce the issue for a faster review.
-->

### What changes were proposed in this pull request?
<!--
Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
  1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
  2. If you fix some SQL features, you can provide some references of other DBMSes.
  3. If there is design documentation, please add the link.
  4. If there is a discussion in the mailing list, please add the link.
-->

This PR proposes to improve the Github template for better and faster review iterations and better interactions between PR authors and reviewers.

As suggested in the the [dev mailing list](http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-New-sections-in-Github-Pull-Request-description-template-td27527.html), this PR referred [Kubernates' PR template](https://raw.githubusercontent.com/kubernetes/kubernetes/master/.github/PULL_REQUEST_TEMPLATE.md).

Therefore, those fields are newly added:

```
### Why are the changes needed?
### Does this PR introduce any user-facing change?
```

and some comments were added.

### Why are the changes needed?
<!--
Please clarify why the changes are needed. For instance,
  1. If you propose a new API, clarify the use case for a new API.
  2. If you fix a bug, you can clarify why it is a bug.
-->

Currently, many PR descriptions are poorly formatted, which causes some overheads between PR authors and reviewers.

There are multiple problems by those poorly formatted PR descriptions:

- Some PRs still write single line in PR description with 500+- code changes in a critical path.
- Some PRs do not describe behaviour changes and reviewers need to find and document.
- Some PRs are hard to review without outlines but they are not mentioned sometimes.
- Spark is being old and sometimes we need to track the history deep. Due to poorly formatted PR description,  sometimes it requires to read whole codes of whole commit histories to find the root cause of a bug.
- Reviews take a while but the number of PR still grows.

This PR targets to alleviate the problems and situation.

### Does this PR introduce any user-facing change?
<!--
If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
If no, write 'No'.
-->

Yes, it changes the PR templates when PRs are open. This PR uses the template this PR proposes.

### How was this patch tested?
<!--
If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
If tests were not added, please describe why they were not added and/or why it was difficult to add.
-->

Manually tested via Github preview feature.

Closes #25310 from HyukjinKwon/SPARK-28578.

Lead-authored-by: HyukjinKwon <gurwls223@apache.org>
Co-authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-08-16 09:45:15 +09:00
DB Tsai c2b40a76bb [SPARK-28719][BUILD][FOLLOWUP] Make Github Actions log quieter
## What changes were proposed in this pull request?

Make Github Actions log quieter

Closes #25468 from dbtsai/actions2.

Authored-by: DB Tsai <d_tsai@apple.com>
Signed-off-by: DB Tsai <d_tsai@apple.com>
2019-08-15 22:22:44 +00:00
DB Tsai 3302042ec4 [SPARK-28719][BUILD] [FOLLOWUP] Add JDK11 for Github Actions
## What changes were proposed in this pull request?

Add JDK11 for Github Actions

Closes #25444 from dbtsai/jdk11.

Authored-by: DB Tsai <d_tsai@apple.com>
Signed-off-by: DB Tsai <d_tsai@apple.com>
2019-08-14 03:14:07 +00:00
DB Tsai 601fd45814 [SPARK-28719][BUILD] Enable Github Actions for master
## What changes were proposed in this pull request?

Github now provides free CI/CD for build, test, and deploy. This PR enables a simple Github Actions to build master with JDK8 with latest Ubuntu. We can extend it with different versions of JDK, and even build Spark with docker images in the future.

Closes #25440 from dbtsai/actions.

Authored-by: DB Tsai <d_tsai@apple.com>
Signed-off-by: DB Tsai <d_tsai@apple.com>
2019-08-13 22:55:02 +00:00
Sean Owen eed6de1a65 [MINOR][DOCS] Tighten up some key links to the project and download pages to use HTTPS
## What changes were proposed in this pull request?

Tighten up some key links to the project and download pages to use HTTPS

## How was this patch tested?

N/A

Closes #24665 from srowen/HTTPSURLs.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-05-21 10:56:42 -07:00
Sean Owen 7e0cd1d9b1
[SPARK-18073][DOCS][WIP] Migrate wiki to spark.apache.org web site
## What changes were proposed in this pull request?

Updates links to the wiki to links to the new location of content on spark.apache.org.

## How was this patch tested?

Doc builds

Author: Sean Owen <sowen@cloudera.com>

Closes #15967 from srowen/SPARK-18073.1.
2016-11-23 11:25:47 +00:00
Sean Owen f8062b63fc [SPARK-17840][DOCS] Add some pointers for wiki/CONTRIBUTING.md in README.md and some warnings in PULL_REQUEST_TEMPLATE
## What changes were proposed in this pull request?

Link to contributing wiki in PR template, README.md

## How was this patch tested?

Doc-only change, tested by Jekyll

Author: Sean Owen <sowen@cloudera.com>

Closes #15429 from srowen/SPARK-17840.
2016-10-12 11:14:03 -07:00
Rahul Tanwani 831429170f [MINOR][MAINTENANCE] Fix typo for the pull request template.
## What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

## How was the this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Author: Rahul Tanwani <tanwanirahul@gmail.com>

Closes #11343 from tanwanirahul/pull_request_template.
2016-02-24 00:45:31 -08:00
Reynold Xin 892b2dd6dd Add github pull request template 2016-02-17 22:14:45 -05:00