Commit graph

81 commits

Author SHA1 Message Date
Enrico Minack 1d450250eb [BUILD][MINOR] Do not publish snapshots from forks
### What changes were proposed in this pull request?
The GitHub workflow `Publish Snapshot` publishes master and 3.1 branch via Nexus. For this, the workflow uses `secrets.NEXUS_USER` and `secrets.NEXUS_PW` secrets. These are not available in forks where this workflow fails every day:

- https://github.com/G-Research/spark/actions/runs/431626797
- https://github.com/G-Research/spark/actions/runs/433153049
- https://github.com/G-Research/spark/actions/runs/434680048
- https://github.com/G-Research/spark/actions/runs/436958780

### Why are the changes needed?
Avoid attempting to publish snapshots from forked repositories.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Code review only.

Closes #30884 from EnricoMi/branch-do-not-publish-snapshots-from-forks.

Authored-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-12-23 00:22:42 +09:00
Kousuke Saruta b0da2bcd46 [MINOR][INFRA] Add -Pspark-ganglia-lgpl to the build definition with Scala 2.13 on GitHub Actions
### What changes were proposed in this pull request?

This PR adds `-Pspark-ganglia-lgpl` to the build definition with Scala 2.13 on GitHub Actions.

### Why are the changes needed?

Keep the code build-able with Scala 2.13.
With this change, all the sub-modules seems to be built-able with Scala 2.13.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

I confirmed Scala 2.13 build pass with the following command.
```
$ ./dev/change-scala-version.sh 2.13
$ build/sbt -Pspark-ganglia-lgpl -Pscala-2.13 compile test:compile
```

Closes #30834 from sarutak/ganglia-scala-2.13.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-12-18 15:10:13 +09:00
Kousuke Saruta b135db3b1a
[SPARK-33757][INFRA][R][FOLLOWUP] Provide more simple solution
### What changes were proposed in this pull request?

This PR proposes a better solution for the R build failure on GitHub Actions.
The issue is solved in #30737 but I noticed the following two things.

* We can use the latest `usethis` if we install additional libraries on the GitHub Actions environment.
* For tests on AppVeyor, `usethis` is not necessary, so I partially revert the previous change.

### Why are the changes needed?

For more simple solution.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Confirmed on GitHub Actions and AppVeyor on my account.

Closes #30753 from sarutak/followup-SPARK-33757.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-12-13 17:27:39 -08:00
Kousuke Saruta fb2e3af4b5 [SPARK-33757][INFRA][R] Fix the R dependencies build error on GitHub Actions and AppVeyor
### What changes were proposed in this pull request?

This PR fixes the R dependencies build error on GitHub Actions and AppVeyor.
The reason seems that `usethis` package is updated 2020/12/10.
https://cran.r-project.org/web/packages/usethis/index.html

### Why are the changes needed?

To keep the build clean.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Should be done by GitHub Actions.

Closes #30737 from sarutak/fix-r-dependencies-build-error.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-12-12 00:54:40 +09:00
Kousuke Saruta 29cc5b3f23 [MINOR][INFRA] Add kubernetes-integration-tests to GitHub Actions for Scala 2.13 build
### What changes were proposed in this pull request?

This PR adds `kubernetes-integration-tests` to GitHub Actions for Scala 2.13 build.

### Why are the changes needed?

Now that the build pass with `kubernetes-integration-tests` and Scala 2.13, it's better to keep it build-able.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Should be done by GitHub Actions.
I also confirmed that the build passes with the following command.
```
$ build/sbt -Pscala-2.13 -Pkubernetes -Pkubernetes-integration-tests compile test:compile
```

Closes #30731 from sarutak/github-actions-k8s.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-12-12 00:53:31 +09:00
Dongjoon Hyun c001dd49e4
[SPARK-33675][INFRA][FOLLOWUP] Schedule branch-3.1 snapshot at master branch
### What changes were proposed in this pull request?

Currently, `master`/`branch-3.0`/`branch-2.4` snapshot publishing is successfully migrated from Jenkins to `GitHub Action`.

- https://github.com/apache/spark/actions?query=workflow%3A%22Publish+Snapshot%22

This PR aims to schedule `branch-3.1` snapshot at `master` branch.

### Why are the changes needed?

This is because it turns out that `GitHub Action Schedule` works only at `master` branch. (the default branch).
- https://docs.github.com/en/free-pro-teamlatest/actions/reference/events-that-trigger-workflows#scheduled-events

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

The matrix triggering is tested at the forked branch.
- https://github.com/dongjoon-hyun/spark/runs/1519015974

Closes #30674 from dongjoon-hyun/SPARK-SCHEDULE-3.1.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-12-08 10:43:41 -08:00
Dongjoon Hyun 3a6546d385 [MINOR][INFRA] Add -Pdocker-integration-tests to GitHub Action Scala 2.13 build job
### What changes were proposed in this pull request?

This aims to add `-Pdocker-integration-tests` at GitHub Action job for Scala 2.13 compilation.

### Why are the changes needed?

We fixed Scala 2.13 compilation of this module at https://github.com/apache/spark/pull/30660 . This PR will prevent accidental regression at that module.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass GitHub Action Scala 2.13 job.

Closes #30661 from dongjoon-hyun/SPARK-DOCKER-IT.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
2020-12-08 14:11:39 +09:00
Kousuke Saruta e88f0d4a24
[SPARK-33683][INFRA] Remove -Djava.version=11 from Scala 2.13 build in GitHub Actions
### What changes were proposed in this pull request?

This PR removes `-Djava.version=11` from the build command for Scala 2.13 in the GitHub Actions' job.

In the GitHub Actions' job, the build command for Scala 2.13 is defined as follows.
```
./build/sbt -Pyarn -Pmesos -Pkubernetes -Phive -Phive-thriftserver -Phadoop-cloud -Pkinesis-asl -Djava.version=11 -Pscala-2.13 compile test:compile
```

Though, Scala 2.13 build uses Java 8 rather than 11 so let's remove `-Djava.version=11`.

### Why are the changes needed?

To build with consistent configuration.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Should be done by GitHub Actions' workflow.

Closes #30633 from sarutak/scala-213-java11.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-12-06 17:57:19 -08:00
Dongjoon Hyun e32de29bce [SPARK-33675][INFRA] Add GitHub Action job to publish snapshot
### What changes were proposed in this pull request?

This PR aims to add `GitHub Action` job to publish daily snapshot for **master** branch.
- https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-core_2.12/3.2.0-SNAPSHOT/

For the other branches, I'll make adjusted backports.
- For `branch-3.1`, we can specify the checkout `ref` to `branch-3.1`.
- For `branch-2.4` and `branch-3.0`, we can publish at every commit since the traffic is low.
  - https://github.com/apache/spark/pull/30630 (branch-3.0)
  - https://github.com/apache/spark/pull/30629 (branch-2.4 LTS)

### Why are the changes needed?

After this series of jobs, this will reduce our maintenance burden permanently from AmpLab Jenkins by removing the following completely.

https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/

For now, AmpLab Jenkins doesn't have a job for `branch-3.1`. We can do it by ourselves by `GitHub Action`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

The snapshot publishing is tested here at PR trigger. Since this PR adds a scheduled job, we cannot test in this PR.
- https://github.com/dongjoon-hyun/spark/runs/1505792859

Apache Infra team finished the setup here.
- https://issues.apache.org/jira/browse/INFRA-21167

Closes #30623 from dongjoon-hyun/SPARK-33675.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-12-07 10:05:28 +09:00
Dongjoon Hyun f94cb53a90 [MINOR][INFRA] Use the latest image for GitHub Action jobs
### What changes were proposed in this pull request?

Currently, GitHub Action is using two docker images.

```
$ git grep dongjoon/apache-spark-github-action-image
.github/workflows/build_and_test.yml:      image: dongjoon/apache-spark-github-action-image:20201015
.github/workflows/build_and_test.yml:      image: dongjoon/apache-spark-github-action-image:20201025
```

This PR aims to make it consistent by using the latest one.
```
- image: dongjoon/apache-spark-github-action-image:20201015
+ image: dongjoon/apache-spark-github-action-image:20201025
```

### Why are the changes needed?

This is for better maintainability. The image size is almost the same.
```
$ docker images | grep 202010
dongjoon/apache-spark-github-action-image                       20201025               37adfa3d226a   5 weeks ago     2.18GB
dongjoon/apache-spark-github-action-image                       20201015               ff6fee8dc36d   6 weeks ago     2.16GB
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the GitHub Action.

Closes #30578 from dongjoon-hyun/SPARK-MINOR.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-12-03 09:34:42 +09:00
HyukjinKwon fbfc0bf628
[SPARK-33464][INFRA] Add/remove (un)necessary cache and restructure GitHub Actions yaml
### What changes were proposed in this pull request?

This PR proposes:
- Add `~/.sbt` directory into the build cache, see also https://github.com/sbt/sbt/issues/3681
- Move `hadoop-2` below to put up together with `java-11` and `scala-213`, see https://github.com/apache/spark/pull/30391#discussion_r524881430
- Remove unnecessary `.m2` cache if you run SBT tests only.
- Remove `rm ~/.m2/repository/org/apache/spark`. If you don't `sbt publishLocal` or `mvn install`, we don't need to care about it.
- Use Java 8 in Scala 2.13 build. We can switch the Java version to 11 used for release later.
- Add caches into linters. The linter scripts uses `sbt` in, for example, `./dev/lint-scala`, and uses `mvn` in, for example, `./dev/lint-java`. Also, it requires to `sbt package` in Jekyll build, see: https://github.com/apache/spark/blob/master/docs/_plugins/copy_api_dirs.rb#L160-L161. We need full caches here for SBT, Maven and build tools.
- Use the same syntax of Java version, 1.8 -> 8.

### Why are the changes needed?

- Remove unnecessary stuff
- Cache what we can in the build

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

It will be tested in GitHub Actions build at the current PR

Closes #30391 from HyukjinKwon/SPARK-33464.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-11-18 15:13:43 -08:00
Dongjoon Hyun 10105b555d [SPARK-33454][INFRA] Add GitHub Action job for Hadoop 2
### What changes were proposed in this pull request?

This PR aims to protect `Hadoop 2.x` profile compilation in Apache Spark 3.1+.

### Why are the changes needed?

Since Apache Spark 3.1+ switch our default profile to Hadoop 3, we had better prevent at least compilation error with `Hadoop 2.x` profile at the PR review phase. Although this is an additional workload, it will finish quickly because it's compilation only.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the GitHub Action.
- This should be merged after https://github.com/apache/spark/pull/30375 .

Closes #30378 from dongjoon-hyun/SPARK-33454.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-11-16 15:06:51 +09:00
Dongjoon Hyun a70a2b02ce
[SPARK-33439][INFRA] Use SERIAL_SBT_TESTS=1 for SQL modules
### What changes were proposed in this pull request?

This PR aims to decrease the parallelism of `SQL` module like `Hive` module.

### Why are the changes needed?

GitHub Action `sql - slow tests` become flaky.
- https://github.com/apache/spark/runs/1393670291
- https://github.com/apache/spark/runs/1393088031

### Does this PR introduce _any_ user-facing change?

No. This is dev-only feature.
Although this will increase the running time, but it's better than flakiness.

### How was this patch tested?

Pass the GitHub Action stably.

Closes #30365 from dongjoon-hyun/SPARK-33439.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-11-12 21:19:51 -08:00
Kousuke Saruta 208b94e4c1 [SPARK-33353][BUILD] Cache dependencies for Coursier with new sbt in GitHub Actions
### What changes were proposed in this pull request?

This PR change the behavior of GitHub Actions job that caches dependencies.
SPARK-33226 upgraded sbt to 1.4.1.
As of 1.3.0, sbt uses Coursier as the dependency resolver / fetcher.
So let's change the dependency cache configuration for the GitHub Actions job.

### Why are the changes needed?

To make build faster with Coursier for the GitHub Actions job.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Should be done by GitHub Actions itself.

Closes #30259 from sarutak/coursier-cache.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-11-05 09:29:53 -08:00
Kyle Bendickson 0535b34ad4 [SPARK-33282] Migrate from deprecated probot autolabeler to GitHub labeler action
### What changes were proposed in this pull request?

This PR removes the old Probot Autolabeler labeling configuration, as the probot autolabeler has been deprecated. I've updated the configs in Iceberg and in Avro, and we also need to update here. This PR adds in an additional workflow for labeling PRs and migrates the old probot config to the new format. Unfortunately, because certain features have not been released upstream, we will not get the _exact_ behavior as before. I have documented where that is and what changes are neeeded, and in the associated ticket I've also discussed other options and why I think this is the best way to go. Definitely a follow up ticket is needed to get the original behavior back in these few cases, but PRs have not been labeled for almost a month and so it's probably best to get it right 95% of the time and occasionally have some UI related PRs labeled as `CORE` while the issue is resolved upstream and/or further investigated.

### Why are the changes needed?

The probot autolabeler is dead and will not be maintained going forward. This has been confirmed with github user [at]mithro in an issue in their repository.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

To test this PR, I first merged the config into my local fork. I then edited it several times and ran tests on that.

Unfortunately, I've overwritten my fork with the apache repo in order to create a proper PR. However, I've also added the config for the same thing in the Iceberg repo as well as the Avro repo.

I have now merged this PR into my local repo and will be running some tests on edge cases there and for validating in general:
- [Check that the SQL label is applied for changes directly below repo root's sql directory](https://github.com/kbendick/spark/pull/16) 
- [Check that the structured streaming label is applied](https://github.com/kbendick/spark/pull/20) 
- [Check that a wildcard at the end of a pattern will match nested files](https://github.com/kbendick/spark/pull/19) 
- [Check that the rule **/*pom.xml will match the root pom.xml file](https://github.com/kbendick/spark/pull/25) 

I've also discovered that we're likely not killing github actions that run (like large tests etc) when users push to their PR. In most cases, I see that a user has to mark something as "OK to test", but it still seems like we might want to discuss whether or not we should add a cancellation step In order to save time / capacity on the runners. If so desired, we would add an action in each workflow that cancels old runs when a `push` action occurs on a PR. This will likely make waiting for test runners much faster iff tests are automatically rerun on push by anybody (such as PMCs, PRs that have been marked OK to test, etc). We could free a large number of resources potentially if a cancellation step was added to all of the workflows in the Apache account (as github action API limits are set at the account level).

Admittedly, the fact that the "old" workflow runs weren't cancelled could admittedly be because of the fact that I was working in a fork, but given that there are explicit actions to be added to the start of workflows to cancel old PR workflows and given that we don't have them configured indicates to me that likely this is the case in this repo (and in most `apache` repos as well), at least under certain circumstances (e.g. repos that don't have "Ok to test"-like webhooks as one example).

This is a separate issue though, which I can bring up on the mailing list once I'm done with this PR. Unfortunately I've been very busy the past two weeks, but if somebody else wanted to work on that I would be happy to support with any knowledge I have.

The last Apache repo to still have the probot autolabeler in it is Beam, at which point we can have Gavin from ASF Infra remove the permissions for the probot autolabeler entirely. See the associated JIRA ticket for the links to other tickets, like the one for ASF Infra to remove the dead probot autolabeler's read and write permissions to our PRs in the Apache organization.

Closes #30244 from kbendick/begin-migration-to-github-labeler-action.

Authored-by: Kyle Bendickson <kjbendickson@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-11-05 16:10:52 +09:00
HyukjinKwon 9818f079aa [SPARK-33243][PYTHON][BUILD] Add numpydoc into documentation dependency
### What changes were proposed in this pull request?

This PR proposes to initiate the migration to NumPy documentation style (from reST style) in PySpark docstrings.
This PR also adds one migration example of `SparkContext`.

- **Before:**
    ...
    ![Screen Shot 2020-10-26 at 7 02 05 PM](https://user-images.githubusercontent.com/6477701/97161090-a8ea0200-17c0-11eb-8204-0e70d18fc571.png)
    ...
    ![Screen Shot 2020-10-26 at 7 02 09 PM](https://user-images.githubusercontent.com/6477701/97161100-aab3c580-17c0-11eb-92ad-f5ad4441ce16.png)
    ...

- **After:**

    ...
    ![Screen Shot 2020-10-26 at 7 24 08 PM](https://user-images.githubusercontent.com/6477701/97161219-d636b000-17c0-11eb-80ab-d17a570ecb4b.png)
    ...

See also https://numpydoc.readthedocs.io/en/latest/format.html

### Why are the changes needed?

There are many reasons for switching to NumPy documentation style.

1. Arguably reST style doesn't fit well when the docstring grows large because it provides (arguably) less structures and syntax.

2. NumPy documentation style provides a better human readable docstring format. For example, notebook users often just do `help(...)` by `pydoc`.

3. NumPy documentation style is pretty commonly used in data science libraries, for example, pandas, numpy, Dask, Koalas,
matplotlib, ... Using NumPy documentation style can give users a consistent documentation style.

### Does this PR introduce _any_ user-facing change?

The dependency itself doesn't change anything user-facing.
The documentation change in `SparkContext` does, as shown above.

### How was this patch tested?

Manually tested via running `cd python` and `make clean html`.

Closes #30149 from HyukjinKwon/SPARK-33243.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-10-27 14:03:57 +09:00
Dongjoon Hyun 850adeb0fd [SPARK-33239][INFRA] Use pre-built image at GitHub Action SparkR job
### What changes were proposed in this pull request?

This PR aims to use a pre-built image for Github Action SparkR job.

### Why are the changes needed?

This will reduce the execution time and the flakiness.

**BEFORE (21 minutes 39 seconds)**
![Screen Shot 2020-10-16 at 1 24 43 PM](https://user-images.githubusercontent.com/9700541/96305593-fbeada80-0fb2-11eb-9b8e-86d8abaad9ef.png)

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the GitHub Action `sparkr` job in this PR.

Closes #30066 from dongjoon-hyun/SPARKR.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-10-26 01:50:23 -07:00
Bryan Cutler 47a6568265 [SPARK-33189][PYTHON][TESTS] Add env var to tests for legacy nested timestamps in pyarrow
### What changes were proposed in this pull request?

Add an environment variable `PYARROW_IGNORE_TIMEZONE` to pyspark tests in run-tests.py to use legacy nested timestamp behavior. This means that when converting arrow to pandas, nested timestamps with timezones will have the timezone localized during conversion.

### Why are the changes needed?

The default behavior was changed in PyArrow 2.0.0 to propagate timezone information. Using the environment variable enables testing with newer versions of pyarrow until the issue can be fixed in SPARK-32285.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing tests

Closes #30111 from BryanCutler/arrow-enable-legacy-nested-timestamps-SPARK-33189.

Authored-by: Bryan Cutler <cutlerb@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-10-21 09:13:33 +09:00
HyukjinKwon eb9966b700 [SPARK-33190][INFRA][TESTS] Set upper bound of PyArrow version in GitHub Actions
### What changes were proposed in this pull request?

PyArrow is uploaded into PyPI today (https://pypi.org/project/pyarrow/), and some tests fail with PyArrow 2.0.0+:

```
======================================================================
ERROR [0.774s]: test_grouped_over_window_with_key (pyspark.sql.tests.test_pandas_grouped_map.GroupedMapInPandasTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/__w/spark/spark/python/pyspark/sql/tests/test_pandas_grouped_map.py", line 595, in test_grouped_over_window_with_key
    .select('id', 'result').collect()
  File "/__w/spark/spark/python/pyspark/sql/dataframe.py", line 588, in collect
    sock_info = self._jdf.collectToPython()
  File "/__w/spark/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/__w/spark/spark/python/pyspark/sql/utils.py", line 117, in deco
    raise converted from None
pyspark.sql.utils.PythonException:
  An exception was thrown from the Python worker. Please see the stack trace below.
Traceback (most recent call last):
  File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 601, in main
    process()
  File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 593, in process
    serializer.dump_stream(out_iter, outfile)
  File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", line 255, in dump_stream
    return ArrowStreamSerializer.dump_stream(self, init_stream_yield_batches(), stream)
  File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", line 81, in dump_stream
    for batch in iterator:
  File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", line 248, in init_stream_yield_batches
    for series in iterator:
  File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 426, in mapper
    return f(keys, vals)
  File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 170, in <lambda>
    return lambda k, v: [(wrapped(k, v), to_arrow_type(return_type))]
  File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 158, in wrapped
    result = f(key, pd.concat(value_series, axis=1))
  File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/util.py", line 68, in wrapper
    return f(*args, **kwargs)
  File "/__w/spark/spark/python/pyspark/sql/tests/test_pandas_grouped_map.py", line 590, in f
    "{} != {}".format(expected_key[i][1], window_range)
AssertionError: {'start': datetime.datetime(2018, 3, 15, 0, 0), 'end': datetime.datetime(2018, 3, 20, 0, 0)} != {'start': datetime.datetime(2018, 3, 15, 0, 0, tzinfo=<StaticTzInfo 'Etc/UTC'>), 'end': datetime.datetime(2018, 3, 20, 0, 0, tzinfo=<StaticTzInfo 'Etc/UTC'>)}
```

https://github.com/apache/spark/runs/1278917457

This PR proposes to set the upper bound of PyArrow in GitHub Actions build. This should be removed when we properly support PyArrow 2.0.0+ (SPARK-33189).

### Why are the changes needed?

To make build pass.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

GitHub Actions in this build will test it out.

Closes #30098 from HyukjinKwon/hot-fix-test.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-10-20 17:35:09 +09:00
Fokko Driesprong 6ad75cda1e [SPARK-17333][PYSPARK] Enable mypy
### What changes were proposed in this pull request?

Add MyPy to the CI. Once this is installed on the CI: https://issues.apache.org/jira/browse/SPARK-32797?jql=project%20%3D%20SPARK%20AND%20text%20~%20mypy this wil automatically check the types.

### Why are the changes needed?

We should check if the types are still correct on the CI.

```
MacBook-Pro-van-Fokko:spark fokkodriesprong$ ./dev/lint-python
starting python compilation test...
python compilation succeeded.

starting pycodestyle test...
pycodestyle checks passed.

starting flake8 test...
flake8 checks passed.

starting mypy test...
mypy checks passed.

The sphinx-build command was not found. Skipping Sphinx build for now.

all lint-python tests passed!
```

### Does this PR introduce _any_ user-facing change?

No :)

### How was this patch tested?

By running `./dev/lint-python` locally.

Closes #30088 from Fokko/SPARK-17333.

Authored-by: Fokko Driesprong <fokko@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-10-19 12:50:01 -07:00
HyukjinKwon a7a8dae483 Revert "[SPARK-33069][INFRA] Skip test result report if no JUnit XML files are found"
This reverts commit a0aa8f33a9.
2020-10-19 17:13:47 +09:00
Dongjoon Hyun 9f5eff0ae1 [SPARK-33162][INFRA] Use pre-built image at GitHub Action PySpark jobs
### What changes were proposed in this pull request?

This PR aims to use `pre-built image` at Github Action PySpark jobs. To isolate the changes, `pyspark` jobs are split from the main job. The docker image is built by the following.

| Item                   | URL                |
| --------------- | ------------- |
| Dockerfile         | https://github.com/dongjoon-hyun/ApacheSparkGitHubActionImage/blob/main/Dockerfile |
| Builder               | https://github.com/dongjoon-hyun/ApacheSparkGitHubActionImage/blob/main/.github/workflows/build.yml |
| Image Location | https://hub.docker.com/r/dongjoon/apache-spark-github-action-image |

Please note that.
1. The community still will use `build_and_test.yml` to add new features like as we did until now. The `Dockerfile` will be updated regularly.
2. When Apache Spark gets an official docker repository location, we will use it.
3. Also, it's the best if we keep this docker file and builder script at a new Apache Spark dev branch instead of outside GitHub repository.

### Why are the changes needed?

Currently, two `pyspark` test jobs take over one and half hour always. In total, 3 hours 14 minutes.
- https://github.com/apache/spark/runs/1240470628 (1 hour 35 mins)
- https://github.com/apache/spark/runs/1240470634 (1 hour 39 mins)

This PR will remove the package installation steps which takes 16 minutes and causes flakiness. Note that `Python 3.6 package installation` is not included in the pre-built image and it only takes `20s`.

**BEFORE**
![Screen Shot 2020-10-15 at 10 32 17 AM](https://user-images.githubusercontent.com/9700541/96165634-be625080-0ed1-11eb-974b-940c112152e9.png)

**AFTER**
![Screen Shot 2020-10-15 at 10 58 17 AM](https://user-images.githubusercontent.com/9700541/96168262-5d3c7c00-0ed5-11eb-83c5-e9dc189a156b.png)

In short, `pyspark` GitHub jobs take shorter time. In total, 2 hours 23 minutes (<- 3 hours 14 minutes, previously).
- https://github.com/apache/spark/pull/30059/checks?check_run_id=1260512568 (1 hour 18 mins)
- https://github.com/apache/spark/pull/30059/checks?check_run_id=1260512582 (1 hour 5 mins)

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the GitHub Action on this PR without `package installation steps`.

Closes #30059 from dongjoon-hyun/SPARK-33162.

Lead-authored-by: Dongjoon Hyun <dongjoon@apache.org>
Co-authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-10-15 17:58:58 -07:00
HyukjinKwon b089fe5376 [SPARK-32247][INFRA] Install and test scipy with PyPy in GitHub Actions
### What changes were proposed in this pull request?

This PR proposes to install `scipy` as well in PyPy. It will test several ML specific test cases in PyPy as well. For example, 31a16fbb40/python/pyspark/mllib/tests/test_linalg.py (L487)

It was not installed when GitHub Actions build was added because it failed to install for an unknown reason. Seems like it's fixed in the latest scipy.

### Why are the changes needed?

To improve test coverage.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

GitHub Actions build in this PR will test it out.

Closes #30054 from HyukjinKwon/SPARK-32247.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-10-15 09:08:14 -07:00
Kousuke Saruta 513b6f5af2 [SPARK-33079][TESTS] Replace the existing Maven job for Scala 2.13 in Github Actions with SBT job
### What changes were proposed in this pull request?

SPARK-32926 added a build test to GitHub Action for Scala 2.13 but it's only with Maven.
As SPARK-32873 reported, some compilation error happens only with SBT so I think we need to add another build test to GitHub Action for SBT.
Unfortunately, we don't have abundant resources for GitHub Actions so instead of just adding the new SBT job, let's replace the existing Maven job with the new SBT job for Scala 2.13.

### Why are the changes needed?

To ensure build test passes even with SBT for Scala 2.13.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

GitHub Actions' job.

Closes #29958 from sarutak/add-sbt-job-for-scala-2.13.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-10-15 20:51:20 +09:00
Dongjoon Hyun e85ed8a14c [SPARK-33156][INFRA] Upgrade GithubAction image from 18.04 to 20.04
### What changes were proposed in this pull request?

This PR aims to upgrade `Github Action` runner image from `Ubuntu 18.04 (LTS)` to `Ubuntu 20.04 (LTS)`.

### Why are the changes needed?

`ubuntu-latest` in `GitHub Action` is still `Ubuntu 18.04 (LTS)`.
- https://github.com/actions/virtual-environments#available-environments

This upgrade will help Apache Spark 3.1+ preparation for vote and release on the latest OS.

This is tested here.
- https://github.com/dongjoon-hyun/spark/pull/36

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the `Github Action` in this PR.

Closes #30050 from dongjoon-hyun/ubuntu_20.04.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-10-15 02:24:49 -07:00
HyukjinKwon a0aa8f33a9 [SPARK-33069][INFRA] Skip test result report if no JUnit XML files are found
### What changes were proposed in this pull request?

This PR proposes to skip test reporting ("Report test results") if there are no JUnit XML files are found.

Currently, we're running and skipping the tests dynamically. For example,
- if there are only changes in SparkR at the underlying commit, it only runs the SparkR tests, and skip the other tests and generate JUnit XML files for SparkR test cases.
- if there are only changes in `docs` at the underlying commit, the build skips all tests except linters and do not generate any JUnit XML files.

When test reporting ("Report test results") job is triggered after the main build ("Build and test
") is finished, and there are no JUnit XML files found, it reports the case as a failure. See https://github.com/apache/spark/runs/1196184007 as an example.

This PR works around it by simply skipping the testing report when there are no JUnit XML files are found.
Please see https://github.com/apache/spark/pull/29906#issuecomment-702525542 for more details.

### Why are the changes needed?

To avoid false alarm for test results.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

Manually tested in my fork.

Positive case:

https://github.com/HyukjinKwon/spark/runs/1208624679?check_suite_focus=true
https://github.com/HyukjinKwon/spark/actions/runs/288996327

Negative case:

https://github.com/HyukjinKwon/spark/runs/1208229838?check_suite_focus=true
https://github.com/HyukjinKwon/spark/actions/runs/289000058

Closes #29946 from HyukjinKwon/test-junit-files.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-10-06 09:09:58 +09:00
HyukjinKwon b205be5ff6 [SPARK-33051][INFRA][R] Uses setup-r to install R in GitHub Actions build
### What changes were proposed in this pull request?

At SPARK-32493, the R installation was switched to manual installation because setup-r was broken. This seems fixed in the upstream so we should better switch it back.

### Why are the changes needed?

To avoid maintaining the installation steps by ourselve.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

GitHub Actions build in this PR should test it.

Closes #29931 from HyukjinKwon/recover-r-build.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-10-02 15:12:33 +09:00
Dongjoon Hyun a8442c2826 [SPARK-32926][TESTS] Add Scala 2.13 build test in GitHub Action
### What changes were proposed in this pull request?

The PR aims to add Scala 2.13 build test coverage into GitHub Action for Apache Spark 3.1.0.

### Why are the changes needed?

The branch is ready for Scala 2.13 and this will prevent any regression.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

Pass the GitHub Action.

Closes #29793 from dongjoon-hyun/SPARK-32926.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-09-17 14:01:52 -07:00
HyukjinKwon b07e7429a6 [SPARK-32695][INFRA] Explicitly cache and hash 'build' directly in GitHub Actions
### What changes were proposed in this pull request?

This PR proposes to explicitly cache and hash the files/directories under 'build' for SBT and Zinc at GitHub Actions. Otherwise, it can end up with overwriting `build` directory. See also https://github.com/apache/spark/pull/29286#issuecomment-679368436

Previously, other files like `build/mvn` and `build/sbt` are also cached and overwritten. So, when you have some changes there, they are ignored.

### Why are the changes needed?

To make GitHub Actions build stable.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

The builds in this PR test it out.

Closes #29536 from HyukjinKwon/SPARK-32695.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-08-26 12:25:59 +09:00
HyukjinKwon b54103016a [SPARK-32204][SPARK-32182][DOCS] Add a quickstart page with Binder integration in PySpark documentation
### What changes were proposed in this pull request?

This PR proposes to:
- add a notebook with a Binder integration which allows users to try PySpark in a live notebook. Please [try this here](https://mybinder.org/v2/gh/HyukjinKwon/spark/SPARK-32204?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart.ipynb).
- reuse this notebook as a quickstart guide in PySpark documentation.

Note that Binder turns a Git repo into a collection of interactive notebooks. It works based on Docker image. Once somebody builds, other people can reuse the image against a specific commit.
Therefore, if we run Binder with the images based on released tags in Spark, virtually all users can instantly launch the Jupyter notebooks.

<br/>

I made a simple demo to make it easier to review. Please see:
- [Main page](https://hyukjin-spark.readthedocs.io/en/stable/). Note that the link ("Live Notebook") in the main page wouldn't work since this PR is not merged yet.
- [Quickstart page](https://hyukjin-spark.readthedocs.io/en/stable/getting_started/quickstart.html)

<br/>

When reviewing the notebook file itself, please give my direct feedback which I will appreciate and address.
Another way might be:
- open [here](https://mybinder.org/v2/gh/HyukjinKwon/spark/SPARK-32204?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart.ipynb).
- edit / change / update the notebook. Please feel free to change as whatever you want. I can apply as are or slightly update more when I apply to this PR.
- download it as a `.ipynb` file:
    ![Screen Shot 2020-08-20 at 10 12 19 PM](https://user-images.githubusercontent.com/6477701/90774311-3e38c800-e332-11ea-8476-699a653984db.png)
- upload the `.ipynb` file here in a GitHub comment. Then, I will push a commit with that file with crediting correctly, of course.
- alternatively, push a commit into this PR right away if that's easier for you (if you're a committer).

References:
- https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html
- https://databricks.com/jp/blog/2020/03/31/10-minutes-from-pandas-to-koalas-on-apache-spark.html - my own blog post .. :-) and https://koalas.readthedocs.io/en/latest/getting_started/10min.html

### Why are the changes needed?

To improve PySpark's usability. The current quickstart for Python users are very friendly.

### Does this PR introduce _any_ user-facing change?

Yes, it will add a documentation page, and expose a live notebook to PySpark users.

### How was this patch tested?

Manually tested, and GitHub Actions builds will test.

Closes #29491 from HyukjinKwon/SPARK-32204.

Lead-authored-by: HyukjinKwon <gurwls223@apache.org>
Co-authored-by: Fokko Driesprong <fokko@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-08-26 12:23:24 +09:00
Takeshi Yamamuro 6dd37cbaac [SPARK-32682][INFRA] Use workflow_dispatch to enable manual test triggers
### What changes were proposed in this pull request?

This PR proposes to add a `workflow_dispatch` entry in the GitHub Action script (`build_and_test.yml`). This update can enable developers to run the Spark tests for a specific branch on their own local repository, so I think it might help to check if al the tests can pass before opening a new PR.

<img width="944" alt="Screen Shot 2020-08-21 at 16 28 41" src="https://user-images.githubusercontent.com/692303/90866249-96250c80-e3ce-11ea-8496-3dd6683e92ea.png">

### Why are the changes needed?

To reduce the pressure of GitHub Actions on the Spark repository.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manually checked.

Closes #29504 from maropu/DispatchTest.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-08-21 21:23:41 +09:00
HyukjinKwon bfd8c34154 [SPARK-32645][INFRA] Upload unit-tests.log as an artifact
### What changes were proposed in this pull request?

This PR proposes to upload `target/unit-tests.log` into the artifact so it will be able to download here:
![Screen Shot 2020-08-18 at 2 23 18 PM](https://user-images.githubusercontent.com/6477701/90474095-789e3b80-e15f-11ea-87f8-e7da3df3c03e.png)

### Why are the changes needed?

Jenkins has this feature. It should be best to have the same dev functionalities with it.
Also, note that this was pointed out https://github.com/apache/spark/pull/29225#discussion_r471485011.

### Does this PR introduce _any_ user-facing change?

No, dev-only

### How was this patch tested?

https://github.com/apache/spark/actions/runs/213000777 should demonstrate it

Closes #29454 from HyukjinKwon/SPARK-32645.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-08-19 12:28:36 +09:00
HyukjinKwon d0dfe4986b [MINOR][INFRA] Rename master.yml to build_and_test.yml
### What changes were proposed in this pull request?

This PR renames `master.yml` to `build_and_test.yml` to indicate this is the workflow that builds and runs the tests.

### Why are the changes needed?

Just for readability. `master.yml` looks like the name of the branch (to me).

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

GitHub Actions build in this PR will test it out.

Closes #29459 from HyukjinKwon/minor-rename.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>
2020-08-18 18:18:47 +08:00
HyukjinKwon 86852c57af [SPARK-32606][SPARK-32605][INFRA] Remove the forks of action-surefire-report and action-download-artifact in test_report.yml
### What changes were proposed in this pull request?

This PR proposes to remove the usage of my own forks and use the original plugins in GitHub Actions testing report.

SPARK-32357 introduced the GitHub Actions test reporting by leveraging two plugins:
 - [ScaCap/action-surefire-report](https://github.com/ScaCap/action-surefire-report)
 - [dawidd6/action-download-artifact](https://github.com/dawidd6/action-download-artifact)

In order to make it working, it had to fork two repositories with custom fixes:
  - HyukjinKwon/action-surefire-reportc96094c
  - f86c565d52

The two custom fixes are thankfully merged at https://github.com/ScaCap/action-surefire-report/pull/14 and https://github.com/dawidd6/action-download-artifact/pull/24, and they released new ones to use at [ScaCap/action-surefire-report/commits/v1](https://github.com/ScaCap/action-surefire-report/commits/v1) and [dawidd6/action-download-artifact/commits/v2](https://github.com/dawidd6/action-download-artifact/commits/v2)  - thanks jmisur and dawidd6 again.

### Why are the changes needed?

To avoid relying on forks and code duplications.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

Logically there is no diff. I tested it at https://github.com/HyukjinKwon/spark/runs/992824229 for doubly sure.

NOTE that this PR cannot be tested here within the workflow triggered by this PR without merging the changes in `test_report.yml` into the master.

Closes #29449 from HyukjinKwon/SPARK-32606-SPARK-32605.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-08-17 11:17:50 -07:00
Hyukjin Kwon 5debde9401 [SPARK-32357][INFRA] Publish failed and succeeded test reports in GitHub Actions
### What changes were proposed in this pull request?

This PR proposes to report the failed and succeeded tests in GitHub Actions in order to improve the development velocity by leveraging [ScaCap/action-surefire-report](https://github.com/ScaCap/action-surefire-report). See the example below:

![Screen Shot 2020-08-13 at 8 17 52 PM](https://user-images.githubusercontent.com/6477701/90128649-28f7f280-dda2-11ea-9211-e98e34332f6b.png)

Note that we cannot just use [ScaCap/action-surefire-report](https://github.com/ScaCap/action-surefire-report) in Apache Spark because PRs are from the forked repository, and GitHub secrets are unavailable for the security reason. This plugin and all similar plugins require to have the GitHub token that has the write access in order to post test results but it is unavailable in PRs.

To work around this limitation, I took this approach:

1. In workflow A, run the tests and upload the JUnit XML test results. GitHub provides to upload and download some files.
2. GitHub introduced new event type [`workflow_run`](https://github.blog/2020-08-03-github-actions-improvements-for-fork-and-pull-request-workflows/) 10 days ago. By leveraging this, it triggers another workflow B.
3. Workflow B is in the main repo instead of fork repo, and has the write access the plugin needs. In workflow B, it downloads the artifact uploaded from workflow A (from the forked repository).
4. Workflow B generates the test reports to port from JUnit xml files.
5. Workflow B looks up the PR and posts the test reports.

The `workflow_run` event is very new feature, and looks not so many GitHub Actions plugins support. In order to make this working with [ScaCap/action-surefire-report](https://github.com/ScaCap/action-surefire-report), I had to fork two GitHub Actions plugins to use:
 - [ScaCap/action-surefire-report](https://github.com/ScaCap/action-surefire-report) to have this custom fix: c96094cc35
    It added `commit` argument to specify the commit to post the test reports. With `workflow_run`, it can access, in workflow B, to the commit from workflow A.

 - [dawidd6/action-download-artifact](https://github.com/dawidd6/action-download-artifact) to have this custom fix: 750b71af35
    It added the support of downloading all artifacts from workflow A, in workflow B. By default, it only supports to specify the name of artifact.

    Note that I was not able to use the official [actions/download-artifact](https://github.com/actions/download-artifact) because:
      - It does not support to download artifacts between different workflows, see also https://github.com/actions/download-artifact/issues/3. Once this issue is resolved, we can switch it back to [actions/download-artifact](https://github.com/actions/download-artifact).

I plan to make a pull request for both repositories so we don't have to rely on forks.

### Why are the changes needed?

Currently, it's difficult to check the failed tests. You should scroll down long logs from GitHub Actions logs.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

Manually tested at: https://github.com/HyukjinKwon/spark/pull/17, https://github.com/HyukjinKwon/spark/pull/18, https://github.com/HyukjinKwon/spark/pull/19, https://github.com/HyukjinKwon/spark/pull/20, and master branch of my forked repository.

Closes #29333 from HyukjinKwon/SPARK-32357-fix.

Lead-authored-by: Hyukjin Kwon <gurwls223@apache.org>
Co-authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-08-13 20:50:47 -07:00
HyukjinKwon 32f4ef005f [SPARK-32497][INFRA] Installs qpdf package for CRAN check in GitHub Actions
### What changes were proposed in this pull request?

CRAN check fails due to the size of the generated PDF docs as below:

```
...
 WARNING
‘qpdf’ is needed for checks on size reduction of PDFs
...
Status: 1 WARNING, 1 NOTE
See
  ‘/home/runner/work/spark/spark/R/SparkR.Rcheck/00check.log’
for details.
```

This PR proposes to install `qpdf` in GitHub Actions.

Note that I cannot reproduce in my local with the same R version so I am not documenting it for now.

Also, while I am here, I piggyback to install SparkR when the module includes `sparkr`. it is rather a followup of SPARK-32491.

### Why are the changes needed?

To fix SparkR CRAN check failure.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

GitHub Actions will test it out.

Closes #29306 from HyukjinKwon/SPARK-32497.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-07-31 00:57:24 +09:00
HyukjinKwon e0c8bd07af [SPARK-32493][INFRA] Manually install R instead of using setup-r in GitHub Actions
### What changes were proposed in this pull request?

This PR proposes to manually install R instead of using `setup-r` which seems broken. Currently, GitHub Actions uses its default R 3.4.4 installed, which we dropped as of SPARK-32073.

While I am here, I am also upgrading R version to 4.0. Jenkins will test the old version and GitHub Actions tests the new version. AppVeyor uses R 4.0 but it does not check CRAN which is important when we make a release.

### Why are the changes needed?

To recover GitHub Actions build.

### Does this PR introduce _any_ user-facing change?

No, dev-only

### How was this patch tested?

Manually tested at https://github.com/HyukjinKwon/spark/pull/15

Closes #29302 from HyukjinKwon/SPARK-32493.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-07-30 20:06:35 +09:00
Dongjoon Hyun 08a66f8fd0 [SPARK-32248][BUILD] Recover Java 11 build in Github Actions
### What changes were proposed in this pull request?

This PR aims to recover Java 11 build in `GitHub Action`.

### Why are the changes needed?

This test coverage is removed before. Now, it's time to recover it.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the GitHub Action.

Closes #29295 from dongjoon-hyun/SPARK-32248.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-07-29 18:05:53 -07:00
HyukjinKwon 6ab29b37cf [SPARK-32179][SPARK-32188][PYTHON][DOCS] Replace and redesign the documentation base
### What changes were proposed in this pull request?

This PR proposes to redesign the PySpark documentation.

I made a demo site to make it easier to review: https://hyukjin-spark.readthedocs.io/en/stable/reference/index.html.

Here is the initial draft for the final PySpark docs shape: https://hyukjin-spark.readthedocs.io/en/latest/index.html.

In more details, this PR proposes:
1. Use [pydata_sphinx_theme](https://github.com/pandas-dev/pydata-sphinx-theme) theme - [pandas](https://pandas.pydata.org/docs/) and [Koalas](https://koalas.readthedocs.io/en/latest/) use this theme. The CSS overwrite is ported from Koalas. The colours in the CSS were actually chosen by designers to use in Spark.
2. Use the Sphinx option to separate `source` and `build` directories as the documentation pages will likely grow.
3. Port current API documentation into the new style. It mimics Koalas and pandas to use the theme most effectively.

    One disadvantage of this approach is that you should list up APIs or classes; however, I think this isn't a big issue in PySpark since we're being conservative on adding APIs. I also intentionally listed classes only instead of functions in ML and MLlib to make it relatively easier to manage.

### Why are the changes needed?

Often I hear the complaints, from the users, that current PySpark documentation is pretty messy to read - https://spark.apache.org/docs/latest/api/python/index.html compared other projects such as [pandas](https://pandas.pydata.org/docs/) and [Koalas](https://koalas.readthedocs.io/en/latest/).

It would be nicer if we can make it more organised instead of just listing all classes, methods and attributes to make it easier to navigate.

Also, the documentation has been there from almost the very first version of PySpark. Maybe it's time to update it.

### Does this PR introduce _any_ user-facing change?

Yes, PySpark API documentation will be redesigned.

### How was this patch tested?

Manually tested, and the demo site was made to show.

Closes #29188 from HyukjinKwon/SPARK-32179.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-07-27 17:49:21 +09:00
HyukjinKwon 6bdd710c4d [SPARK-32316][TESTS][INFRA] Test PySpark with Python 3.8 in Github Actions
### What changes were proposed in this pull request?

This PR aims to test PySpark with Python 3.8 in Github Actions. In the script side, it is already ready:

4ad9bfd53b/python/run-tests.py (L161)

This PR includes small related fixes together:

1. Install Python 3.8
2. Only install one Python implementation instead of installing many for SQL and Yarn test cases because they need one Python executable in their test cases that is higher than Python 2.
3. Do not install Python 2 which is not needed anymore after we dropped Python 2 at SPARK-32138
4. Remove a comment about installing PyPy3 on Jenkins - SPARK-32278. It is already installed.

### Why are the changes needed?

Currently, only PyPy3 and Python 3.6 are being tested with PySpark in Github Actions. We should test the latest version of Python as well because some optimizations can be only enabled with Python 3.8+. See also https://github.com/apache/spark/pull/29114

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

Was not tested. Github Actions build in this PR will test it out.

Closes #29116 from HyukjinKwon/test-python3.8-togehter.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-07-14 20:44:09 -07:00
HyukjinKwon 4ad9bfd53b [SPARK-32138] Drop Python 2.7, 3.4 and 3.5
### What changes were proposed in this pull request?

This PR aims to drop Python 2.7, 3.4 and 3.5.

Roughly speaking, it removes all the widely known Python 2 compatibility workarounds such as `sys.version` comparison, `__future__`. Also, it removes the Python 2 dedicated codes such as `ArrayConstructor` in Spark.

### Why are the changes needed?

 1. Unsupport EOL Python versions
 2. Reduce maintenance overhead and remove a bit of legacy codes and hacks for Python 2.
 3. PyPy2 has a critical bug that causes a flaky test, SPARK-28358 given my testing and investigation.
 4. Users can use Python type hints with Pandas UDFs without thinking about Python version
 5. Users can leverage one latest cloudpickle, https://github.com/apache/spark/pull/28950. With Python 3.8+ it can also leverage C pickle.

### Does this PR introduce _any_ user-facing change?

Yes, users cannot use Python 2.7, 3.4 and 3.5 in the upcoming Spark version.

### How was this patch tested?

Manually tested and also tested in Jenkins.

Closes #28957 from HyukjinKwon/SPARK-32138.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-07-14 11:22:44 +09:00
Hyukjin Kwon 27ef3629dd [SPARK-32292][SPARK-32252][INFRA] Run the relevant tests only in GitHub Actions
### What changes were proposed in this pull request?

This PR mainly proposes to run only relevant tests just like Jenkins PR builder does. Currently, GitHub Actions always run full tests which wastes the resources.

In addition, this PR also fixes 3 more issues  very closely related together while I am here.

1. The main idea here is: It reuses the existing logic embedded in `dev/run-tests.py` which Jenkins PR builder use in order to run only the related test cases.

2. While I am here, I fixed SPARK-32292 too to run the doc tests. It was because other references were not available when it is cloned via `checkoutv2`. With `fetch-depth: 0`, the history is available.

3. In addition, it fixes the `dev/run-tests.py` to match with `python/run-tests.py` in terms of its options. Environment variables such as `TEST_ONLY_XXX` were moved as proper options. For example,

    ```bash
    dev/run-tests.py --modules sql,core
    ```

    which is consistent with `python/run-tests.py`, for example,

    ```bash
    python/run-tests.py --modules pyspark-core,pyspark-ml
    ```

4. Lastly, also fixed the formatting issue in module specification in the matrix:

    ```diff
    -            network_common, network_shuffle, repl, launcher
    +            network-common, network-shuffle, repl, launcher,
    ```

    which incorrectly runs build/test the modules.

### Why are the changes needed?

By running only related tests, we can hugely save the resources and avoid unrelated flaky tests, etc.
Also, now it runs the doctest of `dev/run-tests.py` properly, the usages are similar between `dev/run-tests.py` and `python/run-tests.py`, and run `network-common`, `network-shuffle`, `launcher` and `examples` modules too.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

Manually tested in my own forked Spark:

https://github.com/HyukjinKwon/spark/pull/7
https://github.com/HyukjinKwon/spark/pull/8
https://github.com/HyukjinKwon/spark/pull/9
https://github.com/HyukjinKwon/spark/pull/10
https://github.com/HyukjinKwon/spark/pull/11
https://github.com/HyukjinKwon/spark/pull/12

Closes #29086 from HyukjinKwon/SPARK-32292.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-07-13 08:31:39 -07:00
Dongjoon Hyun bc3d4bacb5 [SPARK-32245][INFRA][FOLLOWUP] Reenable Github Actions on commit
### What changes were proposed in this pull request?

This PR reenables GitHub Action on every commit as a next step.

### Why are the changes needed?

We carefully enabled GitHub Action on every PRs, and it looks good so far.

As we saw at https://github.com/apache/spark/pull/29072, GitHub Action is already triggered at every commits on every PRs. Enabling GitHub Action on `master` branch commit doesn't make a big difference. And, we need to start to test at every commit as a next step.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual.

Closes #29076 from dongjoon-hyun/reenable_gha_commit.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-07-12 14:50:47 -07:00
HyukjinKwon b84ed4146d [SPARK-32245][INFRA] Run Spark tests in Github Actions
### What changes were proposed in this pull request?

This PR aims to run the Spark tests in Github Actions.

To briefly explain the main idea:

- Reuse `dev/run-tests.py` with SBT build
- Reuse the modules in `dev/sparktestsupport/modules.py` to test each module
- Pass the modules to test into `dev/run-tests.py` directly via `TEST_ONLY_MODULES` environment variable. For example, `pyspark-sql,core,sql,hive`.
- `dev/run-tests.py` _does not_ take the dependent modules into account but solely the specified modules to test.

Another thing to note might be `SlowHiveTest` annotation. Running the tests in Hive modules takes too much so the slow tests are extracted and it runs as a separate job. It was extracted from the actual elapsed time in Jenkins:

![Screen Shot 2020-07-09 at 7 48 13 PM](https://user-images.githubusercontent.com/6477701/87050238-f6098e80-c238-11ea-9c4a-ab505af61381.png)

So, Hive tests are separated into to jobs. One is slow test cases, and the other one is the other test cases.

_Note that_ the current GitHub Actions build virtually copies what the default PR builder on Jenkins does (without other profiles such as JDK 11, Hadoop 2, etc.). The only exception is Kinesis https://github.com/apache/spark/pull/29057/files#diff-04eb107ee163a50b61281ca08f4e4c7bR23

### Why are the changes needed?

Last week and onwards, the Jenkins machines became very unstable for many reasons:
  - Apparently, the machines became extremely slow. Almost all tests can't pass.
  - One machine (worker 4) started to have the corrupt `.m2` which fails the build.
  - Documentation build fails time to time for an unknown reason in Jenkins machine specifically. This is disabled for now at https://github.com/apache/spark/pull/29017.
  - Almost all PRs are basically blocked by this instability currently.

The advantages of using Github Actions:
  - To avoid depending on few persons who can access to the cluster.
  - To reduce the elapsed time in the build - we could split the tests (e.g., SQL, ML, CORE), and run them in parallel so the total build time will significantly reduce.
  - To control the environment more flexibly.
  - Other contributors can test and propose to fix Github Actions configurations so we can distribute this build management cost.

Note that:
- The current build in Jenkins takes _more than 7 hours_. With Github actions it takes _less than 2 hours_
- We can now control the environments especially for Python easily.
- The test and build look more stable than the Jenkins'.

### Does this PR introduce _any_ user-facing change?

No, dev-only change.

### How was this patch tested?

Tested at https://github.com/HyukjinKwon/spark/pull/4

Closes #29057 from HyukjinKwon/migrate-to-github-actions.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-07-11 13:09:06 -07:00
HyukjinKwon f0c79ad88a [MINOR][INFRA] Add a guide to clarify release/unreleased Spark versions of user-facing change in the Github PR template
### What changes were proposed in this pull request?

This PR proposes to add a guide to clarify the Spark version when describing "Does this PR introduce any user-facing change?".

### Why are the changes needed?

It seems confusing to write when the user facing changes happen within unreleased branches.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

Manually tested in Github and it renders find as intended.

Closes #28403 from HyukjinKwon/minor-more-guide.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-04-30 09:22:07 +09:00
Dongjoon Hyun 2d3e9601b5 [SPARK-31589][INFRA] Use r-lib/actions/setup-r in GitHub Action
### What changes were proposed in this pull request?

This PR aims to use `r-lib/actions/setup-r` because it's more stable and maintained by 3rd party.

### Why are the changes needed?

This will recover the current outage. In addition, this will be more robust in the future.
As of now, this is tested via https://github.com/dongjoon-hyun/spark/pull/17 .

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Pass the GitHub Actions, especially `Linter R` and `Generate Documents`.

Closes #28382 from dongjoon-hyun/SPARK-31589.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-04-28 13:22:43 +09:00
HyukjinKwon 98ec4a8ced [SPARK-31330][INFRA][FOLLOW-UP] Exclude 'ui' and 'UI.scala' in CORE and 'dev/.rat-excludes' in BUILD autolabeller
### What changes were proposed in this pull request?

This PR excludes `ui` directly and `UI.scala` configuration file in `CORE` label, and exclude `dev/.rat-excludes` in `BUILD` label  in autolabeller. See https://github.com/apache/spark/pull/28218, https://github.com/apache/spark/pull/28217, https://github.com/apache/spark/pull/28214 and https://github.com/apache/spark/pull/28213

There are some contexts about this https://github.com/apache/spark/pull/28114.

The syntax is from https://git-scm.com/docs/gitignore#_pattern_format (see also https://github.com/kaelzhang/node-ignore)

### Why are the changes needed?

To label UI component properly.

### Does this PR introduce any user-facing change?

No, dev-only.

### How was this patch tested?

It uses the same syntax used for other places. I expect to see the actual results after it gets merged as it's difficult to test it out.

Closes #28228 from HyukjinKwon/SPARK-31330-followup.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-04-16 10:16:58 +09:00
HyukjinKwon c519fe1358 [SPARK-31330][INFRA][FOLLOW-UP] Move sbin and some files into appropriate categories in autolabeller
### What changes were proposed in this pull request?

This PR is a followup of 1b87015044. Now, we automatically label PRs, and seems working fine.

This PR proposes to correct some minor list and categories.

**1.** Move `sbin` from `CORE` into `DEPLOY` components.

```
$ ls sbin

decommission-slave.sh          start-all.sh                   start-slave.sh                 stop-master.sh                 stop-thriftserver.sh
slaves.sh                      start-history-server.sh        start-slaves.sh                stop-mesos-dispatcher.sh
spark-config.sh                start-master.sh                start-thriftserver.sh          stop-mesos-shuffle-service.sh
spark-daemon.sh                start-mesos-dispatcher.sh      stop-all.sh                    stop-slave.sh
spark-daemons.sh               start-mesos-shuffle-service.sh stop-history-server.sh         stop-slaves.sh
```

**2.**

`/sbin/*mesos*.sh` -> `MESOS`
`/bin/spark-shell*` -> `SPARK SHELL`.

### Why are the changes needed?

To label correctly and dev can take an advantage of it such as checking the PRs of a specific component.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

It was not tested yet. It can be tested after it was merged.

Closes #28201 from HyukjinKwon/SPARK-31330.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-04-13 18:48:41 +09:00
Nicholas Chammas 1b87015044 [SPARK-31330] Automatically label PRs based on the paths they touch
### What changes were proposed in this pull request?

This PR adds some rules that will be used by Probot Auto Labeler to label PRs based on what paths they modify.

### Why are the changes needed?

This should make it easier for committers to organize PRs, and it could also help drive downstream tooling like the PR dashboard.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

We'll only be able to test it, I believe, after merging it in. Given that [the Avro project is using this same bot already](https://github.com/apache/avro/blob/master/.github/autolabeler.yml), I expect it will be straightforward to get this working.

Closes #28114 from nchammas/SPARK-31330-auto-label-prs.

Lead-authored-by: Nicholas Chammas <nicholas.chammas@gmail.com>
Co-authored-by: HyukjinKwon <gurwls223@apache.org>
Co-authored-by: Nicholas Chammas <nicholas.chammas@liveramp.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-04-13 10:01:31 +09:00
Dongjoon Hyun 2b744fe885 [SPARK-30963][INFRA] Add GitHub Action job for document generation
### What changes were proposed in this pull request?

This PR aims to add a new `GitHub Action` job for document generation.

### Why are the changes needed?

We had better test the document generation in PR Builder.
- https://lists.apache.org/thread.html/rd06a2154e853812652b8f7fa3c003746ed531b213c531517f055e1dc%40%3Cdev.spark.apache.org%3E

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Pass the GitHub Action in this PR.

Closes #27715 from dongjoon-hyun/SPARK-30963.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-02-26 19:24:41 -08:00