Commit graph

813 commits

Author SHA1 Message Date
Yin Huai ea626b6acf [SPARK-30783] Exclude hive-service-rpc
### What changes were proposed in this pull request?
Exclude hive-service-rpc from build.

### Why are the changes needed?
hive-service-rpc 2.3.6 and spark sql's thrift server module have duplicate classes. Leaving hive-service-rpc 2.3.6 in the class path means that spark can pick up classes defined in hive instead of its thrift server module, which can cause hard to debug runtime errors due to class loading order and compilation errors for applications depend on spark.

 If you compare hive-service-rpc 2.3.6's jar (https://search.maven.org/remotecontent?filepath=org/apache/hive/hive-service-rpc/2.3.6/hive-service-rpc-2.3.6.jar) and spark thrift server's jar (e.g. https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-hive-thriftserver_2.12/3.0.0-SNAPSHOT/spark-hive-thriftserver_2.12-3.0.0-20200207.021914-364.jar), you will see that all of classes provided by hive-service-rpc-2.3.6.jar are covered by spark thrift server's jar. https://issues.apache.org/jira/browse/SPARK-30783 has output of jar tf for both jars.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Existing tests.

Closes #27533 from yhuai/SPARK-30783.

Authored-by: Yin Huai <yhuai@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-12 00:12:45 +08:00
HyukjinKwon 20c60a43cc [MINOR][INFRA] Factor Python executable out as a variable in 'lint-python' script
### What changes were proposed in this pull request?

This PR proposes to make hardcoded `python3` to a variable `PYTHON_EXECUTABLE` in ' lint-python' script.

### Why are the changes needed?

To make changes easier. See 561e9b9688 as an example.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Manually by running `dev/lint-python`.

Closes #27470 from HyukjinKwon/minor-PYTHON_EXECUTABLE.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-02-05 17:01:33 -08:00
Onur Satici 86fdb818bf [SPARK-30715][K8S] Bump fabric8 to 4.7.1
### What changes were proposed in this pull request?
Bump fabric8 kubernetes-client to 4.7.1

### Why are the changes needed?
New fabric8 version brings support for Kubernetes 1.17 clusters.
Full release notes:
- https://github.com/fabric8io/kubernetes-client/releases/tag/v4.7.0
- https://github.com/fabric8io/kubernetes-client/releases/tag/v4.7.1

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Existing unit and integration tests cover creation of K8S objects. Adjusted them to work with the new fabric8 version

Closes #27443 from onursatici/os/bump-fabric8.

Authored-by: Onur Satici <onursatici@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-02-05 01:17:30 -08:00
Dongjoon Hyun 1adf3520e3 [SPARK-30704][INFRA] Use jekyll-redirect-from 0.15.0 instead of the latest
### What changes were proposed in this pull request?

This PR aims to pin the version of `jekyll-redirect-from` to 0.15.0. This is a release blocker for both Apache Spark 3.0.0 and 2.4.5.

### Why are the changes needed?

`jekyll-redirect-from` released 0.16.0 a few days ago and that requires Ruby 2.4.0.
- https://github.com/jekyll/jekyll-redirect-from/releases/tag/v0.16.0
```
$ cd dev/create-release/spark-rm/
$ docker build -t spark:test .
...
ERROR:  Error installing jekyll-redirect-from:
	jekyll-redirect-from requires Ruby version >= 2.4.0.
...
```

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Manually do the above command to build `spark-rm` Docker image.
```
...
Successfully installed jekyll-redirect-from-0.15.0
Parsing documentation for jekyll-redirect-from-0.15.0
Installing ri documentation for jekyll-redirect-from-0.15.0
Done installing documentation for jekyll-redirect-from after 0 seconds
1 gem installed
Successfully installed rouge-3.15.0
Parsing documentation for rouge-3.15.0
Installing ri documentation for rouge-3.15.0
Done installing documentation for rouge after 4 seconds
1 gem installed
Removing intermediate container e0ec7c77b69f
 ---> 32dec37291c6
```

Closes #27434 from dongjoon-hyun/SPARK-30704.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-02-02 00:44:25 -08:00
Dongjoon Hyun 2fd15a26fb [SPARK-30695][BUILD] Upgrade Apache ORC to 1.5.9
### What changes were proposed in this pull request?

This PR aims to upgrade to Apache ORC 1.5.9.
- For `hive-2.3` profile, we need to upgrade `hive-storage-api` from `2.6.0` to `2.7.1`.
- For `hive-1.2` profile, ORC library with classifier `nohive` already shaded it. So, there is no change.

### Why are the changes needed?

This will bring the latest bug fixes. The following is the full release note.
- https://issues.apache.org/jira/projects/ORC/versions/12346546

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Pass the Jenkins with the existing tests.

Here is the summary.
1. `Hive 1.2 + Hadoop 2.7` passed. ([here](https://github.com/apache/spark/pull/27421#issuecomment-580924552))
2. `Hive 2.3 + Hadoop 2.7` passed. ([here](https://github.com/apache/spark/pull/27421#issuecomment-580973391))

Closes #27421 from dongjoon-hyun/SPARK-ORC-1.5.9.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-01-31 17:41:27 -08:00
Dongjoon Hyun 561e9b9688 [SPARK-30674][INFRA] Use python3 in dev/lint-python
### What changes were proposed in this pull request?

This PR aims to use `python3` instead of `python` in `dev/lint-python`.

### Why are the changes needed?

Currently, `dev/lint-python` fails at Python 2. And, Python 2 is EOL since January 1st 2020.
```
$ python -V
Python 2.7.17

$ dev/lint-python
starting python compilation test...
Python compilation failed with the following errors:
Compiling ./python/setup.py ...
  File "./python/setup.py", line 27
    file=sys.stderr)
        ^
SyntaxError: invalid syntax
```

### Does this PR introduce any user-facing change?

No. This is a dev environment.

### How was this patch tested?

Jenkins is running this with Python 3 already.
The following is a manual test.

```
$ python -V
Python 3.8.0

$ dev/lint-python
starting python compilation test...
python compilation succeeded.
```

Closes #27394 from dongjoon-hyun/SPARK-30674.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-01-30 03:17:29 -08:00
Nicholas Chammas bda0669110 [SPARK-30665][DOCS][BUILD][PYTHON] Eliminate pypandoc dependency
### What changes were proposed in this pull request?

This PR removes any dependencies on pypandoc. It also makes related tweaks to the docs README to clarify the dependency on pandoc (not pypandoc).

### Why are the changes needed?

We are using pypandoc to convert the Spark README from Markdown to ReST for PyPI. PyPI now natively supports Markdown, so we don't need pypandoc anymore. The dependency on pypandoc also sometimes causes issues when installing Python packages that depend on PySpark, as described in #18981.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Manually:

```sh
python -m venv venv
source venv/bin/activate
pip install -U pip

cd python/
python setup.py sdist
pip install dist/pyspark-3.0.0.dev0.tar.gz
pyspark --version
```

I also built the PySpark and R API docs with `jekyll` and reviewed them locally.

It would be good if a maintainer could also test this by creating a PySpark distribution and uploading it to [Test PyPI](https://test.pypi.org) to confirm the README looks as it should.

Closes #27376 from nchammas/SPARK-30665-pypandoc.

Authored-by: Nicholas Chammas <nicholas.chammas@liveramp.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-01-30 16:40:38 +09:00
Dongjoon Hyun 862959747e [SPARK-30639][BUILD] Upgrade Jersey to 2.30
### What changes were proposed in this pull request?

For better JDK11 support, this PR aims to upgrade **Jersey** and **javassist** to `2.30` and `3.35.0-GA` respectively.

### Why are the changes needed?

**Jersey**: This will bring the following `Jersey` updates.
- https://eclipse-ee4j.github.io/jersey.github.io/release-notes/2.30.html
  - https://github.com/eclipse-ee4j/jersey/issues/4245 (Java 11 java.desktop module dependency)

**javassist**: This is a transitive dependency from 3.20.0-CR2 to 3.25.0-GA.
- `javassist` officially supports JDK11 from [3.24.0-GA release note](https://github.com/jboss-javassist/javassist/blob/master/Readme.html#L308).

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Pass the Jenkins with both JDK8 and JDK11.

Closes #27357 from dongjoon-hyun/SPARK-30639.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-01-25 15:41:55 -08:00
cody koeninger 843224ebd4 [SPARK-30570][BUILD] Update scalafmt plugin to 1.0.3 with onlyChangedFiles feature
### What changes were proposed in this pull request?
Update the scalafmt plugin to 1.0.3 and use its new onlyChangedFiles feature rather than --diff

### Why are the changes needed?
Older versions of the plugin either didn't work with scala 2.13, or got rid of the --diff argument and didn't allow for formatting only changed files

### Does this PR introduce any user-facing change?
The /dev/scalafmt script no longer passes through arbitrary args, instead using the arg to select scala version.  The issue here is the plugin name literally contains the scala version, and doesn't appear to have a shorter way to refer to it.   If srowen or someone else with better maven-fu has an idea I'm all ears.

### How was this patch tested?
Manually, e.g. edited a file and ran

dev/scalafmt

or

dev/scalafmt 2.13

Closes #27279 from koeninger/SPARK-30570.

Authored-by: cody koeninger <cody@koeninger.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-01-23 12:44:43 -08:00
HyukjinKwon ab0890bdb1 [SPARK-28264][PYTHON][SQL] Support type hints in pandas UDF and rename/move inconsistent pandas UDF types
### What changes were proposed in this pull request?

This PR proposes to redesign pandas UDFs as described in [the proposal](https://docs.google.com/document/d/1-kV0FS_LF2zvaRh_GhkV32Uqksm_Sq8SvnBBmRyxm30/edit?usp=sharing).

```python
from pyspark.sql.functions import pandas_udf
import pandas as pd

pandas_udf("long")
def plug_one(s: pd.Series) -> pd.Series:
    return s + 1

spark.range(10).select(plug_one("id")).show()
```

```
+------------+
|plug_one(id)|
+------------+
|           1|
|           2|
|           3|
|           4|
|           5|
|           6|
|           7|
|           8|
|           9|
|          10|
+------------+
```

Note that, this PR address one of the future improvements described [here](https://docs.google.com/document/d/1-kV0FS_LF2zvaRh_GhkV32Uqksm_Sq8SvnBBmRyxm30/edit#heading=h.h3ncjpk6ujqu), "A couple of less-intuitive pandas UDF types" (by zero323) together.

In short,

- Adds new way with type hints as an alternative and experimental way.
    ```python
    pandas_udf(schema='...')
    def func(c1: Series, c2: Series) -> DataFrame:
        pass
    ```

- Replace and/or add an alias for three types below from UDF, and make them as separate standalone APIs. So, `pandas_udf` is now consistent with regular `udf`s and other expressions.

    `df.mapInPandas(udf)`  -replace-> `df.mapInPandas(f, schema)`
    `df.groupby.apply(udf)`  -alias-> `df.groupby.applyInPandas(f, schema)`
    `df.groupby.cogroup.apply(udf)`  -replace-> `df.groupby.cogroup.applyInPandas(f, schema)`

    *`df.groupby.apply` was added from 2.3 while the other were added in the master only.

- No deprecation for the existing ways for now.
    ```python
    pandas_udf(schema='...', functionType=PandasUDFType.SCALAR)
    def func(c1, c2):
        pass
    ```
If users are happy with this, I plan to deprecate the existing way and declare using type hints is not experimental anymore.

One design goal in this PR was that, avoid touching the internal (since we didn't deprecate the old ways for now), but supports type hints with a minimised changes only at the interface.

- Once we deprecate or remove the old ways, I think it requires another refactoring for the internal in the future. At the very least, we should rename internal pandas evaluation types.
- If users find this experimental type hints isn't quite helpful, we should simply revert the changes at the interface level.

### Why are the changes needed?

In order to address old design issues. Please see [the proposal](https://docs.google.com/document/d/1-kV0FS_LF2zvaRh_GhkV32Uqksm_Sq8SvnBBmRyxm30/edit?usp=sharing).

### Does this PR introduce any user-facing change?

For behaviour changes, No.

It adds new ways to use pandas UDFs by using type hints. See below.

**SCALAR**:

```python
pandas_udf(schema='...')
def func(c1: Series, c2: DataFrame) -> Series:
    pass  # DataFrame represents a struct column
```

**SCALAR_ITER**:

```python
pandas_udf(schema='...')
def func(iter: Iterator[Tuple[Series, DataFrame, ...]]) -> Iterator[Series]:
    pass  # Same as SCALAR but wrapped by Iterator
```

**GROUPED_AGG**:

```python
pandas_udf(schema='...')
def func(c1: Series, c2: DataFrame) -> int:
    pass  # DataFrame represents a struct column
```

**GROUPED_MAP**:

This was added in Spark 2.3 as of SPARK-20396. As described above, it keeps the existing behaviour. Additionally, we now have a new alias `groupby.applyInPandas` for `groupby.apply`. See the example below:

```python
def func(pdf):
    return pdf

df.groupby("...").applyInPandas(func, schema=df.schema)
```

**MAP_ITER**: this is not a pandas UDF anymore

This was added in Spark 3.0 as of SPARK-28198; and this PR replaces the usages. See the example below:

```python
def func(iter):
    for df in iter:
        yield df

df.mapInPandas(func, df.schema)
```

**COGROUPED_MAP**: this is not a pandas UDF anymore

This was added in Spark 3.0 as of SPARK-27463; and this PR replaces the usages. See the example below:

```python
def asof_join(left, right):
    return pd.merge_asof(left, right, on="...", by="...")

 df1.groupby("...").cogroup(df2.groupby("...")).applyInPandas(asof_join, schema="...")
```

### How was this patch tested?

Unittests added and tested against Python 2.7, 3.6 and 3.7.

Closes #27165 from HyukjinKwon/revisit-pandas.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-01-22 15:32:58 +09:00
HyukjinKwon e170422f74 Revert "[SPARK-30534][INFRA] Use mvn in dev/scalastyle"
This reverts commit 384899944b.
2020-01-21 18:23:03 +09:00
Takeshi Yamamuro 775fae4640 [SPARK-30486][BUILD] Bump lz4-java version to 1.7.1
### What changes were proposed in this pull request?

This pr intends to upgrade lz4-java from 1.7.0 to 1.7.1.

### Why are the changes needed?

This release includes a bug fix for older macOS. You can see the link below for the changes;
https://github.com/lz4/lz4-java/blob/master/CHANGES.md#171

### Does this PR introduce any user-facing change?

### How was this patch tested?

Existing tests.

Closes #27271 from maropu/SPARK-30486.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-01-19 19:05:30 -08:00
Sean Owen a2081ae4e1 [SPARK-29290][CORE] Update to chill 0.9.5
### What changes were proposed in this pull request?

Update Twitter Chill to 0.9.5.

### Why are the changes needed?

Primarily, Scala 2.13 support for later.
Other changes from 0.9.3 are apparently just minor fixes and improvements:
https://github.com/twitter/chill/releases

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Existing tests

Closes #27227 from srowen/SPARK-29290.

Authored-by: Sean Owen <srowen@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-01-19 18:39:38 -08:00
Dongjoon Hyun 384899944b [SPARK-30534][INFRA] Use mvn in dev/scalastyle
### What changes were proposed in this pull request?

This PR aims to use `mvn` instead of `sbt` in `dev/scalastyle` to recover GitHub Action.

### Why are the changes needed?

As of now, Apache Spark sbt build is broken by the Maven Central repository policy.
https://stackoverflow.com/questions/59764749/requests-to-http-repo1-maven-org-maven2-return-a-501-https-required-status-an

> Effective January 15, 2020, The Central Maven Repository no longer supports insecure
> communication over plain HTTP and requires that all requests to the repository are
> encrypted over HTTPS.

We can reproduce this locally by the following.
```
$ rm -rf ~/.m2/repository/org/apache/apache/18/
$ build/sbt clean
```

And, in GitHub Action, `lint-scala` is the only one which is using `sbt`.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

First of all, GitHub Action should be recovered.

Also, manually, do the following.

**Without Scalastyle violation**
```
$ dev/scalastyle
OpenJDK 64-Bit Server VM warning: ignoring option MaxPermSize=384m; support was removed in 8.0
Using `mvn` from path: /usr/local/bin/mvn
Scalastyle checks passed.
```

**With Scalastyle violation**
```
$ dev/scalastyle
OpenJDK 64-Bit Server VM warning: ignoring option MaxPermSize=384m; support was removed in 8.0
Using `mvn` from path: /usr/local/bin/mvn
Scalastyle checks failed at following occurrences:
error file=/Users/dongjoon/PRS/SPARK-HTTP-501/core/src/main/scala/org/apache/spark/SparkConf.scala message=There should be no empty line separating imports in the same group. line=22 column=0
error file=/Users/dongjoon/PRS/SPARK-HTTP-501/core/src/test/scala/org/apache/spark/resource/ResourceProfileSuite.scala message=There should be no empty line separating imports in the same group. line=22 column=0
```

Closes #27242 from dongjoon-hyun/SPARK-30534.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-01-16 16:00:58 -08:00
Xinrong Meng f88874194a [SPARK-30491][INFRA] Enable dependency audit files to tell dependency classifier
### What changes were proposed in this pull request?
Enable dependency audit files to tell the value of artifact id, version, and classifier of a dependency.

For example, `avro-mapred-1.8.2-hadoop2.jar` should be expanded to `avro-mapred/1.8.2/hadoop2/avro-mapred-1.8.2-hadoop2.jar` where `avro-mapred` is the artifact id, `1.8.2` is the version, and `haddop2` is the classifier.

### Why are the changes needed?
Dependency audit files are expected to be consumed by automated tests or downstream tools.

However, current dependency audit files under `dev/deps` only show jar names. And there isn't a simple rule on how to parse the jar name to get the values of different fields. For example, `hadoop2` is the classifier of `avro-mapred-1.8.2-hadoop2.jar`, in contrast, `incubating` is the version of `htrace-core-3.1.0-incubating.jar`.

Reference: There is a good example of the downstream tool that would be enabled as yhuai suggested,

> Say we have a Spark application that depends on a third-party dependency `foo`, which pulls in `jackson` as a transient dependency. Unfortunately, `foo` depends on a different version of `jackson` than Spark. So, in the pom of this Spark application, we use the dependency management section to pin the version of `jackson`. By doing this, we are lifting `jackson` to the top-level dependency of my application and I want to have a way to keep tracking what Spark uses. What we can do is to cross-check my Spark application's classpath with what Spark uses. Then, with a test written in my code base, whenever my application bumps Spark version, this test will check what we define in the application and what Spark has, and then remind us to change our application's pom if needed. In my case, I am fine to directly access git to get these audit files.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Code changes are verified by generated dependency audit files naturally. Thus, there are no tests added.

Closes #27177 from mengCareers/depsOptimize.

Lead-authored-by: Xinrong Meng <meng.careers@gmail.com>
Co-authored-by: mengCareers <meng.careers@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-01-15 20:19:44 -08:00
HyukjinKwon dcdc9a8be7 [SPARK-28198][PYTHON][FOLLOW-UP] Run the tests of MAP ITER UDF in Jenkins
### What changes were proposed in this pull request?

https://github.com/apache/spark/pull/24997 missed to add `pyspark.sql.tests.test_pandas_udf_iter` to `modules.py`. This PR adds it.

### Why are the changes needed?

Currently, Jenkins does not run the test cases. We should run them.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Jenkins should test.

Closes #27141 from HyukjinKwon/SPARK-28198-followup.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-01-09 13:45:50 +09:00
Eric Chang 5c71304b43 [SPARK-30450][INFRA][FOLLOWUP] Fix git folder regex for windows file separator
### What changes were proposed in this pull request?

The regex is to exclude the .git folder for the python linter, but bash escaping caused only one forward slash to be included. This adds the necessary second slash.

### Why are the changes needed?

This is necessary to properly match the file separator character.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Manually.
Added File dev/something.git.py and ran `dev/lint-python`
```dev/lint-python
pycodestyle checks failed.
*** Error compiling './dev/something.git.py'...
  File "./dev/something.git.py", line 1
    mport asdf2
              ^
SyntaxError: invalid syntax```

Closes #27140 from ericfchang/master.

Authored-by: Eric Chang <eric.chang@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-01-08 19:38:20 -08:00
HyukjinKwon ee8d661058 [SPARK-30434][PYTHON][SQL] Move pandas related functionalities into 'pandas' sub-package
### What changes were proposed in this pull request?

This PR proposes to move pandas related functionalities into pandas package. Namely:

```bash
pyspark/sql/pandas
├── __init__.py
├── conversion.py  # Conversion between pandas <> PySpark DataFrames
├── functions.py   # pandas_udf
├── group_ops.py   # Grouped UDF / Cogrouped UDF + groupby.apply, groupby.cogroup.apply
├── map_ops.py     # Map Iter UDF + mapInPandas
├── serializers.py # pandas <> PyArrow serializers
├── types.py       # Type utils between pandas <> PyArrow
└── utils.py       # Version requirement checks
```

In order to separately locate `groupby.apply`, `groupby.cogroup.apply`, `mapInPandas`, `toPandas`, and `createDataFrame(pdf)` under `pandas` sub-package, I had to use a mix-in approach which Scala side uses often by `trait`, and also pandas itself uses this approach (see `IndexOpsMixin` as an example) to group related functionalities. Currently, you can think it's like Scala's self typed trait. See the structure below:

```python
class PandasMapOpsMixin(object):
    def mapInPandas(self, ...):
        ...
        return ...

    # other Pandas <> PySpark APIs
```

```python
class DataFrame(PandasMapOpsMixin):

    # other DataFrame APIs equivalent to Scala side.

```

Yes, This is a big PR but they are mostly just moving around except one case `createDataFrame` which I had to split the methods.

### Why are the changes needed?

There are pandas functionalities here and there and I myself gets lost where it was. Also, when you have to make a change commonly for all of pandas related features, it's almost impossible now.

Also, after this change, `DataFrame` and `SparkSession` become more consistent with Scala side since pandas is specific to Python, and this change separates pandas-specific APIs away from `DataFrame` or `SparkSession`.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Existing tests should cover. Also, I manually built the PySpark API documentation and checked.

Closes #27109 from HyukjinKwon/pandas-refactoring.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-01-09 10:22:50 +09:00
HyukjinKwon 390e6bd7bc [SPARK-30453][BUILD][R] Update AppVeyor R version to 3.6.2
### What changes were proposed in this pull request?
R version 3.6.2 (Dark and Stormy Night) was released on 2019-12-12. This PR targets to upgrade R installation for AppVeyor CI environment.

### Why are the changes needed?
To test the latest R versions before the release, and see if there are any regressions.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
AppVeyor will test.

Closes #27124 from HyukjinKwon/upgrade-r-version-appveyor.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-01-07 18:43:21 -08:00
Eric Chang ed8a260749 [SPARK-30450][INFRA] Exclude .git folder for python linter
### What changes were proposed in this pull request?

This excludes the .git folder when the python linter runs.  We want to exclude because there may be files in .git from other branches that could cause the linter to fail.

### Why are the changes needed?

I ran into a case where there was a branch name that ended ".py" suffix so there were git refs files in .git folder in .git/logs/refs and .git/refs/remotes.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Manual.
```
$ git branch 3.py
$ git checkout 3.py
Switched to branch '3.py'
$ dev/lint-python
starting python compilation test...
Python compilation failed with the following errors:
*** Error compiling './.git/logs/refs/heads/3.py'...
  File "./.git/logs/refs/heads/3.py", line 1
    0000000000000000000000000000000000000000 895e572b73 Dongjoon Hyun <dhyunapple.com> 1578438255 -0800	branch: Created from master
                                                   ^
SyntaxError: invalid syntax

*** Error compiling './.git/refs/heads/3.py'...
  File "./.git/refs/heads/3.py", line 1
    895e572b73
                                           ^
SyntaxError: invalid syntax
```

Closes #27120 from ericfchang/master.

Authored-by: Eric Chang <eric.chang@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-01-07 15:14:17 -08:00
WeichenXu 88542bc3d9 [SPARK-30154][ML] PySpark UDF to convert MLlib vectors to dense arrays
### What changes were proposed in this pull request?

PySpark UDF to convert MLlib vectors to dense arrays.
Example:
```
from pyspark.ml.functions import vector_to_array
df.select(vector_to_array(col("features"))
```

### Why are the changes needed?
If a PySpark user wants to convert MLlib sparse/dense vectors in a DataFrame into dense arrays, an efficient approach is to do that in JVM. However, it requires PySpark user to write Scala code and register it as a UDF. Often this is infeasible for a pure python project.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
UT.

Closes #26910 from WeichenXu123/vector_to_array.

Authored-by: WeichenXu <weichen.xu@databricks.com>
Signed-off-by: Xiangrui Meng <meng@databricks.com>
2020-01-06 16:18:51 -08:00
Sean Owen fac6b9bde8 Revert [SPARK-27300][GRAPH] Add Spark Graph modules and dependencies
This reverts commit 709387d660.

See https://issues.apache.org/jira/browse/SPARK-27300?focusedCommentId=16990048&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16990048 and previous mailing list discussions.

### What changes were proposed in this pull request?

Revert the addition of skeleton graph API modules for Spark 3.0.

### Why are the changes needed?

It does not appear that content will be added to the module for Spark 3, so I propose avoiding committing to the modules, which are no-ops now, in the upcoming major 3.0 release.

### Does this PR introduce any user-facing change?

No, the modules were not released.

### How was this patch tested?

Existing tests, but mostly N/A.

Closes #26928 from srowen/Revert27300.

Authored-by: Sean Owen <srowen@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-12-17 09:06:23 -08:00
Yuming Wang 5de5e46624 [SPARK-30268][INFRA] Fix incorrect pyspark version when releasing preview versions
### What changes were proposed in this pull request?

This PR fix incorrect pyspark version when releasing preview versions.

### Why are the changes needed?

Failed to make Spark binary distribution:
```
cp: cannot stat 'spark-3.0.0-preview2-bin-hadoop2.7/python/dist/pyspark-3.0.0.dev02.tar.gz': No such file or directory
gpg: can't open 'pyspark-3.0.0.dev02.tar.gz': No such file or directory
gpg: signing failed: No such file or directory
gpg: pyspark-3.0.0.dev02.tar.gz: No such file or directory
```

```
yumwangubuntu-3513086:~/spark-release/output$ ll spark-3.0.0-preview2-bin-hadoop2.7/python/dist/
total 214140
drwxr-xr-x 2 yumwang stack      4096 Dec 16 06:17 ./
drwxr-xr-x 9 yumwang stack      4096 Dec 16 06:17 ../
-rw-r--r-- 1 yumwang stack 219267173 Dec 16 06:17 pyspark-3.0.0.dev2.tar.gz
```

```
/usr/local/lib/python3.6/dist-packages/setuptools/dist.py:476: UserWarning: Normalizing '3.0.0.dev02' to '3.0.0.dev2'
  normalized_version,
```

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
manual test:
```
LM-SHC-16502798:spark yumwang$ SPARK_VERSION=3.0.0-preview2
LM-SHC-16502798:spark yumwang$ echo "$SPARK_VERSION" |  sed -e "s/-/./" -e "s/SNAPSHOT/dev0/" -e "s/preview/dev/"
3.0.0.dev2

```

Closes #26909 from wangyum/SPARK-30268.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-12-17 10:22:29 +09:00
Yuming Wang ba0f59bfaf [SPARK-30265][INFRA] Do not change R version when releasing preview versions
### What changes were proposed in this pull request?
This PR makes it do not change R version when releasing preview versions.

### Why are the changes needed?
Failed to make Spark binary distribution:
```
++ . /opt/spark-rm/output/spark-3.0.0-preview2-bin-hadoop2.7/R/find-r.sh
+++ '[' -z /usr/bin ']'
++ /usr/bin/Rscript -e ' if("devtools" %in% rownames(installed.packages())) { library(devtools); devtools::document(pkg="./pkg", roclets=c("rd")) }'
Loading required package: usethis
Updating SparkR documentation
First time using roxygen2. Upgrading automatically...
Loading SparkR
Invalid DESCRIPTION:
Malformed package version.

See section 'The DESCRIPTION file' in the 'Writing R Extensions'
manual.

Error: invalid version specification '3.0.0-preview2'
In addition: Warning message:
roxygen2 requires Encoding: UTF-8
Execution halted
[ERROR] Command execution failed.
org.apache.commons.exec.ExecuteException: Process exited with an error: 1 (Exit value: 1)
    at org.apache.commons.exec.DefaultExecutor.executeInternal (DefaultExecutor.java:404)
    at org.apache.commons.exec.DefaultExecutor.execute (DefaultExecutor.java:166)
    at org.codehaus.mojo.exec.ExecMojo.executeCommandLine (ExecMojo.java:804)
    at org.codehaus.mojo.exec.ExecMojo.executeCommandLine (ExecMojo.java:751)
    at org.codehaus.mojo.exec.ExecMojo.execute (ExecMojo.java:313)
    at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo (DefaultBuildPluginManager.java:137)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:210)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:156)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:148)
    at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject (LifecycleModuleBuilder.java:117)
    at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject (LifecycleModuleBuilder.java:81)
    at org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build (SingleThreadedBuilder.java:56)
    at org.apache.maven.lifecycle.internal.LifecycleStarter.execute (LifecycleStarter.java:128)
    at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:305)
    at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:192)
    at org.apache.maven.DefaultMaven.execute (DefaultMaven.java:105)
    at org.apache.maven.cli.MavenCli.execute (MavenCli.java:957)
    at org.apache.maven.cli.MavenCli.doMain (MavenCli.java:289)
    at org.apache.maven.cli.MavenCli.main (MavenCli.java:193)
    at sun.reflect.NativeMethodAccessorImpl.invoke0 (Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke (Method.java:498)
    at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced (Launcher.java:282)
    at org.codehaus.plexus.classworlds.launcher.Launcher.launch (Launcher.java:225)
    at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode (Launcher.java:406)
    at org.codehaus.plexus.classworlds.launcher.Launcher.main (Launcher.java:347)
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary for Spark Project Parent POM 3.0.0-preview2:
[INFO]
[INFO] Spark Project Parent POM ........................... SUCCESS [ 18.619 s]
[INFO] Spark Project Tags ................................. SUCCESS [ 13.652 s]
[INFO] Spark Project Sketch ............................... SUCCESS [  5.673 s]
[INFO] Spark Project Local DB ............................. SUCCESS [  2.081 s]
[INFO] Spark Project Networking ........................... SUCCESS [  3.509 s]
[INFO] Spark Project Shuffle Streaming Service ............ SUCCESS [  0.993 s]
[INFO] Spark Project Unsafe ............................... SUCCESS [  7.556 s]
[INFO] Spark Project Launcher ............................. SUCCESS [  5.522 s]
[INFO] Spark Project Core ................................. FAILURE [01:06 min]
[INFO] Spark Project ML Local Library ..................... SKIPPED
[INFO] Spark Project GraphX ............................... SKIPPED
[INFO] Spark Project Streaming ............................ SKIPPED
[INFO] Spark Project Catalyst ............................. SKIPPED
[INFO] Spark Project SQL .................................. SKIPPED
[INFO] Spark Project ML Library ........................... SKIPPED
[INFO] Spark Project Tools ................................ SKIPPED
[INFO] Spark Project Hive ................................. SKIPPED
[INFO] Spark Project Graph API ............................ SKIPPED
[INFO] Spark Project Cypher ............................... SKIPPED
[INFO] Spark Project Graph ................................ SKIPPED
[INFO] Spark Project REPL ................................. SKIPPED
[INFO] Spark Project Assembly ............................. SKIPPED
[INFO] Kafka 0.10+ Token Provider for Streaming ........... SKIPPED
[INFO] Spark Integration for Kafka 0.10 ................... SKIPPED
[INFO] Kafka 0.10+ Source for Structured Streaming ........ SKIPPED
[INFO] Spark Project Examples ............................. SKIPPED
[INFO] Spark Integration for Kafka 0.10 Assembly .......... SKIPPED
[INFO] Spark Avro ......................................... SKIPPED
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  02:04 min
[INFO] Finished at: 2019-12-16T08:02:45Z
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.codehaus.mojo:exec-maven-plugin:1.6.0:exec (sparkr-pkg) on project spark-core_2.12: Command execution failed.: Process exited with an error: 1 (Exit value: 1) -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException
[ERROR]
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR]   mvn <args> -rf :spark-core_2.12
```

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
manual test:
```diff
diff --git a/R/pkg/R/sparkR.R b/R/pkg/R/sparkR.R
index cdb59093781..b648c51e010 100644
--- a/R/pkg/R/sparkR.R
+++ b/R/pkg/R/sparkR.R
 -336,8 +336,8  sparkR.session <- function(

   # Check if version number of SparkSession matches version number of SparkR package
   jvmVersion <- callJMethod(sparkSession, "version")
-  # Remove -SNAPSHOT from jvm versions
-  jvmVersionStrip <- gsub("-SNAPSHOT", "", jvmVersion)
+  # Remove -preview2 from jvm versions
+  jvmVersionStrip <- gsub("-preview2", "", jvmVersion)
   rPackageVersion <- paste0(packageVersion("SparkR"))

   if (jvmVersionStrip != rPackageVersion) {

```

Closes #26904 from wangyum/SPARK-30265.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Yuming Wang <wgyumg@gmail.com>
2019-12-16 04:54:12 -07:00
Yuming Wang 1fc353d51a Revert "[SPARK-30056][INFRA] Skip building test artifacts in dev/make-distribution.sh
### What changes were proposed in this pull request?

This reverts commit 7c0ce285.

### Why are the changes needed?

Failed to make distribution:
```
[INFO] -----------------< org.apache.spark:spark-sketch_2.12 >-----------------
[INFO] Building Spark Project Sketch 3.0.0-preview2                      [3/33]
[INFO] --------------------------------[ jar ]---------------------------------
[INFO] Downloading from central: https://repo.maven.apache.org/maven2/org/apache/spark/spark-tags_2.12/3.0.0-preview2/spark-tags_2.12-3.0.0-preview2-tests.jar
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary for Spark Project Parent POM 3.0.0-preview2:
[INFO]
[INFO] Spark Project Parent POM ........................... SUCCESS [ 26.513 s]
[INFO] Spark Project Tags ................................. SUCCESS [ 48.393 s]
[INFO] Spark Project Sketch ............................... FAILURE [  0.034 s]
[INFO] Spark Project Local DB ............................. SKIPPED
[INFO] Spark Project Networking ........................... SKIPPED
[INFO] Spark Project Shuffle Streaming Service ............ SKIPPED
[INFO] Spark Project Unsafe ............................... SKIPPED
[INFO] Spark Project Launcher ............................. SKIPPED
[INFO] Spark Project Core ................................. SKIPPED
[INFO] Spark Project ML Local Library ..................... SKIPPED
[INFO] Spark Project GraphX ............................... SKIPPED
[INFO] Spark Project Streaming ............................ SKIPPED
[INFO] Spark Project Catalyst ............................. SKIPPED
[INFO] Spark Project SQL .................................. SKIPPED
[INFO] Spark Project ML Library ........................... SKIPPED
[INFO] Spark Project Tools ................................ SKIPPED
[INFO] Spark Project Hive ................................. SKIPPED
[INFO] Spark Project Graph API ............................ SKIPPED
[INFO] Spark Project Cypher ............................... SKIPPED
[INFO] Spark Project Graph ................................ SKIPPED
[INFO] Spark Project REPL ................................. SKIPPED
[INFO] Spark Project YARN Shuffle Service ................. SKIPPED
[INFO] Spark Project YARN ................................. SKIPPED
[INFO] Spark Project Mesos ................................ SKIPPED
[INFO] Spark Project Kubernetes ........................... SKIPPED
[INFO] Spark Project Hive Thrift Server ................... SKIPPED
[INFO] Spark Project Assembly ............................. SKIPPED
[INFO] Kafka 0.10+ Token Provider for Streaming ........... SKIPPED
[INFO] Spark Integration for Kafka 0.10 ................... SKIPPED
[INFO] Kafka 0.10+ Source for Structured Streaming ........ SKIPPED
[INFO] Spark Project Examples ............................. SKIPPED
[INFO] Spark Integration for Kafka 0.10 Assembly .......... SKIPPED
[INFO] Spark Avro ......................................... SKIPPED
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  01:15 min
[INFO] Finished at: 2019-12-16T05:29:43Z
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal on project spark-sketch_2.12: Could not resolve dependencies for project org.apache.spark:spark-sketch_2.12🫙3.0.0-preview2: Could not find artifact org.apache.spark:spark-tags_2.12🫙tests:3.0.0-preview2 in central (https://repo.maven.apache.org/maven2) -> [Help 1]
[ERROR]
```

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

manual test.

Closes #26902 from wangyum/SPARK-30056.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Yuming Wang <wgyumg@gmail.com>
2019-12-15 23:16:17 -07:00
Yuming Wang 26b658f6fb [SPARK-30253][INFRA] Do not add commits when releasing preview version
### What changes were proposed in this pull request?

This PR add support do not add commits to master branch when releasing preview version.

### Why are the changes needed?

We need manual revert this change, example:
![image](https://user-images.githubusercontent.com/5399861/70788945-f9d15180-1dcc-11ea-81f5-c0d89c28440a.png)

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?

manual test

Closes #26879 from wangyum/SPARK-30253.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Yuming Wang <wgyumg@gmail.com>
2019-12-15 19:44:29 -07:00
Yuming Wang e1ee3fb72f [SPARK-30216][INFRA] Use python3 in Docker release image
### What changes were proposed in this pull request?

- Reverts commit 1f94bf4 and d6be46e
- Switches python to python3 in Docker release image.

### Why are the changes needed?
`dev/make-distribution.sh` and `python/setup.py` are use python3.
https://github.com/apache/spark/pull/26844/files#diff-ba2c046d92a1d2b5b417788bfb5cb5f8L236
https://github.com/apache/spark/pull/26330/files#diff-8cf6167d58ce775a08acafcfe6f40966

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?

manual test:
```
yumwangubuntu-3513086:~/spark$ dev/create-release/do-release-docker.sh -n -d /home/yumwang/spark-release
Output directory already exists. Overwrite and continue? [y/n] y
Branch [branch-2.4]: master
Current branch version is 3.0.0-SNAPSHOT.
Release [3.0.0]: 3.0.0-preview2
RC # [1]:
This is a dry run. Please confirm the ref that will be built for testing.
Ref [master]:
ASF user [yumwang]:
Full name [Yuming Wang]:
GPG key [yumwangapache.org]: DBD447010C1B4F7DAD3F7DFD6E1B4122F6A3A338
================
Release details:
BRANCH:     master
VERSION:    3.0.0-preview2
TAG:        v3.0.0-preview2-rc1
NEXT:       3.0.1-SNAPSHOT

ASF USER:   yumwang
GPG KEY:    DBD447010C1B4F7DAD3F7DFD6E1B4122F6A3A338
FULL NAME:  Yuming Wang
E-MAIL:     yumwangapache.org
================
Is this info correct [y/n]? y
GPG passphrase:

========================
= Building spark-rm image with tag latest...
Command: docker build -t spark-rm:latest --build-arg UID=110302528 /home/yumwang/spark/dev/create-release/spark-rm
Log file: docker-build.log
Building v3.0.0-preview2-rc1; output will be at /home/yumwang/spark-release/output

gpg: directory '/home/spark-rm/.gnupg' created
gpg: keybox '/home/spark-rm/.gnupg/pubring.kbx' created
gpg: /home/spark-rm/.gnupg/trustdb.gpg: trustdb created
gpg: key 6E1B4122F6A3A338: public key "Yuming Wang <yumwangapache.org>" imported
gpg: key 6E1B4122F6A3A338: secret key imported
gpg: Total number processed: 1
gpg:               imported: 1
gpg:       secret keys read: 1
gpg:   secret keys imported: 1
========================
= Creating release tag v3.0.0-preview2-rc1...
Command: /opt/spark-rm/release-tag.sh
Log file: tag.log
It may take some time for the tag to be synchronized to github.
Press enter when you've verified that the new tag (v3.0.0-preview2-rc1) is available.
========================
= Building Spark...
Command: /opt/spark-rm/release-build.sh package
Log file: build.log
========================
= Building documentation...
Command: /opt/spark-rm/release-build.sh docs
Log file: docs.log
========================
= Publishing release
Command: /opt/spark-rm/release-build.sh publish-release
Log file: publish.log
```
Generated doc:
![image](https://user-images.githubusercontent.com/5399861/70693075-a7723100-1cf7-11ea-9f88-9356a02349a1.png)

Closes #26848 from wangyum/SPARK-30216.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-12-13 11:31:31 -08:00
Dongjoon Hyun cc276f8a6e [SPARK-30243][BUILD][K8S] Upgrade K8s client dependency to 4.6.4
### What changes were proposed in this pull request?

This PR aims to upgrade K8s client library from 4.6.1 to 4.6.4 for `3.0.0-preview2`.

### Why are the changes needed?

This will bring the latest bug fixes.
- https://github.com/fabric8io/kubernetes-client/releases/tag/v4.6.4
- https://github.com/fabric8io/kubernetes-client/releases/tag/v4.6.3
- https://github.com/fabric8io/kubernetes-client/releases/tag/v4.6.2

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Pass the Jenkins with K8s integration test.

Closes #26874 from dongjoon-hyun/SPARK-30243.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-12-13 08:25:51 -08:00
Yuming Wang 39c0696a39 [MINOR] Fix google style guide address
### What changes were proposed in this pull request?

This PR update  google style guide address to `https://google.github.io/styleguide/javaguide.html`.

### Why are the changes needed?

`https://google-styleguide.googlecode.com/svn-history/r130/trunk/javaguide.html` **404**:

![image](https://user-images.githubusercontent.com/5399861/70717915-431c9500-1d2a-11ea-895b-024be953a116.png)

### Does this PR introduce any user-facing change?
No

### How was this patch tested?

Closes #26865 from wangyum/fix-google-styleguide.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2019-12-12 11:04:01 -06:00
Dongjoon Hyun b709091b4f [SPARK-30228][BUILD] Update zstd-jni to 1.4.4-3
### What changes were proposed in this pull request?

This PR aims to update zstd-jni library to 1.4.4-3.

### Why are the changes needed?

This will bring the latest bug fixes in zstd itself and some performance improvement.
- https://github.com/facebook/zstd/releases/tag/v1.4.4

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Pass the Jenkins.

Closes #26856 from dongjoon-hyun/SPARK-ZSTD-144.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-12-12 14:16:32 +09:00
Yuming Wang eb509968a7 [SPARK-30211][INFRA] Use python3 in make-distribution.sh
### What changes were proposed in this pull request?

This PR switches python to python3 in `make-distribution.sh`.

### Why are the changes needed?

SPARK-29672 changed this
- https://github.com/apache/spark/pull/26330/files#diff-8cf6167d58ce775a08acafcfe6f40966

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
N/A

Closes #26844 from wangyum/SPARK-30211.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-12-10 23:30:12 -08:00
Takeshi Yamamuro be867e8a9e [SPARK-30196][BUILD] Bump lz4-java version to 1.7.0
### What changes were proposed in this pull request?

This pr intends to upgrade lz4-java from 1.6.0 to 1.7.0.

### Why are the changes needed?

This release includes a performance bug (https://github.com/lz4/lz4-java/pull/143) fixed by JoshRosen and some improvements (e.g., LZ4 binary update). You can see the link below for the changes;
https://github.com/lz4/lz4-java/blob/master/CHANGES.md#170

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Existing tests.

Closes #26823 from maropu/LZ4_1_7_0.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-12-10 12:22:03 +09:00
Dongjoon Hyun afc4fa02bd [SPARK-30156][BUILD] Upgrade Jersey from 2.29 to 2.29.1
### What changes were proposed in this pull request?

This PR aims to upgrade `Jersey` from 2.29 to 2.29.1.

### Why are the changes needed?

This will bring several bug fixes and important dependency upgrades.
- https://eclipse-ee4j.github.io/jersey.github.io/release-notes/2.29.1.html

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Pass the Jenkins.

Closes #26785 from dongjoon-hyun/SPARK-30156.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-12-06 18:49:43 -08:00
Dongjoon Hyun 1e0037b5e9 [SPARK-30157][BUILD][TEST-HADOOP3.2][TEST-JAVA11] Upgrade Apache HttpCore from 4.4.10 to 4.4.12
### What changes were proposed in this pull request?

This PR aims to upgrade `Apache HttpCore` from 4.4.10 to 4.4.12.

### Why are the changes needed?

`Apache HttpCore v4.4.11` is the first official release for JDK11.
> This is a maintenance release that corrects a number of defects in non-blocking SSL session code that caused compatibility issues with TLSv1.3 protocol implementation shipped with Java 11.

For the full release note, please see the following.
- https://www.apache.org/dist/httpcomponents/httpcore/RELEASE_NOTES-4.4.x.txt

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Pass the Jenkins.

Closes #26786 from dongjoon-hyun/SPARK-30157.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-12-07 10:59:10 +09:00
Dongjoon Hyun 1595e46a4e [SPARK-30142][TEST-MAVEN][BUILD] Upgrade Maven to 3.6.3
### What changes were proposed in this pull request?

This PR aims to upgrade Maven from 3.6.2 to 3.6.3.

### Why are the changes needed?

This will bring bug fixes like the following.
- MNG-6759 Maven fails to use <repositories> section from dependency when resolving transitive dependencies in some cases
- MNG-6760 ExclusionArtifactFilter result invalid when wildcard exclusion is followed by other exclusions

The following is the full release note.
- https://maven.apache.org/docs/3.6.3/release-notes.html

### Does this PR introduce any user-facing change?

No. (This is a dev-environment change.)

### How was this patch tested?

Pass the Jenkins with both SBT and Maven.

Closes #26770 from dongjoon-hyun/SPARK-30142.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-12-06 23:41:59 +09:00
Dongjoon Hyun f3abee377d [SPARK-30051][BUILD] Clean up hadoop-3.2 dependency
### What changes were proposed in this pull request?

This PR aims to cut `org.eclipse.jetty:jetty-webapp`and `org.eclipse.jetty:jetty-xml` transitive dependency from `hadoop-common`.

### Why are the changes needed?

This will simplify our dependency management by the removal of unused dependencies.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Pass the GitHub Action with all combinations and the Jenkins UT with (Hadoop-3.2).

Closes #26742 from dongjoon-hyun/SPARK-30051.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-12-03 14:33:36 -08:00
Sean Owen 4193d2f4cc [SPARK-30012][CORE][SQL] Change classes extending scala collection classes to work with 2.13
### What changes were proposed in this pull request?

Move some classes extending Scala collections into parallel source trees, to support 2.13; other minor collection-related modifications.

Modify some classes extending Scala collections to work with 2.13 as well as 2.12. In many cases, this means introducing parallel source trees, as the type hierarchy changed in ways that one class can't support both.

### Why are the changes needed?

To support building for Scala 2.13 in the future.

### Does this PR introduce any user-facing change?

There should be no behavior change.

### How was this patch tested?

Existing tests. Note that the 2.13 changes are not tested by the PR builder, of course. They compile in 2.13 but can't even be tested locally. Later, once the project can be compiled for 2.13, thus tested, it's possible the 2.13 implementations will need updates.

Closes #26728 from srowen/SPARK-30012.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-12-03 08:59:43 -08:00
HyukjinKwon 32af7004a2 [SPARK-25016][INFRA][FOLLOW-UP] Remove leftover for dropping Hadoop 2.6 in Jenkins's test script
### What changes were proposed in this pull request?

This PR proposes to remove the leftover. After https://github.com/apache/spark/pull/22615, we don't have Hadoop 2.6 profile anymore in master.

### Why are the changes needed?

Using "test-hadoop2.6" against master branch in a PR wouldn't work.

### Does this PR introduce any user-facing change?

No (dev only).

### How was this patch tested?

Manually tested at https://github.com/apache/spark/pull/26707 and Jenkins build will test.

Without this fix, and hadoop2.6 in the pr title, it shows as below:

```
========================================================================
Building Spark
========================================================================
[error] Could not find hadoop2.6 in the list. Valid options  are dict_keys(['hadoop2.7', 'hadoop3.2'])
Attempting to post to Github...
```

Closes #26708 from HyukjinKwon/SPARK-25016.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-11-30 12:49:14 +09:00
HyukjinKwon 4a73bed318 [SPARK-29991][INFRA] Support Hive 1.2 and Hive 2.3 (default) in PR builder
### What changes were proposed in this pull request?

Currently, Apache Spark PR Builder using `hive-1.2` for `hadoop-2.7` and `hive-2.3` for `hadoop-3.2`. This PR aims to support

- `[test-hive1.2]`  in PR builder
- `[test-hive2.3]` in PR builder to be consistent and independent of the default profile
- After this PR, all PR builders will use Hive 2.3 by default (because Spark uses Hive 2.3 by default as of c98e5eb339)
- Use default profile in AppVeyor build.

Note that this was reverted due to unexpected test failure at `ThriftServerPageSuite`, which was investigated in https://github.com/apache/spark/pull/26706 . This PR fixed it by letting it use their own forked JVM. There is no explicit evidence for this fix and it was just my speculation, and thankfully it fixed at least.

### Why are the changes needed?
This new tag allows us more flexibility.

### Does this PR introduce any user-facing change?
No. (This is a dev-only change.)

### How was this patch tested?
Check the Jenkins triggers in this PR.

Default:

```
========================================================================
Building Spark
========================================================================
[info] Building Spark using SBT with these arguments:  -Phadoop-2.7 -Phive-2.3 -Phive-thriftserver -Pmesos -Pspark-ganglia-lgpl -Phadoop-cloud -Phive -Pkubernetes -Pkinesis-asl -Pyarn test:package streaming-kinesis-asl-assembly/assembly
```

`[test-hive1.2][test-hadoop3.2]`:

```
========================================================================
Building Spark
========================================================================
[info] Building Spark using SBT with these arguments:  -Phadoop-3.2 -Phive-1.2 -Phadoop-cloud -Pyarn -Pspark-ganglia-lgpl -Phive -Phive-thriftserver -Pmesos -Pkubernetes -Pkinesis-asl test:package streaming-kinesis-asl-assembly/assembly
```

`[test-maven][test-hive-2.3]`:

```
========================================================================
Building Spark
========================================================================
[info] Building Spark using Maven with these arguments:  -Phadoop-2.7 -Phive-2.3 -Pspark-ganglia-lgpl -Pyarn -Phive -Phadoop-cloud -Pkinesis-asl -Pmesos -Pkubernetes -Phive-thriftserver clean package -DskipTests
```

Closes #26710 from HyukjinKwon/SPARK-29991.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-11-30 12:48:15 +09:00
HyukjinKwon 9351e3e76f Revert "[SPARK-29991][INFRA] Support test-hive1.2 in PR Builder"
This reverts commit dde0d2fcad.
2019-11-29 13:23:22 +09:00
Dongjoon Hyun dde0d2fcad [SPARK-29991][INFRA] Support test-hive1.2 in PR Builder
### What changes were proposed in this pull request?

Currently, Apache Spark PR Builder using `hive-1.2` for `hadoop-2.7` and `hive-2.3` for `hadoop-3.2`. This PR aims to support `[test-hive1.2]` in PR Builder in order to cut the correlation between `hive-1.2/2.3` to `hadoop-2.7/3.2`. After this PR, the PR Builder will use `hive-2.3` by default for all profiles (if there is no `test-hive1.2`.)

### Why are the changes needed?

This new tag allows us more flexibility.

### Does this PR introduce any user-facing change?

No. (This is a dev-only change.)

### How was this patch tested?

Check the Jenkins triggers in this PR.

**BEFORE**
```
========================================================================
Building Spark
========================================================================
[info] Building Spark using SBT with these arguments:  -Phadoop-2.7 -Phive-1.2 -Pyarn -Pkubernetes -Phive -Phadoop-cloud -Pspark-ganglia-lgpl -Phive-thriftserver -Pkinesis-asl -Pmesos test:package streaming-kinesis-asl-assembly/assembly
```

**AFTER**
1. Title: [[SPARK-29991][INFRA][test-hive1.2] Support `test-hive1.2` in PR Builder](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/114550/testReport)
```
========================================================================
Building Spark
========================================================================
[info] Building Spark using SBT with these arguments:  -Phadoop-2.7 -Phive-1.2 -Pkinesis-asl -Phadoop-cloud -Pyarn -Phive -Pmesos -Pspark-ganglia-lgpl -Pkubernetes -Phive-thriftserver test:package streaming-kinesis-asl-assembly/assembly
```

2. Title: [[SPARK-29991][INFRA] Support `test hive1.2` in PR Builder](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/114551/testReport)
- Note that I removed the hyphen intentionally from `test-hive1.2`.
```
========================================================================
Building Spark
========================================================================
[info] Building Spark using SBT with these arguments:  -Phadoop-2.7 -Phive-thriftserver -Pkubernetes -Pspark-ganglia-lgpl -Phadoop-cloud -Phive -Pmesos -Pyarn -Pkinesis-asl test:package streaming-kinesis-asl-assembly/assembly
```

Closes #26695 from dongjoon-hyun/SPARK-29991.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-11-27 21:14:40 -08:00
Dongjoon Hyun 9459833eae [SPARK-29989][INFRA] Add hadoop-2.7/hive-2.3 pre-built distribution
### What changes were proposed in this pull request?

This PR aims to add another pre-built binary distribution with `-Phadoop-2.7 -Phive-1.2` at `Apache Spark 3.0.0`.

**PRE-BUILT BINARY DISTRIBUTION**
```
spark-3.0.0-SNAPSHOT-bin-hadoop2.7-hive1.2.tgz
spark-3.0.0-SNAPSHOT-bin-hadoop2.7-hive1.2.tgz.asc
spark-3.0.0-SNAPSHOT-bin-hadoop2.7-hive1.2.tgz.sha512
```

**CONTENTS (snippet)**
```
$ ls *hadoop-*
hadoop-annotations-2.7.4.jar                hadoop-mapreduce-client-shuffle-2.7.4.jar
hadoop-auth-2.7.4.jar                       hadoop-yarn-api-2.7.4.jar
hadoop-client-2.7.4.jar                     hadoop-yarn-client-2.7.4.jar
hadoop-common-2.7.4.jar                     hadoop-yarn-common-2.7.4.jar
hadoop-hdfs-2.7.4.jar                       hadoop-yarn-server-common-2.7.4.jar
hadoop-mapreduce-client-app-2.7.4.jar       hadoop-yarn-server-web-proxy-2.7.4.jar
hadoop-mapreduce-client-common-2.7.4.jar    parquet-hadoop-1.10.1.jar
hadoop-mapreduce-client-core-2.7.4.jar      parquet-hadoop-bundle-1.6.0.jar
hadoop-mapreduce-client-jobclient-2.7.4.jar

$ ls *hive-*
hive-beeline-1.2.1.spark2.jar                   hive-jdbc-1.2.1.spark2.jar
hive-cli-1.2.1.spark2.jar                       hive-metastore-1.2.1.spark2.jar
hive-exec-1.2.1.spark2.jar                      spark-hive-thriftserver_2.12-3.0.0-SNAPSHOT.jar
```

### Why are the changes needed?

Since Apache Spark switched to use `-Phive-2.3` by default, all pre-built binary distribution will use `-Phive-2.3`. This PR adds `hadoop-2.7/hive-1.2` distribution to provide a similar combination like `Apache Spark 2.4` line.

### Does this PR introduce any user-facing change?

Yes. This is additional distribution which resembles to `Apache Spark 2.4` line in terms of `hive` version.

### How was this patch tested?

Manual.

Please note that we need a dry-run mode, but the AS-IS release script do not generate additional combinations including this in `dry-run` mode.

Closes #26688 from dongjoon-hyun/SPARK-29989.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
2019-11-27 15:55:52 -08:00
Dongjoon Hyun 7c0ce28501 [SPARK-30056][INFRA] Skip building test artifacts in dev/make-distribution.sh
### What changes were proposed in this pull request?

This PR aims to skip building test artifacts in `dev/make-distribution.sh`.
Since Apache Spark 3.0.0, we need to build additional binary distribution, this helps the release process by speeding up building multiple binary distributions.

### Why are the changes needed?

Since the generated binary artifacts are irrelevant to the test jars, we can skip this.

**BEFORE**
```
$ time dev/make-distribution.sh
      726.86 real      2526.04 user        45.63 sys
```

**AFTER**
```
$ time dev/make-distribution.sh
      305.54 real      1099.99 user        26.52 sys
```

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Manually check `dev/make-distribution.sh` result and time.

Closes #26689 from dongjoon-hyun/SPARK-30056.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-11-27 18:19:21 +09:00
Dongjoon Hyun bdf0c606b6 [SPARK-28752][BUILD][FOLLOWUP] Fix to install rouge instead of rogue
### What changes were proposed in this pull request?

This PR aims to fix a type; `rogue` -> `rouge` .
This is a follow-up of https://github.com/apache/spark/pull/26521.

### Why are the changes needed?

To support `Python 3`, we upgraded from `pygments` to `rouge`.

### Does this PR introduce any user-facing change?

No. (This is for only document generation.)

### How was this patch tested?

Manually.
```
$ docker build -t test dev/create-release/spark-rm/
...
1 gem installed
Successfully installed rouge-3.13.0
Parsing documentation for rouge-3.13.0
Installing ri documentation for rouge-3.13.0
Done installing documentation for rouge after 4 seconds
1 gem installed
Removing intermediate container 9bd8707d9e84
 ---> a18b2f6b0bb9
...
```

Closes #26686 from dongjoon-hyun/SPARK-28752.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-11-26 14:58:20 -08:00
Sean Owen e23c135e56 [SPARK-29293][BUILD] Move scalafmt to Scala 2.12 profile; bump to 0.12
### What changes were proposed in this pull request?

Move scalafmt to Scala 2.12 profile; bump to 0.12.

### Why are the changes needed?

To facilitate a future Scala 2.13 build.

### Does this PR introduce any user-facing change?

None.

### How was this patch tested?

This isn't covered by tests, it's a convenience for contributors.

Closes #26655 from srowen/SPARK-29293.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-11-26 09:59:19 -08:00
Dongjoon Hyun c2d513f8e9 [SPARK-30035][BUILD] Upgrade to Apache Commons Lang 3.9
### What changes were proposed in this pull request?

This PR aims to upgrade to `Apache Commons Lang 3.9`.

### Why are the changes needed?

`Apache Commons Lang 3.9` is the first official release to support JDK9+. The following is the full release note.
- https://commons.apache.org/proper/commons-lang/release-notes/RELEASE-NOTES-3.9.txt

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Pass the Jenkins with the existing tests.

Closes #26672 from dongjoon-hyun/SPARK-30035.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-11-26 21:31:02 +09:00
Dongjoon Hyun 53e19f3678 [SPARK-30032][BUILD] Upgrade to ORC 1.5.8
### What changes were proposed in this pull request?

This PR aims to upgrade to Apache ORC 1.5.8.

### Why are the changes needed?

This will bring the latest bug fixes. The following is the full release note.
- https://issues.apache.org/jira/projects/ORC/versions/12346462

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Pass the Jenkins with the existing tests.

Closes #26669 from dongjoon-hyun/SPARK-ORC-1.5.8.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-11-25 20:08:11 -08:00
Dongjoon Hyun a1706e2fa7 [SPARK-30005][INFRA] Update test-dependencies.sh to check hive-1.2/2.3 profile
### What changes were proposed in this pull request?

This PR aims to update `test-dependencies.sh` to validate all available `Hadoop/Hive` combination.

### Why are the changes needed?

Previously, we have been checking only `Hadoop2.7/Hive1.2` and `Hadoop3.2/Hive2.3`.
We need to validate `Hadoop2.7/Hive2.3` additionally for Apache Spark 3.0.

### Does this PR introduce any user-facing change?

No. (This is a dev-only change).

### How was this patch tested?

Pass the GitHub Action (Linter) with the newly updated manifest because this is only dependency check.

Closes #26646 from dongjoon-hyun/SPARK-30005.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-11-24 10:14:02 -08:00
Dongjoon Hyun a60da23d64 [SPARK-30007][INFRA] Publish snapshot/release artifacts with -Phive-2.3 only
### What changes were proposed in this pull request?

This PR aims to add `-Phive-2.3` to publish profiles.
Since Apache Spark 3.0.0, Maven artifacts will be publish with Apache Hive 2.3 profile only.

This PR also will recover `SNAPSHOT` publishing Jenkins job.
- https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/

We will provide the pre-built distributions (with Hive 1.2.1 also) like Apache Spark 2.4.
SPARK-29989 will update the release script to generate all combinations.

### Why are the changes needed?

This will reduce the explicit dependency on the illegitimate Hive fork in Maven repository.

### Does this PR introduce any user-facing change?

Yes, but this is dev only changes.

### How was this patch tested?

Manual.

Closes #26648 from dongjoon-hyun/SPARK-30007.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-11-23 22:34:21 -08:00
Dongjoon Hyun c98e5eb339 [SPARK-29981][BUILD] Add hive-1.2/2.3 profiles
### What changes were proposed in this pull request?

This PR aims the followings.
- Add two profiles, `hive-1.2` and `hive-2.3` (default)
- Validate if we keep the existing combination at least. (Hadoop-2.7 + Hive 1.2 / Hadoop-3.2 + Hive 2.3).

For now, we assumes that `hive-1.2` is explicitly used with `hadoop-2.7` and `hive-2.3` with `hadoop-3.2`. The followings are beyond the scope of this PR.

- SPARK-29988 Adjust Jenkins jobs for `hive-1.2/2.3` combination
- SPARK-29989 Update release-script for `hive-1.2/2.3` combination
- SPARK-29991 Support `hive-1.2/2.3` in PR Builder

### Why are the changes needed?

This will help to switch our dependencies to update the exposed dependencies.

### Does this PR introduce any user-facing change?

This is a dev-only change that the build profile combinations are changed.
- `-Phadoop-2.7` => `-Phadoop-2.7 -Phive-1.2`
- `-Phadoop-3.2` => `-Phadoop-3.2 -Phive-2.3`

### How was this patch tested?

Pass the Jenkins with the dependency check and tests to make it sure we don't change anything for now.

- [Jenkins (-Phadoop-2.7 -Phive-1.2)](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/114192/consoleFull)
- [Jenkins (-Phadoop-3.2 -Phive-2.3)](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/114192/consoleFull)

Also, from now, GitHub Action validates the following combinations.
![gha](https://user-images.githubusercontent.com/9700541/69355365-822d5e00-0c36-11ea-93f7-e00e5459e1d0.png)

Closes #26619 from dongjoon-hyun/SPARK-29981.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-11-23 10:02:22 -08:00