Commit graph

989 commits

Author SHA1 Message Date
Dongjoon Hyun 290aa02179 [SPARK-33618][CORE] Use hadoop-client instead of hadoop-client-api to make hadoop-aws work
### What changes were proposed in this pull request?

This reverts commit SPARK-33212 (cb3fa6c936) mostly with three exceptions:
1. `SparkSubmitUtils` was updated recently by SPARK-33580
2. `resource-managers/yarn/pom.xml` was updated recently by SPARK-33104 to add `hadoop-yarn-server-resourcemanager` test dependency.
3. Adjust `com.fasterxml.jackson.module:jackson-module-jaxb-annotations` dependency in K8s module which is updated recently by SPARK-33471.

### Why are the changes needed?

According to [HADOOP-16080](https://issues.apache.org/jira/browse/HADOOP-16080) since Apache Hadoop 3.1.1, `hadoop-aws` doesn't work with `hadoop-client-api`. It fails at write operation like the following.

**1. Spark distribution with `-Phadoop-cloud`**

```scala
$ bin/spark-shell --conf spark.hadoop.fs.s3a.access.key=$AWS_ACCESS_KEY_ID --conf spark.hadoop.fs.s3a.secret.key=$AWS_SECRET_ACCESS_KEY
20/11/30 23:01:24 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context available as 'sc' (master = local[*], app id = local-1606806088715).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.1.0-SNAPSHOT
      /_/

Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_272)
Type in expressions to have them evaluated.
Type :help for more information.

scala> spark.read.parquet("s3a://dongjoon/users.parquet").show
20/11/30 23:01:34 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
+------+--------------+----------------+
|  name|favorite_color|favorite_numbers|
+------+--------------+----------------+
|Alyssa|          null|  [3, 9, 15, 20]|
|   Ben|           red|              []|
+------+--------------+----------------+

scala> Seq(1).toDF.write.parquet("s3a://dongjoon/out.parquet")
20/11/30 23:02:14 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2)/ 1]
java.lang.NoSuchMethodError: org.apache.hadoop.util.SemaphoredDelegatingExecutor.<init>(Lcom/google/common/util/concurrent/ListeningExecutorService;IZ)V
```

**2. Spark distribution without `-Phadoop-cloud`**
```scala
$ bin/spark-shell --conf spark.hadoop.fs.s3a.access.key=$AWS_ACCESS_KEY_ID --conf spark.hadoop.fs.s3a.secret.key=$AWS_SECRET_ACCESS_KEY -c spark.eventLog.enabled=true -c spark.eventLog.dir=s3a://dongjoon/spark-events/ --packages org.apache.hadoop:hadoop-aws:3.2.0,org.apache.hadoop:hadoop-common:3.2.0
...
java.lang.NoSuchMethodError: org.apache.hadoop.util.SemaphoredDelegatingExecutor.<init>(Lcom/google/common/util/concurrent/ListeningExecutorService;IZ)V
  at org.apache.hadoop.fs.s3a.S3AFileSystem.create(S3AFileSystem.java:772)
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CI.

Closes #30508 from dongjoon-hyun/SPARK-33212-REVERT.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-12-02 18:23:48 +09:00
Weichen Xu 80161238fe [SPARK-33592] Fix: Pyspark ML Validator params in estimatorParamMaps may be lost after saving and reloading
### What changes were proposed in this pull request?
Fix: Pyspark ML Validator params in estimatorParamMaps may be lost after saving and reloading

When saving validator estimatorParamMaps, will check all nested stages in tuned estimator to get correct param parent.

Two typical cases to manually test:
~~~python
tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
lr = LogisticRegression()
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])

paramGrid = ParamGridBuilder() \
    .addGrid(hashingTF.numFeatures, [10, 100]) \
    .addGrid(lr.maxIter, [100, 200]) \
    .build()
tvs = TrainValidationSplit(estimator=pipeline,
                           estimatorParamMaps=paramGrid,
                           evaluator=MulticlassClassificationEvaluator())

tvs.save(tvsPath)
loadedTvs = TrainValidationSplit.load(tvsPath)

# check `loadedTvs.getEstimatorParamMaps()` restored correctly.
~~~

~~~python
lr = LogisticRegression()
ova = OneVsRest(classifier=lr)
grid = ParamGridBuilder().addGrid(lr.maxIter, [100, 200]).build()
evaluator = MulticlassClassificationEvaluator()
tvs = TrainValidationSplit(estimator=ova, estimatorParamMaps=grid, evaluator=evaluator)

tvs.save(tvsPath)
loadedTvs = TrainValidationSplit.load(tvsPath)

# check `loadedTvs.getEstimatorParamMaps()` restored correctly.
~~~

### Why are the changes needed?
Bug fix.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Unit test.

Closes #30539 from WeichenXu123/fix_tuning_param_maps_io.

Authored-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
2020-12-01 09:36:42 +08:00
Josh Soref 13fd272cd3 Spelling r common dev mlib external project streaming resource managers python
### What changes were proposed in this pull request?

This PR intends to fix typos in the sub-modules:
* `R`
* `common`
* `dev`
* `mlib`
* `external`
* `project`
* `streaming`
* `resource-managers`
* `python`

Split per srowen https://github.com/apache/spark/pull/30323#issuecomment-728981618

NOTE: The misspellings have been reported at 706a726f87 (commitcomment-44064356)

### Why are the changes needed?

Misspelled words make it harder to read / understand content.

### Does this PR introduce _any_ user-facing change?

There are various fixes to documentation, etc...

### How was this patch tested?

No testing was performed

Closes #30402 from jsoref/spelling-R_common_dev_mlib_external_project_streaming_resource-managers_python.

Authored-by: Josh Soref <jsoref@users.noreply.github.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-11-27 10:22:45 -06:00
HyukjinKwon ed9e6fc182 [SPARK-33565][INFRA][FOLLOW-UP] Keep the test coverage with Python 3.8 in GitHub Actions
### What changes were proposed in this pull request?

This PR proposes to keep the test coverage with Python 3.8 in GitHub Actions. It is not tested for now in Jenkins due to an env issue.

**Before this change in GitHub Actions:**

```
========================================================================
Running PySpark tests
========================================================================
Running PySpark tests. Output is in /__w/spark/spark/python/unit-tests.log
Will test against the following Python executables: ['python3.6', 'pypy3']
...
```

**After this change in GitHub Actions:**

```
========================================================================
Running PySpark tests
========================================================================
Running PySpark tests. Output is in /__w/spark/spark/python/unit-tests.log
Will test against the following Python executables: ['python3.6', 'python3.8', 'pypy3']
```

### Why are the changes needed?

To keep the test coverage with Python 3.8 in GitHub Actions.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

GitHub Actions in this build will test.

Closes #30510 from HyukjinKwon/SPARK-33565.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-11-26 11:42:12 +09:00
Yuming Wang 1de3fc4282 [SPARK-33525][SQL] Update hive-service-rpc to 3.1.2
### What changes were proposed in this pull request?

We supported Hive metastore are 0.12.0 through 3.1.2, but we supported hive-jdbc are 0.12.0 through 2.3.7. It will throw `TProtocolException` if we use hive-jdbc 3.x:

```
[rootspark-3267648 apache-hive-3.1.2-bin]# bin/beeline -u jdbc:hive2://localhost:10000/default
Connecting to jdbc:hive2://localhost:10000/default
Connected to: Spark SQL (version 3.1.0-SNAPSHOT)
Driver: Hive JDBC (version 3.1.2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 3.1.2 by Apache Hive
0: jdbc:hive2://localhost:10000/default> create table t1(id int) using parquet;
Unexpected end of file when reading from HS2 server. The root cause might be too many concurrent connections. Please ask the administrator to check the number of active connections, and adjust hive.server2.thrift.max.worker.threads if applicable.
Error: org.apache.thrift.transport.TTransportException (state=08S01,code=0)
```
```
org.apache.thrift.protocol.TProtocolException: Missing version in readMessageBegin, old client?
	at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:234)
	at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:27)
	at org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:53)
	at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:310)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
	at java.base/java.lang.Thread.run(Thread.java:832)
```

This pr upgrade hive-service-rpc to 3.1.2 to fix this issue.

### Why are the changes needed?

To support hive-jdbc 3.x.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual test:
```
[rootspark-3267648 apache-hive-3.1.2-bin]# bin/beeline -u jdbc:hive2://localhost:10000/default
Connecting to jdbc:hive2://localhost:10000/default
Connected to: Spark SQL (version 3.1.0-SNAPSHOT)
Driver: Hive JDBC (version 3.1.2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 3.1.2 by Apache Hive
0: jdbc:hive2://localhost:10000/default> create table t1(id int) using parquet;
+---------+
| Result  |
+---------+
+---------+
No rows selected (1.051 seconds)
0: jdbc:hive2://localhost:10000/default> insert into t1 values(1);
+---------+
| Result  |
+---------+
+---------+
No rows selected (2.08 seconds)
0: jdbc:hive2://localhost:10000/default> select * from t1;
+-----+
| id  |
+-----+
| 1   |
+-----+
1 row selected (0.605 seconds)
```

Closes #30478 from wangyum/SPARK-33525.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-11-25 12:37:59 -08:00
yangjie01 048a9821c7
[SPARK-33535][INFRA][TESTS] Export LANG to en_US.UTF-8 in run-tests-jenkins script
### What changes were proposed in this pull request?
It seems that Jenkins tests tasks in many pr have test failed. The failed cases include:

-  `org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.HIVE_CLI_SERVICE_PROTOCOL_V1 get binary type`
- `org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.HIVE_CLI_SERVICE_PROTOCOL_V2 get binary type`
- `org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.HIVE_CLI_SERVICE_PROTOCOL_V3 get binary type`
- `org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.HIVE_CLI_SERVICE_PROTOCOL_V4 get binary type`
- `org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.HIVE_CLI_SERVICE_PROTOCOL_V5 get binary type`

The error message as follows:

```
Error Messageorg.scalatest.exceptions.TestFailedException: "[?](" did not equal "[�]("Stacktracesbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: "[?](" did not equal "[�]("
	at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
	at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)
	at org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231)
	at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295)
	at org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.$anonfun$new$26(SparkThriftServerProtocolVersionsSuite.scala:302)
```

But they can pass the GitHub Action, maybe it's related to the `LANG` of the Jenkins build machine, this pr add `export LANG="en_US.UTF-8"` in `run-test-jenkins` script.

### Why are the changes needed?
Ensure LANG in Jenkins test process is `en_US.UTF-8` to pass `HIVE_CLI_SERVICE_PROTOCOL_VX` related tests

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Jenkins tests pass

Closes #30487 from LuciferYang/SPARK-33535.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-11-24 09:50:10 -08:00
William Hyun 84e70362db
[SPARK-33510][BUILD] Update SBT to 1.4.4
### What changes were proposed in this pull request?
This PR aims to update SBT from 1.4.2 to 1.4.4.

### Why are the changes needed?

This will bring the latest bug fixes.
- https://github.com/sbt/sbt/releases/tag/v1.4.3
- https://github.com/sbt/sbt/releases/tag/v1.4.4

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Pass the CIs.

Closes #30453 from williamhyun/sbt143.

Authored-by: William Hyun <williamhyun3@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-11-22 22:56:59 -08:00
William Hyun a459238523 [MINOR][INFRA] Suppress warning in check-license
### What changes were proposed in this pull request?
This PR aims to suppress the warning `File exists` in check-license

### Why are the changes needed?

**BEFORE**
```
% dev/check-license
Attempting to fetch rat
RAT checks passed.

% dev/check-license
mkdir: target: File exists
RAT checks passed.
```

**AFTER**
```
% dev/check-license
Attempting to fetch rat
RAT checks passed.

% dev/check-license
RAT checks passed.
```

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Manually do dev/check-license twice.

Closes #30460 from williamhyun/checklicense.

Authored-by: William Hyun <williamhyun3@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-11-23 10:38:40 +09:00
Dongjoon Hyun d5e7bd0cc4
[SPARK-33483][INFRA][TESTS] Fix rat exclusion patterns and add a LICENSE
### What changes were proposed in this pull request?

This PR fixes the RAT exclusion rule which was originated from SPARK-1144 (Apache Spark 1.0)

### Why are the changes needed?

This prevents the situation like https://github.com/apache/spark/pull/30415.

Currently, it missed `catalog` directory due to `.log` rule.
```
$ dev/check-license
Could not find Apache license headers in the following files:
 !????? /Users/dongjoon/APACHE/spark-merge/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/MetadataColumn.java
 !????? /Users/dongjoon/APACHE/spark-merge/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/SupportsMetadataColumns.java
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CI with the new rule.

Closes #30418 from dongjoon-hyun/SPARK-RAT.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-11-18 23:59:11 -08:00
Takeshi Yamamuro 74bd046d17 [SPARK-33475][BUILD] Bump ANTLR runtime version to 4.8-1
### What changes were proposed in this pull request?

This PR intends to upgrade ANTLR runtime from 4.7.1 to 4.8-1.

### Why are the changes needed?

Release note of v4.8 and v4.7.2 (the v4.7.2 release has a few minor bug fixes for java targets):
 - v4.8: https://github.com/antlr/antlr4/releases/tag/4.8
 - v4.7.2: https://github.com/antlr/antlr4/releases/tag/4.7.2

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

GA tests.

Closes #30404 from maropu/UpgradeAntlr.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-11-18 21:20:28 +09:00
Rameshkrishnan Muthusamy 5e8549973d
[SPARK-33471][K8S][BUILD] Upgrade kubernetes-client to 4.12.0
### What changes were proposed in this pull request?

This PR aims to upgrade Kubernetes-client from 4.11.1 to 4.12.0

### Why are the changes needed?

This upgrades the dependency for Apache Spark 3.1.0.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Pass the CIs.

Closes #30401 from ramesh-muthusamy/SPARK-33471-k8s-clientupgrade.

Authored-by: Rameshkrishnan Muthusamy <rameshkrishnan_muthusamy@apple.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-11-17 13:41:58 -08:00
Chao Sun c2caf2522b [SPARK-33213][BUILD] Upgrade Apache Arrow to 2.0.0
### What changes were proposed in this pull request?

This upgrade Apache Arrow version from 1.0.1 to 2.0.0

### Why are the changes needed?

Apache Arrow 2.0.0 was released with some improvements from Java side, so it's better to upgrade Spark to the new version.
Note that the format version in Arrow 2.0.0 is still 1.0.0 so API should still be compatible between 1.0.1 and 2.0.0.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing UTs.

Closes #30306 from sunchao/SPARK-33213.

Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-11-09 19:07:16 -08:00
Dongjoon Hyun 35ac314181 [SPARK-33405][BUILD] Upgrade commons-compress to 1.20
### What changes were proposed in this pull request?

This PR aims to upgrade `commons-compress` from 1.8 to 1.20.

### Why are the changes needed?

- https://commons.apache.org/proper/commons-compress/security-reports.html

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

Closes #30304 from dongjoon-hyun/SPARK-33405.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-11-10 11:08:55 +09:00
huangtianhua 83a80796aa [SPARK-32691][BUILD] Update commons-crypto to v1.1.0
### What changes were proposed in this pull request?
Update the package commons-crypto to v1.1.0 to support aarch64 platform
- https://issues.apache.org/jira/browse/CRYPTO-139

### Why are the changes needed?

The package commons-crypto-1.0.0 available in the Maven repository
doesn't support aarch64 platform. It costs long time in
CryptoRandomFactory.getCryptoRandom(properties).nextBytes(iv) when NettyBlockRpcSever
receive block data from client,  if the time more than the default value 120s, IOException raised and client
will retry replicate the block data to other executors. But in fact the replication is complete,
it makes the replication number incorrect.
This makes DistributedSuite tests pass.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Pass the CIs.

Closes #30275 from huangtianhua/SPARK-32691.

Authored-by: huangtianhua <huangtianhua223@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-11-09 14:33:27 -08:00
Dongjoon Hyun 27d8136934 [SPARK-33324][K8S][BUILD] Upgrade kubernetes-client to 4.11.1
### What changes were proposed in this pull request?

This PR aims to upgrade `Kubernetes-client` from 4.10.3 to 4.11.1.

### Why are the changes needed?

This upgrades the dependency for Apache Spark 3.1.0.
Since 4.12.0 is still new and has a breaking API changes, this PR chooses the latest compatible one.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the all CIs including K8s IT.

Closes #30233 from dongjoon-hyun/SPARK-33324.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-11-02 22:23:26 -08:00
HyukjinKwon 9818f079aa [SPARK-33243][PYTHON][BUILD] Add numpydoc into documentation dependency
### What changes were proposed in this pull request?

This PR proposes to initiate the migration to NumPy documentation style (from reST style) in PySpark docstrings.
This PR also adds one migration example of `SparkContext`.

- **Before:**
    ...
    ![Screen Shot 2020-10-26 at 7 02 05 PM](https://user-images.githubusercontent.com/6477701/97161090-a8ea0200-17c0-11eb-8204-0e70d18fc571.png)
    ...
    ![Screen Shot 2020-10-26 at 7 02 09 PM](https://user-images.githubusercontent.com/6477701/97161100-aab3c580-17c0-11eb-92ad-f5ad4441ce16.png)
    ...

- **After:**

    ...
    ![Screen Shot 2020-10-26 at 7 24 08 PM](https://user-images.githubusercontent.com/6477701/97161219-d636b000-17c0-11eb-80ab-d17a570ecb4b.png)
    ...

See also https://numpydoc.readthedocs.io/en/latest/format.html

### Why are the changes needed?

There are many reasons for switching to NumPy documentation style.

1. Arguably reST style doesn't fit well when the docstring grows large because it provides (arguably) less structures and syntax.

2. NumPy documentation style provides a better human readable docstring format. For example, notebook users often just do `help(...)` by `pydoc`.

3. NumPy documentation style is pretty commonly used in data science libraries, for example, pandas, numpy, Dask, Koalas,
matplotlib, ... Using NumPy documentation style can give users a consistent documentation style.

### Does this PR introduce _any_ user-facing change?

The dependency itself doesn't change anything user-facing.
The documentation change in `SparkContext` does, as shown above.

### How was this patch tested?

Manually tested via running `cd python` and `make clean html`.

Closes #30149 from HyukjinKwon/SPARK-33243.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-10-27 14:03:57 +09:00
Emi ce0ebf5f02 [SPARK-33234][INFRA] Generates SHA-512 using shasum
### What changes were proposed in this pull request?

I am generating the SHA-512 using the standard shasum which also has a better output compared to GPG.

### Why are the changes needed?

Which makes the hash much easier to verify for users that don't have GPG.

Because an user having GPG can check the keys but an user without GPG will have a hard time validating the SHA-512 based on the 'pretty printed' format.

Apache Spark is the only project where I've seen this format. Most other Apache projects have a one-line hash file.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

This patch assumes the build system has shasum (it should, but I can't test this).

Closes #30123 from emilianbold/master.

Authored-by: Emi <emilian.bold@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-10-25 17:06:06 -07:00
Chao Sun cb3fa6c936 [SPARK-33212][BUILD] Move to shaded clients for Hadoop 3.x profile
### What changes were proposed in this pull request?

This switches Spark to use shaded Hadoop clients, namely hadoop-client-api and hadoop-client-runtime, for Hadoop 3.x. For Hadoop 2.7, we'll still use the same modules such as hadoop-client.

In order to still keep default Hadoop profile to be hadoop-3.2, this defines the following Maven properties:

```
hadoop-client-api.artifact
hadoop-client-runtime.artifact
hadoop-client-minicluster.artifact
```

which default to:
```
hadoop-client-api
hadoop-client-runtime
hadoop-client-minicluster
```
but all switch to `hadoop-client` when the Hadoop profile is hadoop-2.7. A side affect from this is we'll import the same dependency multiple times. For this I have to disable Maven enforcer `banDuplicatePomDependencyVersions`.

Besides above, there are the following changes:
- explicitly add a few dependencies which are imported via transitive dependencies from Hadoop jars, but are removed from the shaded client jars.
- removed the use of `ProxyUriUtils.getPath` from `ApplicationMaster` which is a server-side/private API.
- modified `IsolatedClientLoader` to exclude `hadoop-auth` jars when Hadoop version is 3.x. This change should only matter when we're not sharing Hadoop classes with Spark (which is _mostly_ used in tests).

### Why are the changes needed?

This serves two purposes:
- to unblock Spark from upgrading to Hadoop 3.2.2/3.3.0+. Latest Hadoop versions have upgraded to use Guava 27+ and in order to adopt the latest Hadoop versions in Spark, we'll need to resolve the Guava conflicts. This takes the approach by switching to shaded client jars provided by Hadoop.
- avoid pulling 3rd party dependencies from Hadoop and avoid potential future conflicts.

### Does this PR introduce _any_ user-facing change?

When people use Spark with `hadoop-provided` option, they should make sure class path contains `hadoop-client-api` and `hadoop-client-runtime` jars. In addition, they may need to make sure these jars appear before other Hadoop jars in the order. Otherwise, classes may be loaded from the other non-shaded Hadoop jars and cause potential conflicts.

### How was this patch tested?

Relying on existing tests.

Closes #29843 from sunchao/SPARK-29250.

Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: DB Tsai <d_tsai@apple.com>
2020-10-22 03:21:34 +00:00
Takeshi Yamamuro 1b7367ccd7 [SPARK-33205][BUILD] Bump snappy-java version to 1.1.8
### What changes were proposed in this pull request?

This PR intends to upgrade snappy-java from 1.1.7.5 to 1.1.8.

### Why are the changes needed?

For performance improvements; the released `snappy-java` bundles the latest `Snappy` v1.1.8 binaries with small performance improvements.
 - snappy-java release note: https://github.com/xerial/snappy-java/releases/tag/1.1.8
 - snappy release note: https://github.com/google/snappy/releases/tag/1.1.8

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

GA tests.

Closes #30120 from maropu/Snappy1.1.8.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2020-10-21 13:04:39 -07:00
Fokko Driesprong 6ad75cda1e [SPARK-17333][PYSPARK] Enable mypy
### What changes were proposed in this pull request?

Add MyPy to the CI. Once this is installed on the CI: https://issues.apache.org/jira/browse/SPARK-32797?jql=project%20%3D%20SPARK%20AND%20text%20~%20mypy this wil automatically check the types.

### Why are the changes needed?

We should check if the types are still correct on the CI.

```
MacBook-Pro-van-Fokko:spark fokkodriesprong$ ./dev/lint-python
starting python compilation test...
python compilation succeeded.

starting pycodestyle test...
pycodestyle checks passed.

starting flake8 test...
flake8 checks passed.

starting mypy test...
mypy checks passed.

The sphinx-build command was not found. Skipping Sphinx build for now.

all lint-python tests passed!
```

### Does this PR introduce _any_ user-facing change?

No :)

### How was this patch tested?

By running `./dev/lint-python` locally.

Closes #30088 from Fokko/SPARK-17333.

Authored-by: Fokko Driesprong <fokko@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-10-19 12:50:01 -07:00
William Hyun 53783e706d [SPARK-33179][TESTS] Switch default Hadoop profile in run-tests.py
### What changes were proposed in this pull request?

This PR aims to switch the default Hadoop profile from `hadoop2.7` to `hadoop3.2` in `dev/run-tests.py` when it's running in local or GitHub Action environments.

### Why are the changes needed?

The default Hadoop version is 3.2. We had better be consistent.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Manually.

**BEFORE**
```
% dev/run-tests.py
Cannot install SparkR as R was not found in PATH
[info] Using build tool sbt with Hadoop profile hadoop2.7 and Hive profile hive2.3 under environment local
```

**AFTER**
```
% dev/run-tests.py
Cannot install SparkR as R was not found in PATH
[info] Using build tool sbt with Hadoop profile hadoop3.2 and Hive profile hive2.3 under environment local
```

Closes #30090 from williamhyun/SPARK-33179.

Authored-by: William Hyun <williamhyun3@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-10-19 15:54:52 +09:00
William Hyun e6c53c2c1b [SPARK-33123][INFRA] Ignore GitHub only changes in Amplab Jenkins build
### What changes were proposed in this pull request?

This PR aims to ignore GitHub only changes in Amplab Jenkins build.

### Why are the changes needed?

This will save server resources.

### Does this PR introduce _any_ user-facing change?

No, this is a dev-only change.

### How was this patch tested?

Manually. I used the following doctest during testing and removed it at the clean-up.

E2E tests:

```
cd dev
cat test.py
```

```python
import importlib
runtests = importlib.import_module("run-tests")
print([x.name for x in runtests.determine_modules_for_files([".github/workflows/build_and_test.yml"])])
```

```python
$ GITHUB_ACTIONS=1 python test.py
['root']
$ python test.py
[]
```

Unittests:

```bash
$ GITHUN_ACTIONS=1 python3 -m doctest dev/run-tests.py
$ python3 -m doctest dev/run-tests.py
```

Closes #30020 from williamhyun/SPARK-33123.

Lead-authored-by: William Hyun <williamhyun3@gmail.com>
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-10-19 14:13:37 +09:00
Kent Yao 2507301705 [SPARK-33159][SQL] Use hive-service-rpc as dependency instead of inlining the generated code
### What changes were proposed in this pull request?

Hive's `hive-service-rpc` module started since hive-2.1.0 and it contains only the thrift IDL file and the code generated by it.

Removing the inlined code will help maintain and upgrade builtin hive versions

### Why are the changes needed?

to simply the code.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

passing CI

Closes #30055 from yaooqinn/SPARK-33159.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-10-16 09:37:54 -07:00
Dongjoon Hyun 9896288b88 [SPARK-33117][BUILD] Update zstd-jni to 1.4.5-6
### What changes were proposed in this pull request?

This PR aims to upgrade ZStandard library for Apache Spark 3.1.0.

### Why are the changes needed?

This will bring the latest bug fixes.
- 2662fbdc32
- bbe140b758

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CI.

Closes #30010 from dongjoon-hyun/SPARK-33117.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-10-12 00:27:53 -07:00
zero323 018811f974 [SPARK-33105][INFRA] Change default R arch from i386 to x64 and parametrize BINPREF
### What changes were proposed in this pull request?

- Change default R `arch` from `i386` to `x64`, to match Rtools  version.
- Parameterize `BINPREF` with `WIN` (https://stackoverflow.com/a/44035904)

Reported on dev:

http://apache-spark-developers-list.1001551.n3.nabble.com/Broken-rlang-installation-on-AppVeyor-td30294.html

### Why are the changes needed?

It seems like update from rlang 0.4.7 to 0.4.8 exposed an issue, where build fails because of incompatible ddl

```
c:/Rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe:
skipping incompatible C:/R/bin/i386/R.dll when searching for -lR
[00:01:52]
c:/Rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe:
skipping incompatible C:/R/bin/i386/R.dll when searching for -lR
[00:01:52]
c:/Rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe:
cannot find -lR
[00:01:52] collect2.exe: error: ld returned 1 exit status
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #29991 from zero323/APPVEYOR-DEAFAULT-ARCH.

Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-10-10 13:48:26 +09:00
HyukjinKwon 2e07ed3041 [SPARK-33082][SPARK-20202][BUILD][SQL][FOLLOW-UP] Remove Hive 1.2 workarounds and Hive 1.2 profile in Jenkins script
### What changes were proposed in this pull request?

This PR removes the leftover of Hive 1.2 workarounds and Hive 1.2 profile in Jenkins script.

- `test-hive1.2` title is not used anymore in Jenkins
- Remove some comments related to Hive 1.2
- Remove unused codes in `OrcFilters.scala`  Hive
- Test `spark.sql.hive.convertMetastoreOrc` disabled case for the tests added at SPARK-19809 and SPARK-22267

### Why are the changes needed?

To remove unused codes & improve test coverage

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

Manually ran the unit tests. Also It will be tested in CI in this PR.

Closes #29973 from HyukjinKwon/SPARK-33082-SPARK-20202.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-10-09 03:04:26 -07:00
Dongjoon Hyun 008a2ad1f8 [SPARK-20202][BUILD][SQL] Remove references to org.spark-project.hive (Hive 1.2.1)
### What changes were proposed in this pull request?

As of today,
- SPARK-30034 Apache Spark 3.0.0 switched its default Hive execution engine from Hive 1.2 to Hive 2.3. This removes the direct dependency to the forked Hive 1.2.1 in maven repository.
- SPARK-32981 Apache Spark 3.1.0(`master` branch) removed Hive 1.2 related artifacts from Apache Spark binary distributions.

This PR(SPARK-20202) aims to remove the following usage of unofficial Apache Hive fork completely from Apache Spark master for Apache Spark 3.1.0.
```
<hive.group>org.spark-project.hive</hive.group>
<hive.version>1.2.1.spark2</hive.version>
```

For the forked Hive 1.2.1.spark2 users, Apache Spark 2.4(LTS) and 3.0 (~ 2021.12) will provide it.

### Why are the changes needed?

- First, Apache Spark community should not use the unofficial forked release of another Apache project.
- Second, Apache Hive 1.2.1 was released at 2015-06-26 and the forked Hive `1.2.1.spark2` exposed many unfixable bugs in Apache because the forked `1.2.1.spark2` is not maintained at all. Apache Hive 2.3.0 was released at 2017-07-19 and it has been used with less number of bugs compared with `1.2.1.spark2`. Many bugs still exist in `hive-1.2` profile and new Apache Spark unit tests are added with `HiveUtils.isHive23` condition so far.

### Does this PR introduce _any_ user-facing change?

No. This is a dev-only change. PRBuilder will not accept `[test-hive1.2]` on master and `branch-3.1`.

### How was this patch tested?

1. SBT/Hadoop 3.2/Hive 2.3 (https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129366)
2. SBT/Hadoop 2.7/Hive 2.3 (https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129382)
3. SBT/Hadoop 3.2/Hive 1.2 (This has not been supported already due to Hive 1.2 doesn't work with Hadoop 3.2.)
4. SBT/Hadoop 2.7/Hive 1.2 (https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129383, This is rejected)

Closes #29936 from dongjoon-hyun/SPARK-REMOVE-HIVE1.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-10-05 15:29:56 -07:00
Dongjoon Hyun aa6657981a [SPARK-33050][BUILD] Upgrade Apache ORC to 1.5.12
### What changes were proposed in this pull request?

This PR aims to upgrade Apache ORC to 1.5.12.

### Why are the changes needed?

This brings us the latest bug patches like the followings.
- ORC-644 nested struct evolution does not respect to orc.force.positional.evolution
- ORC-667 Positional mapping for nested struct types should not applied by default

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CI.

Closes #29930 from dongjoon-hyun/SPARK-ORC-1.5.12.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-10-02 00:06:03 -07:00
Dongjoon Hyun 9c618b3308 [SPARK-33047][BUILD] Upgrade hive-storage-api to 2.7.2
### What changes were proposed in this pull request?

This PR aims to upgrade Apache Hive `hive-storage-api` library from 2.7.1 to 2.7.2.

### Why are the changes needed?

[storage-api 2.7.2](https://github.com/apache/hive/commits/rel/storage-release-2.7.2/storage-api) has the following extension and can be used when users uses a provided orc dependency.

[HIVE-22959](dade9919d9 (diff-ccfc9dd7584117f531322cda3a29f3c3)) : Extend storage-api to expose FilterContext
[HIVE-23215](361925d2f3 (diff-ccfc9dd7584117f531322cda3a29f3c3)) : Make FilterContext and MutableFilterContext interfaces

### Does this PR introduce _any_ user-facing change?

Yes. This is a dependency change.

### How was this patch tested?

Pass the existing tests.

Closes #29923 from dongjoon-hyun/SPARK-33047.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-10-01 12:41:40 -07:00
Peter Toth 28ed3a512a [SPARK-32723][WEBUI] Upgrade to jQuery 3.5.1
### What changes were proposed in this pull request?
Upgrade to the latest available version of jQuery (3.5.1).

### Why are the changes needed?
There are some CVE-s reported (CVE-2020-11022, CVE-2020-11023) affecting older versions of jQuery. Although Spark UI is read-only and those CVEs doesn't seem to affect Spark, using the latest version of this library can help to handle vulnerability reports of security scans.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Manual tests and checked the jQuery 3.5 upgrade guide.

Closes #29902 from peter-toth/SPARK-32723-upgrade-to-jquery-3.5.1.

Authored-by: Peter Toth <peter.toth@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-09-30 21:30:17 -07:00
jlafleche d75222dd1b [SPARK-33012][BUILD][K8S] Upgrade fabric8 to 4.10.3
### What changes were proposed in this pull request?

This PR aims to upgrade `kubernetes-client` library to track fabric8's declared compatibility for k8s 1.18.0:
https://github.com/fabric8io/kubernetes-client#compatibility-matrix

### Why are the changes needed?
According to fabric8, 4.9.2 is incompatible with k8s 1.18.0.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Not tested yet.

Closes #29888 from laflechejonathan/jlf/fabric8Ugprade.

Authored-by: jlafleche <jlafleche@palantir.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-09-30 19:00:18 -07:00
HyukjinKwon 688d016c7a [SPARK-32982][BUILD] Remove hive-1.2 profiles in PIP installation option
### What changes were proposed in this pull request?

This PR removes Hive 1.2 option (and therefore `HIVE_VERSION` environment variable as well).

### Why are the changes needed?

Hive 1.2 is a fork version. We shouldn't promote users to use.

### Does this PR introduce _any_ user-facing change?

Nope, `HIVE_VERSION` and Hive 1.2 are removed but this is new experimental feature in master only.

### How was this patch tested?

Manually tested:

```bash
SPARK_VERSION=3.0.1 HADOOP_VERSION=3.2 pip install pyspark-3.1.0.dev0.tar.gz -v
SPARK_VERSION=3.0.1 HADOOP_VERSION=2.7 pip install pyspark-3.1.0.dev0.tar.gz -v
SPARK_VERSION=3.0.1 HADOOP_VERSION=invalid pip install pyspark-3.1.0.dev0.tar.gz -v
```

Closes #29858 from HyukjinKwon/SPARK-32981.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-09-24 14:49:58 +09:00
zero323 31a16fbb40 [SPARK-32714][PYTHON] Initial pyspark-stubs port
### What changes were proposed in this pull request?

This PR proposes migration of [`pyspark-stubs`](https://github.com/zero323/pyspark-stubs) into Spark codebase.

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

Yes. This PR adds type annotations directly to Spark source.

This can impact interaction with development tools for users, which haven't used `pyspark-stubs`.

### How was this patch tested?

- [x] MyPy tests of the PySpark source
    ```
    mypy --no-incremental --config python/mypy.ini python/pyspark
    ```
- [x] MyPy tests of Spark examples
    ```
   MYPYPATH=python/ mypy --no-incremental --config python/mypy.ini examples/src/main/python/ml examples/src/main/python/sql examples/src/main/python/sql/streaming
    ```
- [x] Existing Flake8 linter

- [x] Existing unit tests

Tested against:

- `mypy==0.790+dev.e959952d9001e9713d329a2f9b196705b028f894`
- `mypy==0.782`

Closes #29591 from zero323/SPARK-32681.

Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-09-24 14:15:36 +09:00
Dongjoon Hyun 3c97665dad [SPARK-32981][BUILD] Remove hive-1.2/hadoop-2.7 from Apache Spark 3.1 distribution
### What changes were proposed in this pull request?

Apache Spark 3.0 switches its Hive execution version from 1.2 to 2.3, but it still provides the unofficial forked Hive 1.2 version from our distribution like the following. This PR aims to remove it from Apache Spark 3.1.0 officially while keeping `hive-1.2` profile.
```
spark-3.0.1-bin-hadoop2.7-hive1.2.tgz
spark-3.0.1-bin-hadoop2.7-hive1.2.tgz.asc
spark-3.0.1-bin-hadoop2.7-hive1.2.tgz.sha512
```

### Why are the changes needed?

The unofficial Hive 1.2.1 fork has many bugs and is not maintained for a long time. We had better not recommend this in the official Apache Spark distribution.

### Does this PR introduce _any_ user-facing change?

There is no user-facing change in the default distribution (Hadoop 3.2/Hive 2.3).

### How was this patch tested?

Manually because this is a change in release script .

Closes #29856 from dongjoon-hyun/SPARK-32981.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-09-23 15:33:53 -07:00
HyukjinKwon 942f577b6e [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI
### What changes were proposed in this pull request?

This PR proposes to add a way to select Hadoop and Hive versions in pip installation.
Users can select Hive or Hadoop versions as below:

```bash
HADOOP_VERSION=3.2 pip install pyspark
HIVE_VERSION=1.2 pip install pyspark
HIVE_VERSION=1.2 HADOOP_VERSION=2.7 pip install pyspark
```

When the environment variables are set, internally it downloads the corresponding Spark version and then sets the Spark home to it. Also this PR exposes a mirror to set as an environment variable, `PYSPARK_RELEASE_MIRROR`.

**Please NOTE that:**
- We cannot currently leverage pip's native installation option, for example:

    ```bash
    pip install pyspark --install-option="hadoop3.2"
    ```

    This is because of a limitation and bug in pip itself. Once they fix this issue, we can switch from the environment variables to the proper installation options, see SPARK-32837.

    It IS possible to workaround but very ugly or hacky with a big change. See [this PR](https://github.com/microsoft/nni/pull/139/files) as an example.

- In pip installation, we pack the relevant jars together. This PR _does not touch existing packaging way_ in order to prevent any behaviour changes.

  Once this experimental way is proven to be safe, we can avoid packing the relevant jars together (and keep only the relevant Python scripts). And downloads the Spark distribution as this PR proposes.

- This way is sort of consistent with SparkR:

  SparkR provides a method `SparkR::install.spark` to support CRAN installation. This is fine because SparkR is provided purely as a R library. For example, `sparkr` script is not packed together.

  PySpark cannot take this approach because PySpark packaging ships relevant executable script together, e.g.) `pyspark` shell.

  If PySpark has a method such as `pyspark.install_spark`, users cannot call it in `pyspark` because `pyspark` already assumes relevant Spark is installed, JVM is launched, etc.

- There looks no way to release that contains different Hadoop or Hive to PyPI due to [the version semantics](https://www.python.org/dev/peps/pep-0440/). This is not an option.

  The usual way looks either `--install-option` above with hacks or environment variables given my investigation.

### Why are the changes needed?

To provide users the options to select Hadoop and Hive versions.

### Does this PR introduce _any_ user-facing change?

Yes, users will be able to select Hive and Hadoop version as below when they install it from `pip`;

```bash
HADOOP_VERSION=3.2 pip install pyspark
HIVE_VERSION=1.2 pip install pyspark
HIVE_VERSION=1.2 HADOOP_VERSION=2.7 pip install pyspark
```

### How was this patch tested?

Unit tests were added. I also manually tested in Mac and Windows (after building Spark with `python/dist/pyspark-3.1.0.dev0.tar.gz`):

```bash
./build/mvn -DskipTests -Phive-thriftserver clean package
```

Mac:

```bash
SPARK_VERSION=3.0.1 HADOOP_VERSION=3.2 pip install pyspark-3.1.0.dev0.tar.gz
```

Windows:

```bash
set HADOOP_VERSION=3.2
set SPARK_VERSION=3.0.1
pip install pyspark-3.1.0.dev0.tar.gz
```

Closes #29703 from HyukjinKwon/SPARK-32017.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-09-23 09:30:51 +09:00
HyukjinKwon dbc41376de [SPARK-32841][BUILD] Use Apache Hadoop 3.2.0 for PyPI and CRAN
### What changes were proposed in this pull request?

PyPI and CRAN did not change because of the concern about selecting Hadoop and Hive versions.

For PyPI, now there is a PR open at https://github.com/apache/spark/pull/29703
For CRAN, we can already select Hadoop and Hive versions via `SparkR::install.spark`.

### Why are the changes needed?

To keep the default profiles consistent in distributions

### Does this PR introduce _any_ user-facing change?

Yes, the default distributions will use Hadoop 3.2.

### How was this patch tested?

Jenkins tests.

Closes #29704 from HyukjinKwon/SPARK-32058.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-09-10 00:11:07 -07:00
Bryan Cutler e0538bd38c [SPARK-32312][SQL][PYTHON][TEST-JAVA11] Upgrade Apache Arrow to version 1.0.1
### What changes were proposed in this pull request?

Upgrade Apache Arrow to version 1.0.1 for the Java dependency and increase minimum version of PyArrow to 1.0.0.

This release marks a transition to binary stability of the columnar format (which was already informally backward-compatible going back to December 2017) and a transition to Semantic Versioning for the Arrow software libraries. Also note that the Java arrow-memory artifact has been split to separate dependence on netty-buffer and allow users to select an allocator. Spark will continue to use `arrow-memory-netty` to maintain performance benefits.

Version 1.0.0 - 1.0.0 include the following selected fixes/improvements relevant to Spark users:

ARROW-9300 - [Java] Separate Netty Memory to its own module
ARROW-9272 - [C++][Python] Reduce complexity in python to arrow conversion
ARROW-9016 - [Java] Remove direct references to Netty/Unsafe Allocators
ARROW-8664 - [Java] Add skip null check to all Vector types
ARROW-8485 - [Integration][Java] Implement extension types integration
ARROW-8434 - [C++] Ipc RecordBatchFileReader deserializes the Schema multiple times
ARROW-8314 - [Python] Provide a method to select a subset of columns of a Table
ARROW-8230 - [Java] Move Netty memory manager into a separate module
ARROW-8229 - [Java] Move ArrowBuf into the Arrow package
ARROW-7955 - [Java] Support large buffer for file/stream IPC
ARROW-7831 - [Java] unnecessary buffer allocation when calling splitAndTransferTo on variable width vectors
ARROW-6111 - [Java] Support LargeVarChar and LargeBinary types and add integration test with C++
ARROW-6110 - [Java] Support LargeList Type and add integration test with C++
ARROW-5760 - [C++] Optimize Take implementation
ARROW-300 - [Format] Add body buffer compression option to IPC message protocol using LZ4 or ZSTD
ARROW-9098 - RecordBatch::ToStructArray cannot handle record batches with 0 column
ARROW-9066 - [Python] Raise correct error in isnull()
ARROW-9223 - [Python] Fix to_pandas() export for timestamps within structs
ARROW-9195 - [Java] Wrong usage of Unsafe.get from bytearray in ByteFunctionsHelper class
ARROW-7610 - [Java] Finish support for 64 bit int allocations
ARROW-8115 - [Python] Conversion when mixing NaT and datetime objects not working
ARROW-8392 - [Java] Fix overflow related corner cases for vector value comparison
ARROW-8537 - [C++] Performance regression from ARROW-8523
ARROW-8803 - [Java] Row count should be set before loading buffers in VectorLoader
ARROW-8911 - [C++] Slicing a ChunkedArray with zero chunks segfaults

View release notes here:
https://arrow.apache.org/release/1.0.1.html
https://arrow.apache.org/release/1.0.0.html

### Why are the changes needed?

Upgrade brings fixes, improvements and stability guarantees.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing tests with pyarrow 1.0.0 and 1.0.1

Closes #29686 from BryanCutler/arrow-upgrade-100-SPARK-32312.

Authored-by: Bryan Cutler <cutlerb@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-09-10 14:16:19 +09:00
HyukjinKwon 794b48c172 [SPARK-32204][SPARK-32182][DOCS][FOLLOW-UP] Use IPython instead of ipython to check if installed in dev/lint-python
### What changes were proposed in this pull request?

It should check `IPython` as it's imported as a package. Currently, Sphinx build is being skipped in GitHub Actions as below:

https://github.com/apache/spark/runs/1084164546

```
starting python compilation test...
python compilation succeeded.

starting pycodestyle test...
pycodestyle checks passed.

starting flake8 test...
flake8 checks passed.

python3 does not have ipython installed. Skipping Sphinx build for now.

all lint-python tests passed!
```

### Why are the changes needed?

To run the documentation builds in Github Actions.

### Does this PR introduce _any_ user-facing change?

No, dev-only

### How was this patch tested?

Manually tested as `dev/lint-python`.

Closes #29679 from HyukjinKwon/follow-ipython.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>
2020-09-09 12:22:13 +08:00
Yuming Wang c8f56eb7bb [MINOR] Fix usage print to guide pip3 to install jira-python library
### What changes were proposed in this pull request?

Use `pip3` to install `jira-python` library.

### Why are the changes needed?

We use python3 in the build script and `pip` is missing by default.

```
yumwangLM-SHC-16508156 spark % sw_vers
ProductName:	Mac OS X
ProductVersion:	10.15.6
BuildVersion:	19G2021

yumwangLM-SHC-16508156 spark % sudo pip install jira
Password:
sudo: pip: command not found
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual test:
```
yumwangLM-SHC-16508156 spark % sudo pip3 install jira
WARNING: The directory '/Users/yumwang/Library/Caches/pip' or its parent directory is not owned or is not writable by the current user. The cache has been disabled. Check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
Collecting jira
  Downloading jira-2.0.0-py2.py3-none-any.whl (57 kB)
     |████████████████████████████████| 57 kB 78 kB/s
Requirement already satisfied: setuptools>=20.10.1 in /usr/local/lib/python3.8/site-packages (from jira) (49.2.0)
Collecting six>=1.10.0
  Downloading six-1.15.0-py2.py3-none-any.whl (10 kB)
Collecting defusedxml
  Downloading defusedxml-0.6.0-py2.py3-none-any.whl (23 kB)
Collecting oauthlib[signedtoken]>=1.0.0
  Downloading oauthlib-3.1.0-py2.py3-none-any.whl (147 kB)
     |████████████████████████████████| 147 kB 306 kB/s
Collecting pbr>=3.0.0
  Downloading pbr-5.5.0-py2.py3-none-any.whl (106 kB)
     |████████████████████████████████| 106 kB 345 kB/s
Collecting requests-toolbelt
  Downloading requests_toolbelt-0.9.1-py2.py3-none-any.whl (54 kB)
     |████████████████████████████████| 54 kB 474 kB/s
Collecting requests-oauthlib>=0.6.1
  Downloading requests_oauthlib-1.3.0-py2.py3-none-any.whl (23 kB)
Collecting requests>=2.10.0
  Downloading requests-2.24.0-py2.py3-none-any.whl (61 kB)
     |████████████████████████████████| 61 kB 521 kB/s
Collecting cryptography; extra == "signedtoken"
  Downloading cryptography-3.1-cp35-abi3-macosx_10_10_x86_64.whl (1.8 MB)
     |████████████████████████████████| 1.8 MB 497 kB/s
Collecting pyjwt>=1.0.0; extra == "signedtoken"
  Downloading PyJWT-1.7.1-py2.py3-none-any.whl (18 kB)
Collecting chardet<4,>=3.0.2
  Downloading chardet-3.0.4-py2.py3-none-any.whl (133 kB)
     |████████████████████████████████| 133 kB 671 kB/s
Collecting urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1
  Downloading urllib3-1.25.10-py2.py3-none-any.whl (127 kB)
     |████████████████████████████████| 127 kB 930 kB/s
Collecting certifi>=2017.4.17
  Downloading certifi-2020.6.20-py2.py3-none-any.whl (156 kB)
     |████████████████████████████████| 156 kB 7.1 MB/s
Collecting idna<3,>=2.5
  Downloading idna-2.10-py2.py3-none-any.whl (58 kB)
     |████████████████████████████████| 58 kB 1.3 MB/s
Collecting cffi!=1.11.3,>=1.8
  Downloading cffi-1.14.2-cp38-cp38-macosx_10_9_x86_64.whl (176 kB)
     |████████████████████████████████| 176 kB 7.3 MB/s
Collecting pycparser
  Downloading pycparser-2.20-py2.py3-none-any.whl (112 kB)
     |████████████████████████████████| 112 kB 773 kB/s
Installing collected packages: six, defusedxml, pycparser, cffi, cryptography, pyjwt, oauthlib, pbr, chardet, urllib3, certifi, idna, requests, requests-toolbelt, requests-oauthlib, jira
Successfully installed certifi-2020.6.20 cffi-1.14.2 chardet-3.0.4 cryptography-3.1 defusedxml-0.6.0 idna-2.10 jira-2.0.0 oauthlib-3.1.0 pbr-5.5.0 pycparser-2.20 pyjwt-1.7.1 requests-2.24.0 requests-oauthlib-1.3.0 requests-toolbelt-0.9.1 six-1.15.0 urllib3-1.25.10
```

Closes #29628 from wangyum/jira.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-09-03 01:10:59 +09:00
Fokko Driesprong a1e459ed9f [SPARK-32719][PYTHON] Add Flake8 check missing imports
https://issues.apache.org/jira/browse/SPARK-32719

### What changes were proposed in this pull request?

Add a check to detect missing imports. This makes sure that if we use a specific class, it should be explicitly imported (not using a wildcard).

### Why are the changes needed?

To make sure that the quality of the Python code is up to standard.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing unit-tests and Flake8 static analysis

Closes #29563 from Fokko/fd-add-check-missing-imports.

Authored-by: Fokko Driesprong <fokko@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-08-31 11:23:31 +09:00
HyukjinKwon b54103016a [SPARK-32204][SPARK-32182][DOCS] Add a quickstart page with Binder integration in PySpark documentation
### What changes were proposed in this pull request?

This PR proposes to:
- add a notebook with a Binder integration which allows users to try PySpark in a live notebook. Please [try this here](https://mybinder.org/v2/gh/HyukjinKwon/spark/SPARK-32204?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart.ipynb).
- reuse this notebook as a quickstart guide in PySpark documentation.

Note that Binder turns a Git repo into a collection of interactive notebooks. It works based on Docker image. Once somebody builds, other people can reuse the image against a specific commit.
Therefore, if we run Binder with the images based on released tags in Spark, virtually all users can instantly launch the Jupyter notebooks.

<br/>

I made a simple demo to make it easier to review. Please see:
- [Main page](https://hyukjin-spark.readthedocs.io/en/stable/). Note that the link ("Live Notebook") in the main page wouldn't work since this PR is not merged yet.
- [Quickstart page](https://hyukjin-spark.readthedocs.io/en/stable/getting_started/quickstart.html)

<br/>

When reviewing the notebook file itself, please give my direct feedback which I will appreciate and address.
Another way might be:
- open [here](https://mybinder.org/v2/gh/HyukjinKwon/spark/SPARK-32204?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart.ipynb).
- edit / change / update the notebook. Please feel free to change as whatever you want. I can apply as are or slightly update more when I apply to this PR.
- download it as a `.ipynb` file:
    ![Screen Shot 2020-08-20 at 10 12 19 PM](https://user-images.githubusercontent.com/6477701/90774311-3e38c800-e332-11ea-8476-699a653984db.png)
- upload the `.ipynb` file here in a GitHub comment. Then, I will push a commit with that file with crediting correctly, of course.
- alternatively, push a commit into this PR right away if that's easier for you (if you're a committer).

References:
- https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html
- https://databricks.com/jp/blog/2020/03/31/10-minutes-from-pandas-to-koalas-on-apache-spark.html - my own blog post .. :-) and https://koalas.readthedocs.io/en/latest/getting_started/10min.html

### Why are the changes needed?

To improve PySpark's usability. The current quickstart for Python users are very friendly.

### Does this PR introduce _any_ user-facing change?

Yes, it will add a documentation page, and expose a live notebook to PySpark users.

### How was this patch tested?

Manually tested, and GitHub Actions builds will test.

Closes #29491 from HyukjinKwon/SPARK-32204.

Lead-authored-by: HyukjinKwon <gurwls223@apache.org>
Co-authored-by: Fokko Driesprong <fokko@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-08-26 12:23:24 +09:00
Sean Owen a9d4e60a90 [SPARK-32614][SQL] Don't apply comment processing if 'comment' unset for CSV
### What changes were proposed in this pull request?

Spark's CSV source can optionally ignore lines starting with a comment char. Some code paths check to see if it's set before applying comment logic (i.e. not set to default of `\0`), but many do not, including the one that passes the option to Univocity. This means that rows beginning with a null char were being treated as comments even when 'disabled'.

### Why are the changes needed?

To avoid dropping rows that start with a null char when this is not requested or intended. See JIRA for an example.

### Does this PR introduce _any_ user-facing change?

Nothing beyond the effect of the bug fix.

### How was this patch tested?

Existing tests plus new test case.

Closes #29516 from srowen/SPARK-32614.

Authored-by: Sean Owen <srowen@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-08-26 00:25:58 +09:00
Takeshi Yamamuro 6dd37cbaac [SPARK-32682][INFRA] Use workflow_dispatch to enable manual test triggers
### What changes were proposed in this pull request?

This PR proposes to add a `workflow_dispatch` entry in the GitHub Action script (`build_and_test.yml`). This update can enable developers to run the Spark tests for a specific branch on their own local repository, so I think it might help to check if al the tests can pass before opening a new PR.

<img width="944" alt="Screen Shot 2020-08-21 at 16 28 41" src="https://user-images.githubusercontent.com/692303/90866249-96250c80-e3ce-11ea-8496-3dd6683e92ea.png">

### Why are the changes needed?

To reduce the pressure of GitHub Actions on the Spark repository.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manually checked.

Closes #29504 from maropu/DispatchTest.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-08-21 21:23:41 +09:00
Wenchen Fan 8b119f1663 [SPARK-32640][SQL] Downgrade Janino to fix a correctness bug
### What changes were proposed in this pull request?

This PR reverts https://github.com/apache/spark/pull/27860 to downgrade Janino, as the new version has a bug.

### Why are the changes needed?

The symptom is about NaN comparison. For code below
```
if (double_value <= 0.0) {
  ...
} else {
  ...
}
```

If `double_value` is NaN, `NaN <= 0.0` is false and we should go to the else branch. However, current Spark goes to the if branch and causes correctness issues like SPARK-32640.

One way to fix it is:
```
boolean cond = double_value <= 0.0;
if (cond) {
  ...
} else {
  ...
}
```

I'm not familiar with Janino so I don't know what's going on there.

### Does this PR introduce _any_ user-facing change?

Yes, fix correctness bugs.

### How was this patch tested?

a new test

Closes #29495 from cloud-fan/revert.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-08-20 13:26:39 -07:00
Fokko Driesprong 9fcf0ea718 [SPARK-32319][PYSPARK] Disallow the use of unused imports
Disallow the use of unused imports:

- Unnecessary increases the memory footprint of the application
- Removes the imports that are required for the examples in the docstring from the file-scope to the example itself. This keeps the files itself clean, and gives a more complete example as it also includes the imports :)

```
fokkodriesprongFan spark % flake8 python | grep -i "imported but unused"
python/pyspark/cloudpickle.py:46:1: F401 'functools.partial' imported but unused
python/pyspark/cloudpickle.py:55:1: F401 'traceback' imported but unused
python/pyspark/heapq3.py:868:5: F401 '_heapq.*' imported but unused
python/pyspark/__init__.py:61:1: F401 'pyspark.version.__version__' imported but unused
python/pyspark/__init__.py:62:1: F401 'pyspark._globals._NoValue' imported but unused
python/pyspark/__init__.py:115:1: F401 'pyspark.sql.SQLContext' imported but unused
python/pyspark/__init__.py:115:1: F401 'pyspark.sql.HiveContext' imported but unused
python/pyspark/__init__.py:115:1: F401 'pyspark.sql.Row' imported but unused
python/pyspark/rdd.py:21:1: F401 're' imported but unused
python/pyspark/rdd.py:29:1: F401 'tempfile.NamedTemporaryFile' imported but unused
python/pyspark/mllib/regression.py:26:1: F401 'pyspark.mllib.linalg.SparseVector' imported but unused
python/pyspark/mllib/clustering.py:28:1: F401 'pyspark.mllib.linalg.SparseVector' imported but unused
python/pyspark/mllib/clustering.py:28:1: F401 'pyspark.mllib.linalg.DenseVector' imported but unused
python/pyspark/mllib/classification.py:26:1: F401 'pyspark.mllib.linalg.SparseVector' imported but unused
python/pyspark/mllib/feature.py:28:1: F401 'pyspark.mllib.linalg.DenseVector' imported but unused
python/pyspark/mllib/feature.py:28:1: F401 'pyspark.mllib.linalg.SparseVector' imported but unused
python/pyspark/mllib/feature.py:30:1: F401 'pyspark.mllib.regression.LabeledPoint' imported but unused
python/pyspark/mllib/tests/test_linalg.py:18:1: F401 'sys' imported but unused
python/pyspark/mllib/tests/test_linalg.py:642:5: F401 'pyspark.mllib.tests.test_linalg.*' imported but unused
python/pyspark/mllib/tests/test_feature.py:21:1: F401 'numpy.random' imported but unused
python/pyspark/mllib/tests/test_feature.py:21:1: F401 'numpy.exp' imported but unused
python/pyspark/mllib/tests/test_feature.py:23:1: F401 'pyspark.mllib.linalg.Vector' imported but unused
python/pyspark/mllib/tests/test_feature.py:23:1: F401 'pyspark.mllib.linalg.VectorUDT' imported but unused
python/pyspark/mllib/tests/test_feature.py:185:5: F401 'pyspark.mllib.tests.test_feature.*' imported but unused
python/pyspark/mllib/tests/test_util.py:97:5: F401 'pyspark.mllib.tests.test_util.*' imported but unused
python/pyspark/mllib/tests/test_stat.py:23:1: F401 'pyspark.mllib.linalg.Vector' imported but unused
python/pyspark/mllib/tests/test_stat.py:23:1: F401 'pyspark.mllib.linalg.SparseVector' imported but unused
python/pyspark/mllib/tests/test_stat.py:23:1: F401 'pyspark.mllib.linalg.DenseVector' imported but unused
python/pyspark/mllib/tests/test_stat.py:23:1: F401 'pyspark.mllib.linalg.VectorUDT' imported but unused
python/pyspark/mllib/tests/test_stat.py:23:1: F401 'pyspark.mllib.linalg._convert_to_vector' imported but unused
python/pyspark/mllib/tests/test_stat.py:23:1: F401 'pyspark.mllib.linalg.DenseMatrix' imported but unused
python/pyspark/mllib/tests/test_stat.py:23:1: F401 'pyspark.mllib.linalg.SparseMatrix' imported but unused
python/pyspark/mllib/tests/test_stat.py:23:1: F401 'pyspark.mllib.linalg.MatrixUDT' imported but unused
python/pyspark/mllib/tests/test_stat.py:181:5: F401 'pyspark.mllib.tests.test_stat.*' imported but unused
python/pyspark/mllib/tests/test_streaming_algorithms.py:18:1: F401 'time.time' imported but unused
python/pyspark/mllib/tests/test_streaming_algorithms.py:18:1: F401 'time.sleep' imported but unused
python/pyspark/mllib/tests/test_streaming_algorithms.py:470:5: F401 'pyspark.mllib.tests.test_streaming_algorithms.*' imported but unused
python/pyspark/mllib/tests/test_algorithms.py:295:5: F401 'pyspark.mllib.tests.test_algorithms.*' imported but unused
python/pyspark/tests/test_serializers.py:90:13: F401 'xmlrunner' imported but unused
python/pyspark/tests/test_rdd.py:21:1: F401 'sys' imported but unused
python/pyspark/tests/test_rdd.py:29:1: F401 'pyspark.resource.ResourceProfile' imported but unused
python/pyspark/tests/test_rdd.py:885:5: F401 'pyspark.tests.test_rdd.*' imported but unused
python/pyspark/tests/test_readwrite.py:19:1: F401 'sys' imported but unused
python/pyspark/tests/test_readwrite.py:22:1: F401 'array.array' imported but unused
python/pyspark/tests/test_readwrite.py:309:5: F401 'pyspark.tests.test_readwrite.*' imported but unused
python/pyspark/tests/test_join.py:62:5: F401 'pyspark.tests.test_join.*' imported but unused
python/pyspark/tests/test_taskcontext.py:19:1: F401 'shutil' imported but unused
python/pyspark/tests/test_taskcontext.py:325:5: F401 'pyspark.tests.test_taskcontext.*' imported but unused
python/pyspark/tests/test_conf.py:36:5: F401 'pyspark.tests.test_conf.*' imported but unused
python/pyspark/tests/test_broadcast.py:148:5: F401 'pyspark.tests.test_broadcast.*' imported but unused
python/pyspark/tests/test_daemon.py:76:5: F401 'pyspark.tests.test_daemon.*' imported but unused
python/pyspark/tests/test_util.py:77:5: F401 'pyspark.tests.test_util.*' imported but unused
python/pyspark/tests/test_pin_thread.py:19:1: F401 'random' imported but unused
python/pyspark/tests/test_pin_thread.py:149:5: F401 'pyspark.tests.test_pin_thread.*' imported but unused
python/pyspark/tests/test_worker.py:19:1: F401 'sys' imported but unused
python/pyspark/tests/test_worker.py:26:5: F401 'resource' imported but unused
python/pyspark/tests/test_worker.py:203:5: F401 'pyspark.tests.test_worker.*' imported but unused
python/pyspark/tests/test_profiler.py:101:5: F401 'pyspark.tests.test_profiler.*' imported but unused
python/pyspark/tests/test_shuffle.py:18:1: F401 'sys' imported but unused
python/pyspark/tests/test_shuffle.py:171:5: F401 'pyspark.tests.test_shuffle.*' imported but unused
python/pyspark/tests/test_rddbarrier.py:43:5: F401 'pyspark.tests.test_rddbarrier.*' imported but unused
python/pyspark/tests/test_context.py:129:13: F401 'userlibrary.UserClass' imported but unused
python/pyspark/tests/test_context.py:140:13: F401 'userlib.UserClass' imported but unused
python/pyspark/tests/test_context.py:310:5: F401 'pyspark.tests.test_context.*' imported but unused
python/pyspark/tests/test_appsubmit.py:241:5: F401 'pyspark.tests.test_appsubmit.*' imported but unused
python/pyspark/streaming/dstream.py:18:1: F401 'sys' imported but unused
python/pyspark/streaming/tests/test_dstream.py:27:1: F401 'pyspark.RDD' imported but unused
python/pyspark/streaming/tests/test_dstream.py:647:5: F401 'pyspark.streaming.tests.test_dstream.*' imported but unused
python/pyspark/streaming/tests/test_kinesis.py:83:5: F401 'pyspark.streaming.tests.test_kinesis.*' imported but unused
python/pyspark/streaming/tests/test_listener.py:152:5: F401 'pyspark.streaming.tests.test_listener.*' imported but unused
python/pyspark/streaming/tests/test_context.py:178:5: F401 'pyspark.streaming.tests.test_context.*' imported but unused
python/pyspark/testing/utils.py:30:5: F401 'scipy.sparse' imported but unused
python/pyspark/testing/utils.py:36:5: F401 'numpy as np' imported but unused
python/pyspark/ml/regression.py:25:1: F401 'pyspark.ml.tree._TreeEnsembleParams' imported but unused
python/pyspark/ml/regression.py:25:1: F401 'pyspark.ml.tree._HasVarianceImpurity' imported but unused
python/pyspark/ml/regression.py:29:1: F401 'pyspark.ml.wrapper.JavaParams' imported but unused
python/pyspark/ml/util.py:19:1: F401 'sys' imported but unused
python/pyspark/ml/__init__.py:25:1: F401 'pyspark.ml.pipeline' imported but unused
python/pyspark/ml/pipeline.py:18:1: F401 'sys' imported but unused
python/pyspark/ml/stat.py:22:1: F401 'pyspark.ml.linalg.DenseMatrix' imported but unused
python/pyspark/ml/stat.py:22:1: F401 'pyspark.ml.linalg.Vectors' imported but unused
python/pyspark/ml/tests/test_training_summary.py:18:1: F401 'sys' imported but unused
python/pyspark/ml/tests/test_training_summary.py:364:5: F401 'pyspark.ml.tests.test_training_summary.*' imported but unused
python/pyspark/ml/tests/test_linalg.py:381:5: F401 'pyspark.ml.tests.test_linalg.*' imported but unused
python/pyspark/ml/tests/test_tuning.py:427:9: F401 'pyspark.sql.functions as F' imported but unused
python/pyspark/ml/tests/test_tuning.py:757:5: F401 'pyspark.ml.tests.test_tuning.*' imported but unused
python/pyspark/ml/tests/test_wrapper.py:120:5: F401 'pyspark.ml.tests.test_wrapper.*' imported but unused
python/pyspark/ml/tests/test_feature.py:19:1: F401 'sys' imported but unused
python/pyspark/ml/tests/test_feature.py:304:5: F401 'pyspark.ml.tests.test_feature.*' imported but unused
python/pyspark/ml/tests/test_image.py:19:1: F401 'py4j' imported but unused
python/pyspark/ml/tests/test_image.py:22:1: F401 'pyspark.testing.mlutils.PySparkTestCase' imported but unused
python/pyspark/ml/tests/test_image.py:71:5: F401 'pyspark.ml.tests.test_image.*' imported but unused
python/pyspark/ml/tests/test_persistence.py:456:5: F401 'pyspark.ml.tests.test_persistence.*' imported but unused
python/pyspark/ml/tests/test_evaluation.py:56:5: F401 'pyspark.ml.tests.test_evaluation.*' imported but unused
python/pyspark/ml/tests/test_stat.py:43:5: F401 'pyspark.ml.tests.test_stat.*' imported but unused
python/pyspark/ml/tests/test_base.py:70:5: F401 'pyspark.ml.tests.test_base.*' imported but unused
python/pyspark/ml/tests/test_param.py:20:1: F401 'sys' imported but unused
python/pyspark/ml/tests/test_param.py:375:5: F401 'pyspark.ml.tests.test_param.*' imported but unused
python/pyspark/ml/tests/test_pipeline.py:62:5: F401 'pyspark.ml.tests.test_pipeline.*' imported but unused
python/pyspark/ml/tests/test_algorithms.py:333:5: F401 'pyspark.ml.tests.test_algorithms.*' imported but unused
python/pyspark/ml/param/__init__.py:18:1: F401 'sys' imported but unused
python/pyspark/resource/tests/test_resources.py:17:1: F401 'random' imported but unused
python/pyspark/resource/tests/test_resources.py:20:1: F401 'pyspark.resource.ResourceProfile' imported but unused
python/pyspark/resource/tests/test_resources.py:75:5: F401 'pyspark.resource.tests.test_resources.*' imported but unused
python/pyspark/sql/functions.py:32:1: F401 'pyspark.sql.udf.UserDefinedFunction' imported but unused
python/pyspark/sql/functions.py:34:1: F401 'pyspark.sql.pandas.functions.pandas_udf' imported but unused
python/pyspark/sql/session.py:30:1: F401 'pyspark.sql.types.Row' imported but unused
python/pyspark/sql/session.py:30:1: F401 'pyspark.sql.types.StringType' imported but unused
python/pyspark/sql/readwriter.py:1084:5: F401 'pyspark.sql.Row' imported but unused
python/pyspark/sql/context.py:26:1: F401 'pyspark.sql.types.IntegerType' imported but unused
python/pyspark/sql/context.py:26:1: F401 'pyspark.sql.types.Row' imported but unused
python/pyspark/sql/context.py:26:1: F401 'pyspark.sql.types.StringType' imported but unused
python/pyspark/sql/context.py:27:1: F401 'pyspark.sql.udf.UDFRegistration' imported but unused
python/pyspark/sql/streaming.py:1212:5: F401 'pyspark.sql.Row' imported but unused
python/pyspark/sql/tests/test_utils.py:55:5: F401 'pyspark.sql.tests.test_utils.*' imported but unused
python/pyspark/sql/tests/test_pandas_map.py:18:1: F401 'sys' imported but unused
python/pyspark/sql/tests/test_pandas_map.py:22:1: F401 'pyspark.sql.functions.pandas_udf' imported but unused
python/pyspark/sql/tests/test_pandas_map.py:22:1: F401 'pyspark.sql.functions.PandasUDFType' imported but unused
python/pyspark/sql/tests/test_pandas_map.py:119:5: F401 'pyspark.sql.tests.test_pandas_map.*' imported but unused
python/pyspark/sql/tests/test_catalog.py:193:5: F401 'pyspark.sql.tests.test_catalog.*' imported but unused
python/pyspark/sql/tests/test_group.py:39:5: F401 'pyspark.sql.tests.test_group.*' imported but unused
python/pyspark/sql/tests/test_session.py:361:5: F401 'pyspark.sql.tests.test_session.*' imported but unused
python/pyspark/sql/tests/test_conf.py:49:5: F401 'pyspark.sql.tests.test_conf.*' imported but unused
python/pyspark/sql/tests/test_pandas_cogrouped_map.py:19:1: F401 'sys' imported but unused
python/pyspark/sql/tests/test_pandas_cogrouped_map.py:21:1: F401 'pyspark.sql.functions.sum' imported but unused
python/pyspark/sql/tests/test_pandas_cogrouped_map.py:21:1: F401 'pyspark.sql.functions.PandasUDFType' imported but unused
python/pyspark/sql/tests/test_pandas_cogrouped_map.py:29:5: F401 'pandas.util.testing.assert_series_equal' imported but unused
python/pyspark/sql/tests/test_pandas_cogrouped_map.py:32:5: F401 'pyarrow as pa' imported but unused
python/pyspark/sql/tests/test_pandas_cogrouped_map.py:248:5: F401 'pyspark.sql.tests.test_pandas_cogrouped_map.*' imported but unused
python/pyspark/sql/tests/test_udf.py:24:1: F401 'py4j' imported but unused
python/pyspark/sql/tests/test_pandas_udf_typehints.py:246:5: F401 'pyspark.sql.tests.test_pandas_udf_typehints.*' imported but unused
python/pyspark/sql/tests/test_functions.py:19:1: F401 'sys' imported but unused
python/pyspark/sql/tests/test_functions.py:362:9: F401 'pyspark.sql.functions.exists' imported but unused
python/pyspark/sql/tests/test_functions.py:387:5: F401 'pyspark.sql.tests.test_functions.*' imported but unused
python/pyspark/sql/tests/test_pandas_udf_scalar.py:21:1: F401 'sys' imported but unused
python/pyspark/sql/tests/test_pandas_udf_scalar.py:45:5: F401 'pyarrow as pa' imported but unused
python/pyspark/sql/tests/test_pandas_udf_window.py:355:5: F401 'pyspark.sql.tests.test_pandas_udf_window.*' imported but unused
python/pyspark/sql/tests/test_arrow.py:38:5: F401 'pyarrow as pa' imported but unused
python/pyspark/sql/tests/test_pandas_grouped_map.py:20:1: F401 'sys' imported but unused
python/pyspark/sql/tests/test_pandas_grouped_map.py:38:5: F401 'pyarrow as pa' imported but unused
python/pyspark/sql/tests/test_dataframe.py:382:9: F401 'pyspark.sql.DataFrame' imported but unused
python/pyspark/sql/avro/functions.py:125:5: F401 'pyspark.sql.Row' imported but unused
python/pyspark/sql/pandas/functions.py:19:1: F401 'sys' imported but unused
```

After:
```
fokkodriesprongFan spark % flake8 python | grep -i "imported but unused"
fokkodriesprongFan spark %
```

### What changes were proposed in this pull request?

Removing unused imports from the Python files to keep everything nice and tidy.

### Why are the changes needed?

Cleaning up of the imports that aren't used, and suppressing the imports that are used as references to other modules, preserving backward compatibility.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Adding the rule to the existing Flake8 checks.

Closes #29121 from Fokko/SPARK-32319.

Authored-by: Fokko Driesprong <fokko@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-08-08 08:51:57 -07:00
Prashant Sharma 6c3d0a4405 [SPARK-32556][INFRA] Fix release script to have urlencoded passwords where required
### What changes were proposed in this pull request?
1. URL encode the `ASF_PASSWORD` of the release manager.
2. Update the image to install `qpdf` and `jq` dep
3. Increase the JVM HEAM memory for maven build.

### Why are the changes needed?
Release script takes hours to run, and if a single failure happens about somewhere midway, then either one has to get down to manually doing stuff or re run the entire script. (This is my understanding) So, I have made the fixes of a few failures, discovered so far.

1. If the release manager password contains a char, that is not allowed in URL, then it fails the build at the clone spark step.
`git clone "https://$ASF_USERNAME:$ASF_PASSWORD$ASF_SPARK_REPO" -b $GIT_BRANCH`

          ^^^ Fails with bad URL

`ASF_USERNAME` may not be URL encoded, but we need to encode `ASF_PASSWORD`.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running the release for branch-2.4, using both type of passwords, i.e. passwords with special chars and without it.

Closes #29373 from ScrapCodes/release-script-fix2.

Lead-authored-by: Prashant Sharma <prashant@apache.org>
Co-authored-by: Prashant Sharma <prashsh1@in.ibm.com>
Signed-off-by: Prashant Sharma <prashant@apache.org>
2020-08-07 03:31:45 -05:00
Terry Kim 171b7d5d71 [SPARK-23431][CORE] Expose stage level peak executor metrics via REST API
### What changes were proposed in this pull request?

Note that this PR is forked from #23340 originally written by edwinalu.

This PR proposes to expose the peak executor metrics at the stage level via the REST APIs:
* `/applications/<application_id>/stages/`: peak values of executor metrics for **each stage**
* `/applications/<application_id>/stages/<stage_id>/< stage_attempt_id >`: peak values of executor metrics for **each executor** for the stage, followed by peak values of executor metrics for the stage

### Why are the changes needed?

The stage level peak executor metrics can help better understand your application's resource utilization.

### Does this PR introduce _any_ user-facing change?

1. For the `/applications/<application_id>/stages/` API, you will see the following new info for **each stage**:
```JSON
  "peakExecutorMetrics" : {
    "JVMHeapMemory" : 213367864,
    "JVMOffHeapMemory" : 189011656,
    "OnHeapExecutionMemory" : 0,
    "OffHeapExecutionMemory" : 0,
    "OnHeapStorageMemory" : 2133349,
    "OffHeapStorageMemory" : 0,
    "OnHeapUnifiedMemory" : 2133349,
    "OffHeapUnifiedMemory" : 0,
    "DirectPoolMemory" : 282024,
    "MappedPoolMemory" : 0,
    "ProcessTreeJVMVMemory" : 0,
    "ProcessTreeJVMRSSMemory" : 0,
    "ProcessTreePythonVMemory" : 0,
    "ProcessTreePythonRSSMemory" : 0,
    "ProcessTreeOtherVMemory" : 0,
    "ProcessTreeOtherRSSMemory" : 0,
    "MinorGCCount" : 13,
    "MinorGCTime" : 115,
    "MajorGCCount" : 4,
    "MajorGCTime" : 339
  }
```

2. For the `/applications/<application_id>/stages/<stage_id>/<stage_attempt_id>` API, you will see the following new info for **each executor** under `executorSummary`:
```JSON
  "peakMemoryMetrics" : {
    "JVMHeapMemory" : 0,
    "JVMOffHeapMemory" : 0,
    "OnHeapExecutionMemory" : 0,
    "OffHeapExecutionMemory" : 0,
    "OnHeapStorageMemory" : 0,
    "OffHeapStorageMemory" : 0,
    "OnHeapUnifiedMemory" : 0,
    "OffHeapUnifiedMemory" : 0,
    "DirectPoolMemory" : 0,
    "MappedPoolMemory" : 0,
    "ProcessTreeJVMVMemory" : 0,
    "ProcessTreeJVMRSSMemory" : 0,
    "ProcessTreePythonVMemory" : 0,
    "ProcessTreePythonRSSMemory" : 0,
    "ProcessTreeOtherVMemory" : 0,
    "ProcessTreeOtherRSSMemory" : 0,
    "MinorGCCount" : 0,
    "MinorGCTime" : 0,
    "MajorGCCount" : 0,
    "MajorGCTime" : 0
  }
```
, and the following at the stage level:
```JSON
"peakExecutorMetrics" : {
    "JVMHeapMemory" : 213367864,
    "JVMOffHeapMemory" : 189011656,
    "OnHeapExecutionMemory" : 0,
    "OffHeapExecutionMemory" : 0,
    "OnHeapStorageMemory" : 2133349,
    "OffHeapStorageMemory" : 0,
    "OnHeapUnifiedMemory" : 2133349,
    "OffHeapUnifiedMemory" : 0,
    "DirectPoolMemory" : 282024,
    "MappedPoolMemory" : 0,
    "ProcessTreeJVMVMemory" : 0,
    "ProcessTreeJVMRSSMemory" : 0,
    "ProcessTreePythonVMemory" : 0,
    "ProcessTreePythonRSSMemory" : 0,
    "ProcessTreeOtherVMemory" : 0,
    "ProcessTreeOtherRSSMemory" : 0,
    "MinorGCCount" : 13,
    "MinorGCTime" : 115,
    "MajorGCCount" : 4,
    "MajorGCTime" : 339
  }
```

### How was this patch tested?

Added tests.

Closes #29020 from imback82/metrics.

Lead-authored-by: Terry Kim <yuminkim@gmail.com>
Co-authored-by: edwinalu <edwina.lu@gmail.com>
Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>
2020-08-04 21:11:00 +08:00
yangjie01 0693d8bbf2 [SPARK-32490][BUILD] Upgrade netty-all to 4.1.51.Final
### What changes were proposed in this pull request?
This PR aims to bring the bug fixes from the latest netty version.

### Why are the changes needed?
- 4.1.48.Final: [https://github.com/netty/netty/milestone/223?closed=1](https://github.com/netty/netty/milestone/223?closed=1)(14 patches or issues)
- 4.1.49.Final: [https://github.com/netty/netty/milestone/224?closed=1](https://github.com/netty/netty/milestone/224?closed=1)(48 patches or issues)
- 4.1.50.Final: [https://github.com/netty/netty/milestone/225?closed=1](https://github.com/netty/netty/milestone/225?closed=1)(38 patches or issues)
- 4.1.51.Final: [https://github.com/netty/netty/milestone/226?closed=1](https://github.com/netty/netty/milestone/226?closed=1)(53 patches or issues)

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass the Jenkins with the existing tests.

Closes #29299 from LuciferYang/upgrade-netty-version.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-08-02 16:46:11 -07:00
HyukjinKwon 32f4ef005f [SPARK-32497][INFRA] Installs qpdf package for CRAN check in GitHub Actions
### What changes were proposed in this pull request?

CRAN check fails due to the size of the generated PDF docs as below:

```
...
 WARNING
‘qpdf’ is needed for checks on size reduction of PDFs
...
Status: 1 WARNING, 1 NOTE
See
  ‘/home/runner/work/spark/spark/R/SparkR.Rcheck/00check.log’
for details.
```

This PR proposes to install `qpdf` in GitHub Actions.

Note that I cannot reproduce in my local with the same R version so I am not documenting it for now.

Also, while I am here, I piggyback to install SparkR when the module includes `sparkr`. it is rather a followup of SPARK-32491.

### Why are the changes needed?

To fix SparkR CRAN check failure.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

GitHub Actions will test it out.

Closes #29306 from HyukjinKwon/SPARK-32497.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-07-31 00:57:24 +09:00
HyukjinKwon 12f443cd99 [SPARK-32496][INFRA] Include GitHub Action file as the changes in testing
### What changes were proposed in this pull request?

https://github.com/apache/spark/pull/26556 excluded `.github/workflows/master.yml`. So tests are skipped if the GitHub Actions configuration file is changed.

As of SPARK-32245, we now run the regular tests via the testing script. We should include it to test to make sure GitHub Actions build does not break due to some changes such as Python versions.

### Why are the changes needed?

For better test coverage in GitHub Actions build.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

GitHub Actions in this PR will test.

Closes #29305 from HyukjinKwon/SPARK-32496.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-07-30 21:07:31 +09:00
HyukjinKwon 1f7fe5415e [SPARK-32491][INFRA] Do not install SparkR in test-only mode in testing script
### What changes were proposed in this pull request?

This PR proposes to skip SparkR installation that is to run R linters (see SPARK-8505) in the test-only mode at `dev/run-tests.py` script.

As of SPARK-32292, the test-only mode in `dev/run-tests.py` was introduced, for example:

```
dev/run-tests.py --modules sql,core
```

which only runs the relevant tests and does not run other tests such as linters. Therefore, we don't need to install SparkR when `--modules` are specified.

### Why are the changes needed?

GitHub Actions build is currently failed as below:

```
ERROR: this R is version 3.4.4, package 'SparkR' requires R >= 3.5
[error] running /home/runner/work/spark/spark/R/install-dev.sh ; received return code 1
##[error]Process completed with exit code 10.
```

For some reasons, looks GitHub Actions started to have R 3.4.4 installed by default; however, R 3.4 was dropped as of SPARK-32073.  When SparkR tests are not needed, GitHub Actions still builds SparkR with a low R version and it causes the test failure.

This PR partially fixes it by avoid the installation of SparkR.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

GitHub Actions tests should run to confirm this fix is correct.

Closes #29300 from HyukjinKwon/install-r.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-07-30 19:28:34 +09:00
HyukjinKwon a82aee0441 [SPARK-32435][PYTHON] Remove heapq3 port from Python 3
### What changes were proposed in this pull request?

This PR removes the manual port of `heapq3.py` introduced from SPARK-3073. The main reason of this was to support Python 2.6 and 2.7 because Python 2's `heapq.merge()` doesn't not support `key` and `reverse`.

See
- https://docs.python.org/2/library/heapq.html#heapq.merge in Python 2
- https://docs.python.org/3.8/library/heapq.html#heapq.merge in Python 3

Since we dropped the Python 2 at SPARK-32138, we can remove this away.

### Why are the changes needed?

To remove unnecessary codes. Also, we can leverage bug fixes made in Python 3.x at `heapq`.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

Existing tests should cover. I locally ran and verified:

```bash
./python/run-tests --python-executable=python3 --testname="pyspark.tests.test_shuffle"
./python/run-tests --python-executable=python3 --testname="pyspark.shuffle ExternalSorter"
./python/run-tests --python-executable=python3 --testname="pyspark.tests.test_rdd RDDTests.test_external_group_by_key"
```

Closes #29229 from HyukjinKwon/SPARK-32435.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-07-27 20:10:13 +09:00
HyukjinKwon 6ab29b37cf [SPARK-32179][SPARK-32188][PYTHON][DOCS] Replace and redesign the documentation base
### What changes were proposed in this pull request?

This PR proposes to redesign the PySpark documentation.

I made a demo site to make it easier to review: https://hyukjin-spark.readthedocs.io/en/stable/reference/index.html.

Here is the initial draft for the final PySpark docs shape: https://hyukjin-spark.readthedocs.io/en/latest/index.html.

In more details, this PR proposes:
1. Use [pydata_sphinx_theme](https://github.com/pandas-dev/pydata-sphinx-theme) theme - [pandas](https://pandas.pydata.org/docs/) and [Koalas](https://koalas.readthedocs.io/en/latest/) use this theme. The CSS overwrite is ported from Koalas. The colours in the CSS were actually chosen by designers to use in Spark.
2. Use the Sphinx option to separate `source` and `build` directories as the documentation pages will likely grow.
3. Port current API documentation into the new style. It mimics Koalas and pandas to use the theme most effectively.

    One disadvantage of this approach is that you should list up APIs or classes; however, I think this isn't a big issue in PySpark since we're being conservative on adding APIs. I also intentionally listed classes only instead of functions in ML and MLlib to make it relatively easier to manage.

### Why are the changes needed?

Often I hear the complaints, from the users, that current PySpark documentation is pretty messy to read - https://spark.apache.org/docs/latest/api/python/index.html compared other projects such as [pandas](https://pandas.pydata.org/docs/) and [Koalas](https://koalas.readthedocs.io/en/latest/).

It would be nicer if we can make it more organised instead of just listing all classes, methods and attributes to make it easier to navigate.

Also, the documentation has been there from almost the very first version of PySpark. Maybe it's time to update it.

### Does this PR introduce _any_ user-facing change?

Yes, PySpark API documentation will be redesigned.

### How was this patch tested?

Manually tested, and the demo site was made to show.

Closes #29188 from HyukjinKwon/SPARK-32179.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-07-27 17:49:21 +09:00
Liang-Chi Hsieh 70ac594bb3 [SPARK-32450][PYTHON] Upgrade pycodestyle to v2.6.0
### What changes were proposed in this pull request?

This patch upgrades pycodestyle from v2.4.0 to v2.6.0. The changes at each release:

2.5.0: https://pycodestyle.pycqa.org/en/latest/developer.html#id3
2.6.0a1: https://pycodestyle.pycqa.org/en/latest/developer.html#a1-2020-04-23
2.6.0: https://pycodestyle.pycqa.org/en/latest/developer.html#id2

Changes: Dropped Python 2.6 and 3.3 support, added Python 3.7 and 3.8 support...

### Why are the changes needed?

Including bug fixes and newer Python version support.

### Does this PR introduce _any_ user-facing change?

No, dev only.

### How was this patch tested?

Ran `dev/lint-python` locally.

Closes #29249 from viirya/upgrade-pycodestyle.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-07-27 10:43:32 +09:00
Dongjoon Hyun 83ffef7ffb [SPARK-32441][BUILD][CORE] Update json4s to 3.7.0-M5 for Scala 2.13
### What changes were proposed in this pull request?

This PR aims to upgrade `json4s` to from 3.6.6 to 3.7.0-M5 for Scala 2.13 support at Apache Spark 3.1.0 on December. We will upgrade to the latest `json4s` around November.

### Why are the changes needed?

`json4s` starts to support Scala 2.13 since v3.7.0-M4.
- https://github.com/json4s/json4s/issues/660
- b013af8e75

Old `json4s` causes many UT failures with `NoSuchMethodException`.
```scala
 Cause: java.lang.NoSuchMethodException: scala.collection.immutable.Seq$.apply(scala.collection.Seq)
  at java.lang.Class.getMethod(Class.java:1786)
```

The following is one example.
```scala
$ dev/change-scala-version.sh 2.13
$ build/mvn test -pl core --am -Pscala-2.13 -Dtest=none -DwildcardSuites=org.apache.spark.executor.CoarseGrainedExecutorBackendSuite
...
Tests: succeeded 4, failed 9, canceled 0, ignored 0, pending 0
*** 9 TESTS FAILED ***
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

1. **Scala 2.12**: Pass the Jenkins or GitHub Action with the existing tests.
2. **Scala 2.13**: Do the following manually at least.
```scala
$ dev/change-scala-version.sh 2.13
$ build/mvn test -pl core --am -Pscala-2.13 -Dtest=none -DwildcardSuites=org.apache.spark.executor.CoarseGrainedExecutorBackendSuite
...
Tests: succeeded 13, failed 0, canceled 0, ignored 0, pending 0
All tests passed.
```

Closes #29239 from dongjoon-hyun/SPARK-32441.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-07-25 20:34:31 -07:00
Dongjoon Hyun d1301af4eb [SPARK-32437][CORE][FOLLOWUP] Update dependency manifest for RoaringBitmap 0.9.0 2020-07-25 10:58:25 -07:00
HyukjinKwon 8e36a8f33f [SPARK-32419][PYTHON][BUILD] Avoid using subshell for Conda env (de)activation in pip packaging test
### What changes were proposed in this pull request?

This PR proposes to avoid using subshell when it activates Conda environment. Looks like it ends up with activating the env within the subshell even if you use `conda` command.

### Why are the changes needed?

If you take a close look for GitHub Actions log:

```
 Installing dist into virtual env
Processing ./python/dist/pyspark-3.1.0.dev0.tar.gz
Collecting py4j==0.10.9
 Downloading py4j-0.10.9-py2.py3-none-any.whl (198 kB)
Using legacy setup.py install for pyspark, since package 'wheel' is not installed.
Installing collected packages: py4j, pyspark
 Running setup.py install for pyspark: started
 Running setup.py install for pyspark: finished with status 'done'
Successfully installed py4j-0.10.9 pyspark-3.1.0.dev0

...

Installing dist into virtual env
Obtaining file:///home/runner/work/spark/spark/python
Collecting py4j==0.10.9
 Downloading py4j-0.10.9-py2.py3-none-any.whl (198 kB)
Installing collected packages: py4j, pyspark
 Attempting uninstall: py4j
 Found existing installation: py4j 0.10.9
 Uninstalling py4j-0.10.9:
 Successfully uninstalled py4j-0.10.9
 Attempting uninstall: pyspark
 Found existing installation: pyspark 3.1.0.dev0
 Uninstalling pyspark-3.1.0.dev0:
 Successfully uninstalled pyspark-3.1.0.dev0
 Running setup.py develop for pyspark
Successfully installed py4j-0.10.9 pyspark
```

It looks not properly using Conda as it removes the previously installed one when it reinstalls again.
We should ideally test it with Conda environment as it's intended.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

GitHub Actions will test. I also manually tested in my local.

Closes #29212 from HyukjinKwon/SPARK-32419.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-07-25 13:09:23 +09:00
Gengliang Wang 8896f4af87 Revert "[SPARK-32253][INFRA] Show errors only for the sbt tests of github actions"
### What changes were proposed in this pull request?

This reverts commit 026b0b926d.

### Why are the changes needed?

As HyukjinKwon pointed out in https://github.com/apache/spark/pull/29133#issuecomment-663339240, there is no JUnit test report after https://github.com/apache/spark/pull/29133. Let's revert https://github.com/apache/spark/pull/29133 for now and find a better solution to improve the log output later.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

GitHub Actions build

Closes #29219 from gengliangwang/revertErrorOnly.

Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>
2020-07-24 18:14:19 +08:00
Sean Owen be2eca22e9 [SPARK-32398][TESTS][CORE][STREAMING][SQL][ML] Update to scalatest 3.2.0 for Scala 2.13.3+
### What changes were proposed in this pull request?

Updates to scalatest 3.2.0. Though it looks large, it is 99% changes to the new location of scalatest classes.

### Why are the changes needed?

3.2.0+ has a fix that is required for Scala 2.13.3+ compatibility.

### Does this PR introduce _any_ user-facing change?

No, only affects tests.

### How was this patch tested?

Existing tests.

Closes #29196 from srowen/SPARK-32398.

Authored-by: Sean Owen <srowen@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-07-23 16:20:17 -07:00
HyukjinKwon 4da93b00d7 [SPARK-32363][PYTHON][BUILD] Fix flakiness in pip package testing in Jenkins
### What changes were proposed in this pull request?

This PR proposes:

- Don't use `--user` in pip packaging test
- Pull `source` out of the subshell, and place it first.
- Exclude user sitepackages in Python path during pip installation test

to address the flakiness of the pip packaging test in Jenkins.

(I think) #29116 caused this flakiness given my observation in the Jenkins log. I had to work around by specifying `--user` but it turned out that it does not properly work in old Conda on Jenkins for some reasons. Therefore, reverting this change back.

(I think) the installation at user site-packages affects other environments created by Conda in the old Conda version that Jenkins has. Seems it fails to isolate the environments for some reasons. So, it excludes user sitepackages in the Python path during the test.

In addition, #29116 also added some fallback logics of `conda (de)activate` and `source (de)activate` because Conda prefers to use `conda (de)activate` now per the official documentation and `source (de)activate` doesn't work for some reasons in certain environments (see also https://github.com/conda/conda/issues/7980). The problem was that `source` loads things to the current shell so does not affect the current shell. Therefore, this PR pulls `source` out of the subshell.

Disclaimer: I made the analysis purely based on Jenkins machine's log in this PR. It may have a different reason I missed during my observation.

### Why are the changes needed?

To make the build and tests pass in Jenkins.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

Jenkins tests should test it out.

Closes #29117 from HyukjinKwon/debug-conda.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-07-21 22:49:14 +09:00
Gengliang Wang 026b0b926d [SPARK-32253][INFRA] Show errors only for the sbt tests of github actions
### What changes were proposed in this pull request?

Make the test result log of github action more readable by showing errors from SBT only.
1. Add "--error" flag to sbt in github action to set the log level as "ERROR"
2. Show only failed test cases in stderr output of github action. According to https://www.scalatest.org/user_guide/using_the_runner, with SBT option `-eNCXEHLOPQMDF ` we can drop all the following events:
```
N - drop TestStarting events
C - drop TestSucceeded events
X - drop TestIgnored events
E - drop TestPending events
H - drop SuiteStarting events
L - drop SuiteCompleted events
O - drop InfoProvided events
P - drop ScopeOpened events
Q - drop ScopeClosed events
R - drop ScopePending events
M - drop MarkupProvided events
```
and enable the following two mode:
```
D - show all durations
F - show full stack traces
```

### Why are the changes needed?

Currently, the output of github action is very long and we have to scroll down to find the failed test cases. Even more, the log may be truncated. In such a case, we will have to wait until all the jobs are completed and then download all the raw logs.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Before changes, all the warnings in compiling are shown:
![image](https://user-images.githubusercontent.com/1097932/87846810-98ec8900-c887-11ea-913b-164b84df62cd.png)

as well as all the passed and ignored test cases:
![image](https://user-images.githubusercontent.com/1097932/87846834-ca655480-c887-11ea-9c29-977f802e4c82.png)

After changes, sbt test only shows the summary for a successful job:
![image](https://user-images.githubusercontent.com/1097932/87846961-e74e5780-c888-11ea-82d5-cf1da1740181.png)

![image](https://user-images.githubusercontent.com/1097932/87745273-5735e280-c7a2-11ea-8ac9-b4b0e3cb458d.png)

If there is a test failure, a full stack track is shown as well as a test failure summary at the end of test log:

![image](https://user-images.githubusercontent.com/1097932/87751143-3aa1a680-c7b2-11ea-9d09-52637a322270.png)

![image](https://user-images.githubusercontent.com/1097932/87752404-1f846600-c7b5-11ea-8106-8ddaf3cc3f7e.png)

Closes #29133 from gengliangwang/shortLog.

Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-07-19 12:00:23 +09:00
Sean Owen 40ef01283d [SPARK-29802][BUILD] Use python3 in build scripts
### What changes were proposed in this pull request?

Use `/usr/bin/env python3` consistently instead of `/usr/bin/env python` in build scripts, to reliably select Python 3.

### Why are the changes needed?

Scripts no longer work with Python 2.

### Does this PR introduce _any_ user-facing change?

No, should be all build system changes.

### How was this patch tested?

Existing tests / NA

Closes #29151 from srowen/SPARK-29909.2.

Authored-by: Sean Owen <srowen@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-07-19 11:02:37 +09:00
williamhyun 5daf244d0f [SPARK-32329][TESTS] Rename HADOOP2_MODULE_PROFILES to HADOOP_MODULE_PROFILES
### What changes were proposed in this pull request?

This PR aims to rename `HADOOP2_MODULE_PROFILES` to `HADOOP_MODULE_PROFILES` because Hadoop 3 is now the default.

### Why are the changes needed?

Hadoop 3 is now the default.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass GitHub Action dependency test.

Closes #29128 from williamhyun/williamhyun-patch-3.

Authored-by: williamhyun <62487364+williamhyun@users.noreply.github.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-07-17 11:59:19 -05:00
HyukjinKwon ea9e8f365a [SPARK-32094][PYTHON] Update cloudpickle to v1.5.0
### What changes were proposed in this pull request?

This PR aims to upgrade PySpark's embedded cloudpickle to the latest cloudpickle v1.5.0 (See https://github.com/cloudpipe/cloudpickle/blob/v1.5.0/cloudpickle/cloudpickle.py)

### Why are the changes needed?

There are many bug fixes. For example, the bug described in the JIRA:

dill unpickling fails because they define `types.ClassType`, which is undefined in dill. This results in the following error:

```
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/apache_beam/internal/pickler.py", line 279, in loads
    return dill.loads(s)
  File "/usr/local/lib/python3.6/site-packages/dill/_dill.py", line 317, in loads
    return load(file, ignore)
  File "/usr/local/lib/python3.6/site-packages/dill/_dill.py", line 305, in load
    obj = pik.load()
  File "/usr/local/lib/python3.6/site-packages/dill/_dill.py", line 577, in _load_type
    return _reverse_typemap[name]
KeyError: 'ClassType'
```

See also https://github.com/cloudpipe/cloudpickle/issues/82. This was fixed for cloudpickle 1.3.0+ (https://github.com/cloudpipe/cloudpickle/pull/337), but PySpark's cloudpickle.py doesn't have this change yet.

More notably, now it supports C pickle implementation with Python 3.8 which hugely improve performance. This is already adopted in another project such as Ray.

### Does this PR introduce _any_ user-facing change?

Yes, as described above, the bug fixes. Internally, users also could leverage the fast cloudpickle backed by C pickle.

### How was this patch tested?

Jenkins will test it out.

Closes #29114 from HyukjinKwon/SPARK-32094.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-07-17 11:49:18 +09:00
Erik Krogen cf22d947fb [SPARK-32036] Replace references to blacklist/whitelist language with more appropriate terminology, excluding the blacklisting feature
### What changes were proposed in this pull request?

This PR will remove references to these "blacklist" and "whitelist" terms besides the blacklisting feature as a whole, which can be handled in a separate JIRA/PR.

This touches quite a few files, but the changes are straightforward (variable/method/etc. name changes) and most quite self-contained.

### Why are the changes needed?

As per discussion on the Spark dev list, it will be beneficial to remove references to problematic language that can alienate potential community members. One such reference is "blacklist" and "whitelist". While it seems to me that there is some valid debate as to whether these terms have racist origins, the cultural connotations are inescapable in today's world.

### Does this PR introduce _any_ user-facing change?

In the test file `HiveQueryFileTest`, a developer has the ability to specify the system property `spark.hive.whitelist` to specify a list of Hive query files that should be tested. This system property has been renamed to `spark.hive.includelist`. The old property has been kept for compatibility, but will log a warning if used. I am open to feedback from others on whether keeping a deprecated property here is unnecessary given that this is just for developers running tests.

### How was this patch tested?

Existing tests should be suitable since no behavior changes are expected as a result of this PR.

Closes #28874 from xkrogen/xkrogen-SPARK-32036-rename-blacklists.

Authored-by: Erik Krogen <ekrogen@linkedin.com>
Signed-off-by: Thomas Graves <tgraves@apache.org>
2020-07-15 11:40:55 -05:00
HyukjinKwon 902e1342a3 [SPARK-32303][PYTHON][TESTS] Remove leftover from editable mode installation in PIP test
### What changes were proposed in this pull request?

Currently the Jenkins PIP packaging test fails as below intermediately:

```
Installing dist into virtual env
Processing ./python/dist/pyspark-3.1.0.dev0.tar.gz
Collecting py4j==0.10.9 (from pyspark==3.1.0.dev0)
  Downloading 6a4fb90cd2/py4j-0.10.9-py2.py3-none-any.whl (198kB)
Installing collected packages: py4j, pyspark
  Found existing installation: py4j 0.10.9
    Uninstalling py4j-0.10.9:
      Successfully uninstalled py4j-0.10.9
  Found existing installation: pyspark 3.1.0.dev0
Exception:
Traceback (most recent call last):
  File "/home/anaconda/envs/py36/lib/python3.6/site-packages/pip/_internal/cli/base_command.py", line 179, in main
    status = self.run(options, args)
  File "/home/anaconda/envs/py36/lib/python3.6/site-packages/pip/_internal/commands/install.py", line 393, in run
    use_user_site=options.use_user_site,
  File "/home/anaconda/envs/py36/lib/python3.6/site-packages/pip/_internal/req/__init__.py", line 50, in install_given_reqs
    auto_confirm=True
  File "/home/anaconda/envs/py36/lib/python3.6/site-packages/pip/_internal/req/req_install.py", line 816, in uninstall
    uninstalled_pathset = UninstallPathSet.from_dist(dist)
  File "/home/anaconda/envs/py36/lib/python3.6/site-packages/pip/_internal/req/req_uninstall.py", line 505, in from_dist
    '(at %s)' % (link_pointer, dist.project_name, dist.location)
AssertionError: Egg-link /home/jenkins/workspace/SparkPullRequestBuilder3/python does not match installed
```

- https://github.com/apache/spark/pull/29099#issuecomment-658073453 (amp-jenkins-worker-04)
- https://github.com/apache/spark/pull/29090#issuecomment-657819973 (amp-jenkins-worker-03)

Seems like the previous installation of editable mode affects other PRs.

This PR simply works around by removing the symbolic link from the previous editable installation. This is a common workaround up to my knowledge.

### Why are the changes needed?

To recover the Jenkins build.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

Jenkins build will test it out.

Closes #29102 from HyukjinKwon/SPARK-32303.

Lead-authored-by: HyukjinKwon <gurwls223@apache.org>
Co-authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-07-14 16:43:16 -07:00
HyukjinKwon 4ad9bfd53b [SPARK-32138] Drop Python 2.7, 3.4 and 3.5
### What changes were proposed in this pull request?

This PR aims to drop Python 2.7, 3.4 and 3.5.

Roughly speaking, it removes all the widely known Python 2 compatibility workarounds such as `sys.version` comparison, `__future__`. Also, it removes the Python 2 dedicated codes such as `ArrayConstructor` in Spark.

### Why are the changes needed?

 1. Unsupport EOL Python versions
 2. Reduce maintenance overhead and remove a bit of legacy codes and hacks for Python 2.
 3. PyPy2 has a critical bug that causes a flaky test, SPARK-28358 given my testing and investigation.
 4. Users can use Python type hints with Pandas UDFs without thinking about Python version
 5. Users can leverage one latest cloudpickle, https://github.com/apache/spark/pull/28950. With Python 3.8+ it can also leverage C pickle.

### Does this PR introduce _any_ user-facing change?

Yes, users cannot use Python 2.7, 3.4 and 3.5 in the upcoming Spark version.

### How was this patch tested?

Manually tested and also tested in Jenkins.

Closes #28957 from HyukjinKwon/SPARK-32138.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-07-14 11:22:44 +09:00
Hyukjin Kwon 27ef3629dd [SPARK-32292][SPARK-32252][INFRA] Run the relevant tests only in GitHub Actions
### What changes were proposed in this pull request?

This PR mainly proposes to run only relevant tests just like Jenkins PR builder does. Currently, GitHub Actions always run full tests which wastes the resources.

In addition, this PR also fixes 3 more issues  very closely related together while I am here.

1. The main idea here is: It reuses the existing logic embedded in `dev/run-tests.py` which Jenkins PR builder use in order to run only the related test cases.

2. While I am here, I fixed SPARK-32292 too to run the doc tests. It was because other references were not available when it is cloned via `checkoutv2`. With `fetch-depth: 0`, the history is available.

3. In addition, it fixes the `dev/run-tests.py` to match with `python/run-tests.py` in terms of its options. Environment variables such as `TEST_ONLY_XXX` were moved as proper options. For example,

    ```bash
    dev/run-tests.py --modules sql,core
    ```

    which is consistent with `python/run-tests.py`, for example,

    ```bash
    python/run-tests.py --modules pyspark-core,pyspark-ml
    ```

4. Lastly, also fixed the formatting issue in module specification in the matrix:

    ```diff
    -            network_common, network_shuffle, repl, launcher
    +            network-common, network-shuffle, repl, launcher,
    ```

    which incorrectly runs build/test the modules.

### Why are the changes needed?

By running only related tests, we can hugely save the resources and avoid unrelated flaky tests, etc.
Also, now it runs the doctest of `dev/run-tests.py` properly, the usages are similar between `dev/run-tests.py` and `python/run-tests.py`, and run `network-common`, `network-shuffle`, `launcher` and `examples` modules too.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

Manually tested in my own forked Spark:

https://github.com/HyukjinKwon/spark/pull/7
https://github.com/HyukjinKwon/spark/pull/8
https://github.com/HyukjinKwon/spark/pull/9
https://github.com/HyukjinKwon/spark/pull/10
https://github.com/HyukjinKwon/spark/pull/11
https://github.com/HyukjinKwon/spark/pull/12

Closes #29086 from HyukjinKwon/SPARK-32292.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-07-13 08:31:39 -07:00
HyukjinKwon b84ed4146d [SPARK-32245][INFRA] Run Spark tests in Github Actions
### What changes were proposed in this pull request?

This PR aims to run the Spark tests in Github Actions.

To briefly explain the main idea:

- Reuse `dev/run-tests.py` with SBT build
- Reuse the modules in `dev/sparktestsupport/modules.py` to test each module
- Pass the modules to test into `dev/run-tests.py` directly via `TEST_ONLY_MODULES` environment variable. For example, `pyspark-sql,core,sql,hive`.
- `dev/run-tests.py` _does not_ take the dependent modules into account but solely the specified modules to test.

Another thing to note might be `SlowHiveTest` annotation. Running the tests in Hive modules takes too much so the slow tests are extracted and it runs as a separate job. It was extracted from the actual elapsed time in Jenkins:

![Screen Shot 2020-07-09 at 7 48 13 PM](https://user-images.githubusercontent.com/6477701/87050238-f6098e80-c238-11ea-9c4a-ab505af61381.png)

So, Hive tests are separated into to jobs. One is slow test cases, and the other one is the other test cases.

_Note that_ the current GitHub Actions build virtually copies what the default PR builder on Jenkins does (without other profiles such as JDK 11, Hadoop 2, etc.). The only exception is Kinesis https://github.com/apache/spark/pull/29057/files#diff-04eb107ee163a50b61281ca08f4e4c7bR23

### Why are the changes needed?

Last week and onwards, the Jenkins machines became very unstable for many reasons:
  - Apparently, the machines became extremely slow. Almost all tests can't pass.
  - One machine (worker 4) started to have the corrupt `.m2` which fails the build.
  - Documentation build fails time to time for an unknown reason in Jenkins machine specifically. This is disabled for now at https://github.com/apache/spark/pull/29017.
  - Almost all PRs are basically blocked by this instability currently.

The advantages of using Github Actions:
  - To avoid depending on few persons who can access to the cluster.
  - To reduce the elapsed time in the build - we could split the tests (e.g., SQL, ML, CORE), and run them in parallel so the total build time will significantly reduce.
  - To control the environment more flexibly.
  - Other contributors can test and propose to fix Github Actions configurations so we can distribute this build management cost.

Note that:
- The current build in Jenkins takes _more than 7 hours_. With Github actions it takes _less than 2 hours_
- We can now control the environments especially for Python easily.
- The test and build look more stable than the Jenkins'.

### Does this PR introduce _any_ user-facing change?

No, dev-only change.

### How was this patch tested?

Tested at https://github.com/HyukjinKwon/spark/pull/4

Closes #29057 from HyukjinKwon/migrate-to-github-actions.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-07-11 13:09:06 -07:00
William Hyun 10a65ee9b4 [SPARK-32150][BUILD] Upgrade to ZStd 1.4.5-4
### What changes were proposed in this pull request?

This PR aims to upgrade to ZStd 1.4.5-4.

### Why are the changes needed?

ZStd 1.4.5-4 fixes the following.

- 3d16e51525
- 3d51bdcb82
### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the Jenkins.

Closes #28969 from williamhyun/zstd2.

Authored-by: William Hyun <williamhyun3@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-07-12 00:45:48 +09:00
Dongjoon Hyun c5bd0730a2 [SPARK-32231][R][INFRA] Use Hadoop 3.2 winutils in AppVeyor build
### What changes were proposed in this pull request?

This PR proposes to use Hadoop 3 winutils to make AppVeyor builds pass. Currently it's being failed as below
https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/builds/33976604

### Why are the changes needed?

To recover the build in AppVeyor.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

AppVeyor build will test it out.

Closes #29042 from HyukjinKwon/SPARK-32231.

Lead-authored-by: Dongjoon Hyun <dongjoon@apache.org>
Co-authored-by: Hyukjin Kwon <gurwls223@apache.org>
Co-authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-07-09 17:18:39 +09:00
Dongjoon Hyun 17997a5796 [SPARK-32233][TESTS] Disable SBT unidoc generation testing in Jenkins
### What changes were proposed in this pull request?

This PR aims to disable SBT `unidoc` generation testing in Jenkins environment because it's flaky in Jenkins environment and not used for the official documentation generation. Also, GitHub Action has the correct test coverage for the official documentation generation.

- https://github.com/apache/spark/pull/28848#issuecomment-654577911 (amp-jenkins-worker-06)
- https://github.com/apache/spark/pull/28926#issuecomment-654316537 (amp-jenkins-worker-06)
- https://github.com/apache/spark/pull/28969#issuecomment-654918636 (amp-jenkins-worker-06)
- https://github.com/apache/spark/pull/28975#issuecomment-654447955 (amp-jenkins-worker-05)
- https://github.com/apache/spark/pull/28986#issuecomment-654416543  (amp-jenkins-worker-05)
- https://github.com/apache/spark/pull/28992#issuecomment-654371469 (amp-jenkins-worker-06)
- https://github.com/apache/spark/pull/28993#issuecomment-655289237 (amp-jenkins-worker-05)
- https://github.com/apache/spark/pull/28999#issuecomment-653976760 (amp-jenkins-worker-04)
- https://github.com/apache/spark/pull/29010#issuecomment-655246083 (amp-jenkins-worker-03)
- https://github.com/apache/spark/pull/29013#issuecomment-654292483 (amp-jenkins-worker-04)
- https://github.com/apache/spark/pull/29016#issuecomment-654495070 (amp-jenkins-worker-05)
- https://github.com/apache/spark/pull/29025#issuecomment-654889938 (amp-jenkins-worker-04)
- https://github.com/apache/spark/pull/29042#issuecomment-655587989 (amp-jenkins-worker-03)

### Why are the changes needed?

Apache Spark `release-build.sh` generates the official document by using the following command.
- https://github.com/apache/spark/blob/master/dev/create-release/release-build.sh#L341

```bash
PRODUCTION=1 RELEASE_VERSION="$SPARK_VERSION" jekyll build
```

And, this is executed by the following `unidoc` command for Scala/Java API doc.
- https://github.com/apache/spark/blob/master/docs/_plugins/copy_api_dirs.rb#L30

```ruby
system("build/sbt -Pkinesis-asl clean compile unidoc") || raise("Unidoc generation failed")
```

However, the PR builder disabled `Jekyll build` and instead has a different test coverage.
```python
# determine if docs were changed and if we're inside the amplab environment
# note - the below commented out until *all* Jenkins workers can get `jekyll` installed
# if "DOCS" in changed_modules and test_env == "amplab_jenkins":
#    build_spark_documentation()
```

```
Building Unidoc API Documentation
========================================================================
[info] Building Spark unidoc using SBT with these arguments:
-Phadoop-3.2 -Phive-2.3 -Pspark-ganglia-lgpl -Pkubernetes -Pmesos
-Phadoop-cloud -Phive -Phive-thriftserver -Pkinesis-asl -Pyarn unidoc
```

### Does this PR introduce _any_ user-facing change?

No. (This is used only for testing and not used in the official doc generation.)

### How was this patch tested?

Pass the Jenkins without doc generation invocation.

Closes #29017 from dongjoon-hyun/SPARK-DOC-GEN.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-07-08 14:11:18 -07:00
Dongjoon Hyun 0e33b5ecde [SPARK-32178][TESTS] Disable test-dependencies.sh from Jenkins jobs
### What changes were proposed in this pull request?

This PR aims to disable dependency tests(test-dependencies.sh) from Jenkins.

### Why are the changes needed?

- First of all, GitHub Action provides the same test capability already and stabler.
- Second, currently, `test-dependencies.sh` fails very frequently in AmpLab Jenkins environment. For example, in the following irrelevant PR, it fails 5 times during 6 hours.
   - https://github.com/apache/spark/pull/29001

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the Jenkins without `test-dependencies.sh` invocation.

Closes #29004 from dongjoon-hyun/SPARK-32178.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-07-06 12:03:08 +09:00
Wenchen Fan a6b6a1fd61 [MINOR] update dev/create-release/known_translations
This is a followup of https://github.com/apache/spark/pull/28861 :
1. sort the names by Github ID, suggested by https://github.com/apache/spark/pull/28861#pullrequestreview-433369190
2. add more full names collected in https://github.com/apache/spark/pull/28861
3. The duplicated entry of `linbojin` is removed.
4. Name format is normalized to First Last name style.

Closes #28891 from cloud-fan/update.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-06-29 11:53:56 +00:00
Dongjoon Hyun 9c134b57bf [SPARK-32058][BUILD] Use Apache Hadoop 3.2.0 dependency by default
### What changes were proposed in this pull request?

According to the dev mailing list discussion, this PR aims to switch the default Apache Hadoop dependency from 2.7.4 to 3.2.0 for Apache Spark 3.1.0 on December 2020.

| Item | Default Hadoop Dependency |
|------|-----------------------------|
| Apache Spark Website | 3.2.0 |
| Apache Download Site | 3.2.0 |
| Apache Snapshot | 3.2.0 |
| Maven Central | 3.2.0 |
| PyPI | 2.7.4 (We will switch later) |
| CRAN | 2.7.4 (We will switch later) |
| Homebrew | 3.2.0 (already) |

In Apache Spark 3.0.0 release, we focused on the other features. This PR targets for [Apache Spark 3.1.0 scheduled on December 2020](https://spark.apache.org/versioning-policy.html).

### Why are the changes needed?

Apache Hadoop 3.2 has many fixes and new cloud-friendly features.

**Reference**
- 2017-08-04: https://hadoop.apache.org/release/2.7.4.html
- 2019-01-16: https://hadoop.apache.org/release/3.2.0.html

### Does this PR introduce _any_ user-facing change?

Since the default Hadoop dependency changes, the users will get a better support in a cloud environment.

### How was this patch tested?

Pass the Jenkins.

Closes #28897 from dongjoon-hyun/SPARK-32058.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-06-26 19:43:29 -07:00
HyukjinKwon 71b6d462fb [SPARK-32089][R][BUILD] Upgrade R version to 4.0.2 in the release DockerFiile
### What changes were proposed in this pull request?

This PR proposes to upgrade R version to 4.0.2 in the release docker image. As of SPARK-31918, we should make a release with R 4.0.0+ which works with R 3.5+ too.

### Why are the changes needed?

To unblock releases on CRAN.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

Manually tested via scripts under `dev/create-release`, manually attaching to the container and checking the R version.

Closes #28922 from HyukjinKwon/SPARK-32089.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-06-24 14:56:55 +00:00
HyukjinKwon e29ec42879 [SPARK-32074][BUILD][R] Update AppVeyor R version to 4.0.2
### What changes were proposed in this pull request?
R version 4.0.2 was released, see https://cran.r-project.org/doc/manuals/r-release/NEWS.html. This PR targets to upgrade R version in AppVeyor CI environment.

### Why are the changes needed?

To test the latest R versions before the release, and see if there are any regressions.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

AppVeyor will test.

Closes #28909 from HyukjinKwon/SPARK-32074.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-06-24 15:37:41 +09:00
William Hyun 2bcbe3dd9a [SPARK-32045][BUILD] Upgrade to Apache Commons Lang 3.10
### What changes were proposed in this pull request?

This PR aims to upgrade to Apache Commons Lang 3.10.

### Why are the changes needed?

This will bring the latest bug fixes like [LANG-1453](https://issues.apache.org/jira/browse/LANG-1453).
https://commons.apache.org/proper/commons-lang/release-notes/RELEASE-NOTES-3.10.txt

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Pass the Jenkins.

Closes #28889 from williamhyun/commons-lang-3.10.

Authored-by: William Hyun <williamhyun3@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-06-23 16:59:55 +09:00
Wenchen Fan 8a9ae01e74 [MINOR] update dev/create-release/known_translations
This is auto-updated by running script `translate-contributors.py`

Closes #28861 from cloud-fan/update.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-06-18 15:32:44 +00:00
DB Tsai 9b792518b2 [SPARK-31960][YARN][BUILD] Only populate Hadoop classpath for no-hadoop build
### What changes were proposed in this pull request?
If a Spark distribution has built-in hadoop runtime, Spark will not populate the hadoop classpath from `yarn.application.classpath` and `mapreduce.application.classpath` when a job is submitted to Yarn. Users can override this behavior by setting `spark.yarn.populateHadoopClasspath` to `true`.

### Why are the changes needed?
Without this, Spark will populate the hadoop classpath from `yarn.application.classpath` and `mapreduce.application.classpath` even Spark distribution has built-in hadoop. This results jar conflict and many unexpected behaviors in runtime.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Manually test with two builds, with-hadoop and no-hadoop builds.

Closes #28788 from dbtsai/yarn-classpath.

Authored-by: DB Tsai <d_tsai@apple.com>
Signed-off-by: DB Tsai <d_tsai@apple.com>
2020-06-18 06:08:40 +00:00
Gengliang Wang f535004e14 [SPARK-31967][UI] Downgrade to vis.js 4.21.0 to fix Jobs UI loading time regression
### What changes were proposed in this pull request?

After #28192, the job list page becomes very slow.
For example, after the following operation, the UI loading can take >40 sec.
```
(1 to 1000).foreach(_ => sc.parallelize(1 to 10).collect)
```

This is caused by a  [performance issue of `vis-timeline`](https://github.com/visjs/vis-timeline/issues/379). The serious issue affects both branch-3.0 and branch-2.4

I tried a different version 4.21.0 from https://cdnjs.com/libraries/vis
The infinite drawing issue seems also fixed if the zoom is disabled as default.

### Why are the changes needed?

Fix the serious perf issue in web UI by falling back vis-timeline-graph2d to an ealier version.

### Does this PR introduce _any_ user-facing change?

Yes, fix the UI perf regression

### How was this patch tested?

Manual test

Closes #28806 from gengliangwang/downgradeVis.

Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>
2020-06-12 17:22:41 -07:00
Wenchen Fan 28f131fc8a [SPARK-31979] Release script should not fail when remove non-existing files
### What changes were proposed in this pull request?

When removing non-existing files in the release script, do not fail.

### Why are the changes needed?

This is to make the release script more robust, as we don't care if the files exist before we remove them.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

tested when cutting 3.0.0 RC

Closes #28815 from cloud-fan/release.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-06-12 11:06:52 -07:00
Wenchen Fan d3a5e2963c Revert "[SPARK-31860][BUILD] only push release tags on succes"
This reverts commit 69ba9b662e.
2020-06-12 17:50:43 +08:00
William Hyun 367d94a30d [SPARK-31876][BUILD] Upgrade to Zstd 1.4.5
### What changes were proposed in this pull request?

This PR aims to upgrade to Zstd 1.4.5.

### Why are the changes needed?

Zstd 1.4.5 improves performance.

https://github.com/facebook/zstd/releases/tag/v1.4.5

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Passed the Jenkins.

Closes #28682 from williamhyun/zstd.

Authored-by: William Hyun <williamhyun3@gmail.com>
Signed-off-by: DB Tsai <d_tsai@apple.com>
2020-06-02 21:58:33 +00:00
Holden Karau 69ba9b662e [SPARK-31860][BUILD] only push release tags on succes
### What changes were proposed in this pull request?

Only push the release tag after the build has finished.

### Why are the changes needed?

If the build fails we don't need a release tag.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Running locally with a fake user in https://github.com/apache/spark/pull/28667

Closes #28700 from holdenk/SPARK-31860-build-master-only-push-tags-on-success.

Authored-by: Holden Karau <hkarau@apple.com>
Signed-off-by: Holden Karau <hkarau@apple.com>
2020-06-02 11:04:07 -07:00
Holden Karau ab9e5a2fe9 [SPARK-31889][BUILD] Docker release script does not allocate enough memory to reliably publish
### What changes were proposed in this pull request?
Allow overriding the zinc options in the docker release and set a higher so the publish step can succeed consistently.

### Why are the changes needed?

The publish step experiences memory pressure.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?
Running test locally with fake user to see if publish step (besides svn part) succeeds

Closes #28698 from holdenk/SPARK-31889-docker-release-script-does-not-allocate-enough-memory-to-reliably-publish.

Authored-by: Holden Karau <hkarau@apple.com>
Signed-off-by: Holden Karau <hkarau@apple.com>
2020-06-01 15:49:17 -07:00
Dongjoon Hyun 625abca9db [SPARK-31858][BUILD] Upgrade commons-io to 2.5 in Hadoop 3.2 profile
### What changes were proposed in this pull request?

This PR aims to upgrade `commons-io` from 2.4 to 2.5 for Apache Spark 3.1.

### Why are the changes needed?

Since Hadoop 3.1, `commons-io` 2.5 is used.
- https://issues.apache.org/jira/browse/HADOOP-15261

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the Jenkins with Hadoop-3.2 profile.

Maven dependency is verified via `test-dependencies.sh` automatically. SBT dependency can be verified like the following manually.
```
build/sbt -Phadoop-3.2 "core/dependencyTree" | grep commons-io:commons-io | head -n1
[info]   | | +-commons-io:commons-io:2.5
```

Closes #28665 from dongjoon-hyun/SPARK-31858.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-05-29 07:46:53 -07:00
Jungtaek Lim (HeartSaVioR) fe1d1e24bc [SPARK-31214][BUILD] Upgrade Janino to 3.1.2
### What changes were proposed in this pull request?

This PR proposes to upgrade Janino to 3.1.2 which is released recently.

Major changes were done for refactoring, as well as there're lots of "no commit message". Belows are the pairs of (commit title, commit) which seem to deal with some bugs or specific improvements (not purposed to refactor) after 3.0.15.

* Issue #119: Guarantee executing popOperand() in popUninitializedVariableOperand() via moving popOperand() out of "assert"
* Issue #116: replace operand to final target type if boxing conversion and widening reference conversion happen together
* Merged pull request `#114` "Grow the code for relocatables, and do fixup, and relocate".
  * 367c58e73e
* issue `#107`: Janino requires "org.codehaus.commons.compiler.io", but commons-compiler does not export this package
  * f7d99596d4
* Throw an NYI CompileException when a static interface method is invoked.
  * efd3884983
* Fixed the promotion of the array access index expression (see JLS7 15.13 Array Access Expressions)
  * 32fdb5f5f1
* Issue `#104`: ClassLoaderIClassLoader 's ClassNotFoundException handle mechanism enhancement
  * 6e8a97d609

You can see the changelog from the link: http://janino-compiler.github.io/janino/changelog.html

### Why are the changes needed?

We got some report on failure on user's query which Janino throws error on compiling generated code. The issue is here: https://github.com/janino-compiler/janino/issues/113 It contains the information of generated code, symptom (error), and analysis of the bug, so please refer the link for more details.
Janino 3.1.1 contains the PR https://github.com/janino-compiler/janino/pull/114 which would enable Janino to succeed to compile user's query properly. I've also fixed a couple of more bugs as 3.1.1 made Spark UTs fail - hence we need to upgrade to 3.1.2.

Furthermore, from my testing, https://github.com/janino-compiler/janino/issues/90 (which Josh Rosen filed before) seems to be also resolved in 3.1.2 as well.

Looks like Janino is maintained by one person and there's no even version branches and releases/tags so we can't expect Janino maintainer to release a new bugfix version - hence have to try out new minor version.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Existing UTs.

Closes #27860 from HeartSaVioR/SPARK-31101.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-05-29 07:42:57 -07:00
shane knapp 9e68affd13 [BUILD][INFRA] bump the timeout to match the jenkins PRB
### What changes were proposed in this pull request?

bump the timeout to match what's set in jenkins

### Why are the changes needed?

tests be timing out!

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

via jenkins

Closes #28666 from shaneknapp/increase-jenkins-timeout.

Authored-by: shane knapp <incomplete@gmail.com>
Signed-off-by: shane knapp <incomplete@gmail.com>
2020-05-28 14:25:49 -07:00
Gabor Somogyi 8f1d77488c
[SPARK-31821][BUILD] Remove mssql-jdbc dependencies from Hadoop 3.2 profile
### What changes were proposed in this pull request?
There is an unnecessary dependency for `mssql-jdbc`. In this PR I've removed it.

### Why are the changes needed?
Unnecessary dependency.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?

Pass the Jenkins with the following configuration.
- [x] Pass the dependency test.
- [x] SBT with Hadoop-3.2 (https://github.com/apache/spark/pull/28640#issuecomment-634192512)
- [ ] Maven with Hadoop-3.2

Closes #28640 from gaborgsomogyi/SPARK-31821.

Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-05-26 18:21:22 -07:00
William Hyun 753636e86b [SPARK-31807][INFRA] Use python 3 style in release-build.sh
### What changes were proposed in this pull request?

This PR aims to use the style that is compatible with both python 2 and 3.

### Why are the changes needed?

This will help python 3 migration.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual.

Closes #28632 from williamhyun/use_python3_style.

Authored-by: William Hyun <williamhyun3@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-05-25 10:25:43 +09:00
Kousuke Saruta 8441e936fc Revert "[SPARK-31756][WEBUI] Add real headless browser support for UI test
This reverts commit d95570864a.

Closes #28624 from sarutak/revert-real-headless-browser-support.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
2020-05-24 09:13:38 +09:00
Dongjoon Hyun 64ffc66496
[SPARK-31786][K8S][BUILD] Upgrade kubernetes-client to 4.9.2
### What changes were proposed in this pull request?

This PR aims to upgrade `kubernetes-client` library to bring the JDK8 related fixes. Please note that JDK11 works fine without any problem.
- https://github.com/fabric8io/kubernetes-client/releases/tag/v4.9.2
  - JDK8 always uses http/1.1 protocol (Prevent OkHttp from wrongly enabling http/2)

### Why are the changes needed?

OkHttp "wrongly" detects the Platform as Jdk9Platform on JDK 8u251.
- https://github.com/fabric8io/kubernetes-client/issues/2212
- https://stackoverflow.com/questions/61565751/why-am-i-not-able-to-run-sparkpi-example-on-a-kubernetes-k8s-cluster

Although there is a workaround `export HTTP2_DISABLE=true` and `Downgrade JDK or K8s`, we had better avoid this problematic situation.

### Does this PR introduce _any_ user-facing change?

No. This will recover the failures on JDK 8u252.

### How was this patch tested?

- [x] Pass the Jenkins UT (https://github.com/apache/spark/pull/28601#issuecomment-632474270)
- [x] Pass the Jenkins K8S IT with the K8s 1.13 (https://github.com/apache/spark/pull/28601#issuecomment-632438452)
- [x] Manual testing with K8s 1.17.3. (Below)

**v1.17.6 result (on Minikube)**
```
KubernetesSuite:
- Run SparkPi with no resources
- Run SparkPi with a very long application name.
- Use SparkLauncher.NO_RESOURCE
- Run SparkPi with a master URL without a scheme.
- Run SparkPi with an argument.
- Run SparkPi with custom labels, annotations, and environment variables.
- All pods have the same service account by default
- Run extraJVMOptions check on driver
- Run SparkRemoteFileTest using a remote data file
- Run SparkPi with env and mount secrets.
- Run PySpark on simple pi.py example
- Run PySpark with Python2 to test a pyfiles example
- Run PySpark with Python3 to test a pyfiles example
- Run PySpark with memory customization
- Run in client mode.
- Start pod creation from template
- PVs with local storage
- Launcher client dependencies
- Test basic decommissioning
Run completed in 8 minutes, 27 seconds.
Total number of tests run: 19
Suites: completed 2, aborted 0
Tests: succeeded 19, failed 0, canceled 0, ignored 0, pending 0
All tests passed.
```

Closes #28601 from dongjoon-hyun/SPARK-K8S-CLIENT.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-05-23 11:07:45 -07:00
Kousuke Saruta d95570864a [SPARK-31756][WEBUI] Add real headless browser support for UI test
### What changes were proposed in this pull request?

This PR mainly adds two things.

1. Real headless browser support for UI test
2. A test suite using headless Chrome as one instance of  those browsers.

Also, for environment where Chrome and Chrome driver is not installed, `ChromeUITest` tag is added to filter out the test suite.

### Why are the changes needed?

In the current master, there are two problems for UI test.
1. Lots of tests especially JavaScript related ones are done manually.
Appearance is better to be confirmed by our eyes but logic should be tested by test cases ideally.

2. Compared to the real web browsers, HtmlUnit doesn't seem to support JavaScript enough.
I added a JavaScript related test before for SPARK-31534 using HtmlUnit which is simple library based headless browser for test.
The test I added works somehow but some JavaScript related error is shown in unit-tests.log.

```
======= EXCEPTION START ========
Exception class=[net.sourceforge.htmlunit.corejs.javascript.JavaScriptException]
com.gargoylesoftware.htmlunit.ScriptException: Error: TOOLTIP: Option "sanitizeFn" provided type "window" but expected type "(null|function)". (http://192.168.1.209:60724/static/jquery-3.4.1.min.js#2)
        at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:904)
        at net.sourceforge.htmlunit.corejs.javascript.Context.call(Context.java:628)
        at net.sourceforge.htmlunit.corejs.javascript.ContextFactory.call(ContextFactory.java:515)
        at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.callFunction(JavaScriptEngine.java:835)
        at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.callFunction(JavaScriptEngine.java:807)
        at com.gargoylesoftware.htmlunit.InteractivePage.executeJavaScriptFunctionIfPossible(InteractivePage.java:216)
        at com.gargoylesoftware.htmlunit.javascript.background.JavaScriptFunctionJob.runJavaScript(JavaScriptFunctionJob.java:52)
        at com.gargoylesoftware.htmlunit.javascript.background.JavaScriptExecutionJob.run(JavaScriptExecutionJob.java:102)
        at com.gargoylesoftware.htmlunit.javascript.background.JavaScriptJobManagerImpl.runSingleJob(JavaScriptJobManagerImpl.java:426)
        at com.gargoylesoftware.htmlunit.javascript.background.DefaultJavaScriptExecutor.run(DefaultJavaScriptExecutor.java:157)
        at java.lang.Thread.run(Thread.java:748)
Caused by: net.sourceforge.htmlunit.corejs.javascript.JavaScriptException: Error: TOOLTIP: Option "sanitizeFn" provided type "window" but expected type "(null|function)". (http://192.168.1.209:60724/static/jquery-3.4.1.min.js#2)
        at net.sourceforge.htmlunit.corejs.javascript.Interpreter.interpretLoop(Interpreter.java:1009)
        at net.sourceforge.htmlunit.corejs.javascript.Interpreter.interpret(Interpreter.java:800)
        at net.sourceforge.htmlunit.corejs.javascript.InterpretedFunction.call(InterpretedFunction.java:105)
        at net.sourceforge.htmlunit.corejs.javascript.ContextFactory.doTopCall(ContextFactory.java:413)
        at com.gargoylesoftware.htmlunit.javascript.HtmlUnitContextFactory.doTopCall(HtmlUnitContextFactory.java:252)
        at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.doTopCall(ScriptRuntime.java:3264)
        at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$4.doRun(JavaScriptEngine.java:828)
        at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:889)
        ... 10 more
JavaScriptException value = Error: TOOLTIP: Option "sanitizeFn" provided type "window" but expected type "(null|function)".
== CALLING JAVASCRIPT ==
  function () {
      throw e;
  }
======= EXCEPTION END ========
```
I tried to upgrade HtmlUnit to 2.40.0 but what is worse, the test become not working even though it works on real browsers like Chrome, Safari and Firefox without error.
```
[info] UISeleniumSuite:
[info] - SPARK-31534: text for tooltip should be escaped *** FAILED *** (17 seconds, 745 milliseconds)
[info]   The code passed to eventually never returned normally. Attempted 2 times over 12.910785232 seconds. Last failure message: com.gargoylesoftware.htmlunit.ScriptException: ReferenceError: Assignment to undefined "regeneratorRuntime" in strict mode (http://192.168.1.209:62132/static/vis-timeline-graph2d.min.js#52(Function)#1)
```
To resolve those problems, it's better to support headless browser for UI test.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

I tested with following patterns. Both Chrome and Chrome driver should be installed to test.

1. sbt / with chromedriver / include tag (expect to succeed)
`build/sbt -Dspark.test.webdriver.chrome.driver=/path/to/chromedriver "testOnly org.apache.spark.ui.ChromeUISeleniumSuite"`
2. sbt / with chromedriver / exclude tag (expect to be ignored)
`build/sbt -Dspark.test.webdriver.chrome.driver=/path/to/chromedriver "testOnly org.apache.spark.ui.ChromeUISeleniumSuite -l org.apache.spark.tags.ChromeUITest"`
3. sbt / without chromedriver / include tag (expect to be failed)
`build/sbt  "testOnly org.apache.spark.ui.ChromeUISeleniumSuite"`
4. sbt / without chromedriver / exclude tag (expect to be skipped)
`build/sbt  -Dtest.exclude.tags=org.apache.spark.tags.ChromeUITest "testOnly org.apache.spark.ui.ChromeUISeleniumSuite"`
5. Maven / wth chromedriver / include tag (expect to succeed)
`build/mvn -Dspark.test.webdriver.chrome.driver=/path/to/chromedriver -Dtest=none -DwildcardSuites=org.apache.spark.ui.ChromeUISeleniumSuite test`
6. Maven / with chromedriver / exclude tag (expect to be skipped)
`build/mvn -Dtest.exclude.tags="org.apache.spark.tags.ChromeUITest" -Dspark.test.webdriver.chrome.driver=/path/to/chromedriver -Dtest=none -DwildcardSuites=org.apache.spark.ui.ChromeUISeleniumSuite test`
7. Maven / without chromedriver / include tag (expect to be failed)
`build/mvn -Dtest=none -DwildcardSuites=org.apache.spark.ui.ChromeUISeleniumSuite test`
8. Maven / without chromedriver / exclude tag (expect to be skipped)
`build/mvn -Dtest.exclude.tags=org.apache.spark.tags.ChromeUITest  -Dtest=none -DwildcardSuites=org.apache.spark.ui.ChromeUISeleniumSuite test`

Closes #28578 from sarutak/real-headless-browser-support.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-05-22 08:24:31 -05:00
Thomas Graves b64688ebba [SPARK-29303][WEB UI] Add UI support for stage level scheduling
### What changes were proposed in this pull request?

This adds UI updates to support stage level scheduling and ResourceProfiles. 3 main things have been added. ResourceProfile id added to the Stage page, the Executors page now has an optional selectable column to show the ResourceProfile Id of each executor, and the Environment page now has a section with the ResourceProfile ids.  Along with this the rest api for environment page was updated to include the Resource profile information.

I debating on splitting the resource profile information into its own page but I wasn't sure it called for a completely separate page. Open to peoples thoughts on this.

Screen shots:
![Screen Shot 2020-04-01 at 3 07 46 PM](https://user-images.githubusercontent.com/4563792/78185169-469a7000-7430-11ea-8b0c-d9ede2d41df8.png)
![Screen Shot 2020-04-01 at 3 08 14 PM](https://user-images.githubusercontent.com/4563792/78185175-48fcca00-7430-11ea-8d1d-6b9333700f32.png)
![Screen Shot 2020-04-01 at 3 09 03 PM](https://user-images.githubusercontent.com/4563792/78185176-4a2df700-7430-11ea-92d9-73c382bb0f32.png)
![Screen Shot 2020-04-01 at 11 05 48 AM](https://user-images.githubusercontent.com/4563792/78185186-4dc17e00-7430-11ea-8962-f749dd47ea60.png)

### Why are the changes needed?

For user to be able to know what resource profile was used with which stage and executors. The resource profile information is also available so user debugging can see exactly what resources were requested with that profile.

### Does this PR introduce any user-facing change?

Yes, UI updates.

### How was this patch tested?

Unit tests and tested on yarn both active applications and with the history server.

Closes #28094 from tgravescs/SPARK-29303-pr.

Lead-authored-by: Thomas Graves <tgraves@nvidia.com>
Co-authored-by: Thomas Graves <tgraves@apache.org>
Signed-off-by: Thomas Graves <tgraves@apache.org>
2020-05-21 13:11:35 -05:00
Prakhar Jain c560428fe0 [SPARK-20732][CORE] Decommission cache blocks to other executors when an executor is decommissioned
### What changes were proposed in this pull request?
After changes in SPARK-20628, CoarseGrainedSchedulerBackend can decommission an executor and stop assigning new tasks on it. We should also decommission the corresponding blockmanagers in the same way. i.e. Move the cached RDD blocks from those executors to other active executors.

### Why are the changes needed?
We need to gracefully decommission the block managers so that the underlying RDD cache blocks are not lost in case the executors are taken away forcefully after some timeout (because of spotloss/pre-emptible VM etc). Its good to save as much cache data as possible.

Also In future once the decommissioning signal comes from Cluster Manager (say YARN/Mesos etc), dynamic allocation + this change gives us opportunity to downscale the executors faster by making the executors free of cache data.

Note that this is a best effort approach. We try to move cache blocks from decommissioning executors to active executors. If the active executors don't have free resources available on them for caching, then the decommissioning executors will keep the cache block which it was not able to move and it will still be able to serve them.

Current overall Flow:

1. CoarseGrainedSchedulerBackend receives a signal to decommissionExecutor. On receiving the signal, it do 2 things - Stop assigning new tasks (SPARK-20628), Send another message to BlockManagerMasterEndpoint (via BlockManagerMaster) to decommission the corresponding BlockManager.

2. BlockManagerMasterEndpoint receives "DecommissionBlockManagers" message. On receiving this, it moves the corresponding block managers to "decommissioning" state. All decommissioning BMs are excluded from the getPeers RPC call which is used for replication. All these decommissioning BMs are also sent message from BlockManagerMasterEndpoint to start decommissioning process on themselves.

3. BlockManager on worker (say BM-x) receives the "DecommissionBlockManager" message. Now it will start BlockManagerDecommissionManager thread to offload all the RDD cached blocks. This thread can make multiple reattempts to decommission the existing cache blocks (multiple reattempts might be needed as there might not be sufficient space in other active BMs initially).

### Does this PR introduce any user-facing change?
NO

### How was this patch tested?
Added UTs.

Closes #28370 from prakharjain09/SPARK-20732-rddcache-1.

Authored-by: Prakhar Jain <prakharjain09@gmail.com>
Signed-off-by: Holden Karau <hkarau@apple.com>
2020-05-18 11:37:53 -07:00
Dongjoon Hyun cd5fbcf9a0
[SPARK-31713][INFRA] Make test-dependencies.sh detect version string correctly
### What changes were proposed in this pull request?

This PR makes `test-dependencies.sh` detect the version string correctly by ignoring all the other lines.

### Why are the changes needed?

Currently, all SBT jobs are broken like the following.
- https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-3.0-test-sbt-hadoop-3.2-hive-2.3/476/console
```
[error] running /home/jenkins/workspace/spark-branch-3.0-test-sbt-hadoop-3.2-hive-2.3/dev/test-dependencies.sh ; received return code 1
Build step 'Execute shell' marked build as failure
```

The reason is that the script detects the old version like `Falling back to archive.apache.org to download Maven 3.1.0-SNAPSHOT` when `build/mvn` did fallback.

Specifically, in the script, `OLD_VERSION` became `Falling back to archive.apache.org to download Maven 3.1.0-SNAPSHOT` instead of `3.1.0-SNAPSHOT` if build/mvn did fallback. Then, `pom.xml` file is corrupted like the following at the end and the exit code become `1` instead of `0`. It causes Jenkins jobs fails
```
-    <version>3.1.0-SNAPSHOT</version>
+    <version>Falling</version>
```

**NO FALLBACK**
```
$ build/mvn -q -Dexec.executable="echo" -Dexec.args='${project.version}' --non-recursive org.codehaus.mojo:exec-maven-plugin:1.6.0:exec
Using `mvn` from path: /Users/dongjoon/APACHE/spark-merge/build/apache-maven-3.6.3/bin/mvn
3.1.0-SNAPSHOT
```

**FALLBACK**
```
$ build/mvn -q -Dexec.executable="echo" -Dexec.args='${project.version}' --non-recursive org.codehaus.mojo:exec-maven-plugin:1.6.0:exec
Falling back to archive.apache.org to download Maven
Using `mvn` from path: /Users/dongjoon/APACHE/spark-merge/build/apache-maven-3.6.3/bin/mvn
3.1.0-SNAPSHOT
```

**In the script**
```
$ echo $(build/mvn -q -Dexec.executable="echo" -Dexec.args='${project.version}' --non-recursive org.codehaus.mojo:exec-maven-plugin:1.6.0:exec)
Using `mvn` from path: /Users/dongjoon/APACHE/spark-merge/build/apache-maven-3.6.3/bin/mvn
Falling back to archive.apache.org to download Maven 3.1.0-SNAPSHOT
```

This PR will prevent irrelevant logs like `Falling back to archive.apache.org to download Maven`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the PR Builder.

Closes #28532 from dongjoon-hyun/SPARK-31713.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-05-14 19:28:25 -07:00
Kousuke Saruta 38bc45b0b5
[SPARK-30654][WEBUI][FOLLOWUP] Remove bootstrap-tooltip.js which is no longer used
### What changes were proposed in this pull request?

This PR removes `bootstrap-tooltip.js` which is no longer used.
That script is replaced with `bootstrap.bundle.min.js` in SPARK-30654 ( #27370 ).

### Why are the changes needed?

For cleaning up repository..

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manually checked whether tooltips are shown in the UI and no error message shown in the debug console.

Closes #28515 from sarutak/remove-tooltipjs.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-05-13 01:04:57 -07:00
Dongjoon Hyun 3772154442
[SPARK-31691][INFRA] release-build.sh should ignore a fallback output from build/mvn
### What changes were proposed in this pull request?

This PR adds `i` option to ignore additional `build/mvn` output which is irrelevant to version string.

### Why are the changes needed?

SPARK-28963 added additional output message, `Falling back to archive.apache.org to download Maven` in build/mvn. This breaks `dev/create-release/release-build.sh` and currently Spark Packaging Jenkins job is hitting this issue consistently and broken.
- https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/2912/console

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

This happens only when the mirror fails. So, this is verified manually hiject the script. It works like the following.
```
$ echo 'Falling back to archive.apache.org to download Maven' > out
$ build/mvn help:evaluate -Dexpression=project.version >> out
Using `mvn` from path: /Users/dongjoon/PRS/SPARK_RELEASE_2/build/apache-maven-3.6.3/bin/mvn
$ cat out | grep -v INFO | grep -v WARNING | grep -v Download
Falling back to archive.apache.org to download Maven
3.1.0-SNAPSHOT
$ cat out | grep -v INFO | grep -v WARNING | grep -vi Download
3.1.0-SNAPSHOT
```

Closes #28514 from dongjoon-hyun/SPARK_RELEASE_2.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-05-12 14:24:56 -07:00
Dongjoon Hyun 0feb3cbe77
[SPARK-31687][INFRA] Use GitHub instead of GitBox in release script
### What changes were proposed in this pull request?

This PR aims to use GitHub urls instead of GitHub in the release scripts.

### Why are the changes needed?

Currently, Spark Packaing jobs are broken due to GitBox issue.
```
fatal: unable to access 'https://gitbox.apache.org/repos/asf/spark.git/': Failed to connect to gitbox.apache.org port 443: Connection timed out
```

- https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/2906/console
- https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-branch-3.0-maven-snapshots/105/console
 - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-branch-2.4-maven-snapshots/439/console

### Does this PR introduce _any_ user-facing change?

No. (This is a dev-only script.)

### How was this patch tested?

Manual.
```
$ cat ./test.sh
. dev/create-release/release-util.sh
get_release_info
git clone "$ASF_REPO"

$ sh test.sh
Branch [branch-3.0]:
Current branch version is 3.0.1-SNAPSHOT.
Release [3.0.0]:
RC # [2]:
Full name [Dongjoon Hyun]:
GPG key [dongjoonapache.org]:
================
Release details:
BRANCH:     branch-3.0
VERSION:    3.0.0
TAG:        v3.0.0-rc2
NEXT:       3.0.1-SNAPSHOT

ASF USER:   dongjoon
GPG KEY:    dongjoonapache.org
FULL NAME:  Dongjoon Hyun
E-MAIL:     dongjoonapache.org
================
Is this info correct [y/n]? y
ASF password:
GPG passphrase:
Cloning into 'spark'...
remote: Enumerating objects: 223, done.
remote: Counting objects: 100% (223/223), done.
remote: Compressing objects: 100% (117/117), done.
remote: Total 708324 (delta 70), reused 138 (delta 32), pack-reused 708101
Receiving objects: 100% (708324/708324), 322.08 MiB | 2.94 MiB/s, done.
Resolving deltas: 100% (289268/289268), done.
Updating files: 100% (16287/16287), done.

$ sh test.sh

...
```

Closes #28513 from dongjoon-hyun/SPARK-31687.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-05-12 13:07:00 -07:00