Commit graph

661 commits

Author SHA1 Message Date
Sean Owen 754f820035 [SPARK-26918][DOCS] All .md should have ASF license header
## What changes were proposed in this pull request?

Add AL2 license to metadata of all .md files.
This seemed to be the tidiest way as it will get ignored by .md renderers and other tools. Attempts to write them as markdown comments revealed that there is no such standard thing.

## How was this patch tested?

Doc build

Closes #24243 from srowen/SPARK-26918.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-03-30 19:49:45 -05:00
Sean Owen 2ec650d843 [SPARK-27267][CORE] Update snappy to avoid error when decompressing empty serialized data
## What changes were proposed in this pull request?

(See JIRA for problem statement)

Update snappy 1.1.7.1 -> 1.1.7.3 to pick up an empty-stream and Java 9 fix.

There appear to be no other changes of consequence:
https://github.com/xerial/snappy-java/blob/master/Milestone.md

## How was this patch tested?

Existing tests

Closes #24242 from srowen/SPARK-27267.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-03-30 02:41:24 -05:00
Hyukjin Kwon 0e16a6f5b0 [SPARK-27277][INFRA] Recover from setting fix version failure in merge script
## What changes were proposed in this pull request?

I happened to meet this case few times before:

```
Enter comma-separated fix version(s) [3.0.0]: 3.0,0
Restoring head pointer to master
git checkout master
Already on 'master'
git branch
Traceback (most recent call last):
  File "./dev/merge_spark_pr_jira.py", line 537, in <module>
    main()
  File "./dev/merge_spark_pr_jira.py", line 523, in main
    resolve_jira_issues(title, merged_refs, jira_comment)
  File "./dev/merge_spark_pr_jira.py", line 359, in resolve_jira_issues
    resolve_jira_issue(merge_branches, comment, jira_id)
  File "./dev/merge_spark_pr_jira.py", line 302, in resolve_jira_issue
    jira_fix_versions = map(lambda v: get_version_json(v), fix_versions)
  File "./dev/merge_spark_pr_jira.py", line 302, in <lambda>
    jira_fix_versions = map(lambda v: get_version_json(v), fix_versions)
  File "./dev/merge_spark_pr_jira.py", line 300, in get_version_json
    return filter(lambda v: v.name == version_str, versions)[0].raw
IndexError: list index out of range
```

I typed the fix version wrongly (there's comma in `3.0,0`) and it ended the loop in the merge script. Not a big deal but it bugged me few times. Finally I met this today again, and decided to fix.

This PR proposes to recover from wrongly set fix versions.

## How was this patch tested?

I manually copied and pasted the specific codes and tested separately in both Python 2 and Python 3.

**Positive cases:**

```
Enter comma-separated fix version(s) [3.0.0]:  # blank test (to use default)
['3.0.0']
```

```
Enter comma-separated fix version(s) [3.0.0,2.4.2]:  # multiple default versions
['3.0.0', '2.4.2']
```

```
Enter comma-separated fix version(s) [3.0.0]: 2.4.1  # valid version
['2.4.1']
```

```
Enter comma-separated fix version(s) [3.0.0]: 3.0.0,2.4.2  # multiple valid versions
['3.0.0', '2.4.2']
```

**Keyboard interrupt(Ctrl + c):**

```
Enter comma-separated fix version(s) [3.0.0]: ^CTraceback (most recent call last):  # keyboard interrupt
  File "test_merge_script.py", line 45, in <module>
    test()
  File "test_merge_script.py", line 26, in test
    fix_versions = input("Enter comma-separated fix version(s) [%s]: " % default_fix_versions)
KeyboardInterrupt
```

**Wrongly typed versions (recovered):**

```
Enter comma-separated fix version(s) [3.0.0]: 3.1
Specified version(s) [3.1] not found in the available versions, try again (or leave blank and fix manually).
Enter comma-separated fix version(s) [3.0.0]: 123
Specified version(s) [123] not found in the available versions, try again (or leave blank and fix manually).
Enter comma-separated fix version(s) [3.0.0]: 3.0,0
Specified version(s) [3.0, 0] not found in the available versions, try again (or leave blank and fix manually).
Enter comma-separated fix version(s) [3.0.0]: damn
Specified version(s) [damn] not found in the available versions, try again (or leave blank and fix manually).
Enter comma-separated fix version(s) [3.0.0]: 3.0.0,2.5.2  # one invalid versions in multiple versions
Specified version(s) [3.0.0, 2.5.2] not found in the available versions, try again (or leave blank and fix manually).
```

**Arbitrary exceptions in fix version parsing (recovered)**

```
Enter comma-separated fix version(s) [3.0.0]:
Traceback (most recent call last):
  File "tmp.py", line 11, in <module>
    raise Exception("arbitrary exception")
Exception: arbitrary exception
Error setting fix version(s), try again (or leave blank and fix manually)
Enter comma-separated fix version(s) [3.0.0]:
Traceback (most recent call last):
  File "tmp.py", line 10, in <module>
    raise Exception("arbitrary exception")
Exception: arbitrary exception
Error setting fix version(s), try again (or leave blank and fix manually)
Enter comma-separated fix version(s) [3.0.0]:
```

Closes #24213 from HyukjinKwon/merge_script_fix_version.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2019-03-26 21:14:07 +09:00
Sean Owen 8bc304f97e [SPARK-26132][BUILD][CORE] Remove support for Scala 2.11 in Spark 3.0.0
## What changes were proposed in this pull request?

Remove Scala 2.11 support in build files and docs, and in various parts of code that accommodated 2.11. See some targeted comments below.

## How was this patch tested?

Existing tests.

Closes #23098 from srowen/SPARK-26132.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-03-25 10:46:42 -05:00
Yuming Wang 9c0af746e5 [SPARK-27175][BUILD] Upgrade hadoop-3 to 3.2.0
## What changes were proposed in this pull request?

This PR upgrade `hadoop-3` to `3.2.0`  to workaround [HADOOP-16086](https://issues.apache.org/jira/browse/HADOOP-16086). Otherwise some test case will throw IllegalArgumentException:
```java
02:44:34.707 ERROR org.apache.hadoop.hive.ql.exec.Task: Job Submission failed with exception 'java.io.IOException(Cannot initialize Cluster. Please check your configuration for mapreduce.framework.name and the correspond server addresses.)'
java.io.IOException: Cannot initialize Cluster. Please check your configuration for mapreduce.framework.name and the correspond server addresses.
	at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:116)
	at org.apache.hadoop.mapreduce.Cluster.<init>(Cluster.java:109)
	at org.apache.hadoop.mapreduce.Cluster.<init>(Cluster.java:102)
	at org.apache.hadoop.mapred.JobClient.init(JobClient.java:475)
	at org.apache.hadoop.mapred.JobClient.<init>(JobClient.java:454)
	at org.apache.hadoop.hive.ql.exec.mr.ExecDriver.execute(ExecDriver.java:369)
	at org.apache.hadoop.hive.ql.exec.mr.MapRedTask.execute(MapRedTask.java:151)
	at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:199)
	at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:100)
	at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:2183)
	at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1839)
	at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1526)
	at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1237)
	at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1227)
	at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$runHive$1(HiveClientImpl.scala:730)
	at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:283)
	at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:221)
	at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:220)
	at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:266)
	at org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:719)
	at org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:709)
	at org.apache.spark.sql.hive.StatisticsSuite.createNonPartitionedTable(StatisticsSuite.scala:719)
	at org.apache.spark.sql.hive.StatisticsSuite.$anonfun$testAlterTableProperties$2(StatisticsSuite.scala:822)
```

## How was this patch tested?

manual tests

Closes #24106 from wangyum/SPARK-27175.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-03-16 19:42:05 -05:00
Dongjoon Hyun f26a1f3d37 [SPARK-27165][SPARK-27107][BUILD][SQL] Upgrade Apache ORC to 1.5.5
## What changes were proposed in this pull request?

This PR aims to update Apache ORC dependency to fix [SPARK-27107](https://issues.apache.org/jira/browse/SPARK-27107) .
```
[ORC-452] Support converting MAP column from JSON to ORC Improvement
[ORC-447] Change the docker scripts to keep a persistent m2 cache
[ORC-463] Add `version` command
[ORC-475] ORC reader should lazily get filesystem
[ORC-476] Make SearchAgument kryo buffer size configurable
```

## How was this patch tested?

Pass the Jenkins with the existing tests.

Closes #24096 from dongjoon-hyun/SPARK-27165.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-03-14 20:14:31 -07:00
Yuming Wang f0b6245ea4 [SPARK-27158][BUILD] dev/mima and dev/scalastyle support dynamic profiles
## What changes were proposed in this pull request?

`dev/mima` and `dev/scalastyle` support dynamic reading profiles from `modules.py`.

## How was this patch tested?

manual tests

Closes #24089 from wangyum/SPARK-27158.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2019-03-15 08:20:42 +09:00
Jiaxin Shan 2d0b7cfe44 [SPARK-26742][K8S] Update Kubernetes-Client version to 4.1.2
## What changes were proposed in this pull request?
https://github.com/apache/spark/pull/23814 was reverted because of Jenkins integration tests failure. After minikube upgrade, Kubernetes client SDK v1.4.2 work with kubernetes v1.13. We can bring this change back.

Reference:
[Bump Kubernetes Client Version to 4.1.2](https://issues.apache.org/jira/browse/SPARK-26742)
[Original PR against master](https://github.com/apache/spark/pull/23814)
[Kubernetes client upgrade for Spark 2.4](https://github.com/apache/spark/pull/23993)

## How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Unit Tests:
```
All tests passed.
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary for Spark Project Parent POM 3.0.0-SNAPSHOT:
[INFO]
[INFO] Spark Project Parent POM ........................... SUCCESS [  2.343 s]
[INFO] Spark Project Tags ................................. SUCCESS [  2.039 s]
[INFO] Spark Project Sketch ............................... SUCCESS [ 12.714 s]
[INFO] Spark Project Local DB ............................. SUCCESS [  2.185 s]
[INFO] Spark Project Networking ........................... SUCCESS [ 38.154 s]
[INFO] Spark Project Shuffle Streaming Service ............ SUCCESS [  7.989 s]
[INFO] Spark Project Unsafe ............................... SUCCESS [  2.297 s]
[INFO] Spark Project Launcher ............................. SUCCESS [  2.813 s]
[INFO] Spark Project Core ................................. SUCCESS [38:03 min]
[INFO] Spark Project ML Local Library ..................... SUCCESS [  3.848 s]
[INFO] Spark Project GraphX ............................... SUCCESS [ 56.084 s]
[INFO] Spark Project Streaming ............................ SUCCESS [04:58 min]
[INFO] Spark Project Catalyst ............................. SUCCESS [06:39 min]
[INFO] Spark Project SQL .................................. SUCCESS [37:12 min]
[INFO] Spark Project ML Library ........................... SUCCESS [18:59 min]
[INFO] Spark Project Tools ................................ SUCCESS [  0.767 s]
[INFO] Spark Project Hive ................................. SUCCESS [33:45 min]
[INFO] Spark Project REPL ................................. SUCCESS [01:14 min]
[INFO] Spark Project Assembly ............................. SUCCESS [  1.444 s]
[INFO] Spark Integration for Kafka 0.10 ................... SUCCESS [01:12 min]
[INFO] Kafka 0.10+ Token Provider for Streaming ........... SUCCESS [  6.719 s]
[INFO] Kafka 0.10+ Source for Structured Streaming ........ SUCCESS [07:00 min]
[INFO] Spark Project Examples ............................. SUCCESS [ 21.805 s]
[INFO] Spark Integration for Kafka 0.10 Assembly .......... SUCCESS [  0.906 s]
[INFO] Spark Avro ......................................... SUCCESS [ 50.486 s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  02:32 h
[INFO] Finished at: 2019-03-07T08:39:34Z
[INFO] ------------------------------------------------------------------------

```

Please review http://spark.apache.org/contributing.html before opening a pull request.

Closes #24002 from Jeffwan/update_k8s_sdk_master.

Authored-by: Jiaxin Shan <seedjeffwan@gmail.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
2019-03-13 15:04:27 -07:00
DB Tsai 2b9ad2516e [MINOR][BUILD] Add Scala 2.12 profile back for branch-2.4 build
Closes #24074 from dbtsai/scala-2.12.

Authored-by: DB Tsai <d_tsai@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-03-12 20:08:52 -07:00
Yuming Wang dccf6615c3 [SPARK-27130][BUILD] Automatically select profile when executing sbt-checkstyle
## What changes were proposed in this pull request?

This PR makes it automatically select profile when executing `sbt-checkstyle`. The reason for this is that `hadoop-2.7` and `hadoop-3.1` may have different `hive-thriftserver` module in the future.

## How was this patch tested?

manual tests:
```
Update AbstractService.java file.
export HADOOP_PROFILE=hadoop2.7
./dev/run-tests
```
The result:
![image](https://user-images.githubusercontent.com/5399861/54197992-5337e780-4500-11e9-930c-722982cdcd45.png)

Closes #24065 from wangyum/SPARK-27130.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2019-03-13 08:03:46 +09:00
Yuming Wang 8ab13065f6 [SPARK-23807][FOLLOW-UP][BUILD][TEST-HADOOP3.1] Add test-hadoop3.1 phrase
## What changes were proposed in this pull request?

Add `test-hadoop3.1` phrase to test Spark against Spark’s Hadoop 3.1 profile.

## How was this patch tested?
Tested on jenkins. This is output:
```
[info] Using build tool sbt with Hadoop profile hadoop3.1 under environment amplab_jenkins
...
[info] Building Spark (w/Hive 1.2.1) using SBT with these arguments:  -Phadoop-3.1 -Pkubernetes -Phive-thriftserver -Pkinesis-asl -Pyarn -Pspark-ganglia-lgpl -Phive -Pmesos test:package streaming-kinesis-asl-assembly/assembly
```
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/103282/console

Closes #24045 from wangyum/SPARK-23807.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2019-03-11 11:15:58 +09:00
Gabor Somogyi 3729efb4d0 [SPARK-26856][PYSPARK] Python support for from_avro and to_avro APIs
## What changes were proposed in this pull request?

Avro is built-in but external data source module since Spark 2.4 but  `from_avro` and `to_avro` APIs not yet supported in pyspark.

In this PR I've made them available from pyspark.

## How was this patch tested?

Please see the python API examples what I've added.

cd docs/
SKIP_SCALADOC=1 SKIP_RDOC=1 SKIP_SQLDOC=1 jekyll build
Manual webpage check.

Closes #23797 from gaborgsomogyi/SPARK-26856.

Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2019-03-11 10:15:07 +09:00
Yuming Wang eed3091a60 [SPARK-27120][BUILD][TEST] Upgrade scalatest version to 3.0.5
## What changes were proposed in this pull request?

**ScalaTest 3.0.5 Release Notes**

**Bug Fixes**

- Fixed the implicit view not available problem when used with compile macro.
- Fixed a stack depth problem in RefSpecLike and fixture.SpecLike under Scala 2.13.
- Changed Framework and ScalaTestFramework to set spanScaleFactor for Runner object instances for different Runners using different class loaders. This fixed a problem whereby an incorrect Runner.spanScaleFactor could be used when the tests for multiple sbt project's were run concurrently.
- Fixed a bug in endsWith regex matcher.

**Improvements**
- Removed duplicated parsing code for -C in ArgsParser.
- Improved performance in WebBrowser.
- Documentation typo rectification.
- Improve validity of Junit XML reports.
- Improved performance by replacing all .size == 0 and .length == 0 to .isEmpty.

**Enhancements**
- Added 'C' option to -P, which will tell -P to use cached thread pool.
- External Dependencies Update
- Bumped up scala-js version to 0.6.22.
- Changed to depend on mockito-core, not mockito-all.
- Bumped up jmock version to 2.8.3.
- Bumped up junit version to 4.12.
- Removed dependency to scala-parser-combinators.

More details:
http://www.scalatest.org/release_notes/3.0.5

## How was this patch tested?

manual tests on local machine:
```
nohup build/sbt clean -Djline.terminal=jline.UnsupportedTerminal -Phadoop-2.7  -Pkubernetes -Phive-thriftserver -Pyarn -Pspark-ganglia-lgpl -Phive -Pkinesis-asl -Pmesos test > run.scalatest.log &
```

Closes #24042 from wangyum/SPARK-27120.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-03-10 15:22:52 -07:00
Yuming Wang f732647ae4 [SPARK-27054][BUILD][SQL] Remove the Calcite dependency
## What changes were proposed in this pull request?

Calcite is only used for [runSqlHive](02bbe977ab/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala (L699-L705)) when `hive.cbo.enable=true`([SemanticAnalyzer](https://github.com/apache/hive/blob/release-1.2.1/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzerFactory.java#L278-L280)).
So we can disable `hive.cbo.enable` and remove Calcite dependency.

## How was this patch tested?

Exist tests

Closes #23970 from wangyum/SPARK-27054.

Lead-authored-by: Yuming Wang <yumwang@ebay.com>
Co-authored-by: Yuming Wang <wgyumg@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-03-09 16:34:24 -08:00
DB Tsai b6375097bc [SPARK-27026][BUILD] Upgrade Docker image for release build to Ubuntu 18.04 LTS
## What changes were proposed in this pull request?

Upgrade Docker image for release build to Ubuntu 18.04LTS

## How was this patch tested?

Manually tested.

Closes #23932 from dbtsai/ubuntu18.04.

Authored-by: DB Tsai <d_tsai@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-03-06 13:58:21 -08:00
Yanbo Liang 7857c6d633 [SPARK-27051][CORE] Bump Jackson version to 2.9.8
## What changes were proposed in this pull request?
Fasterxml Jackson version before 2.9.8 is affected by multiple [CVEs](https://github.com/FasterXML/jackson-databind/issues/2186), we need to fix bump the dependent Jackson to 2.9.8.

## How was this patch tested?
Existing tests and offline benchmark.
I have run ```SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.datasources.json.JSONBenchmark"``` to check there is no performance degradation for this upgrade.

Closes #23965 from yanboliang/SPARK-27051.

Authored-by: Yanbo Liang <ybliang8@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2019-03-05 11:46:51 +09:00
LantaoJin e5c502c596 [SPARK-25865][CORE] Add GC information to ExecutorMetrics
## What changes were proposed in this pull request?

Only memory usage without GC information could not help us to determinate the proper settings of memory. We need the GC metrics about frequency of major & minor GC. For example, two cases, their configured memory for executor are all 10GB and their usages are all near 10GB. So should we increase or decrease the configured memory for them? This metrics may be helpful. We can increase configured memory for the first one if it has very frequency major GC and decrease the second one if only some minor GC and none major GC.
GC metrics are only useful in entire lifetime of executors instead of separated stages.

## How was this patch tested?

Adding UT.

Closes #22874 from LantaoJin/SPARK-25865.

Authored-by: LantaoJin <jinlantao@gmail.com>
Signed-off-by: Imran Rashid <irashid@cloudera.com>
2019-03-04 14:26:02 -06:00
Sean Owen d8754df2bf [SPARK-27029][BUILD] Update Thrift to 0.12.0
## What changes were proposed in this pull request?

Update Thrift to 0.12.0 to pick up bug and security fixes.
Changes: https://github.com/apache/thrift/blob/master/CHANGES.md
The important one is for https://issues.apache.org/jira/browse/THRIFT-4506

## How was this patch tested?

Existing tests. A quick local test suggests this works.

Closes #23935 from srowen/SPARK-27029.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-03-02 17:28:37 -08:00
Marcelo Vanzin d00eca75b3 [SPARK-26048][BUILD] Enable flume profile when creating 2.x releases.
Closes #23931 from vanzin/SPARK-26048.

Authored-by: Marcelo Vanzin <vanzin@cloudera.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-03-02 08:14:06 -08:00
Sean Owen 131b464d0c [SPARK-26986][ML][FOLLOWUP] Add JAXB reference impl to build for Java 9+
## What changes were proposed in this pull request?

Remove a few new JAXB dependencies that shouldn't be necessary now.
See https://github.com/apache/spark/pull/23890#issuecomment-468299922

## How was this patch tested?

Existing tests

Closes #23923 from srowen/SPARK-26986.2.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-03-01 11:23:40 -06:00
Sean Owen 9c283662c6 [SPARK-26986][ML] Add JAXB reference impl to build for Java 9+
## What changes were proposed in this pull request?

Add reference JAXB impl for Java 9+ from Glassfish. Right now it's only apparently necessary in MLlib but can be expanded later.

## How was this patch tested?

Existing tests particularly PMML-related ones, which use JAXB.
This works on Java 11.

Closes #23890 from srowen/SPARK-26986.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-02-26 18:26:49 -06:00
Marcelo Vanzin afbff6446f Revert "[SPARK-26742][K8S] Update Kubernetes-Client version to 4.1.2"
This reverts commit a3192d966a.
2019-02-26 13:42:07 -08:00
Jungtaek Lim (HeartSaVioR) c5de804093 [MINOR][BUILD] Update all checkstyle dtd to use "https://checkstyle.org"
## What changes were proposed in this pull request?

Below build failed with Java checkstyle test, but instead of violation it shows FileNotFound on dtd file.
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/102751/

Looks like the link of dtd file is dead `http://www.puppycrawl.com/dtds/configuration_1_3.dtd`.

This patch updates the dtd link to "https://checkstyle.org/dtds/" given checkstyle repository also updated the URL path.
https://github.com/checkstyle/checkstyle/issues/5601

## How was this patch tested?

Checked the new links.

Closes #23887 from HeartSaVioR/java-checkstyle-dtd-change-url.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
2019-02-25 11:25:53 -08:00
Jiaxin Shan a3192d966a [SPARK-26742][K8S] Update Kubernetes-Client version to 4.1.2
## What changes were proposed in this pull request?
Changed the `kubernetes-client` version to 4.1.2.  Latest version fix error with exec credentials (used by aws eks) and this will be used to talk with kubernetes API server. Users can submit spark job to EKS api endpoint now with this patch.

## How was this patch tested?
unit tests and manual tests.

Closes #23814 from Jeffwan/update_k8s_sdk.

Authored-by: Jiaxin Shan <seedjeffwan@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-02-25 04:56:04 -06:00
Holden Karau 6b3c832dac [SPARK-26882] Check the Kubernetes integration tests scalatyle
## What changes were proposed in this pull request?

Add the kubernetes integration tests to the scalastyle profiles.

## How was this patch tested?

Run ./dev/scalastyle with a bad change manually

## Follow on work

See SPARK-26898 to add scalastyle for k8s integration to the CI

Closes #23792 from holdenk/SPARK-26882-check-k8s-integration-tests-when-linting.

Authored-by: Holden Karau <holden@pigscanfly.ca>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
2019-02-19 13:49:47 -08:00
cchung100m dc46fb77ba [SPARK-26822] Upgrade the deprecated module 'optparse'
Follow the [official document](https://docs.python.org/2/library/argparse.html#upgrading-optparse-code)  to upgrade the deprecated module 'optparse' to  'argparse'.

## What changes were proposed in this pull request?

This PR proposes to replace 'optparse' module with 'argparse' module.

## How was this patch tested?

Follow the [previous testing](7e3eb3cd20), manually tested and negative tests were also done. My [test results](https://gist.github.com/cchung100m/1661e7df6e8b66940a6e52a20861f61d)

Closes #23730 from cchung100m/solve_deprecated_module_optparse.

Authored-by: cchung100m <cchung100m@cs.ccu.edu.tw>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-02-10 00:36:22 -06:00
Ryan Blue f72d217788
[SPARK-26677][BUILD] Update Parquet to 1.10.1 with notEq pushdown fix.
## What changes were proposed in this pull request?

Update to Parquet Java 1.10.1.

## How was this patch tested?

Added a test from HyukjinKwon that validates the notEq case from SPARK-26677.

Closes #23704 from rdblue/SPARK-26677-fix-noteq-parquet-bug.

Lead-authored-by: Ryan Blue <blue@apache.org>
Co-authored-by: Hyukjin Kwon <gurwls223@apache.org>
Co-authored-by: Ryan Blue <rdblue@users.noreply.github.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2019-02-02 09:17:52 -08:00
Hyukjin Kwon cdd694c52b [SPARK-7721][INFRA] Run and generate test coverage report from Python via Jenkins
## What changes were proposed in this pull request?

### Background

For the current status, the test script that generates coverage information was merged
into Spark, https://github.com/apache/spark/pull/20204

So, we can generate the coverage report and site by, for example:

```
run-tests-with-coverage --python-executables=python3 --modules=pyspark-sql
```

like `run-tests` script in `./python`.

### Proposed change

The next step is to host this coverage report via `github.io` automatically
by Jenkins (see https://spark-test.github.io/pyspark-coverage-site/).

This uses my testing account for Spark, spark-test, which is shared to Felix and Shivaram a long time ago for testing purpose including AppVeyor.

To cut this short, this PR targets to run the coverage in
[spark-master-test-sbt-hadoop-2.7](https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/)

In the specific job, it will clone the page, and rebase the up-to-date PySpark test coverage from the latest commit. For instance as below:

```bash
# Clone PySpark coverage site.
git clone https://github.com/spark-test/pyspark-coverage-site.git

# Remove existing HTMLs.
rm -fr pyspark-coverage-site/*

# Copy generated coverage HTMLs.
cp -r .../python/test_coverage/htmlcov/* pyspark-coverage-site/

# Check out to a temporary branch.
git symbolic-ref HEAD refs/heads/latest_branch

# Add all the files.
git add -A

# Commit current HTMLs.
git commit -am "Coverage report at latest commit in Apache Spark"

# Delete the old branch.
git branch -D gh-pages

# Rename the temporary branch to master.
git branch -m gh-pages

# Finally, force update to our repository.
git push -f origin gh-pages
```

So, it is a one single up-to-date coverage can be shown in the `github-io` page. The commands above were manually tested.

### TODOs

- [x] Write a draft HyukjinKwon
- [x] `pip install coverage` to all python implementations (pypy, python2, python3) in Jenkins workers  - shaneknapp
- [x] Set hidden `SPARK_TEST_KEY` for spark-test's password in Jenkins via Jenkins's feature
 This should be set in both PR builder and `spark-master-test-sbt-hadoop-2.7` so that later other PRs can test and fix the bugs - shaneknapp
- [x] Set an environment variable that indicates `spark-master-test-sbt-hadoop-2.7` so that that specific build can report and update the coverage site - shaneknapp
- [x] Make PR builder's test passed HyukjinKwon
- [x] Fix flaky test related with coverage HyukjinKwon
  -  6 consecutive passes out of 7 runs

This PR will be co-authored with me and shaneknapp

## How was this patch tested?

It will be tested via Jenkins.

Closes #23117 from HyukjinKwon/SPARK-7721.

Lead-authored-by: Hyukjin Kwon <gurwls223@apache.org>
Co-authored-by: hyukjinkwon <gurwls223@apache.org>
Co-authored-by: shane knapp <incomplete@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2019-02-01 10:18:08 +08:00
Bryan Cutler 16990f9299 [SPARK-26566][PYTHON][SQL] Upgrade Apache Arrow to version 0.12.0
## What changes were proposed in this pull request?

Upgrade Apache Arrow to version 0.12.0. This includes the Java artifacts and fixes to enable usage with pyarrow 0.12.0

Version 0.12.0 includes the following selected fixes/improvements relevant to Spark users:

* Safe cast fails from numpy float64 array with nans to integer, ARROW-4258
* Java, Reduce heap usage for variable width vectors, ARROW-4147
* Binary identity cast not implemented, ARROW-4101
* pyarrow open_stream deprecated, use ipc.open_stream, ARROW-4098
* conversion to date object no longer needed, ARROW-3910
* Error reading IPC file with no record batches, ARROW-3894
* Signed to unsigned integer cast yields incorrect results when type sizes are the same, ARROW-3790
* from_pandas gives incorrect results when converting floating point to bool, ARROW-3428
* Import pyarrow fails if scikit-learn is installed from conda (boost-cpp / libboost issue), ARROW-3048
* Java update to official Flatbuffers version 1.9.0, ARROW-3175

complete list [here](https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20(Resolved%2C%20Closed)%20AND%20fixVersion%20%3D%200.12.0)

PySpark requires the following fixes to work with PyArrow 0.12.0

* Encrypted pyspark worker fails due to ChunkedStream missing closed property
* pyarrow now converts dates as objects by default, which causes error because type is assumed datetime64
* ArrowTests fails due to difference in raised error message
* pyarrow.open_stream deprecated
* tests fail because groupby adds index column with duplicate name

## How was this patch tested?

Ran unit tests with pyarrow versions 0.8.0, 0.10.0, 0.11.1, 0.12.0

Closes #23657 from BryanCutler/arrow-upgrade-012.

Authored-by: Bryan Cutler <cutlerb@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2019-01-29 14:18:45 +08:00
Sean Owen c2d0d700b5 [SPARK-26640][CORE][ML][SQL][STREAMING][PYSPARK] Code cleanup from lgtm.com analysis
## What changes were proposed in this pull request?

Misc code cleanup from lgtm.com analysis. See comments below for details.

## How was this patch tested?

Existing tests.

Closes #23571 from srowen/SPARK-26640.

Lead-authored-by: Sean Owen <sean.owen@databricks.com>
Co-authored-by: Hyukjin Kwon <gurwls223@apache.org>
Co-authored-by: Sean Owen <srowen@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-01-17 19:40:39 -06:00
wright 4915cb3adf [MINOR][BUILD] ensure call to translate_component has correct number of arguments
## What changes were proposed in this pull request?

The call to `translate_component` only supplied 2 out of the 3 required arguments. I added a default empty list for the missing argument to avoid a run-time error.

I work for Semmle, and noticed the bug with our LGTM code analyzer:
0655f1624f/files/dev/create-release/releaseutils.py?sort=name&dir=ASC&mode=heatmap#x1434915b6576fb40:1

## How was this patch tested?

I checked that  `./dev/run-tests` pass OK.

Closes #23567 from ipwright/wrong-number-of-arguments-fix.

Authored-by: wright <wright@semmle.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-01-16 21:00:58 -06:00
Takeshi Yamamuro abc937b247 [MINOR][BUILD] Remove binary license/notice files in a source release for branch-2.4+ only
## What changes were proposed in this pull request?
To skip some steps to remove binary license/notice files in a source release for branch2.3 (these files only exist in master/branch-2.4 now), this pr checked a Spark release version in `dev/create-release/release-build.sh`.

## How was this patch tested?
Manually checked.

Closes #23538 from maropu/FixReleaseScript.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-01-14 19:17:39 -06:00
Dongjoon Hyun 6f35ede31c
[SPARK-26554][BUILD][FOLLOWUP] Use GitHub instead of GitBox to check HEADER
## What changes were proposed in this pull request?

This PR uses GitHub repository instead of GitBox because GitHub repo returns HTTP header status correctly.

## How was this patch tested?

Manual.

```
$ ./do-release-docker.sh -d /tmp/test -n
Branch [branch-2.4]:
Current branch version is 2.4.1-SNAPSHOT.
Release [2.4.1]:
RC # [1]:
This is a dry run. Please confirm the ref that will be built for testing.
Ref [v2.4.1-rc1]:
```

Closes #23482 from dongjoon-hyun/SPARK-26554-2.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2019-01-07 17:54:05 -08:00
Dongjoon Hyun 468d25ec74
[MINOR][BUILD] Fix script name in release-tag.sh usage message
## What changes were proposed in this pull request?

This PR fixes the old script name in `release-tag.sh`.

    $ ./release-tag.sh --help | head -n1
    usage: tag-release.sh

## How was this patch tested?

Manual.

    $ ./release-tag.sh --help | head -n1
    usage: release-tag.sh

Closes #23477 from dongjoon-hyun/SPARK-RELEASE-TAG.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2019-01-06 22:45:18 -08:00
Dongjoon Hyun fe039faddf
[SPARK-26554][BUILD] Update release-util.sh to avoid GitBox fake 200 headers
## What changes were proposed in this pull request?

Unlike the previous Apache Git repository, new GitBox repository returns a fake HTTP 200 header instead of `404 Not Found` header. This makes release scripts out of order. This PR aims to fix it to handle the html body message instead of the fake HTTP headers. This is a release blocker.

```bash
$ curl -s --head --fail "https://gitbox.apache.org/repos/asf?p=spark.git;a=commit;h=v3.0.0"
HTTP/1.1 200 OK
Date: Sun, 06 Jan 2019 22:42:39 GMT
Server: Apache/2.4.18 (Ubuntu)
Vary: Accept-Encoding
Access-Control-Allow-Origin: *
Access-Control-Allow-Methods: POST, GET, OPTIONS
Access-Control-Allow-Headers: X-PINGOTHER
Access-Control-Max-Age: 1728000
Content-Type: text/html; charset=utf-8
```

**BEFORE**
```bash
$ ./do-release-docker.sh -d /tmp/test -n
Branch [branch-2.4]:
Current branch version is 2.4.1-SNAPSHOT.
Release [2.4.1]:
RC # [1]:
v2.4.1-rc1 already exists. Continue anyway [y/n]?
```

**AFTER**
```bash
$ ./do-release-docker.sh -d /tmp/test -n
Branch [branch-2.4]:
Current branch version is 2.4.1-SNAPSHOT.
Release [2.4.1]:
RC # [1]:
This is a dry run. Please confirm the ref that will be built for testing.
Ref [v2.4.1-rc1]:
```

## How was this patch tested?

Manual.

Closes #23476 from dongjoon-hyun/SPARK-26554.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2019-01-06 19:59:31 -08:00
Dongjoon Hyun 5969b8a2ed
[SPARK-26541][BUILD] Add -Pdocker-integration-tests to dev/scalastyle
## What changes were proposed in this pull request?

This PR makes `scalastyle` to check `docker-integration-tests` module additionally and fixes one error.

## How was this patch tested?

Pass the Jenkins with the updated Scalastyle.
```
========================================================================
Running Scala style checks
========================================================================
Scalastyle checks passed.
```

Closes #23459 from dongjoon-hyun/SPARK-26541.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2019-01-05 00:55:17 -08:00
shane knapp bccb8602d7
[SPARK-26537][BUILD] change git-wip-us to gitbox
## What changes were proposed in this pull request?

due to apache recently moving from git-wip-us.apache.org to gitbox.apache.org, we need to update the packaging scripts to point to the new repo location.

this will also need to be backported to 2.4, 2.3, 2.1, 2.0 and 1.6.

## How was this patch tested?

the build system will test this.

Please review http://spark.apache.org/contributing.html before opening a pull request.

Closes #23454 from shaneknapp/update-apache-repo.

Authored-by: shane knapp <incomplete@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2019-01-04 18:27:26 -08:00
Dongjoon Hyun 81addaa6b7
[SPARK-26427][BUILD] Upgrade Apache ORC to 1.5.4
## What changes were proposed in this pull request?

This PR aims to update Apache ORC dependency to the latest version 1.5.4 released at Dec. 20. ([Release Notes](https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12318320&version=12344187]))
```
[ORC-237] OrcFile.mergeFiles Specified block size is less than configured minimum value
[ORC-409] Changes for extending MemoryManagerImpl
[ORC-410] Fix a locale-dependent test in TestCsvReader
[ORC-416] Avoid opening data reader when there is no stripe
[ORC-417] Use dynamic Apache Maven mirror link
[ORC-419] Ensure to call `close` at RecordReaderImpl constructor exception
[ORC-432] openjdk 8 has a bug that prevents surefire from working
[ORC-435] Ability to read stripes that are greater than 2GB
[ORC-437] Make acid schema checks case insensitive
[ORC-411] Update build to work with Java 10.
[ORC-418] Fix broken docker build script
```

## How was this patch tested?

Build and pass Jenkins.

Closes #23364 from dongjoon-hyun/SPARK-26427.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2018-12-22 00:41:21 -08:00
Reza Safi 90c77ea313 [SPARK-24958][CORE] Add memory from procfs to executor metrics.
This adds the entire memory used by spark’s executor (as measured by procfs) to the executor metrics.  The memory usage is collected from the entire process tree under the executor.  The metrics are subdivided into memory used by java, by python, and by other processes, to aid users in diagnosing the source of high memory usage.
The additional metrics are sent to the driver in heartbeats, using the mechanism introduced by SPARK-23429.  This also slightly extends that approach to allow one ExecutorMetricType to collect multiple metrics.

Added unit tests and also tested on a live cluster.

Closes #22612 from rezasafi/ptreememory2.

Authored-by: Reza Safi <rezasafi@cloudera.com>
Signed-off-by: Imran Rashid <irashid@cloudera.com>
2018-12-10 11:14:11 -06:00
Sean Owen 2ea9792fde [SPARK-26266][BUILD] Update to Scala 2.12.8
## What changes were proposed in this pull request?

Update to Scala 2.12.8

## How was this patch tested?

Existing tests.

Closes #23218 from srowen/SPARK-26266.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2018-12-08 05:59:53 -06:00
Dongjoon Hyun 4772265203
[SPARK-26298][BUILD] Upgrade Janino to 3.0.11
## What changes were proposed in this pull request?

This PR aims to upgrade Janino compiler to the latest version 3.0.11. The followings are the changes from the [release note](http://janino-compiler.github.io/janino/changelog.html).

- Script with many "helper" variables.
- Java 9+ compatibility
- Compilation Error Messages Generated by JDK.
- Added experimental support for the "StackMapFrame" attribute; not active yet.
- Make Unparser more flexible.
- Fixed NPEs in various "toString()" methods.
- Optimize static method invocation with rvalue target expression.
- Added all missing "ClassFile.getConstant*Info()" methods, removing the necessity for many type casts.

## How was this patch tested?

Pass the Jenkins with the existing tests.

Closes #23250 from dongjoon-hyun/SPARK-26298.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2018-12-06 20:50:57 -08:00
cody koeninger 5e5b9f2ee0 [SPARK-26177] Config change followup to [] Automated formatting for Scala code
Let's keep this open for a while to see if other configuration tweaks are suggested

## What changes were proposed in this pull request?

Formatting configuration changes following up
https://github.com/apache/spark/pull/23148

## How was this patch tested?

./dev/scalafmt

Closes #23182 from koeninger/scalafmt-config.

Authored-by: cody koeninger <cody@koeninger.org>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2018-12-03 10:03:51 -06:00
Kazuaki Ishizaki 1abfbda7eb [SPARK-26212][BUILD][TEST-MAVEN] Upgrade maven version to 3.6.0
## What changes were proposed in this pull request?

This PR updates maven version from 3.5.4 to 3.6.0. The release note of the 3.6.0 is [here](https://maven.apache.org/docs/3.6.0/release-notes.html).

From [the release note of the 3.6.0](https://maven.apache.org/docs/3.6.0/release-notes.html), the followings are new features:
1. There had been issues related to the project discoverytime which has been increased in previous version which influenced some of our users.
1. The output in the reactor summary has been improved.
1. There was an issue related to the classpath ordering.

## How was this patch tested?

Existing tests

Closes #23177 from kiszk/SPARK-26212.

Authored-by: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2018-12-01 07:06:18 -06:00
cody koeninger 9a09e91a3e [SPARK-26177] Automated formatting for Scala code
## What changes were proposed in this pull request?

Add a maven plugin and wrapper script to use scalafmt to format files that differ from git master.

Intention is for contributors to be able to use this to automate fixing code style, not to include it in build pipeline yet.

If this PR is accepted, I'd make a different PR to update the code style section of https://spark.apache.org/contributing.html to mention the script

## How was this patch tested?

Manually tested by modifying a few files and running ./dev/scalafmt then checking that ./dev/scalastyle still passed.

Closes #23148 from koeninger/scalafmt.

Authored-by: cody koeninger <cody@koeninger.org>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2018-11-29 08:54:31 -06:00
hyukjinkwon 41d5aaec84 [SPARK-26148][PYTHON][TESTS] Increases default parallelism in PySpark tests to speed up
## What changes were proposed in this pull request?

This PR proposes to increase parallelism in PySpark tests to speed up from 4 to 8.

It decreases the elapsed time from

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99163/consoleFull
Tests passed in 1770 seconds

to

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99186/testReport/
Tests passed in 1027 seconds

## How was this patch tested?

Jenkins tests

Closes #23111 from HyukjinKwon/parallelism.

Authored-by: hyukjinkwon <gurwls223@apache.org>
Signed-off-by: hyukjinkwon <gurwls223@apache.org>
2018-11-26 00:26:24 +09:00
Takanobu Asanuma 15c0384977
[SPARK-26134][CORE] Upgrading Hadoop to 2.7.4 to fix java.version problem
## What changes were proposed in this pull request?

When I ran spark-shell on JDK11+28(2018-09-25), It failed with the error below.

```
Exception in thread "main" java.lang.ExceptionInInitializerError
	at org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:80)
	at org.apache.hadoop.security.SecurityUtil.getAuthenticationMethod(SecurityUtil.java:611)
	at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:273)
	at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:261)
	at org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:791)
	at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:761)
	at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:634)
	at org.apache.spark.util.Utils$.$anonfun$getCurrentUserName$1(Utils.scala:2427)
	at scala.Option.getOrElse(Option.scala:121)
	at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2427)
	at org.apache.spark.SecurityManager.<init>(SecurityManager.scala:79)
	at org.apache.spark.deploy.SparkSubmit.secMgr$lzycompute$1(SparkSubmit.scala:359)
	at org.apache.spark.deploy.SparkSubmit.secMgr$1(SparkSubmit.scala:359)
	at org.apache.spark.deploy.SparkSubmit.$anonfun$prepareSubmitEnvironment$9(SparkSubmit.scala:367)
	at scala.Option.map(Option.scala:146)
	at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:367)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:143)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:927)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:936)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.StringIndexOutOfBoundsException: begin 0, end 3, length 2
	at java.base/java.lang.String.checkBoundsBeginEnd(String.java:3319)
	at java.base/java.lang.String.substring(String.java:1874)
	at org.apache.hadoop.util.Shell.<clinit>(Shell.java:52)
```
This is a Hadoop issue that fails to parse some java.version. It has been fixed from Hadoop-2.7.4(see [HADOOP-14586](https://issues.apache.org/jira/browse/HADOOP-14586)).

Note, Hadoop-2.7.5 or upper have another problem with Spark ([SPARK-25330](https://issues.apache.org/jira/browse/SPARK-25330)). So upgrading to 2.7.4 would be fine for now.

## How was this patch tested?
Existing tests.

Closes #23101 from tasanuma/SPARK-26134.

Authored-by: Takanobu Asanuma <tasanuma@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2018-11-21 23:09:57 -08:00
shane knapp 42c48387c0 [BUILD] refactor dev/lint-python in to something readable
## What changes were proposed in this pull request?

`dev/lint-python` is a mess of nearly unreadable bash.  i would like to fix that as best as i can.

## How was this patch tested?

the build system will test this.

Closes #22994 from shaneknapp/lint-python-refactor.

Authored-by: shane knapp <incomplete@gmail.com>
Signed-off-by: shane knapp <incomplete@gmail.com>
2018-11-20 12:38:40 -08:00
Bryan Cutler 034ae305c3 [SPARK-26033][PYTHON][TESTS] Break large ml/tests.py file into smaller files
## What changes were proposed in this pull request?

This PR breaks down the large ml/tests.py file that contains all Python ML unit tests into several smaller test files to be easier to read and maintain.

The tests are broken down as follows:
```
pyspark
├── __init__.py
...
├── ml
│   ├── __init__.py
...
│   ├── tests
│   │   ├── __init__.py
│   │   ├── test_algorithms.py
│   │   ├── test_base.py
│   │   ├── test_evaluation.py
│   │   ├── test_feature.py
│   │   ├── test_image.py
│   │   ├── test_linalg.py
│   │   ├── test_param.py
│   │   ├── test_persistence.py
│   │   ├── test_pipeline.py
│   │   ├── test_stat.py
│   │   ├── test_training_summary.py
│   │   ├── test_tuning.py
│   │   └── test_wrapper.py
...
├── testing
...
│   ├── mlutils.py
...
```

## How was this patch tested?

Ran tests manually by module to ensure test count was the same, and ran `python/run-tests --modules=pyspark-ml` to verify all passing with Python 2.7 and Python 3.6.

Closes #23063 from BryanCutler/python-test-breakup-ml-SPARK-26033.

Authored-by: Bryan Cutler <cutlerb@gmail.com>
Signed-off-by: hyukjinkwon <gurwls223@apache.org>
2018-11-18 16:02:15 +08:00
Marcelo Vanzin d2792046a1 [SPARK-26095][BUILD] Disable parallelization in make-distibution.sh.
It makes the build slower, but at least it doesn't hang. Seems that
maven-shade-plugin has some issue with parallelization.

Closes #23061 from vanzin/SPARK-26095.

Authored-by: Marcelo Vanzin <vanzin@cloudera.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
2018-11-16 15:57:38 -08:00
Bryan Cutler a2fc48c28c [SPARK-26034][PYTHON][TESTS] Break large mllib/tests.py file into smaller files
## What changes were proposed in this pull request?

This PR breaks down the large mllib/tests.py file that contains all Python MLlib unit tests into several smaller test files to be easier to read and maintain.

The tests are broken down as follows:
```
pyspark
├── __init__.py
...
├── mllib
│   ├── __init__.py
...
│   ├── tests
│   │   ├── __init__.py
│   │   ├── test_algorithms.py
│   │   ├── test_feature.py
│   │   ├── test_linalg.py
│   │   ├── test_stat.py
│   │   ├── test_streaming_algorithms.py
│   │   └── test_util.py
...
├── testing
...
│   ├── mllibutils.py
...
```

## How was this patch tested?

Ran tests manually by module to ensure test count was the same, and ran `python/run-tests --modules=pyspark-mllib` to verify all passing with Python 2.7 and Python 3.6. Also installed scipy to include optional tests in test_linalg.

Closes #23056 from BryanCutler/python-test-breakup-mllib-SPARK-26034.

Authored-by: Bryan Cutler <cutlerb@gmail.com>
Signed-off-by: hyukjinkwon <gurwls223@apache.org>
2018-11-17 00:12:17 +08:00