PR #23890 introduced `org.glassfish.jaxb:jaxb-runtime:2.3.2` as a runtime dependency. As an unexpected side effect, `jakarta.activation:jakarta.activation-api:1.2.1` was also pulled in as a transitive dependency. As a result, for the Maven build, both of the following two jars can be found under `assembly/target/scala-2.12/jars/`:
```
activation-1.1.1.jar
jakarta.activation-api-1.2.1.jar
```
This PR exludes the Jakarta one.
Manually built Spark using Maven and checked files under `assembly/target/scala-2.12/jars/`. After this change, only `activation-1.1.1.jar` is there.
Closes#24507 from liancheng/spark-27611.
Authored-by: Cheng Lian <lian@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
## What changes were proposed in this pull request?
Update the `docs/building-spark.md`. Otherwise:
```
mvn package -DskipTests=true
...
[INFO] --- maven-enforcer-plugin:3.0.0-M2:enforce (enforce-versions) spark-parent_2.12 ---
[WARNING] Rule 0: org.apache.maven.plugins.enforcer.RequireMavenVersion failed with message:
Detected Maven Version: 3.6.0 is not in the allowed range 3.6.1.
...
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-enforcer-plugin:3.0.0-M2:enforce (enforce-versions) on project spark-parent_2.12: Some Enforcer rules have failed. Look above for specific messages explaining why the rule failed. -> [Help 1]
[ERROR]
...
```
## How was this patch tested?
Just test `https://archive.apache.org/dist/maven/maven-3/3.6.1/binaries/apache-maven-3.6.1-bin.zip` is avilable.
Closes#24477 from wangyum/SPARK-27467.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
## What changes were proposed in this pull request?
```
========================================================================
Building Spark
========================================================================
[info] Building Spark (w/Hive 1.2.1) using SBT with these arguments: -Phadoop-3.2 -Pkubernetes -Phive-thriftserver -Pkinesis-asl -Pyarn -Pspark-ganglia-lgpl -Phive -Pmesos test:package streaming-kinesis-asl-assembly/assembly
```
`(w/Hive 1.2.1)` is incorrect when testing hadoop-3.2, It's should be (w/Hive 2.3.4).
This pr removes `(w/Hive 1.2.1)` in run-tests.py.
## How was this patch tested?
N/A
Closes#24451 from wangyum/run-tests-invalid-info.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
## What changes were proposed in this pull request?
have jenkins test against python3.6 (instead of 3.4).
## How was this patch tested?
extensive testing on both the centos and ubuntu jenkins workers.
NOTE: this will need to be backported to all active branches.
Closes#24266 from shaneknapp/updating-python3-executable.
Authored-by: shane knapp <incomplete@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
## What changes were proposed in this pull request?
Update pyrolite to 4.23 to pick up bug and security fixes.
## How was this patch tested?
Existing tests.
Closes#24381 from srowen/SPARK-27470.
Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
## What changes were proposed in this pull request?
Unify commons-beanutils deps to latest 1.9.3. This resolves the version inconsistency in Hadoop 2.7's build and also picks up security and bug fixes.
## How was this patch tested?
Existing tests.
Closes#24378 from srowen/SPARK-27469.
Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
## What changes were proposed in this pull request?
This PR upgrades `lz4-java` to 1.5.1 in order to get a patch for avoiding racing with GC.
- https://github.com/lz4/lz4-java/blob/master/CHANGES.md#151
## How was this patch tested?
Pass the Jenkins with the existing tests.
Closes#24363 from dongjoon-hyun/SPARK-LZ4.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
## What changes were proposed in this pull request?
This PR mainly contains:
1. Upgrade hadoop-3's built-in Hive maven dependencies to 2.3.4.
2. Resolve compatibility issues between Hive 1.2.1 and Hive 2.3.4 in the `sql/hive` module.
## How was this patch tested?
jenkins test hadoop-2.7
manual test hadoop-3:
```shell
build/sbt clean package -Phadoop-3.2 -Phive
export SPARK_PREPEND_CLASSES=true
# rm -rf metastore_db
cat <<EOF > test_hadoop3.scala
spark.range(10).write.saveAsTable("test_hadoop3")
spark.table("test_hadoop3").show
EOF
bin/spark-shell --conf spark.hadoop.hive.metastore.schema.verification=false --conf spark.hadoop.datanucleus.schema.autoCreateAll=true -i test_hadoop3.scala
```
Closes#23788 from wangyum/SPARK-23710-hadoop3.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
## What changes were proposed in this pull request?
Update jquery -> 1.12.4, datatables -> 1.10.18, mustache -> 2.3.12.
Add missing mustache license
## How was this patch tested?
I manually tested the UI locally with the javascript console open and didn't observe any problems or JS errors. The only 'risky' change seems to be mustache, but on reading its release notes, don't think the changes from 0.8.1 to 2.x would affect Spark's simple usage.
Closes#24288 from srowen/SPARK-27358.
Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
## What changes were proposed in this pull request?
HighlyCompressedMapStatus uses RoaringBitmap to record the empty blocks. But RoaringBitmap couldn't be ser/deser with unsafe KryoSerializer.
It's a bug of RoaringBitmap-0.5.11 and fixed in latest version.
This is an update of #24157
## How was this patch tested?
Add a UT
Closes#24264 from LantaoJin/SPARK-27216.
Lead-authored-by: LantaoJin <jinlantao@gmail.com>
Co-authored-by: Lantao Jin <jinlantao@gmail.com>
Signed-off-by: Imran Rashid <irashid@cloudera.com>
## What changes were proposed in this pull request?
Fix testing issues with `yarn` module in Hadoop-3:
1. Upgrade jersey-1 to `1.19` to fix ```Cause: java.lang.NoClassDefFoundError: com/sun/jersey/spi/container/servlet/ServletContainer```.
2. Copy `ServerSocketUtil` from hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/net/ServerSocketUtil.java to fix ```java.lang.NoClassDefFoundError: org/apache/hadoop/net/ServerSocketUtil```.
3. Adapte `SessionHandler` from jetty-9.3.25.v20180904/jetty-server/src/main/java/org/eclipse/jetty/server/session/SessionHandler.java to fix ```java.lang.NoSuchMethodError: org.eclipse.jetty.server.session.SessionHandler.getSessionManager()Lorg/eclipse/jetty/server/SessionManager```.
## How was this patch tested?
manual tests:
```shell
build/sbt yarn/test -Pyarn
build/sbt yarn/test -Phadoop-3.2 -Pyarn
build/mvn -Dtest=none -DwildcardSuites=org.apache.spark.deploy.yarn.YarnClusterSuite -pl resource-managers/yarn test -Pyarn
build/mvn -Dtest=none -DwildcardSuites=org.apache.spark.deploy.yarn.YarnClusterSuite -pl resource-managers/yarn test -Pyarn -Phadoop-3.2
```
Closes#24115 from wangyum/hadoop3-yarn.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
## What changes were proposed in this pull request?
Add AL2 license to metadata of all .md files.
This seemed to be the tidiest way as it will get ignored by .md renderers and other tools. Attempts to write them as markdown comments revealed that there is no such standard thing.
## How was this patch tested?
Doc build
Closes#24243 from srowen/SPARK-26918.
Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
## What changes were proposed in this pull request?
(See JIRA for problem statement)
Update snappy 1.1.7.1 -> 1.1.7.3 to pick up an empty-stream and Java 9 fix.
There appear to be no other changes of consequence:
https://github.com/xerial/snappy-java/blob/master/Milestone.md
## How was this patch tested?
Existing tests
Closes#24242 from srowen/SPARK-27267.
Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
## What changes were proposed in this pull request?
I happened to meet this case few times before:
```
Enter comma-separated fix version(s) [3.0.0]: 3.0,0
Restoring head pointer to master
git checkout master
Already on 'master'
git branch
Traceback (most recent call last):
File "./dev/merge_spark_pr_jira.py", line 537, in <module>
main()
File "./dev/merge_spark_pr_jira.py", line 523, in main
resolve_jira_issues(title, merged_refs, jira_comment)
File "./dev/merge_spark_pr_jira.py", line 359, in resolve_jira_issues
resolve_jira_issue(merge_branches, comment, jira_id)
File "./dev/merge_spark_pr_jira.py", line 302, in resolve_jira_issue
jira_fix_versions = map(lambda v: get_version_json(v), fix_versions)
File "./dev/merge_spark_pr_jira.py", line 302, in <lambda>
jira_fix_versions = map(lambda v: get_version_json(v), fix_versions)
File "./dev/merge_spark_pr_jira.py", line 300, in get_version_json
return filter(lambda v: v.name == version_str, versions)[0].raw
IndexError: list index out of range
```
I typed the fix version wrongly (there's comma in `3.0,0`) and it ended the loop in the merge script. Not a big deal but it bugged me few times. Finally I met this today again, and decided to fix.
This PR proposes to recover from wrongly set fix versions.
## How was this patch tested?
I manually copied and pasted the specific codes and tested separately in both Python 2 and Python 3.
**Positive cases:**
```
Enter comma-separated fix version(s) [3.0.0]: # blank test (to use default)
['3.0.0']
```
```
Enter comma-separated fix version(s) [3.0.0,2.4.2]: # multiple default versions
['3.0.0', '2.4.2']
```
```
Enter comma-separated fix version(s) [3.0.0]: 2.4.1 # valid version
['2.4.1']
```
```
Enter comma-separated fix version(s) [3.0.0]: 3.0.0,2.4.2 # multiple valid versions
['3.0.0', '2.4.2']
```
**Keyboard interrupt(Ctrl + c):**
```
Enter comma-separated fix version(s) [3.0.0]: ^CTraceback (most recent call last): # keyboard interrupt
File "test_merge_script.py", line 45, in <module>
test()
File "test_merge_script.py", line 26, in test
fix_versions = input("Enter comma-separated fix version(s) [%s]: " % default_fix_versions)
KeyboardInterrupt
```
**Wrongly typed versions (recovered):**
```
Enter comma-separated fix version(s) [3.0.0]: 3.1
Specified version(s) [3.1] not found in the available versions, try again (or leave blank and fix manually).
Enter comma-separated fix version(s) [3.0.0]: 123
Specified version(s) [123] not found in the available versions, try again (or leave blank and fix manually).
Enter comma-separated fix version(s) [3.0.0]: 3.0,0
Specified version(s) [3.0, 0] not found in the available versions, try again (or leave blank and fix manually).
Enter comma-separated fix version(s) [3.0.0]: damn
Specified version(s) [damn] not found in the available versions, try again (or leave blank and fix manually).
Enter comma-separated fix version(s) [3.0.0]: 3.0.0,2.5.2 # one invalid versions in multiple versions
Specified version(s) [3.0.0, 2.5.2] not found in the available versions, try again (or leave blank and fix manually).
```
**Arbitrary exceptions in fix version parsing (recovered)**
```
Enter comma-separated fix version(s) [3.0.0]:
Traceback (most recent call last):
File "tmp.py", line 11, in <module>
raise Exception("arbitrary exception")
Exception: arbitrary exception
Error setting fix version(s), try again (or leave blank and fix manually)
Enter comma-separated fix version(s) [3.0.0]:
Traceback (most recent call last):
File "tmp.py", line 10, in <module>
raise Exception("arbitrary exception")
Exception: arbitrary exception
Error setting fix version(s), try again (or leave blank and fix manually)
Enter comma-separated fix version(s) [3.0.0]:
```
Closes#24213 from HyukjinKwon/merge_script_fix_version.
Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
## What changes were proposed in this pull request?
Remove Scala 2.11 support in build files and docs, and in various parts of code that accommodated 2.11. See some targeted comments below.
## How was this patch tested?
Existing tests.
Closes#23098 from srowen/SPARK-26132.
Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
## What changes were proposed in this pull request?
This PR upgrade `hadoop-3` to `3.2.0` to workaround [HADOOP-16086](https://issues.apache.org/jira/browse/HADOOP-16086). Otherwise some test case will throw IllegalArgumentException:
```java
02:44:34.707 ERROR org.apache.hadoop.hive.ql.exec.Task: Job Submission failed with exception 'java.io.IOException(Cannot initialize Cluster. Please check your configuration for mapreduce.framework.name and the correspond server addresses.)'
java.io.IOException: Cannot initialize Cluster. Please check your configuration for mapreduce.framework.name and the correspond server addresses.
at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:116)
at org.apache.hadoop.mapreduce.Cluster.<init>(Cluster.java:109)
at org.apache.hadoop.mapreduce.Cluster.<init>(Cluster.java:102)
at org.apache.hadoop.mapred.JobClient.init(JobClient.java:475)
at org.apache.hadoop.mapred.JobClient.<init>(JobClient.java:454)
at org.apache.hadoop.hive.ql.exec.mr.ExecDriver.execute(ExecDriver.java:369)
at org.apache.hadoop.hive.ql.exec.mr.MapRedTask.execute(MapRedTask.java:151)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:199)
at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:100)
at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:2183)
at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1839)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1526)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1237)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1227)
at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$runHive$1(HiveClientImpl.scala:730)
at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:283)
at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:221)
at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:220)
at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:266)
at org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:719)
at org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:709)
at org.apache.spark.sql.hive.StatisticsSuite.createNonPartitionedTable(StatisticsSuite.scala:719)
at org.apache.spark.sql.hive.StatisticsSuite.$anonfun$testAlterTableProperties$2(StatisticsSuite.scala:822)
```
## How was this patch tested?
manual tests
Closes#24106 from wangyum/SPARK-27175.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
## What changes were proposed in this pull request?
This PR aims to update Apache ORC dependency to fix [SPARK-27107](https://issues.apache.org/jira/browse/SPARK-27107) .
```
[ORC-452] Support converting MAP column from JSON to ORC Improvement
[ORC-447] Change the docker scripts to keep a persistent m2 cache
[ORC-463] Add `version` command
[ORC-475] ORC reader should lazily get filesystem
[ORC-476] Make SearchAgument kryo buffer size configurable
```
## How was this patch tested?
Pass the Jenkins with the existing tests.
Closes#24096 from dongjoon-hyun/SPARK-27165.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
## What changes were proposed in this pull request?
`dev/mima` and `dev/scalastyle` support dynamic reading profiles from `modules.py`.
## How was this patch tested?
manual tests
Closes#24089 from wangyum/SPARK-27158.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
## What changes were proposed in this pull request?
This PR makes it automatically select profile when executing `sbt-checkstyle`. The reason for this is that `hadoop-2.7` and `hadoop-3.1` may have different `hive-thriftserver` module in the future.
## How was this patch tested?
manual tests:
```
Update AbstractService.java file.
export HADOOP_PROFILE=hadoop2.7
./dev/run-tests
```
The result:
![image](https://user-images.githubusercontent.com/5399861/54197992-5337e780-4500-11e9-930c-722982cdcd45.png)
Closes#24065 from wangyum/SPARK-27130.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
## What changes were proposed in this pull request?
Add `test-hadoop3.1` phrase to test Spark against Spark’s Hadoop 3.1 profile.
## How was this patch tested?
Tested on jenkins. This is output:
```
[info] Using build tool sbt with Hadoop profile hadoop3.1 under environment amplab_jenkins
...
[info] Building Spark (w/Hive 1.2.1) using SBT with these arguments: -Phadoop-3.1 -Pkubernetes -Phive-thriftserver -Pkinesis-asl -Pyarn -Pspark-ganglia-lgpl -Phive -Pmesos test:package streaming-kinesis-asl-assembly/assembly
```
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/103282/consoleCloses#24045 from wangyum/SPARK-23807.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
## What changes were proposed in this pull request?
Avro is built-in but external data source module since Spark 2.4 but `from_avro` and `to_avro` APIs not yet supported in pyspark.
In this PR I've made them available from pyspark.
## How was this patch tested?
Please see the python API examples what I've added.
cd docs/
SKIP_SCALADOC=1 SKIP_RDOC=1 SKIP_SQLDOC=1 jekyll build
Manual webpage check.
Closes#23797 from gaborgsomogyi/SPARK-26856.
Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
## What changes were proposed in this pull request?
**ScalaTest 3.0.5 Release Notes**
**Bug Fixes**
- Fixed the implicit view not available problem when used with compile macro.
- Fixed a stack depth problem in RefSpecLike and fixture.SpecLike under Scala 2.13.
- Changed Framework and ScalaTestFramework to set spanScaleFactor for Runner object instances for different Runners using different class loaders. This fixed a problem whereby an incorrect Runner.spanScaleFactor could be used when the tests for multiple sbt project's were run concurrently.
- Fixed a bug in endsWith regex matcher.
**Improvements**
- Removed duplicated parsing code for -C in ArgsParser.
- Improved performance in WebBrowser.
- Documentation typo rectification.
- Improve validity of Junit XML reports.
- Improved performance by replacing all .size == 0 and .length == 0 to .isEmpty.
**Enhancements**
- Added 'C' option to -P, which will tell -P to use cached thread pool.
- External Dependencies Update
- Bumped up scala-js version to 0.6.22.
- Changed to depend on mockito-core, not mockito-all.
- Bumped up jmock version to 2.8.3.
- Bumped up junit version to 4.12.
- Removed dependency to scala-parser-combinators.
More details:
http://www.scalatest.org/release_notes/3.0.5
## How was this patch tested?
manual tests on local machine:
```
nohup build/sbt clean -Djline.terminal=jline.UnsupportedTerminal -Phadoop-2.7 -Pkubernetes -Phive-thriftserver -Pyarn -Pspark-ganglia-lgpl -Phive -Pkinesis-asl -Pmesos test > run.scalatest.log &
```
Closes#24042 from wangyum/SPARK-27120.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
## What changes were proposed in this pull request?
Upgrade Docker image for release build to Ubuntu 18.04LTS
## How was this patch tested?
Manually tested.
Closes#23932 from dbtsai/ubuntu18.04.
Authored-by: DB Tsai <d_tsai@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
## What changes were proposed in this pull request?
Fasterxml Jackson version before 2.9.8 is affected by multiple [CVEs](https://github.com/FasterXML/jackson-databind/issues/2186), we need to fix bump the dependent Jackson to 2.9.8.
## How was this patch tested?
Existing tests and offline benchmark.
I have run ```SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.datasources.json.JSONBenchmark"``` to check there is no performance degradation for this upgrade.
Closes#23965 from yanboliang/SPARK-27051.
Authored-by: Yanbo Liang <ybliang8@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
## What changes were proposed in this pull request?
Only memory usage without GC information could not help us to determinate the proper settings of memory. We need the GC metrics about frequency of major & minor GC. For example, two cases, their configured memory for executor are all 10GB and their usages are all near 10GB. So should we increase or decrease the configured memory for them? This metrics may be helpful. We can increase configured memory for the first one if it has very frequency major GC and decrease the second one if only some minor GC and none major GC.
GC metrics are only useful in entire lifetime of executors instead of separated stages.
## How was this patch tested?
Adding UT.
Closes#22874 from LantaoJin/SPARK-25865.
Authored-by: LantaoJin <jinlantao@gmail.com>
Signed-off-by: Imran Rashid <irashid@cloudera.com>
## What changes were proposed in this pull request?
Update Thrift to 0.12.0 to pick up bug and security fixes.
Changes: https://github.com/apache/thrift/blob/master/CHANGES.md
The important one is for https://issues.apache.org/jira/browse/THRIFT-4506
## How was this patch tested?
Existing tests. A quick local test suggests this works.
Closes#23935 from srowen/SPARK-27029.
Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
## What changes were proposed in this pull request?
Remove a few new JAXB dependencies that shouldn't be necessary now.
See https://github.com/apache/spark/pull/23890#issuecomment-468299922
## How was this patch tested?
Existing tests
Closes#23923 from srowen/SPARK-26986.2.
Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
## What changes were proposed in this pull request?
Add reference JAXB impl for Java 9+ from Glassfish. Right now it's only apparently necessary in MLlib but can be expanded later.
## How was this patch tested?
Existing tests particularly PMML-related ones, which use JAXB.
This works on Java 11.
Closes#23890 from srowen/SPARK-26986.
Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
## What changes were proposed in this pull request?
Changed the `kubernetes-client` version to 4.1.2. Latest version fix error with exec credentials (used by aws eks) and this will be used to talk with kubernetes API server. Users can submit spark job to EKS api endpoint now with this patch.
## How was this patch tested?
unit tests and manual tests.
Closes#23814 from Jeffwan/update_k8s_sdk.
Authored-by: Jiaxin Shan <seedjeffwan@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
## What changes were proposed in this pull request?
Add the kubernetes integration tests to the scalastyle profiles.
## How was this patch tested?
Run ./dev/scalastyle with a bad change manually
## Follow on work
See SPARK-26898 to add scalastyle for k8s integration to the CI
Closes#23792 from holdenk/SPARK-26882-check-k8s-integration-tests-when-linting.
Authored-by: Holden Karau <holden@pigscanfly.ca>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
Follow the [official document](https://docs.python.org/2/library/argparse.html#upgrading-optparse-code) to upgrade the deprecated module 'optparse' to 'argparse'.
## What changes were proposed in this pull request?
This PR proposes to replace 'optparse' module with 'argparse' module.
## How was this patch tested?
Follow the [previous testing](7e3eb3cd20), manually tested and negative tests were also done. My [test results](https://gist.github.com/cchung100m/1661e7df6e8b66940a6e52a20861f61d)
Closes#23730 from cchung100m/solve_deprecated_module_optparse.
Authored-by: cchung100m <cchung100m@cs.ccu.edu.tw>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
## What changes were proposed in this pull request?
Update to Parquet Java 1.10.1.
## How was this patch tested?
Added a test from HyukjinKwon that validates the notEq case from SPARK-26677.
Closes#23704 from rdblue/SPARK-26677-fix-noteq-parquet-bug.
Lead-authored-by: Ryan Blue <blue@apache.org>
Co-authored-by: Hyukjin Kwon <gurwls223@apache.org>
Co-authored-by: Ryan Blue <rdblue@users.noreply.github.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
## What changes were proposed in this pull request?
### Background
For the current status, the test script that generates coverage information was merged
into Spark, https://github.com/apache/spark/pull/20204
So, we can generate the coverage report and site by, for example:
```
run-tests-with-coverage --python-executables=python3 --modules=pyspark-sql
```
like `run-tests` script in `./python`.
### Proposed change
The next step is to host this coverage report via `github.io` automatically
by Jenkins (see https://spark-test.github.io/pyspark-coverage-site/).
This uses my testing account for Spark, spark-test, which is shared to Felix and Shivaram a long time ago for testing purpose including AppVeyor.
To cut this short, this PR targets to run the coverage in
[spark-master-test-sbt-hadoop-2.7](https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/)
In the specific job, it will clone the page, and rebase the up-to-date PySpark test coverage from the latest commit. For instance as below:
```bash
# Clone PySpark coverage site.
git clone https://github.com/spark-test/pyspark-coverage-site.git
# Remove existing HTMLs.
rm -fr pyspark-coverage-site/*
# Copy generated coverage HTMLs.
cp -r .../python/test_coverage/htmlcov/* pyspark-coverage-site/
# Check out to a temporary branch.
git symbolic-ref HEAD refs/heads/latest_branch
# Add all the files.
git add -A
# Commit current HTMLs.
git commit -am "Coverage report at latest commit in Apache Spark"
# Delete the old branch.
git branch -D gh-pages
# Rename the temporary branch to master.
git branch -m gh-pages
# Finally, force update to our repository.
git push -f origin gh-pages
```
So, it is a one single up-to-date coverage can be shown in the `github-io` page. The commands above were manually tested.
### TODOs
- [x] Write a draft HyukjinKwon
- [x] `pip install coverage` to all python implementations (pypy, python2, python3) in Jenkins workers - shaneknapp
- [x] Set hidden `SPARK_TEST_KEY` for spark-test's password in Jenkins via Jenkins's feature
This should be set in both PR builder and `spark-master-test-sbt-hadoop-2.7` so that later other PRs can test and fix the bugs - shaneknapp
- [x] Set an environment variable that indicates `spark-master-test-sbt-hadoop-2.7` so that that specific build can report and update the coverage site - shaneknapp
- [x] Make PR builder's test passed HyukjinKwon
- [x] Fix flaky test related with coverage HyukjinKwon
- 6 consecutive passes out of 7 runs
This PR will be co-authored with me and shaneknapp
## How was this patch tested?
It will be tested via Jenkins.
Closes#23117 from HyukjinKwon/SPARK-7721.
Lead-authored-by: Hyukjin Kwon <gurwls223@apache.org>
Co-authored-by: hyukjinkwon <gurwls223@apache.org>
Co-authored-by: shane knapp <incomplete@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
## What changes were proposed in this pull request?
Upgrade Apache Arrow to version 0.12.0. This includes the Java artifacts and fixes to enable usage with pyarrow 0.12.0
Version 0.12.0 includes the following selected fixes/improvements relevant to Spark users:
* Safe cast fails from numpy float64 array with nans to integer, ARROW-4258
* Java, Reduce heap usage for variable width vectors, ARROW-4147
* Binary identity cast not implemented, ARROW-4101
* pyarrow open_stream deprecated, use ipc.open_stream, ARROW-4098
* conversion to date object no longer needed, ARROW-3910
* Error reading IPC file with no record batches, ARROW-3894
* Signed to unsigned integer cast yields incorrect results when type sizes are the same, ARROW-3790
* from_pandas gives incorrect results when converting floating point to bool, ARROW-3428
* Import pyarrow fails if scikit-learn is installed from conda (boost-cpp / libboost issue), ARROW-3048
* Java update to official Flatbuffers version 1.9.0, ARROW-3175
complete list [here](https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20(Resolved%2C%20Closed)%20AND%20fixVersion%20%3D%200.12.0)
PySpark requires the following fixes to work with PyArrow 0.12.0
* Encrypted pyspark worker fails due to ChunkedStream missing closed property
* pyarrow now converts dates as objects by default, which causes error because type is assumed datetime64
* ArrowTests fails due to difference in raised error message
* pyarrow.open_stream deprecated
* tests fail because groupby adds index column with duplicate name
## How was this patch tested?
Ran unit tests with pyarrow versions 0.8.0, 0.10.0, 0.11.1, 0.12.0
Closes#23657 from BryanCutler/arrow-upgrade-012.
Authored-by: Bryan Cutler <cutlerb@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
## What changes were proposed in this pull request?
Misc code cleanup from lgtm.com analysis. See comments below for details.
## How was this patch tested?
Existing tests.
Closes#23571 from srowen/SPARK-26640.
Lead-authored-by: Sean Owen <sean.owen@databricks.com>
Co-authored-by: Hyukjin Kwon <gurwls223@apache.org>
Co-authored-by: Sean Owen <srowen@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
## What changes were proposed in this pull request?
The call to `translate_component` only supplied 2 out of the 3 required arguments. I added a default empty list for the missing argument to avoid a run-time error.
I work for Semmle, and noticed the bug with our LGTM code analyzer:
0655f1624f/files/dev/create-release/releaseutils.py?sort=name&dir=ASC&mode=heatmap#x1434915b6576fb40:1
## How was this patch tested?
I checked that `./dev/run-tests` pass OK.
Closes#23567 from ipwright/wrong-number-of-arguments-fix.
Authored-by: wright <wright@semmle.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
## What changes were proposed in this pull request?
To skip some steps to remove binary license/notice files in a source release for branch2.3 (these files only exist in master/branch-2.4 now), this pr checked a Spark release version in `dev/create-release/release-build.sh`.
## How was this patch tested?
Manually checked.
Closes#23538 from maropu/FixReleaseScript.
Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
## What changes were proposed in this pull request?
This PR uses GitHub repository instead of GitBox because GitHub repo returns HTTP header status correctly.
## How was this patch tested?
Manual.
```
$ ./do-release-docker.sh -d /tmp/test -n
Branch [branch-2.4]:
Current branch version is 2.4.1-SNAPSHOT.
Release [2.4.1]:
RC # [1]:
This is a dry run. Please confirm the ref that will be built for testing.
Ref [v2.4.1-rc1]:
```
Closes#23482 from dongjoon-hyun/SPARK-26554-2.
Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>