ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Dongjoon Hyun	020e84e92f	[SPARK-34486][K8S] Upgrade kubernetes-client to 4.13.2 ### What changes were proposed in this pull request? This PR aims to upgrade `kubernetes-client` library from 4.12.0 to 4.13.2 for Apache Spark 3.2.0. ### Why are the changes needed? This will bring [K8s 1.19.1](https://github.com/fabric8io/kubernetes-client/pull/2541) models officially and the latest bug fixes. - https://github.com/fabric8io/kubernetes-client/releases/tag/v4.13.0 - https://github.com/fabric8io/kubernetes-client/releases/tag/v4.13.1 - https://github.com/fabric8io/kubernetes-client/releases/tag/v4.13.2 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the K8s IT and UT. ``` KubernetesSuite: - Run SparkPi with no resources - Run SparkPi with a very long application name. - Use SparkLauncher.NO_RESOURCE - Run SparkPi with a master URL without a scheme. - Run SparkPi with an argument. - Run SparkPi with custom labels, annotations, and environment variables. - All pods have the same service account by default - Run extraJVMOptions check on driver - Run SparkRemoteFileTest using a remote data file - Verify logging configuration is picked from the provided SPARK_CONF_DIR/log4j.properties - Run SparkPi with env and mount secrets. - Run PySpark on simple pi.py example - Run PySpark to test a pyfiles example - Run PySpark with memory customization - Run in client mode. - Start pod creation from template - PVs with local storage - Launcher client dependencies - SPARK-33615: Launcher client archives - SPARK-33748: Launcher python client respecting PYSPARK_PYTHON - SPARK-33748: Launcher python client respecting spark.pyspark.python and spark.pyspark.driver.python - Launcher python client dependencies using a zip file - Test basic decommissioning - Test basic decommissioning with shuffle cleanup - Test decommissioning with dynamic allocation & shuffle cleanups - Test decommissioning timeouts - Run SparkR on simple dataframe.R example Run completed in 19 minutes, 25 seconds. Total number of tests run: 27 Suites: completed 2, aborted 0 Tests: succeeded 27, failed 0, canceled 0, ignored 0, pending 0 All tests passed. ``` Closes #31602 from dongjoon-hyun/SPARK-34486. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-02-21 18:35:38 +09:00
Dongjoon Hyun	331c6fd4ef	[SPARK-34467][BUILD] Upgrade Zstd-jni to 1.4.8-4 ### What changes were proposed in this pull request? This PR aims to upgrade Zstd-JNI library to 1.4.8-4 to bring JNI side optimization. `ZStandardBenchmark` shows that there is no regression in terms of performance and show some improvements. ### Why are the changes needed? https://github.com/luben/zstd-jni/commits/v1.4.8-4 - `be9be47fae` - `be51ebade1` - `44ff8b6f95` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. Closes #31585 from dongjoon-hyun/SPARK-ZSTD-1.4.8-4. Lead-authored-by: Dongjoon Hyun <dhyun@apple.com> Co-authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-02-18 13:35:49 -08:00
“attilapiros”	bdcad33d8b	[SPARK-34433][DOCS] Lock Jekyll version by Gemfile and Bundler ### What changes were proposed in this pull request? Improving the documentation and release process by pinning Jekyll version by Gemfile and Bundler. Some files and their responsibilities within this PR: - `docs/.bundle/config` is used to specify a directory "docs/.local_ruby_bundle" which will be used as destination to install the ruby packages into instead of the global one which requires root access - `docs/Gemfile` is specifying the required Jekyll version and other top level gem versions - `docs/Gemfile.lock` is generated by the "bundle install". This file contains the exact resolved versions of all the gems including the top level gems and all the direct and transitive dependencies of those gems. When this file is generated it contains a platform related section "PLATFORMS" (in my case after the generation it was "universal-darwin-19"). Still this file must be under version control as when the version of a gem does not fit to the one specified in `Gemfile` an error comes (i.e. if the `Gemfile.lock` was generated for Jekyll 4.1.0 and its version is updated in the `Gemfile` to 4.2.0 then it triggers the error: "The bundle currently has jekyll locked at 4.1.0."). This is solution is also suggested officially in [its documentation](https://bundler.io/rationale.html#checking-your-code-into-version-control). To get rid of the specific platform (like "universal-darwin-19") first we have to add "ruby" as platform [which means this should work on every platform where Ruby runs](https://guides.rubygems.org/what-is-a-gem/)) by running "bundle lock --add-platform ruby" then the specific platform can be removed by "bundle lock --remove-platform universal-darwin-19". After this the correct process to update Jekyll version is the following: 1. update the version in `Gemfile` 2. run "bundle update" which updates the `Gemfile.lock` 3. commit both files This process for version update is tested for details please check the testing section. ### Why are the changes needed? Using different Jekyll versions can generate different output documents. This PR standardize the process. ### Does this PR introduce _any_ user-facing change? No, assuming the release was done via docker by using `do-release-docker.sh`. In that case there should be no difference at all as the same Jekyll version is specified in the Gemfile. ### How was this patch tested? #### Testing document generation Doc generation step was triggered via the docker release: ``` $ ./do-release-docker.sh -d ~/working -n -s docs ... ======================== = Building documentation... Command: /opt/spark-rm/release-build.sh docs Log file: docs.log Skipping publish step. ``` The docs.log contains the followings: ``` Building Spark docs Fetching gem metadata from https://rubygems.org/......... Using bundler 2.2.9 Fetching rb-fsevent 0.10.4 Fetching forwardable-extended 2.6.0 Fetching public_suffix 4.0.6 Fetching colorator 1.1.0 Fetching eventmachine 1.2.7 Fetching http_parser.rb 0.6.0 Fetching ffi 1.14.2 Fetching concurrent-ruby 1.1.8 Installing colorator 1.1.0 Installing forwardable-extended 2.6.0 Installing rb-fsevent 0.10.4 Installing public_suffix 4.0.6 Installing http_parser.rb 0.6.0 with native extensions Installing eventmachine 1.2.7 with native extensions Installing concurrent-ruby 1.1.8 Fetching rexml 3.2.4 Fetching liquid 4.0.3 Installing ffi 1.14.2 with native extensions Installing rexml 3.2.4 Installing liquid 4.0.3 Fetching mercenary 0.4.0 Installing mercenary 0.4.0 Fetching rouge 3.26.0 Installing rouge 3.26.0 Fetching safe_yaml 1.0.5 Installing safe_yaml 1.0.5 Fetching unicode-display_width 1.7.0 Installing unicode-display_width 1.7.0 Fetching webrick 1.7.0 Installing webrick 1.7.0 Fetching pathutil 0.16.2 Fetching kramdown 2.3.0 Fetching terminal-table 2.0.0 Fetching addressable 2.7.0 Fetching i18n 1.8.9 Installing terminal-table 2.0.0 Installing pathutil 0.16.2 Installing i18n 1.8.9 Installing addressable 2.7.0 Installing kramdown 2.3.0 Fetching kramdown-parser-gfm 1.1.0 Installing kramdown-parser-gfm 1.1.0 Fetching rb-inotify 0.10.1 Fetching sassc 2.4.0 Fetching em-websocket 0.5.2 Installing rb-inotify 0.10.1 Installing em-websocket 0.5.2 Installing sassc 2.4.0 with native extensions Fetching listen 3.4.1 Installing listen 3.4.1 Fetching jekyll-watch 2.2.1 Installing jekyll-watch 2.2.1 Fetching jekyll-sass-converter 2.1.0 Installing jekyll-sass-converter 2.1.0 Fetching jekyll 4.2.0 Installing jekyll 4.2.0 Fetching jekyll-redirect-from 0.16.0 Installing jekyll-redirect-from 0.16.0 Bundle complete! 4 Gemfile dependencies, 30 gems now installed. Bundled gems are installed into `./.local_ruby_bundle` ``` #### Testing Jekyll (or other gem) update First locally I reverted Jekyll to 4.1.0: ``` $ rm Gemfile.lock $ rm -rf .local_ruby_bundle # edited Gemfile to use version 4.1.0 $ cat Gemfile source "https://rubygems.org" gem "jekyll", "4.1.0" gem "rouge", "3.26.0" gem "jekyll-redirect-from", "0.16.0" gem "webrick", "1.7" $ bundle install ... ``` Testing Jekyll version before the update: ``` $ bundle exec jekyll --version jekyll 4.1.0 ``` Imitating Jekyll update coming from git by reverting my local changes: ``` $ git checkout Gemfile Updated 1 path from the index $ cat Gemfile source "https://rubygems.org" gem "jekyll", "4.2.0" gem "rouge", "3.26.0" gem "jekyll-redirect-from", "0.16.0" gem "webrick", "1.7" $ git checkout Gemfile.lock Updated 1 path from the index ``` Run the install: ``` $ bundle install ... ``` Checking the updated Jekyll version: ``` $ bundle exec jekyll --version jekyll 4.2.0 ``` Closes #31559 from attilapiros/pin-jekyll-version. Lead-authored-by: “attilapiros” <piros.attila.zsolt@gmail.com> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Co-authored-by: Attila Zsolt Piros <2017933+attilapiros@users.noreply.github.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-02-18 12:17:57 +09:00
HyukjinKwon	556ecd681a	[MINOR] Add a note about pip installation test in RC for release vote template ### What changes were proposed in this pull request? This PR proposes to add a note about pip installation test in RC for release vote template. ### Why are the changes needed? To promote PySpark users to test PyPi distribution and pip installation. ### Does this PR introduce _any_ user-facing change? No. It will be used for release vote. ### How was this patch tested? N/A Closes #31527 from HyukjinKwon/minor-update-vote-templ. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-02-08 22:24:42 +09:00
Dongjoon Hyun	329f945534	[SPARK-34391][BUILD] Upgrade commons-io to 2.8.0 ### What changes were proposed in this pull request? This PR aims to upgrade `commons-io` from 2.5 to 2.8.0 for Apache Spark 3.2.0. ### Why are the changes needed? `2.5` was released on 2016-04-22. This will bring the latest bug fixes. - [2020-09-05: 2.8.0](https://commons.apache.org/proper/commons-io/changes-report.html#a2.8.0) - [2020-05-24: 2.7](https://commons.apache.org/proper/commons-io/changes-report.html#a2.7) - [2017-10-15: 2.6](https://commons.apache.org/proper/commons-io/changes-report.html#a2.6) ### Does this PR introduce _any_ user-facing change? Yes, but this is a compatible dependency change. ### How was this patch tested? Pass the CIs. Closes #31503 from dongjoon-hyun/SPARK-34391. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-02-07 21:53:42 -08:00
William Hyun	5acc5b8f1e	[SPARK-34323][BUILD] Upgrade zstd-jni to 1.4.8-3 ### What changes were proposed in this pull request? This PR aims to upgrade zstd-jni to 1.4.8-3. ### Why are the changes needed? This will bring the latest improvements and bug fixes. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs with the existing tests. Closes #31430 from williamhyun/zstd-148. Authored-by: William Hyun <williamhyun3@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-02-02 00:39:05 -08:00
David Toneian	d99d0d27be	[SPARK-34300][PYSPARK][DOCS][MINOR] Fix some typos and syntax issues in docstrings and output of `dev/lint-python` This changeset is published into the public domain. ### What changes were proposed in this pull request? Some typos and syntax issues in docstrings and the output of `dev/lint-python` have been fixed. ### Why are the changes needed? In some places, the documentation did not refer to parameters or classes by the full and correct name, potentially causing uncertainty in the reader or rendering issues in Sphinx. Also, a typo in the standard output of `dev/lint-python` was fixed. ### Does this PR introduce _any_ user-facing change? Slight improvements in documentation, and in standard output of `dev/lint-python`. ### How was this patch tested? Manual testing and `dev/lint-python` run. No new Sphinx warnings arise due to this change. Closes #31401 from DavidToneian/SPARK-34300. Authored-by: David Toneian <david@toneian.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-02-02 09:30:50 +09:00
Yuming Wang	a7683afdf4	[SPARK-26346][BUILD][SQL] Upgrade Parquet to 1.11.1 ### What changes were proposed in this pull request? This PR upgrade Parquet to 1.11.1. Parquet 1.11.1 new features: - [PARQUET-1201](https://issues.apache.org/jira/browse/PARQUET-1201) - Column indexes - [PARQUET-1253](https://issues.apache.org/jira/browse/PARQUET-1253) - Support for new logical type representation - [PARQUET-1388](https://issues.apache.org/jira/browse/PARQUET-1388) - Nanosecond precision time and timestamp - parquet-mr More details: https://github.com/apache/parquet-mr/blob/apache-parquet-1.11.1/CHANGES.md ### Why are the changes needed? Support column indexes to improve query performance. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing test. Closes #26804 from wangyum/SPARK-26346. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Yuming Wang <yumwang@ebay.com>	2021-01-29 08:07:49 +08:00
HyukjinKwon	1217c8b418	Revert "[SPARK-31168][SPARK-33913][BUILD] Upgrade Scala to 2.12.13 and Kafka to 2.7.0" This reverts commit `a65e86a65e`.	2021-01-27 17:03:15 +09:00
Dongjoon Hyun	785d5822e5	[SPARK-34218][INFRA][FOLLOWUP] Fix Scala 2.13 profile typo in publish-snapshot ### What changes were proposed in this pull request? This is a follow-up of #31311 and fixes a typo in Scala 2.13 profile section in `publish-snapshot` command. ### Why are the changes needed? To fix snapshot publishing for Scala 2.13. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual. Closes #31338 from dongjoon-hyun/SPARK-34218-2. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-26 15:11:07 +09:00
Dongjoon Hyun	bdf71be0ff	[SPARK-34218][INFRA] Add Scala 2.13 packaging and publishing ### What changes were proposed in this pull request? This PR aims to add `Scala 2.13` packaging and publishing. ### Why are the changes needed? To support Scala 2.13 officially in Apache Spark 3.2.0, we need to publish the artifacts. ### Does this PR introduce _any_ user-facing change? Yes, this will provide additional artifacts. ### How was this patch tested? Manual. Closes #31311 from dongjoon-hyun/SPARK-34218. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-25 09:01:26 -08:00
Dongjoon Hyun	b8fc6f88b5	[SPARK-34217][INFRA] Fix Scala 2.12 release profile ### What changes were proposed in this pull request? This PR aims to fix the Scala 2.12 release profile in `release-build.sh`. ### Why are the changes needed? Since 3.0.0 (SPARK-26132), the release script is using `SCALA_2_11_PROFILES` to publish Scala 2.12 artifacts. After looking at the code, this is not a blocker because `-Pscala-2.11` is no-op in `branch-3.x`. In addition `scala-2.12` profile is enabled by default and it's an empty profile without any configuration technically. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? This is used by release manager only. Manually. This should land at `master/3.1/3.0`. Closes #31310 from dongjoon-hyun/SPARK-34217. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-24 14:59:55 -08:00
Dongjoon Hyun	d5d1c84bf4	[SPARK-34208][BUILD] Upgrade ORC to 1.6.7 ### What changes were proposed in this pull request? This PR aims to upgrade Apache ORC from 1.6.6 to 1.6.7. ### Why are the changes needed? Apache ORC 1.6.7 has the following fixes including [ORC-711 Support CryptoExtension in create/decryptLocalKey](https://issues.apache.org/jira/browse/ORC-711). - https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12318320&version=12349470 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs with the existing tests. Closes #31301 from dongjoon-hyun/SPARK-34208. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-22 17:06:18 -08:00
Ismaël Mejía	e9e81f798f	[SPARK-27733][CORE] Upgrade Avro to version 1.10.1 ### What changes were proposed in this pull request? Update Avro dependency to version 1.10.1 ### Why are the changes needed? To catch up multiple improvements of Avro as well as fix security issues on transitive dependencies. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Since there were no API changes required we just run the tests Closes #31232 from iemejia/SPARK-27733-avro-upgrade. Authored-by: Ismaël Mejía <iemejia@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-20 15:42:27 -08:00
CodingCat	7f3e952c23	[SPARK-33940][BUILD] Upgrade univocity to 2.9.1 ### What changes were proposed in this pull request? upgrade univocity ### Why are the changes needed? csv writer actually has an implicit limit on column name length due to univocity-parser 2.9.0, when we initialize a writer `e09114c687/src/main/java/com/univocity/parsers/common/AbstractWriter.java (L211)`, it calls toIdentifierGroupArray which calls valueOf in NormalizedString.java eventually (`e09114c687/src/main/java/com/univocity/parsers/common/NormalizedString.java (L205-L209)`) in that stringCache.get, it has a maxStringLength cap `e09114c687/src/main/java/com/univocity/parsers/common/StringCache.java (L104)` which is 1024 by default more details at https://github.com/apache/spark/pull/30972 and https://github.com/uniVocity/univocity-parsers/issues/438 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing UT Closes #31246 from CodingCat/upgrade_univocity. Authored-by: CodingCat <zhunansjtu@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-20 11:40:37 +09:00
Dongjoon Hyun	a65e86a65e	[SPARK-31168][SPARK-33913][BUILD] Upgrade Scala to 2.12.13 and Kafka to 2.7.0 ### What changes were proposed in this pull request? This PR is the 3rd try to upgrade Scala 2.12.x in order to see the feasibility. - https://github.com/apache/spark/pull/27929 (Upgrade Scala to 2.12.11, wangyum ) - https://github.com/apache/spark/pull/30940 (Upgrade Scala to 2.12.12, viirya ) `silencer` library is updated accordingly. And, Kafka version upgrade is required because it fails like the following. ``` [info] KafkaDataConsumerSuite: [info] org.apache.spark.streaming.kafka010.KafkaDataConsumerSuite * ABORTED * (1 second, 580 milliseconds) [info] java.lang.NoClassDefFoundError: scala/math/Ordering$$anon$7 [info] at kafka.api.ApiVersion$.orderingByVersion(ApiVersion.scala:45) ``` ### Why are the changes needed? Apache Spark was stuck to 2.12.10 due to the regression in Scala 2.12.11 and 2.12.12. This will bring all the bug fixes. - https://github.com/scala/scala/releases/tag/v2.12.13 - https://github.com/scala/scala/releases/tag/v2.12.12 - https://github.com/scala/scala/releases/tag/v2.12.11 ### Does this PR introduce _any_ user-facing change? Yes, but this is a bug-fixed version. ### How was this patch tested? Pass the CIs. Closes #31223 from dongjoon-hyun/SPARK-31168. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-18 13:45:06 -08:00
Yuming Wang	c87b0085c9	[SPARK-33696][BUILD][SQL] Upgrade built-in Hive to 2.3.8 ### What changes were proposed in this pull request? Hive 2.3.8 changes: HIVE-19662: Upgrade Avro to 1.8.2 HIVE-24324: Remove deprecated API usage from Avro HIVE-23980: Shade Guava from hive-exec in Hive 2.3 HIVE-24436: Fix Avro NULL_DEFAULT_VALUE compatibility issue HIVE-24512: Exclude calcite in packaging. HIVE-22708: Fix for HttpTransport to replace String.equals HIVE-24551: Hive should include transitive dependencies from calcite after shading it HIVE-24553: Exclude calcite from test-jar dependency of hive-exec ### Why are the changes needed? Upgrade Avro and Parquet to latest version. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing test add test try to upgrade Parquet to 1.11.1 and Avro to 1.10.1: https://github.com/apache/spark/pull/30517 Closes #30657 from wangyum/SPARK-33696. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-17 21:54:35 -08:00
Yuming Wang	d6906b3b76	[SPARK-34110][BUILD] Upgrade Zookeeper to 3.6.2 ### What changes were proposed in this pull request? This PR upgrade Zookeeper to 3.6.2. ### Why are the changes needed? To make Spark running on jdk 14, otherwise: ``` 21/01/13 20:25:32,533 WARN [Driver-SendThread(apache-spark-zk-3.vip.hadoop.com:2181)] zookeeper.ClientCnxn:1164 : Session 0x0 for server apache-spark-zk-3.vip.hadoop.com/<unresolved>:2181, unexpected error, closing socket connection and attempting reconnect java.lang.IllegalArgumentException: Unable to canonicalize address apache-spark-zk-3.vip.hadoop.com/<unresolved>:2181 because it's not resolvable at org.apache.zookeeper.SaslServerPrincipal.getServerPrincipal(SaslServerPrincipal.java:65) at org.apache.zookeeper.SaslServerPrincipal.getServerPrincipal(SaslServerPrincipal.java:41) at org.apache.zookeeper.ClientCnxn$SendThread.startConnect(ClientCnxn.java:1001) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1060) ``` Please see [ZOOKEEPER-3779](https://issues.apache.org/jira/browse/ZOOKEEPER-3779) for more details. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual test: 1. Replace zookeeper-3.4.14.jar with zookeeper-3.6.2.jar and zookeeper-jute-3.6.2.jar 2. Run Spark on jdk 14. Hadoop 2.7 with HADOOP-12760, Hive 1.2.1 and Zookeeper server version is 3.4.6. Some key configurations: ``` # spark-defaults.conf spark.yarn.appMasterEnv.JAVA_HOME /apache/releases/jdk-14.0.2 spark.executorEnv.JAVA_HOME /apache/releases/jdk-14.0.2 # spark-env.sh export JAVA_HOME=/apache/releases/jdk-14.0.2 ``` Jenkins Tests - Hadoop 3.2: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134048/testReport - Hadoop 2.7: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134063/testReport Closes #31177 from wangyum/SPARK-34110. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-15 21:12:41 -08:00
Chao Sun	b6f46ca297	[SPARK-33212][BUILD] Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile ### What changes were proposed in this pull request? This: 1. switches Spark to use shaded Hadoop clients, namely hadoop-client-api and hadoop-client-runtime, for Hadoop 3.x. 2. upgrade built-in version for Hadoop 3.x to Hadoop 3.2.2 Note that for Hadoop 2.7, we'll still use the same modules such as hadoop-client. In order to still keep default Hadoop profile to be hadoop-3.2, this defines the following Maven properties: ``` hadoop-client-api.artifact hadoop-client-runtime.artifact hadoop-client-minicluster.artifact ``` which default to: ``` hadoop-client-api hadoop-client-runtime hadoop-client-minicluster ``` but all switch to `hadoop-client` when the Hadoop profile is hadoop-2.7. A side affect from this is we'll import the same dependency multiple times. For this I have to disable Maven enforcer `banDuplicatePomDependencyVersions`. Besides above, there are the following changes: - explicitly add a few dependencies which are imported via transitive dependencies from Hadoop jars, but are removed from the shaded client jars. - removed the use of `ProxyUriUtils.getPath` from `ApplicationMaster` which is a server-side/private API. - modified `IsolatedClientLoader` to exclude `hadoop-auth` jars when Hadoop version is 3.x. This change should only matter when we're not sharing Hadoop classes with Spark (which is _mostly_ used in tests). ### Why are the changes needed? Hadoop 3.2.2 is released with new features and bug fixes, so it's good for the Spark community to adopt it. However, latest Hadoop versions starting from Hadoop 3.2.1 have upgraded to use Guava 27+. In order to resolve Guava conflicts, this takes the approach by switching to shaded client jars provided by Hadoop. This also has the benefits of avoid pulling other 3rd party dependencies from Hadoop side so as to avoid more potential future conflicts. ### Does this PR introduce _any_ user-facing change? When people use Spark with `hadoop-provided` option, they should make sure class path contains `hadoop-client-api` and `hadoop-client-runtime` jars. In addition, they may need to make sure these jars appear before other Hadoop jars in the order. Otherwise, classes may be loaded from the other non-shaded Hadoop jars and cause potential conflicts. ### How was this patch tested? Relying on existing tests. Closes #30701 from sunchao/test-hadoop-3.2.2. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-15 14:06:50 -08:00
Kousuke Saruta	b1c4fc7fc7	[SPARK-34008][BUILD] Upgrade derby to 10.14.2.0 ### What changes were proposed in this pull request? This PR upgrades `derby` to `10.14.2.0`. You can check the major changes from the following URLs. * 10.13.1.1 http://svn.apache.org/repos/asf/db/derby/code/tags/10.13.1.1/RELEASE-NOTES.html * 10.14.1.0 http://svn.apache.org/repos/asf/db/derby/code/tags/10.14.1.0/RELEASE-NOTES.html * 10.14.2.0 http://svn.apache.org/repos/asf/db/derby/code/tags/10.14.2.0/RELEASE-NOTES.html ### Why are the changes needed? It seems to be the final release which supports `JDK8` as the minimum required version. After `10.15.1.3`, the minimum required version is `JDK9`. https://db.apache.org/derby/ ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #31032 from sarutak/upgrade-derby. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-05 21:50:16 -08:00
HyukjinKwon	6b86aa0b52	[SPARK-33984][PYTHON] Upgrade to Py4J 0.10.9.1 ### What changes were proposed in this pull request? This PR upgrade Py4J from 0.10.9 to 0.10.9.1 that contains some bug fixes and improvements. It contains one bug fix (`4152353ac1`). ### Why are the changes needed? To leverage fixes from the upstream in Py4J. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Jenkins build and GitHub Actions will test it out. Closes #31009 from HyukjinKwon/SPARK-33984. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-04 10:23:38 -08:00
William Hyun	bd346f4a2d	[SPARK-33957][BUILD] Update commons-lang3 to 3.11 ### What changes were proposed in this pull request? This PR aims to update commons-lang3 to 3.11 to support Java 16+ better. ### Why are the changes needed? commons-lang3 has the following bug fixes and Java 16 support. - https://commons.apache.org/proper/commons-lang/changes-report.html#a3.11 ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? Pass the CIs. Closes #30990 from williamhyun/Commons-lang3. Authored-by: William Hyun <williamhyun3@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-01 19:59:17 -08:00
Hyukjin Kwon	403bf55cbe	[SPARK-33927][BUILD] Fix Dockerfile for Spark release to work ### What changes were proposed in this pull request? This PR proposes to fix the `Dockerfile` for Spark release. - Port `b135db3b1a` to `Dockerfile` - Upgrade Ubuntu 18.04 -> 20.04 (because of porting `b135db3`) - Remove Python 2 (because of Ubuntu upgrade) - Use built-in Python 3.8.5 (because of Ubuntu upgrade) - Node.js 11 -> 12 (because of Ubuntu upgrade) - Ruby 2.5 -> 2.7 (because of Ubuntu upgrade) - Python dependencies and Jekyll + plugins upgrade to the latest as it's used in GitHub Actions build (unrelated to the issue itself) ### Why are the changes needed? To make a Spark release :-). ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? Manually tested via: ```bash cd dev/create-release/spark-rm docker build -t spark-rm --build-arg UID=$UID . ``` ``` ... Successfully built 516d7943634f Successfully tagged spark-rm:latest ``` Closes #30971 from HyukjinKwon/SPARK-33927. Lead-authored-by: Hyukjin Kwon <gurwls223@apache.org> Co-authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-30 16:37:23 +09:00
Dongjoon Hyun	00642ee19e	[SPARK-33843][BUILD] Upgrade to Zstd 1.4.8 ### What changes were proposed in this pull request? This PR aims to upgrade Zstd library to 1.4.8. ### Why are the changes needed? This will bring Zstd 1.4.7 and 1.4.8 improvement and bug fixes and the following from `zstd-jni`. - https://github.com/facebook/zstd/releases/tag/v1.4.7 - https://github.com/facebook/zstd/releases/tag/v1.4.8 - https://github.com/luben/zstd-jni/issues/153 (Apple M1 architecture) ### Does this PR introduce _any_ user-facing change? This will unblock Apple Silicon usage. ### How was this patch tested? Pass the CIs. Closes #30848 from dongjoon-hyun/SPARK-33843. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-19 06:59:44 -08:00
HyukjinKwon	ddda32b156	[SPARK-33802][INFRA][FOLLOW-UP] Separate arguments properly for -c option in git command for PySpark coverage ### What changes were proposed in this pull request? This PR proposes to separate arguments properly for `-c` options. Otherwise, the space is considered as its part of argument: ``` Cloning into 'pyspark-coverage-site'... unknown option: -c user.name='Apache Spark Test Account' usage: git [--version] [--help] [-C <path>] [-c <name>=<value>] [--exec-path[=<path>]] [--html-path] [--man-path] [--info-path] [-p \| --paginate \| -P \| --no-pager] [--no-replace-objects] [--bare] [--git-dir=<path>] [--work-tree=<path>] [--namespace=<name>] <command> [<args>] [error] running git -c user.name='Apache Spark Test Account' -c user.email='sparktestaccgmail.com' commit -am Coverage report at latest commit in Apache Spark ; received return code 129 ``` ### Why are the changes needed? To make the build pass (https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-3.2/1728/console). ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? ```python >>> from sparktestsupport.shellutils import run_cmd >>> run_cmd([ ... "git", ... "-c", ... "user.name='Apache Spark Test Account'", ... "-c", ... "user.email='sparktestaccgmail.com'", ... "commit", ... "-am", ... "Coverage report at latest commit in Apache Spark"]) [SPARK-33802-followup 80d2565a511] Coverage report at latest commit in Apache Spark 1 file changed, 1 insertion(+), 1 deletion(-) CompletedProcess(args=['git', '-c', "user.name='Apache Spark Test Account'", '-c', "user.email='sparktestaccgmail.com'", 'commit', '-am', 'Coverage report at latest commit in Apache Spark'], returncode=0) ``` I cannot run e2e test because it requires the env to have Jenkins secret. Closes #30804 from HyukjinKwon/SPARK-33802-followup. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-16 23:42:34 +09:00
HyukjinKwon	888a274a88	[SPARK-33802][INFRA] Override name and email address explicitly when updating PySpark coverage ### What changes were proposed in this pull request? The current Jenkins job fails as below (https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-3.2/1726/console) ``` Generating HTML files for PySpark coverage under /home/jenkins/workspace/spark-master-test-sbt-hadoop-3.2/python/test_coverage/htmlcov /home/jenkins/workspace/spark-master-test-sbt-hadoop-3.2 Cloning into 'pyspark-coverage-site'... *** Please tell me who you are. Run git config --global user.email "youexample.com" git config --global user.name "Your Name" to set your account's default identity. Omit --global to set the identity only in this repository. ``` This PR proposes to set both when committing to the coverage site. ### Why are the changes needed? To make the coverage site keep working. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? Manually tested in the console but it has to be merged to test in the Jenkins environment. Closes #30796 from HyukjinKwon/SPARK-33802. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-16 17:20:03 +09:00
Kent Yao	4d47ac4b4b	[SPARK-33705][SQL][TEST] Fix HiveThriftHttpServerSuite flakiness ### What changes were proposed in this pull request? TO FIX flaky tests: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/132345/testReport/ ``` org.apache.spark.sql.hive.thriftserver.HiveThriftHttpServerSuite.JDBC query execution org.apache.spark.sql.hive.thriftserver.HiveThriftHttpServerSuite.Checks Hive version org.apache.spark.sql.hive.thriftserver.HiveThriftHttpServerSuite.SPARK-24829 Checks cast as float ``` The root cause here is a jar conflict issue. `NewCookie.isHttpOnly` is not defined in the `jsr311-api.jar` which conflicts The transitive artifact `jsr311-api.jar` of `hadoop-client` is excluded at the maven side. See https://issues.apache.org/jira/browse/SPARK-27179. The Jenkins PR builder and Github Action use `SBT` as the compiler tool. First, the exclusion rule from maven is not followed by sbt, so I was able to see `jsr311-api.jar` from maven cache to be added to the classpath directly. This seems to be a bug of `sbt-pom-reader` plugin but I'm not that sure. Then I added an `ExcludeRule` for the `hive-thriftserver` module at the SBT side and did see the `jsr311-api.jar` gone, but the CI jobs still failed with the same error. I added a trace log in ThriftHttpServlet ```s ERROR ThriftHttpServlet: !!!!!!!!! Suspect???????? ---> file:/home/jenkins/workspace/SparkPullRequestBuilder/assembly/target/scala-2.12/jars/jsr311-api-1.1.1.jar ``` And the log pointed out that the assembly phase copied it to `assembly/target/scala-2.12/jars/` which will be added to the classpath too. With the help of SBT `dependencyTree` tool, I saw the `jsr311-api` again as a transitive of `jersery-core` from `yarn` module with a `test` scope. So This seems to be another bug from the SBT side of the `sbt-assembly` plugin. It copied a test scope transitive artifact to the assembly output. In this PR, I defined some rules in SparkBuild.scala to bypass the potential bugs from the SBT side. First, exclude the `jsr311` from all over the project and then add it back separately to the YARN module for SBT. Additionally, the HiveThriftServerSuites was reflected for reducing flakiness too, but not related to the bugs I have found so far. ### Why are the changes needed? fix test here ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? passing jenkins and ga Closes #30643 from yaooqinn/HiveThriftHttpServerSuite. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-14 05:14:38 +00:00
Yuming Wang	01b73ae638	[SPARK-33766][BUILD] Upgrade Jackson to 2.11.4 ### What changes were proposed in this pull request? This pr upgrade Jackson to 2.11.4. Jackson Release 2.11: https://github.com/FasterXML/jackson/wiki/Jackson-Release-2.11 ### Why are the changes needed? Make it easy to upgrade dependency because Jackson 2.10 is not compatible with 2.11: ``` com.fasterxml.jackson.databind.JsonMappingException: Scala module 2.10.5 requires Jackson Databind version >= 2.10.0 and < 2.11.0 ``` [Avro](https://issues.apache.org/jira/browse/AVRO-2967) has upgraded Jackson to 2.11.3. [Parquet](https://issues.apache.org/jira/browse/PARQUET-1895) has upgraded Jackson to 2.11.2. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing test. Closes #30746 from wangyum/SPARK-33766. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-13 14:40:55 -08:00
Nicholas Marion	99848e530f	[SPARK-33762][BUILD] Upgrade commons-codec to 1.15 ### What changes were proposed in this pull request? ### Why are the changes needed? Open Source scans are reporting a potential encoding/decoding issue related to versions of commons-codec prior to 1.13. Commit referenced: `48b615756d` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #30740 from n-marion/SPARK-33762_upgrade-commons-codec. Authored-by: Nicholas Marion <nmarion@us.ibm.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-13 14:36:54 -08:00
HyukjinKwon	cd7a30641f	[SPARK-33749][BUILD][PYTHON] Exclude target directory in pycodestyle and flake8 ### What changes were proposed in this pull request? Once you build and ran K8S tests, Python lint fails as below: ```bash $ ./dev/lint-python ``` Before this PR: ``` starting python compilation test... python compilation succeeded. downloading pycodestyle from https://raw.githubusercontent.com/PyCQA/pycodestyle/2.6.0/pycodestyle.py... starting pycodestyle test... pycodestyle checks failed: ./resource-managers/kubernetes/integration-tests/target/spark-dist-unpacked/python/pyspark/cloudpickle/cloudpickle.py:15:101: E501 line too long (105 > 100 characters) ./resource-managers/kubernetes/integration-tests/target/spark-dist-unpacked/python/docs/source/conf.py:60:101: E501 line too long (124 > 100 characters) ... ``` After this PR: ``` starting python compilation test... python compilation succeeded. downloading pycodestyle from https://raw.githubusercontent.com/PyCQA/pycodestyle/2.6.0/pycodestyle.py... starting pycodestyle test... pycodestyle checks passed. starting flake8 test... flake8 checks passed. starting mypy test... mypy checks passed. starting sphinx-build tests... sphinx-build checks passed. ``` This PR excludes target directory to avoid such cases in the future. ### Why are the changes needed? To make it easier to run linters ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? Manually tested va running `./dev/lint-python`. Closes #30718 from HyukjinKwon/SPARK-33749. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-11 14:15:56 +09:00
Dongjoon Hyun	1ba1732beb	[SPARK-33295][BUILD] Upgrade ORC to 1.6.6 ### What changes were proposed in this pull request? This PR aims to upgrade Apache ORC to 1.6.6 for Apache Spark 3.2.0. ### Why are the changes needed? This brings the latest bug fixes and features. Apache Iceberg is already using Apache ORC 1.6.6. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. Closes #30715 from dongjoon-hyun/SPARK-33295. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-10 19:15:01 -08:00
Liang-Chi Hsieh	667f64f447	[SPARK-33725][BUILD] Upgrade snappy-java to 1.1.8.2 ### What changes were proposed in this pull request? This upgrades snappy-java to 1.1.8.2. ### Why are the changes needed? Minor version upgrade that includes: - [Fixed](https://github.com/xerial/snappy-java/pull/265) an initialization issue when using a recent Mac OS X version - Support Apple Silicon (M1, Mac-aarch64) - Fixed the pure-java Snappy fallback logic when no native library for your platform is found. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test. Closes #30690 from viirya/upgrade-snappy. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-09 14:26:53 -08:00
Nicholas Marion	3ac70f169d	[SPARK-33695][BUILD] Upgrade to jackson to 2.10.5 and jackson-databind to 2.10.5.1 ### What changes were proposed in this pull request? Upgrade the jackson dependencies to 2.10.5 and jackson-databind to 2.10.5.1 ### Why are the changes needed? Jackson dependency has vulnerability CVE-2020-25649. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing unit tests. Closes #30656 from n-marion/SPARK-33695_upgrade-jackson. Authored-by: Nicholas Marion <nmarion@us.ibm.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-08 12:11:06 -08:00
Fokko Driesprong	e4d1c10760	[SPARK-32320][PYSPARK] Remove mutable default arguments This is bad practice, and might lead to unexpected behaviour: https://florimond.dev/blog/articles/2018/08/python-mutable-defaults-are-the-source-of-all-evil/ ``` fokkodriesprongFan spark % grep -R "={}" python \| grep def python/pyspark/resource/profile.py: def __init__(self, _java_resource_profile=None, _exec_req={}, _task_req={}): python/pyspark/sql/functions.py:def from_json(col, schema, options={}): python/pyspark/sql/functions.py:def to_json(col, options={}): python/pyspark/sql/functions.py:def schema_of_json(json, options={}): python/pyspark/sql/functions.py:def schema_of_csv(csv, options={}): python/pyspark/sql/functions.py:def to_csv(col, options={}): python/pyspark/sql/functions.py:def from_csv(col, schema, options={}): python/pyspark/sql/avro/functions.py:def from_avro(data, jsonFormatSchema, options={}): ``` ``` fokkodriesprongFan spark % grep -R "=\[\]" python \| grep def python/pyspark/ml/tuning.py: def __init__(self, bestModel, avgMetrics=[], subModels=None): python/pyspark/ml/tuning.py: def __init__(self, bestModel, validationMetrics=[], subModels=None): ``` ### What changes were proposed in this pull request? Removing the mutable default arguments. ### Why are the changes needed? Removing the mutable default arguments, and changing the signature to `Optional[...]`. ### Does this PR introduce _any_ user-facing change? No 👍 ### How was this patch tested? Using the Flake8 bugbear code analysis plugin. Closes #29122 from Fokko/SPARK-32320. Authored-by: Fokko Driesprong <fokko@apache.org> Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>	2020-12-08 09:35:36 +08:00
Kousuke Saruta	d48ef34911	[SPARK-33684][BUILD] Upgrade httpclient from 4.5.6 to 4.5.13 ### What changes were proposed in this pull request? This PR upgrades `commons.httpclient` from `4.5.6` to `4.5.13`. 4.5.6 is released over 2 years ago and now we can use more stable `4.5.13`. https://archive.apache.org/dist/httpcomponents/httpclient/RELEASE_NOTES-4.5.x.txt ### Why are the changes needed? To follow the more stable release. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Should be done by the existing tests. Closes #30634 from sarutak/upgrade-httpclient. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-06 23:02:36 -08:00
uncleGen	4f96670358	[SPARK-31953][SS] Add Spark Structured Streaming History Server Support ### What changes were proposed in this pull request? Add Spark Structured Streaming History Server Support. ### Why are the changes needed? Add a streaming query history server plugin. ![image](https://user-images.githubusercontent.com/7402327/84248291-d26cfe80-ab3b-11ea-86d2-98205fa2bcc4.png) ![image](https://user-images.githubusercontent.com/7402327/84248347-e44ea180-ab3b-11ea-81de-eefe207656f2.png) ![image](https://user-images.githubusercontent.com/7402327/84248396-f0d2fa00-ab3b-11ea-9b0d-e410115471b0.png) - Follow-ups - Query duration should not update in history UI. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Update UT. Closes #28781 from uncleGen/SPARK-31953. Lead-authored-by: uncleGen <hustyugm@gmail.com> Co-authored-by: Genmao Yu <hustyugm@gmail.com> Co-authored-by: Yuanjian Li <yuanjian.li@databricks.com> Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>	2020-12-02 17:11:51 -08:00
Dongjoon Hyun	290aa02179	[SPARK-33618][CORE] Use hadoop-client instead of hadoop-client-api to make hadoop-aws work ### What changes were proposed in this pull request? This reverts commit SPARK-33212 (`cb3fa6c936`) mostly with three exceptions: 1. `SparkSubmitUtils` was updated recently by SPARK-33580 2. `resource-managers/yarn/pom.xml` was updated recently by SPARK-33104 to add `hadoop-yarn-server-resourcemanager` test dependency. 3. Adjust `com.fasterxml.jackson.module:jackson-module-jaxb-annotations` dependency in K8s module which is updated recently by SPARK-33471. ### Why are the changes needed? According to [HADOOP-16080](https://issues.apache.org/jira/browse/HADOOP-16080) since Apache Hadoop 3.1.1, `hadoop-aws` doesn't work with `hadoop-client-api`. It fails at write operation like the following. 1. Spark distribution with `-Phadoop-cloud` ```scala $ bin/spark-shell --conf spark.hadoop.fs.s3a.access.key=$AWS_ACCESS_KEY_ID --conf spark.hadoop.fs.s3a.secret.key=$AWS_SECRET_ACCESS_KEY 20/11/30 23:01:24 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Spark context available as 'sc' (master = local[], app id = local-1606806088715). Spark session available as 'spark'. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.1.0-SNAPSHOT /_/ Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_272) Type in expressions to have them evaluated. Type :help for more information. scala> spark.read.parquet("s3a://dongjoon/users.parquet").show 20/11/30 23:01:34 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties +------+--------------+----------------+ \| name\|favorite_color\|favorite_numbers\| +------+--------------+----------------+ \|Alyssa\| null\| [3, 9, 15, 20]\| \| Ben\| red\| []\| +------+--------------+----------------+ scala> Seq(1).toDF.write.parquet("s3a://dongjoon/out.parquet") 20/11/30 23:02:14 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2)/ 1] java.lang.NoSuchMethodError: org.apache.hadoop.util.SemaphoredDelegatingExecutor.<init>(Lcom/google/common/util/concurrent/ListeningExecutorService;IZ)V ``` 2. Spark distribution without `-Phadoop-cloud`* ```scala $ bin/spark-shell --conf spark.hadoop.fs.s3a.access.key=$AWS_ACCESS_KEY_ID --conf spark.hadoop.fs.s3a.secret.key=$AWS_SECRET_ACCESS_KEY -c spark.eventLog.enabled=true -c spark.eventLog.dir=s3a://dongjoon/spark-events/ --packages org.apache.hadoop:hadoop-aws:3.2.0,org.apache.hadoop:hadoop-common:3.2.0 ... java.lang.NoSuchMethodError: org.apache.hadoop.util.SemaphoredDelegatingExecutor.<init>(Lcom/google/common/util/concurrent/ListeningExecutorService;IZ)V at org.apache.hadoop.fs.s3a.S3AFileSystem.create(S3AFileSystem.java:772) ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CI. Closes #30508 from dongjoon-hyun/SPARK-33212-REVERT. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-02 18:23:48 +09:00
Weichen Xu	80161238fe	[SPARK-33592] Fix: Pyspark ML Validator params in estimatorParamMaps may be lost after saving and reloading ### What changes were proposed in this pull request? Fix: Pyspark ML Validator params in estimatorParamMaps may be lost after saving and reloading When saving validator estimatorParamMaps, will check all nested stages in tuned estimator to get correct param parent. Two typical cases to manually test: ~~~python tokenizer = Tokenizer(inputCol="text", outputCol="words") hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features") lr = LogisticRegression() pipeline = Pipeline(stages=[tokenizer, hashingTF, lr]) paramGrid = ParamGridBuilder() \ .addGrid(hashingTF.numFeatures, [10, 100]) \ .addGrid(lr.maxIter, [100, 200]) \ .build() tvs = TrainValidationSplit(estimator=pipeline, estimatorParamMaps=paramGrid, evaluator=MulticlassClassificationEvaluator()) tvs.save(tvsPath) loadedTvs = TrainValidationSplit.load(tvsPath) # check `loadedTvs.getEstimatorParamMaps()` restored correctly. ~~~ ~~~python lr = LogisticRegression() ova = OneVsRest(classifier=lr) grid = ParamGridBuilder().addGrid(lr.maxIter, [100, 200]).build() evaluator = MulticlassClassificationEvaluator() tvs = TrainValidationSplit(estimator=ova, estimatorParamMaps=grid, evaluator=evaluator) tvs.save(tvsPath) loadedTvs = TrainValidationSplit.load(tvsPath) # check `loadedTvs.getEstimatorParamMaps()` restored correctly. ~~~ ### Why are the changes needed? Bug fix. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test. Closes #30539 from WeichenXu123/fix_tuning_param_maps_io. Authored-by: Weichen Xu <weichen.xu@databricks.com> Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>	2020-12-01 09:36:42 +08:00
Josh Soref	13fd272cd3	Spelling r common dev mlib external project streaming resource managers python ### What changes were proposed in this pull request? This PR intends to fix typos in the sub-modules: * `R` * `common` * `dev` * `mlib` * `external` * `project` * `streaming` * `resource-managers` * `python` Split per srowen https://github.com/apache/spark/pull/30323#issuecomment-728981618 NOTE: The misspellings have been reported at `706a726f87 (commitcomment-44064356)` ### Why are the changes needed? Misspelled words make it harder to read / understand content. ### Does this PR introduce _any_ user-facing change? There are various fixes to documentation, etc... ### How was this patch tested? No testing was performed Closes #30402 from jsoref/spelling-R_common_dev_mlib_external_project_streaming_resource-managers_python. Authored-by: Josh Soref <jsoref@users.noreply.github.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-11-27 10:22:45 -06:00
HyukjinKwon	ed9e6fc182	[SPARK-33565][INFRA][FOLLOW-UP] Keep the test coverage with Python 3.8 in GitHub Actions ### What changes were proposed in this pull request? This PR proposes to keep the test coverage with Python 3.8 in GitHub Actions. It is not tested for now in Jenkins due to an env issue. Before this change in GitHub Actions: ``` ======================================================================== Running PySpark tests ======================================================================== Running PySpark tests. Output is in /__w/spark/spark/python/unit-tests.log Will test against the following Python executables: ['python3.6', 'pypy3'] ... ``` After this change in GitHub Actions: ``` ======================================================================== Running PySpark tests ======================================================================== Running PySpark tests. Output is in /__w/spark/spark/python/unit-tests.log Will test against the following Python executables: ['python3.6', 'python3.8', 'pypy3'] ``` ### Why are the changes needed? To keep the test coverage with Python 3.8 in GitHub Actions. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? GitHub Actions in this build will test. Closes #30510 from HyukjinKwon/SPARK-33565. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-26 11:42:12 +09:00
Yuming Wang	1de3fc4282	[SPARK-33525][SQL] Update hive-service-rpc to 3.1.2 ### What changes were proposed in this pull request? We supported Hive metastore are 0.12.0 through 3.1.2, but we supported hive-jdbc are 0.12.0 through 2.3.7. It will throw `TProtocolException` if we use hive-jdbc 3.x: ``` [rootspark-3267648 apache-hive-3.1.2-bin]# bin/beeline -u jdbc:hive2://localhost:10000/default Connecting to jdbc:hive2://localhost:10000/default Connected to: Spark SQL (version 3.1.0-SNAPSHOT) Driver: Hive JDBC (version 3.1.2) Transaction isolation: TRANSACTION_REPEATABLE_READ Beeline version 3.1.2 by Apache Hive 0: jdbc:hive2://localhost:10000/default> create table t1(id int) using parquet; Unexpected end of file when reading from HS2 server. The root cause might be too many concurrent connections. Please ask the administrator to check the number of active connections, and adjust hive.server2.thrift.max.worker.threads if applicable. Error: org.apache.thrift.transport.TTransportException (state=08S01,code=0) ``` ``` org.apache.thrift.protocol.TProtocolException: Missing version in readMessageBegin, old client? at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:234) at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:27) at org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:53) at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:310) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) at java.base/java.lang.Thread.run(Thread.java:832) ``` This pr upgrade hive-service-rpc to 3.1.2 to fix this issue. ### Why are the changes needed? To support hive-jdbc 3.x. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual test: ``` [rootspark-3267648 apache-hive-3.1.2-bin]# bin/beeline -u jdbc:hive2://localhost:10000/default Connecting to jdbc:hive2://localhost:10000/default Connected to: Spark SQL (version 3.1.0-SNAPSHOT) Driver: Hive JDBC (version 3.1.2) Transaction isolation: TRANSACTION_REPEATABLE_READ Beeline version 3.1.2 by Apache Hive 0: jdbc:hive2://localhost:10000/default> create table t1(id int) using parquet; +---------+ \| Result \| +---------+ +---------+ No rows selected (1.051 seconds) 0: jdbc:hive2://localhost:10000/default> insert into t1 values(1); +---------+ \| Result \| +---------+ +---------+ No rows selected (2.08 seconds) 0: jdbc:hive2://localhost:10000/default> select * from t1; +-----+ \| id \| +-----+ \| 1 \| +-----+ 1 row selected (0.605 seconds) ``` Closes #30478 from wangyum/SPARK-33525. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-25 12:37:59 -08:00
yangjie01	048a9821c7	[SPARK-33535][INFRA][TESTS] Export LANG to en_US.UTF-8 in run-tests-jenkins script ### What changes were proposed in this pull request? It seems that Jenkins tests tasks in many pr have test failed. The failed cases include: - `org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.HIVE_CLI_SERVICE_PROTOCOL_V1 get binary type` - `org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.HIVE_CLI_SERVICE_PROTOCOL_V2 get binary type` - `org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.HIVE_CLI_SERVICE_PROTOCOL_V3 get binary type` - `org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.HIVE_CLI_SERVICE_PROTOCOL_V4 get binary type` - `org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.HIVE_CLI_SERVICE_PROTOCOL_V5 get binary type` The error message as follows: ``` Error Messageorg.scalatest.exceptions.TestFailedException: "[?](" did not equal "[�]("Stacktracesbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: "[?](" did not equal "[�](" at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471) at org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231) at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295) at org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.$anonfun$new$26(SparkThriftServerProtocolVersionsSuite.scala:302) ``` But they can pass the GitHub Action, maybe it's related to the `LANG` of the Jenkins build machine, this pr add `export LANG="en_US.UTF-8"` in `run-test-jenkins` script. ### Why are the changes needed? Ensure LANG in Jenkins test process is `en_US.UTF-8` to pass `HIVE_CLI_SERVICE_PROTOCOL_VX` related tests ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Jenkins tests pass Closes #30487 from LuciferYang/SPARK-33535. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-24 09:50:10 -08:00
William Hyun	84e70362db	[SPARK-33510][BUILD] Update SBT to 1.4.4 ### What changes were proposed in this pull request? This PR aims to update SBT from 1.4.2 to 1.4.4. ### Why are the changes needed? This will bring the latest bug fixes. - https://github.com/sbt/sbt/releases/tag/v1.4.3 - https://github.com/sbt/sbt/releases/tag/v1.4.4 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. Closes #30453 from williamhyun/sbt143. Authored-by: William Hyun <williamhyun3@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-22 22:56:59 -08:00
William Hyun	a459238523	[MINOR][INFRA] Suppress warning in check-license ### What changes were proposed in this pull request? This PR aims to suppress the warning `File exists` in check-license ### Why are the changes needed? BEFORE ``` % dev/check-license Attempting to fetch rat RAT checks passed. % dev/check-license mkdir: target: File exists RAT checks passed. ``` AFTER ``` % dev/check-license Attempting to fetch rat RAT checks passed. % dev/check-license RAT checks passed. ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually do dev/check-license twice. Closes #30460 from williamhyun/checklicense. Authored-by: William Hyun <williamhyun3@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-23 10:38:40 +09:00
Dongjoon Hyun	d5e7bd0cc4	[SPARK-33483][INFRA][TESTS] Fix rat exclusion patterns and add a LICENSE ### What changes were proposed in this pull request? This PR fixes the RAT exclusion rule which was originated from SPARK-1144 (Apache Spark 1.0) ### Why are the changes needed? This prevents the situation like https://github.com/apache/spark/pull/30415. Currently, it missed `catalog` directory due to `.log` rule. ``` $ dev/check-license Could not find Apache license headers in the following files: !????? /Users/dongjoon/APACHE/spark-merge/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/MetadataColumn.java !????? /Users/dongjoon/APACHE/spark-merge/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/SupportsMetadataColumns.java ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CI with the new rule. Closes #30418 from dongjoon-hyun/SPARK-RAT. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-18 23:59:11 -08:00
Takeshi Yamamuro	74bd046d17	[SPARK-33475][BUILD] Bump ANTLR runtime version to 4.8-1 ### What changes were proposed in this pull request? This PR intends to upgrade ANTLR runtime from 4.7.1 to 4.8-1. ### Why are the changes needed? Release note of v4.8 and v4.7.2 (the v4.7.2 release has a few minor bug fixes for java targets): - v4.8: https://github.com/antlr/antlr4/releases/tag/4.8 - v4.7.2: https://github.com/antlr/antlr4/releases/tag/4.7.2 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? GA tests. Closes #30404 from maropu/UpgradeAntlr. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-18 21:20:28 +09:00
Rameshkrishnan Muthusamy	5e8549973d	[SPARK-33471][K8S][BUILD] Upgrade kubernetes-client to 4.12.0 ### What changes were proposed in this pull request? This PR aims to upgrade Kubernetes-client from 4.11.1 to 4.12.0 ### Why are the changes needed? This upgrades the dependency for Apache Spark 3.1.0. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Pass the CIs. Closes #30401 from ramesh-muthusamy/SPARK-33471-k8s-clientupgrade. Authored-by: Rameshkrishnan Muthusamy <rameshkrishnan_muthusamy@apple.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-17 13:41:58 -08:00
Chao Sun	c2caf2522b	[SPARK-33213][BUILD] Upgrade Apache Arrow to 2.0.0 ### What changes were proposed in this pull request? This upgrade Apache Arrow version from 1.0.1 to 2.0.0 ### Why are the changes needed? Apache Arrow 2.0.0 was released with some improvements from Java side, so it's better to upgrade Spark to the new version. Note that the format version in Arrow 2.0.0 is still 1.0.0 so API should still be compatible between 1.0.1 and 2.0.0. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing UTs. Closes #30306 from sunchao/SPARK-33213. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-11-09 19:07:16 -08:00
Dongjoon Hyun	35ac314181	[SPARK-33405][BUILD] Upgrade commons-compress to 1.20 ### What changes were proposed in this pull request? This PR aims to upgrade `commons-compress` from 1.8 to 1.20. ### Why are the changes needed? - https://commons.apache.org/proper/commons-compress/security-reports.html ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. Closes #30304 from dongjoon-hyun/SPARK-33405. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-10 11:08:55 +09:00
huangtianhua	83a80796aa	[SPARK-32691][BUILD] Update commons-crypto to v1.1.0 ### What changes were proposed in this pull request? Update the package commons-crypto to v1.1.0 to support aarch64 platform - https://issues.apache.org/jira/browse/CRYPTO-139 ### Why are the changes needed? The package commons-crypto-1.0.0 available in the Maven repository doesn't support aarch64 platform. It costs long time in CryptoRandomFactory.getCryptoRandom(properties).nextBytes(iv) when NettyBlockRpcSever receive block data from client, if the time more than the default value 120s, IOException raised and client will retry replicate the block data to other executors. But in fact the replication is complete, it makes the replication number incorrect. This makes DistributedSuite tests pass. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Pass the CIs. Closes #30275 from huangtianhua/SPARK-32691. Authored-by: huangtianhua <huangtianhua223@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-09 14:33:27 -08:00

1 2 3 4 5 ...

975 commits