### What changes were proposed in this pull request?
This PR aims to upgrade `kubernetes-client` library to bring the JDK8 related fixes. Please note that JDK11 works fine without any problem.
- https://github.com/fabric8io/kubernetes-client/releases/tag/v4.9.2
- JDK8 always uses http/1.1 protocol (Prevent OkHttp from wrongly enabling http/2)
### Why are the changes needed?
OkHttp "wrongly" detects the Platform as Jdk9Platform on JDK 8u251.
- https://github.com/fabric8io/kubernetes-client/issues/2212
- https://stackoverflow.com/questions/61565751/why-am-i-not-able-to-run-sparkpi-example-on-a-kubernetes-k8s-cluster
Although there is a workaround `export HTTP2_DISABLE=true` and `Downgrade JDK or K8s`, we had better avoid this problematic situation.
### Does this PR introduce _any_ user-facing change?
No. This will recover the failures on JDK 8u252.
### How was this patch tested?
- [x] Pass the Jenkins UT (https://github.com/apache/spark/pull/28601#issuecomment-632474270)
- [x] Pass the Jenkins K8S IT with the K8s 1.13 (https://github.com/apache/spark/pull/28601#issuecomment-632438452)
- [x] Manual testing with K8s 1.17.3. (Below)
**v1.17.6 result (on Minikube)**
```
KubernetesSuite:
- Run SparkPi with no resources
- Run SparkPi with a very long application name.
- Use SparkLauncher.NO_RESOURCE
- Run SparkPi with a master URL without a scheme.
- Run SparkPi with an argument.
- Run SparkPi with custom labels, annotations, and environment variables.
- All pods have the same service account by default
- Run extraJVMOptions check on driver
- Run SparkRemoteFileTest using a remote data file
- Run SparkPi with env and mount secrets.
- Run PySpark on simple pi.py example
- Run PySpark with Python2 to test a pyfiles example
- Run PySpark with Python3 to test a pyfiles example
- Run PySpark with memory customization
- Run in client mode.
- Start pod creation from template
- PVs with local storage
- Launcher client dependencies
- Test basic decommissioning
Run completed in 8 minutes, 27 seconds.
Total number of tests run: 19
Suites: completed 2, aborted 0
Tests: succeeded 19, failed 0, canceled 0, ignored 0, pending 0
All tests passed.
```
Closes#28601 from dongjoon-hyun/SPARK-K8S-CLIENT.
Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
This PR mainly adds two things.
1. Real headless browser support for UI test
2. A test suite using headless Chrome as one instance of those browsers.
Also, for environment where Chrome and Chrome driver is not installed, `ChromeUITest` tag is added to filter out the test suite.
### Why are the changes needed?
In the current master, there are two problems for UI test.
1. Lots of tests especially JavaScript related ones are done manually.
Appearance is better to be confirmed by our eyes but logic should be tested by test cases ideally.
2. Compared to the real web browsers, HtmlUnit doesn't seem to support JavaScript enough.
I added a JavaScript related test before for SPARK-31534 using HtmlUnit which is simple library based headless browser for test.
The test I added works somehow but some JavaScript related error is shown in unit-tests.log.
```
======= EXCEPTION START ========
Exception class=[net.sourceforge.htmlunit.corejs.javascript.JavaScriptException]
com.gargoylesoftware.htmlunit.ScriptException: Error: TOOLTIP: Option "sanitizeFn" provided type "window" but expected type "(null|function)". (http://192.168.1.209:60724/static/jquery-3.4.1.min.js#2)
at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:904)
at net.sourceforge.htmlunit.corejs.javascript.Context.call(Context.java:628)
at net.sourceforge.htmlunit.corejs.javascript.ContextFactory.call(ContextFactory.java:515)
at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.callFunction(JavaScriptEngine.java:835)
at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.callFunction(JavaScriptEngine.java:807)
at com.gargoylesoftware.htmlunit.InteractivePage.executeJavaScriptFunctionIfPossible(InteractivePage.java:216)
at com.gargoylesoftware.htmlunit.javascript.background.JavaScriptFunctionJob.runJavaScript(JavaScriptFunctionJob.java:52)
at com.gargoylesoftware.htmlunit.javascript.background.JavaScriptExecutionJob.run(JavaScriptExecutionJob.java:102)
at com.gargoylesoftware.htmlunit.javascript.background.JavaScriptJobManagerImpl.runSingleJob(JavaScriptJobManagerImpl.java:426)
at com.gargoylesoftware.htmlunit.javascript.background.DefaultJavaScriptExecutor.run(DefaultJavaScriptExecutor.java:157)
at java.lang.Thread.run(Thread.java:748)
Caused by: net.sourceforge.htmlunit.corejs.javascript.JavaScriptException: Error: TOOLTIP: Option "sanitizeFn" provided type "window" but expected type "(null|function)". (http://192.168.1.209:60724/static/jquery-3.4.1.min.js#2)
at net.sourceforge.htmlunit.corejs.javascript.Interpreter.interpretLoop(Interpreter.java:1009)
at net.sourceforge.htmlunit.corejs.javascript.Interpreter.interpret(Interpreter.java:800)
at net.sourceforge.htmlunit.corejs.javascript.InterpretedFunction.call(InterpretedFunction.java:105)
at net.sourceforge.htmlunit.corejs.javascript.ContextFactory.doTopCall(ContextFactory.java:413)
at com.gargoylesoftware.htmlunit.javascript.HtmlUnitContextFactory.doTopCall(HtmlUnitContextFactory.java:252)
at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.doTopCall(ScriptRuntime.java:3264)
at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$4.doRun(JavaScriptEngine.java:828)
at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:889)
... 10 more
JavaScriptException value = Error: TOOLTIP: Option "sanitizeFn" provided type "window" but expected type "(null|function)".
== CALLING JAVASCRIPT ==
function () {
throw e;
}
======= EXCEPTION END ========
```
I tried to upgrade HtmlUnit to 2.40.0 but what is worse, the test become not working even though it works on real browsers like Chrome, Safari and Firefox without error.
```
[info] UISeleniumSuite:
[info] - SPARK-31534: text for tooltip should be escaped *** FAILED *** (17 seconds, 745 milliseconds)
[info] The code passed to eventually never returned normally. Attempted 2 times over 12.910785232 seconds. Last failure message: com.gargoylesoftware.htmlunit.ScriptException: ReferenceError: Assignment to undefined "regeneratorRuntime" in strict mode (http://192.168.1.209:62132/static/vis-timeline-graph2d.min.js#52(Function)#1)
```
To resolve those problems, it's better to support headless browser for UI test.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
I tested with following patterns. Both Chrome and Chrome driver should be installed to test.
1. sbt / with chromedriver / include tag (expect to succeed)
`build/sbt -Dspark.test.webdriver.chrome.driver=/path/to/chromedriver "testOnly org.apache.spark.ui.ChromeUISeleniumSuite"`
2. sbt / with chromedriver / exclude tag (expect to be ignored)
`build/sbt -Dspark.test.webdriver.chrome.driver=/path/to/chromedriver "testOnly org.apache.spark.ui.ChromeUISeleniumSuite -l org.apache.spark.tags.ChromeUITest"`
3. sbt / without chromedriver / include tag (expect to be failed)
`build/sbt "testOnly org.apache.spark.ui.ChromeUISeleniumSuite"`
4. sbt / without chromedriver / exclude tag (expect to be skipped)
`build/sbt -Dtest.exclude.tags=org.apache.spark.tags.ChromeUITest "testOnly org.apache.spark.ui.ChromeUISeleniumSuite"`
5. Maven / wth chromedriver / include tag (expect to succeed)
`build/mvn -Dspark.test.webdriver.chrome.driver=/path/to/chromedriver -Dtest=none -DwildcardSuites=org.apache.spark.ui.ChromeUISeleniumSuite test`
6. Maven / with chromedriver / exclude tag (expect to be skipped)
`build/mvn -Dtest.exclude.tags="org.apache.spark.tags.ChromeUITest" -Dspark.test.webdriver.chrome.driver=/path/to/chromedriver -Dtest=none -DwildcardSuites=org.apache.spark.ui.ChromeUISeleniumSuite test`
7. Maven / without chromedriver / include tag (expect to be failed)
`build/mvn -Dtest=none -DwildcardSuites=org.apache.spark.ui.ChromeUISeleniumSuite test`
8. Maven / without chromedriver / exclude tag (expect to be skipped)
`build/mvn -Dtest.exclude.tags=org.apache.spark.tags.ChromeUITest -Dtest=none -DwildcardSuites=org.apache.spark.ui.ChromeUISeleniumSuite test`
Closes#28578 from sarutak/real-headless-browser-support.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
This adds UI updates to support stage level scheduling and ResourceProfiles. 3 main things have been added. ResourceProfile id added to the Stage page, the Executors page now has an optional selectable column to show the ResourceProfile Id of each executor, and the Environment page now has a section with the ResourceProfile ids. Along with this the rest api for environment page was updated to include the Resource profile information.
I debating on splitting the resource profile information into its own page but I wasn't sure it called for a completely separate page. Open to peoples thoughts on this.
Screen shots:
![Screen Shot 2020-04-01 at 3 07 46 PM](https://user-images.githubusercontent.com/4563792/78185169-469a7000-7430-11ea-8b0c-d9ede2d41df8.png)
![Screen Shot 2020-04-01 at 3 08 14 PM](https://user-images.githubusercontent.com/4563792/78185175-48fcca00-7430-11ea-8d1d-6b9333700f32.png)
![Screen Shot 2020-04-01 at 3 09 03 PM](https://user-images.githubusercontent.com/4563792/78185176-4a2df700-7430-11ea-92d9-73c382bb0f32.png)
![Screen Shot 2020-04-01 at 11 05 48 AM](https://user-images.githubusercontent.com/4563792/78185186-4dc17e00-7430-11ea-8962-f749dd47ea60.png)
### Why are the changes needed?
For user to be able to know what resource profile was used with which stage and executors. The resource profile information is also available so user debugging can see exactly what resources were requested with that profile.
### Does this PR introduce any user-facing change?
Yes, UI updates.
### How was this patch tested?
Unit tests and tested on yarn both active applications and with the history server.
Closes#28094 from tgravescs/SPARK-29303-pr.
Lead-authored-by: Thomas Graves <tgraves@nvidia.com>
Co-authored-by: Thomas Graves <tgraves@apache.org>
Signed-off-by: Thomas Graves <tgraves@apache.org>
### What changes were proposed in this pull request?
After changes in SPARK-20628, CoarseGrainedSchedulerBackend can decommission an executor and stop assigning new tasks on it. We should also decommission the corresponding blockmanagers in the same way. i.e. Move the cached RDD blocks from those executors to other active executors.
### Why are the changes needed?
We need to gracefully decommission the block managers so that the underlying RDD cache blocks are not lost in case the executors are taken away forcefully after some timeout (because of spotloss/pre-emptible VM etc). Its good to save as much cache data as possible.
Also In future once the decommissioning signal comes from Cluster Manager (say YARN/Mesos etc), dynamic allocation + this change gives us opportunity to downscale the executors faster by making the executors free of cache data.
Note that this is a best effort approach. We try to move cache blocks from decommissioning executors to active executors. If the active executors don't have free resources available on them for caching, then the decommissioning executors will keep the cache block which it was not able to move and it will still be able to serve them.
Current overall Flow:
1. CoarseGrainedSchedulerBackend receives a signal to decommissionExecutor. On receiving the signal, it do 2 things - Stop assigning new tasks (SPARK-20628), Send another message to BlockManagerMasterEndpoint (via BlockManagerMaster) to decommission the corresponding BlockManager.
2. BlockManagerMasterEndpoint receives "DecommissionBlockManagers" message. On receiving this, it moves the corresponding block managers to "decommissioning" state. All decommissioning BMs are excluded from the getPeers RPC call which is used for replication. All these decommissioning BMs are also sent message from BlockManagerMasterEndpoint to start decommissioning process on themselves.
3. BlockManager on worker (say BM-x) receives the "DecommissionBlockManager" message. Now it will start BlockManagerDecommissionManager thread to offload all the RDD cached blocks. This thread can make multiple reattempts to decommission the existing cache blocks (multiple reattempts might be needed as there might not be sufficient space in other active BMs initially).
### Does this PR introduce any user-facing change?
NO
### How was this patch tested?
Added UTs.
Closes#28370 from prakharjain09/SPARK-20732-rddcache-1.
Authored-by: Prakhar Jain <prakharjain09@gmail.com>
Signed-off-by: Holden Karau <hkarau@apple.com>
### What changes were proposed in this pull request?
This PR makes `test-dependencies.sh` detect the version string correctly by ignoring all the other lines.
### Why are the changes needed?
Currently, all SBT jobs are broken like the following.
- https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-3.0-test-sbt-hadoop-3.2-hive-2.3/476/console
```
[error] running /home/jenkins/workspace/spark-branch-3.0-test-sbt-hadoop-3.2-hive-2.3/dev/test-dependencies.sh ; received return code 1
Build step 'Execute shell' marked build as failure
```
The reason is that the script detects the old version like `Falling back to archive.apache.org to download Maven 3.1.0-SNAPSHOT` when `build/mvn` did fallback.
Specifically, in the script, `OLD_VERSION` became `Falling back to archive.apache.org to download Maven 3.1.0-SNAPSHOT` instead of `3.1.0-SNAPSHOT` if build/mvn did fallback. Then, `pom.xml` file is corrupted like the following at the end and the exit code become `1` instead of `0`. It causes Jenkins jobs fails
```
- <version>3.1.0-SNAPSHOT</version>
+ <version>Falling</version>
```
**NO FALLBACK**
```
$ build/mvn -q -Dexec.executable="echo" -Dexec.args='${project.version}' --non-recursive org.codehaus.mojo:exec-maven-plugin:1.6.0:exec
Using `mvn` from path: /Users/dongjoon/APACHE/spark-merge/build/apache-maven-3.6.3/bin/mvn
3.1.0-SNAPSHOT
```
**FALLBACK**
```
$ build/mvn -q -Dexec.executable="echo" -Dexec.args='${project.version}' --non-recursive org.codehaus.mojo:exec-maven-plugin:1.6.0:exec
Falling back to archive.apache.org to download Maven
Using `mvn` from path: /Users/dongjoon/APACHE/spark-merge/build/apache-maven-3.6.3/bin/mvn
3.1.0-SNAPSHOT
```
**In the script**
```
$ echo $(build/mvn -q -Dexec.executable="echo" -Dexec.args='${project.version}' --non-recursive org.codehaus.mojo:exec-maven-plugin:1.6.0:exec)
Using `mvn` from path: /Users/dongjoon/APACHE/spark-merge/build/apache-maven-3.6.3/bin/mvn
Falling back to archive.apache.org to download Maven 3.1.0-SNAPSHOT
```
This PR will prevent irrelevant logs like `Falling back to archive.apache.org to download Maven`.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass the PR Builder.
Closes#28532 from dongjoon-hyun/SPARK-31713.
Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
This PR removes `bootstrap-tooltip.js` which is no longer used.
That script is replaced with `bootstrap.bundle.min.js` in SPARK-30654 ( #27370 ).
### Why are the changes needed?
For cleaning up repository..
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Manually checked whether tooltips are shown in the UI and no error message shown in the debug console.
Closes#28515 from sarutak/remove-tooltipjs.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
This PR adds `i` option to ignore additional `build/mvn` output which is irrelevant to version string.
### Why are the changes needed?
SPARK-28963 added additional output message, `Falling back to archive.apache.org to download Maven` in build/mvn. This breaks `dev/create-release/release-build.sh` and currently Spark Packaging Jenkins job is hitting this issue consistently and broken.
- https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/2912/console
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
This happens only when the mirror fails. So, this is verified manually hiject the script. It works like the following.
```
$ echo 'Falling back to archive.apache.org to download Maven' > out
$ build/mvn help:evaluate -Dexpression=project.version >> out
Using `mvn` from path: /Users/dongjoon/PRS/SPARK_RELEASE_2/build/apache-maven-3.6.3/bin/mvn
$ cat out | grep -v INFO | grep -v WARNING | grep -v Download
Falling back to archive.apache.org to download Maven
3.1.0-SNAPSHOT
$ cat out | grep -v INFO | grep -v WARNING | grep -vi Download
3.1.0-SNAPSHOT
```
Closes#28514 from dongjoon-hyun/SPARK_RELEASE_2.
Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
snappy-java have release v1.1.7.5, upgrade to latest version.
Fixed in v1.1.7.4
- Caching internal buffers for SnappyFramed streams #234
- Fixed the native lib for ppc64le to work with glibc 2.17 (Previously it depended on 2.22)
Fixed in v1.1.7.5
- Fixes java.lang.NoClassDefFoundError: org/xerial/snappy/pool/DefaultPoolFactory in 1.1.7.4
https://github.com/xerial/snappy-java/compare/1.1.7.3...1.1.7.5
v 1.1.7.5 release note:
edc4ec28bd
### Why are the changes needed?
Fix bug
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
No need
Closes#28472 from AngersZhuuuu/spark-31655.
Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
This PR aims to upgrade SLF4J from 1.7.16 to 1.7.30.
### Why are the changes needed?
SLF4J 1.7.23+ is required to enable `slf4j-log4j12` with MDC feature to run under Java 9. Also, this will bring all latest bug fixes.
- http://www.slf4j.org/news.html
> When running under Java 9, log4j version 1.2.x is unable to correctly parse the "java.version" system property. Assuming an inccorect Java version, it proceeded to disable its MDC functionality. The slf4j-log4j12 module shipping in this release fixes the issue by tweaking MDC internals by reflection, allowing log4j to run under Java 9. See also SLF4J-393.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass the Jenkins with the existing tests.
Closes#28446 from dongjoon-hyun/SPARK-31633.
Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
Some tidying of `sh` scripts in `R/`
### Why are the changes needed?
Not strictly needed, but the `'devtools' %in% installed.packages()` line in particular is "improper" / proabbly slow
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Not
Closes#28419 from MichaelChirico/r-scripts-cleanup.
Authored-by: Michael Chirico <michael.chirico@grabtaxi.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
This aims to upgrade Rtools first to prepare R 4.0.0 in AppVeyor for Apache Spark 3.1.0.
### Why are the changes needed?
R 4.0.0 is released on April 24th, 2020. It uses Rtools 4.0.0 officially.
- https://cran.r-project.org/doc/manuals/r-release/NEWS.html
- https://stat.ethz.ch/pipermail/r-announce/2020/000653.html
### Does this PR introduce any user-facing change?
No. (This PR aims to test Rtools 4.0.0 in AppVeyor environment.)
### How was this patch tested?
See the AppVeyor result.
Closes#28358 from dongjoon-hyun/SPARK-31567.
Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
As part of the Stage level scheduling features, add the Python api's to set resource profiles.
This also adds the functionality to properly apply the pyspark memory configuration when specified in the ResourceProfile. The pyspark memory configuration is being passed in the task local properties. This was an easy way to get it to the PythonRunner that needs it. I modeled this off how the barrier task scheduling is passing the addresses. As part of this I added in the JavaRDD api's because those are needed by python.
### Why are the changes needed?
python api for this feature
### Does this PR introduce any user-facing change?
Yes adds the java and python apis for user to specify a ResourceProfile to use stage level scheduling.
### How was this patch tested?
unit tests and manually tested on yarn. Tests also run to verify it errors properly on standalone and local mode where its not yet supported.
Closes#28085 from tgravescs/SPARK-29641-pr-base.
Lead-authored-by: Thomas Graves <tgraves@nvidia.com>
Co-authored-by: Thomas Graves <tgraves@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
**Hive 2.3.7** fixed these issues:
- HIVE-21508: ClassCastException when initializing HiveMetaStoreClient on JDK10 or newer
- HIVE-21980:Parsing time can be high in case of deeply nested subqueries
- HIVE-22249: Support Parquet through HCatalog
### Why are the changes needed?
Fix CCE during creating HiveMetaStoreClient in JDK11 environment: [SPARK-29245](https://issues.apache.org/jira/browse/SPARK-29245).
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
- [x] Test Jenkins with Hadoop 2.7 (https://github.com/apache/spark/pull/28148#issuecomment-616757840)
- [x] Test Jenkins with Hadoop 3.2 on JDK11 (https://github.com/apache/spark/pull/28148#issuecomment-616294353)
- [x] Manual test with remote hive metastore.
Hive side:
```
export JAVA_HOME=/usr/lib/jdk1.8.0_221
export PATH=$JAVA_HOME/bin:$PATH
cd /usr/lib/hive-2.3.6 # Start Hive metastore with Hive 2.3.6
bin/schematool -dbType derby -initSchema --verbose
bin/hive --service metastore
```
Spark side:
```
export JAVA_HOME=/usr/lib/jdk-11.0.3
export PATH=$JAVA_HOME/bin:$PATH
build/sbt clean package -Phive -Phadoop-3.2 -Phive-thriftserver
export SPARK_PREPEND_CLASSES=true
bin/spark-sql --conf spark.hadoop.hive.metastore.uris=thrift://localhost:9083
```
Closes#28148 from wangyum/SPARK-31381.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
This PR (SPARK-31462) fixes the usage of getopts and case statement in `do-release.sh` and `do-release-docker.sh`.
### Why are the changes needed?
In the current master, do-release.sh contains the following code.
```
while getopts "bn" opt; do
case $opt in
b) GIT_BRANCH=$OPTARG ;;
n) DRY_RUN=1 ;;
?) error "Invalid option: $OPTARG" ;;
esac
done
```
There are 3 wrong usage in getopts and case statement.
1. To set $OPTARG to an argument passed for the option "b", the parameter for getopts should be "b:".
2. To set $OPTARG to the invalid option name passed, the parameter for getopts starts with ":".
3. It's minor but to match the character "?", it's better to escape like "\\?".
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
I checked that $GIT_BRANCH is set when do-release.sh is launched with -b option.
I also checked that the error message contains invalid option name when do-release.sh is launched with an invalid option.
Closes#28234 from sarutak/fix-do-release.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
Upgrade vis.js to fix an infinite re-drawing issue.
As reported here, old releases of vis.js have that issue.
Fortunately, the latest version seems to resolve the issue.
With the latest release of vis.js, there are some performance issues with the original `timeline-view.js` and `timeline-view.css` so I also changed them.
### Why are the changes needed?
For better UX.
### Does this PR introduce any user-facing change?
No. Appearance and functionalities are not changed.
### How was this patch tested?
I confirmed infinite redrawing doesn't happen with a JobPage which I had reproduced the issue.
With the original version of vis.js, I reproduced the issue with the following conditions.
* Use history server and load core/src/test/resources/spark-events.
* Visit the JobPage for job2 in application_1553914137147_0018.
* Zoom out to 80% on Safari / Chrome / Firefox.
Maybe, it depends on OS and the version of browsers.
Closes#28192 from sarutak/upgrade-visjs.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>
### What changes were proposed in this pull request?
This PR unsets `setuptools` version in CI. This was fixed in the 0.46.1.2+ `setuptools` - pypa/setuptools#2046. `setuptools` 0.46.1.0 and 0.46.1.1 still have this problem.
### Why are the changes needed?
To test the latest setuptools out to see if users can actually install and use it.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Jenkins will test.
Closes#28111 from HyukjinKwon/SPARK-31231-revert.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
The release script stops working after d5865493ae , as we require `mkdocs 1.0.0`.
This PR upgrades `mkdocs` from 0.1.6.3 to 1.0.4. To do that ruby is also upgraded to 2.5.
This PR also fixes some small issues.
### Why are the changes needed?
to make RC
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
tested by 3.0.0-rc1
Closes#28088 from cloud-fan/rc.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
## What changes were proposed in this pull request?
This PR is a followup of apache/spark#27995. Rather then pining setuptools version, it sets upper bound so Python 3.5 with branch-2.4 tests can pass too.
## Why are the changes needed?
To make the CI build stable
## Does this PR introduce any user-facing change?
No, dev-only change.
## How was this patch tested?
Jenkins will test.
Closes#28005 from HyukjinKwon/investigate-pip-packaging-followup.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
For a bit of background,
PIP packaging test started to fail (see [this logs](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/120218/testReport/)) as of setuptools 46.1.0 release. In https://github.com/pypa/setuptools/issues/1424, they decided to don't keep the modes in `package_data`.
In PySpark pip installation, we keep the executable scripts in `package_data` fc4e56a54c/python/setup.py (L199-L200), and expose their symbolic links as executable scripts.
So, the symbolic links (or copied scripts) executes the scripts copied from `package_data`, which doesn't have the executable permission in its mode:
```
/tmp/tmp.UmkEGNFdKF/3.6/bin/spark-submit: line 27: /tmp/tmp.UmkEGNFdKF/3.6/lib/python3.6/site-packages/pyspark/bin/spark-class: Permission denied
/tmp/tmp.UmkEGNFdKF/3.6/bin/spark-submit: line 27: exec: /tmp/tmp.UmkEGNFdKF/3.6/lib/python3.6/site-packages/pyspark/bin/spark-class: cannot execute: Permission denied
```
The current issue is being tracked at https://github.com/pypa/setuptools/issues/2041
</br>
For what this PR proposes:
It sets the upper bound in PR builder for now to unblock other PRs. _This PR does not solve the issue yet. I will make a fix after monitoring https://github.com/pypa/setuptools/issues/2041_
### Why are the changes needed?
It currently affects users who uses the latest setuptools. So, _users seem unable to use PySpark with the latest setuptools._ See also https://github.com/pypa/setuptools/issues/2041#issuecomment-602566667
### Does this PR introduce any user-facing change?
It makes CI pass for now. No user-facing change yet.
### How was this patch tested?
Jenkins will test.
Closes#27995 from HyukjinKwon/investigate-pip-packaging.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR(SPARK-31101) proposes to upgrade Janino to 3.0.16 which is released recently.
* Merged pull request janino-compiler/janino#114 "Grow the code for relocatables, and do fixup, and relocate".
Please see the commit log.
- https://github.com/janino-compiler/janino/commits/3.0.16
You can see the changelog from the link: http://janino-compiler.github.io/janino/changelog.html / though release note for Janino 3.0.16 is actually incorrect.
### Why are the changes needed?
We got some report on failure on user's query which Janino throws error on compiling generated code. The issue is here: janino-compiler/janino#113 It contains the information of generated code, symptom (error), and analysis of the bug, so please refer the link for more details.
Janino 3.0.16 contains the PR janino-compiler/janino#114 which would enable Janino to succeed to compile user's query properly.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Existing UTs.
Closes#27932 from HeartSaVioR/SPARK-31101-janino-3.0.16.
Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
As discovered here https://github.com/apache/spark/pull/27910#issuecomment-599027190, pydocstyle tests were not running anywhere (not on Jenkins; not on GitHub).
~This PR enables those tests.~
It also seems like a [large hill to climb](https://github.com/apache/spark/pull/27912#issuecomment-599167117) to enable any meaningful checks, so we're going to just rip pydocstyle out for now.
### Why are the changes needed?
Presumably, we defined those doc style tests because we care about whatever it is they enforce. Since we're not actually testing anything, though, it's better to clear the cruft.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Will check the GitHub workflow logs on this PR.
Closes#27912 from nchammas/SPARK-31155-pydocstyle.
Authored-by: Nicholas Chammas <nicholas.chammas@liveramp.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR cleans up several failures -- most of them silent -- in `dev/lint-python`. I don't understand how we haven't been bitten by these yet. Perhaps we've been lucky?
Fixes include:
* Fix how we compare versions. All the version checks currently in `master` silently fail with:
```
File "<string>", line 2
print(LooseVersion("""2.3.1""") >= LooseVersion("""2.4.0"""))
^
IndentationError: unexpected indent
```
Another problem is that `distutils.version` is undocumented and unsupported.
* Fix some basic bugs. e.g. We have an incorrect reference to `$PYDOCSTYLEBUILD`, which doesn't exist, which was causing the doc style test to silently fail with:
```
./dev/lint-python: line 193: --version: command not found
```
* Stop suppressing error output! It's hiding problems and serves no purpose here.
### Why are the changes needed?
`lint-python` is part of our CI build and is currently doing any combination of the following: silently failing; incorrectly skipping tests; incorrectly downloading libraries when a suitable library is already available.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Lots of manual testing with `set -x` enabled.
Closes#27910 from nchammas/SPARK-31153-lint-python.
Authored-by: Nicholas Chammas <nicholas.chammas@liveramp.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Spark's Web UI is using an older version of Bootstrap (v. 2.3.2) for the portal pages. Bootstrap 2.x was moved to EOL in Aug 2013 and Bootstrap 3.x was moved to EOL in July 2019 (https://github.com/twbs/release). Older versions of Bootstrap are also getting flagged in security scans for various CVEs:
https://snyk.io/vuln/SNYK-JS-BOOTSTRAP-72889https://snyk.io/vuln/SNYK-JS-BOOTSTRAP-173700https://snyk.io/vuln/npm:bootstrap:20180529https://snyk.io/vuln/npm:bootstrap:20160627
I haven't validated each CVE, but it would be nice to resolve any potential issues and get on a supported release.
The bad news is that there have been quite a few changes between Bootstrap 2 and Bootstrap 4. I've tried updating the library, refactoring/tweaking the CSS and JS to maintain a similar appearance and functionality, and testing the UI for functionality and appearance. This is a fairly large change so I'm sure additional testing and fixes will be needed.
### How was this patch tested?
This has been manually tested, but there is a ton of functionality and there are many pages and detail pages so it is very possible bugs introduced from the upgrade were missed. Additional testing and feedback is welcomed. If it appears a whole page was missed let me know and I'll take a pass at addressing that page/section.
Closes#27370 from clarkead/bootstrap4-core-upgrade.
Authored-by: Dale Clarke <a.dale.clarke@gmail.com>
Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>
### What changes were proposed in this pull request?
This PR makes `dev/make-distribution.sh` a bit easier to use by not hiding errors thrown by Maven.
As a supporting change, this PR also suppresses progress bar output from curl and wget that may be output from within `build/mvn`.
### Why are the changes needed?
It's surprising for command-line options to be position-dependent. The errors that get thrown when you pass the correct option (like `--pip`) but in the wrong order are confusing.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
I ran a few invocations of `make-distribution.sh` to confirm that, when passed incorrect options, I can actually see the errors thrown directly by Maven.
Closes#27800 from nchammas/SPARK-31041-make-distribution.
Authored-by: Nicholas Chammas <nicholas.chammas@liveramp.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
This PR makes the following refinements to the workflow for building docs:
* Install Python and Ruby consistently using pyenv and rbenv across both the docs README and the release Dockerfile.
* Pin the Python and Ruby versions we use.
* Pin all direct Python and Ruby dependency versions.
* Eliminate any use of `sudo pip`, which the Python community discourages, or `sudo gem`.
### Why are the changes needed?
This PR should increase the consistency and reproducibility of the doc-building process by managing Python and Ruby in a more consistent way, and by eliminating unused or outdated code.
Here's a possible example of an issue building the docs that would be addressed by the changes in this PR: https://github.com/apache/spark/pull/27459#discussion_r376135719
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Manual tests:
* I was able to build the Docker image successfully, minus the final part about `RUN useradd`.
* I am unable to run `do-release-docker.sh` because I am not a committer and don't have the required GPG key.
* I built the docs locally and viewed them in the browser.
I think I need a committer to more fully test out these changes.
Closes#27534 from nchammas/SPARK-30731-building-docs.
Authored-by: Nicholas Chammas <nicholas.chammas@liveramp.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
This PR propose
1. Explicitly include xml-apis. xml-apis is already the part of xerces 2.12.0 (https://repo1.maven.org/maven2/xerces/xercesImpl/2.12.0/xercesImpl-2.12.0.pom). However, we're excluding it by setting `scope` to `test`. This seems causing `spark-shell`, built from Maven, to fail.
Seems like previously xml-apis wasn't reached for some reasons but after we upgrade, it seems requiring. Therefore, this PR proposes to include it.
2. Pins `xerces` version in SBT as well. Seems this dependency is resolved differently from Maven.
Note that Hadoop 3 does not looks requiring this as they replaced xerces as of [HDFS-12221](https://issues.apache.org/jira/browse/HDFS-12221).
### Why are the changes needed?
To make `spark-shell` working from Maven build, and uses the same xerces version.
### Does this PR introduce any user-facing change?
No, it's master only.
### How was this patch tested?
**1.**
```bash
./build/mvn -DskipTests -Psparkr -Phive clean package
./bin/spark-shell
```
Before:
```
Exception in thread "main" java.lang.NoClassDefFoundError: org/w3c/dom/ElementTraversal
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:763)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:468)
at java.net.URLClassLoader.access$100(URLClassLoader.java:74)
at java.net.URLClassLoader$1.run(URLClassLoader.java:369)
at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:362)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at org.apache.xerces.parsers.AbstractDOMParser.startDocument(Unknown Source)
at org.apache.xerces.xinclude.XIncludeHandler.startDocument(Unknown Source)
at org.apache.xerces.impl.dtd.XMLDTDValidator.startDocument(Unknown Source)
at org.apache.xerces.impl.XMLDocumentScannerImpl.startEntity(Unknown Source)
at org.apache.xerces.impl.XMLVersionDetector.startDocumentParsing(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:150)
at org.apache.hadoop.conf.Configuration.parse(Configuration.java:2482)
at org.apache.hadoop.conf.Configuration.parse(Configuration.java:2470)
at org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:2541)
at org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:2494)
at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:2407)
at org.apache.hadoop.conf.Configuration.set(Configuration.java:1143)
at org.apache.hadoop.conf.Configuration.set(Configuration.java:1115)
at org.apache.spark.deploy.SparkHadoopUtil$.org$apache$spark$deploy$SparkHadoopUtil$$appendS3AndSparkHadoopHiveConfigurations(SparkHadoopUtil.scala:456)
at org.apache.spark.deploy.SparkHadoopUtil$.newConfiguration(SparkHadoopUtil.scala:427)
at org.apache.spark.deploy.SparkSubmit.$anonfun$prepareSubmitEnvironment$2(SparkSubmit.scala:342)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:342)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:871)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1007)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1016)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: org.w3c.dom.ElementTraversal
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 42 more
```
After:
```
...
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 3.1.0-SNAPSHOT
/_/
Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_202)
Type in expressions to have them evaluated.
Type :help for more information.
scala>
```
**2.**
```
./build/sbt dependencyTree -Phadoop-2.7 -Phive-2.3 -Phive-thriftserver -Phive
./build/sbt dependencyTree -Phadoop-3.2 -Phive-2.3 -Phive-thriftserver -Phive
```
Closes#27808 from HyukjinKwon/SPARK-30994.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Manage up the version of Xerces that Hadoop uses (and potentially user apps) to 2.12.0 to match https://issues.apache.org/jira/browse/HADOOP-16530
### Why are the changes needed?
Picks up bug and security fixes: https://www.xml.com/news/2018-05-apache-xerces-j-2120/
### Does this PR introduce any user-facing change?
Should be no behavior changes.
### How was this patch tested?
Existing tests.
Closes#27746 from srowen/SPARK-30994.
Authored-by: Sean Owen <srowen@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
turn off `x` mode and do not print the command while printing usage of `make-distribution.sh`
### Why are the changes needed?
improve dev tools
### Does this PR introduce any user-facing change?
for only developers, clearer hints
#### after
```
./dev/make-distribution.sh --hel
+++ dirname ./dev/make-distribution.sh
++ cd ./dev/..
++ pwd
+ SPARK_HOME=/Users/kentyao/spark
+ DISTDIR=/Users/kentyao/spark/dist
+ MAKE_TGZ=false
+ MAKE_PIP=false
+ MAKE_R=false
+ NAME=none
+ MVN=/Users/kentyao/spark/build/mvn
+ (( 1 ))
+ case $1 in
+ echo 'Error: --hel is not supported'
Error: --hel is not supported
+ exit_with_usage
+ set +x
make-distribution.sh - tool for making binary distributions of Spark
usage:
make-distribution.sh [--name] [--tgz] [--pip] [--r] [--mvn <mvn-command>] <maven build options>
See Spark's "Building Spark" doc for correct Maven options.
```
#### before
```
+++ dirname ./dev/make-distribution.sh
++ cd ./dev/..
++ pwd
+ SPARK_HOME=/Users/kentyao/spark
+ DISTDIR=/Users/kentyao/spark/dist
+ MAKE_TGZ=false
+ MAKE_PIP=false
+ MAKE_R=false
+ NAME=none
+ MVN=/Users/kentyao/spark/build/mvn
+ (( 1 ))
+ case $1 in
+ echo 'Error: --hel is not supported'
Error: --hel is not supported
+ exit_with_usage
+ echo 'make-distribution.sh - tool for making binary distributions of Spark'
make-distribution.sh - tool for making binary distributions of Spark
+ echo ''
+ echo usage:
usage:
+ cl_options='[--name] [--tgz] [--pip] [--r] [--mvn <mvn-command>]'
+ echo 'make-distribution.sh [--name] [--tgz] [--pip] [--r] [--mvn <mvn-command>] <maven build options>'
make-distribution.sh [--name] [--tgz] [--pip] [--r] [--mvn <mvn-command>] <maven build options>
+ echo 'See Spark'\''s "Building Spark" doc for correct Maven options.'
See Spark's "Building Spark" doc for correct Maven options.
+ echo ''
+ exit 1
```
### How was this patch tested?
manually
Closes#27706 from yaooqinn/build.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
This PR aims to upgrade Py4J to `0.10.9` for better Python 3.7 support in Apache Spark 3.0.0 (master/branch-3.0). This is not for `branch-2.4`.
- Apache Spark 3.0.0 is using `Py4J 0.10.8.1` (released on 2018-10-21) because `0.10.8.1` was the first official release to support Python 3.7.
- https://www.py4j.org/changelog.html#py4j-0-10-8-and-py4j-0-10-8-1
- `Py4J 0.10.9` was released on January 25th 2020 with better Python 3.7 support and `magic_member` bug fix.
- https://github.com/bartdag/py4j/releases/tag/0.10.9
- https://www.py4j.org/changelog.html#py4j-0-10-9
No.
Pass the Jenkins with the existing tests.
Closes#27641 from dongjoon-hyun/SPARK-30884.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR targets to document the Pandas UDF redesign with type hints introduced at SPARK-28264.
Mostly self-describing; however, there are few things to note for reviewers.
1. This PR replace the existing documentation of pandas UDFs to the newer redesign to promote the Python type hints. I added some words that Spark 3.0 still keeps the compatibility though.
2. This PR proposes to name non-pandas UDFs as "Pandas Function API"
3. SCALAR_ITER become two separate sections to reduce confusion:
- `Iterator[pd.Series]` -> `Iterator[pd.Series]`
- `Iterator[Tuple[pd.Series, ...]]` -> `Iterator[pd.Series]`
4. I removed some examples that look overkill to me.
5. I also removed some information in the doc, that seems duplicating or too much.
### Why are the changes needed?
To document new redesign in pandas UDF.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Existing tests should cover.
Closes#27466 from HyukjinKwon/SPARK-30722.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR proposes to make hardcoded `python3` to a variable `PYTHON_EXECUTABLE` in ' lint-python' script.
### Why are the changes needed?
To make changes easier. See 561e9b9688 as an example.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Manually by running `dev/lint-python`.
Closes#27470 from HyukjinKwon/minor-PYTHON_EXECUTABLE.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Bump fabric8 kubernetes-client to 4.7.1
### Why are the changes needed?
New fabric8 version brings support for Kubernetes 1.17 clusters.
Full release notes:
- https://github.com/fabric8io/kubernetes-client/releases/tag/v4.7.0
- https://github.com/fabric8io/kubernetes-client/releases/tag/v4.7.1
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Existing unit and integration tests cover creation of K8S objects. Adjusted them to work with the new fabric8 version
Closes#27443 from onursatici/os/bump-fabric8.
Authored-by: Onur Satici <onursatici@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR aims to pin the version of `jekyll-redirect-from` to 0.15.0. This is a release blocker for both Apache Spark 3.0.0 and 2.4.5.
### Why are the changes needed?
`jekyll-redirect-from` released 0.16.0 a few days ago and that requires Ruby 2.4.0.
- https://github.com/jekyll/jekyll-redirect-from/releases/tag/v0.16.0
```
$ cd dev/create-release/spark-rm/
$ docker build -t spark:test .
...
ERROR: Error installing jekyll-redirect-from:
jekyll-redirect-from requires Ruby version >= 2.4.0.
...
```
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Manually do the above command to build `spark-rm` Docker image.
```
...
Successfully installed jekyll-redirect-from-0.15.0
Parsing documentation for jekyll-redirect-from-0.15.0
Installing ri documentation for jekyll-redirect-from-0.15.0
Done installing documentation for jekyll-redirect-from after 0 seconds
1 gem installed
Successfully installed rouge-3.15.0
Parsing documentation for rouge-3.15.0
Installing ri documentation for rouge-3.15.0
Done installing documentation for rouge after 4 seconds
1 gem installed
Removing intermediate container e0ec7c77b69f
---> 32dec37291c6
```
Closes#27434 from dongjoon-hyun/SPARK-30704.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR aims to upgrade to Apache ORC 1.5.9.
- For `hive-2.3` profile, we need to upgrade `hive-storage-api` from `2.6.0` to `2.7.1`.
- For `hive-1.2` profile, ORC library with classifier `nohive` already shaded it. So, there is no change.
### Why are the changes needed?
This will bring the latest bug fixes. The following is the full release note.
- https://issues.apache.org/jira/projects/ORC/versions/12346546
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Pass the Jenkins with the existing tests.
Here is the summary.
1. `Hive 1.2 + Hadoop 2.7` passed. ([here](https://github.com/apache/spark/pull/27421#issuecomment-580924552))
2. `Hive 2.3 + Hadoop 2.7` passed. ([here](https://github.com/apache/spark/pull/27421#issuecomment-580973391))
Closes#27421 from dongjoon-hyun/SPARK-ORC-1.5.9.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR aims to use `python3` instead of `python` in `dev/lint-python`.
### Why are the changes needed?
Currently, `dev/lint-python` fails at Python 2. And, Python 2 is EOL since January 1st 2020.
```
$ python -V
Python 2.7.17
$ dev/lint-python
starting python compilation test...
Python compilation failed with the following errors:
Compiling ./python/setup.py ...
File "./python/setup.py", line 27
file=sys.stderr)
^
SyntaxError: invalid syntax
```
### Does this PR introduce any user-facing change?
No. This is a dev environment.
### How was this patch tested?
Jenkins is running this with Python 3 already.
The following is a manual test.
```
$ python -V
Python 3.8.0
$ dev/lint-python
starting python compilation test...
python compilation succeeded.
```
Closes#27394 from dongjoon-hyun/SPARK-30674.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR removes any dependencies on pypandoc. It also makes related tweaks to the docs README to clarify the dependency on pandoc (not pypandoc).
### Why are the changes needed?
We are using pypandoc to convert the Spark README from Markdown to ReST for PyPI. PyPI now natively supports Markdown, so we don't need pypandoc anymore. The dependency on pypandoc also sometimes causes issues when installing Python packages that depend on PySpark, as described in #18981.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Manually:
```sh
python -m venv venv
source venv/bin/activate
pip install -U pip
cd python/
python setup.py sdist
pip install dist/pyspark-3.0.0.dev0.tar.gz
pyspark --version
```
I also built the PySpark and R API docs with `jekyll` and reviewed them locally.
It would be good if a maintainer could also test this by creating a PySpark distribution and uploading it to [Test PyPI](https://test.pypi.org) to confirm the README looks as it should.
Closes#27376 from nchammas/SPARK-30665-pypandoc.
Authored-by: Nicholas Chammas <nicholas.chammas@liveramp.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
For better JDK11 support, this PR aims to upgrade **Jersey** and **javassist** to `2.30` and `3.35.0-GA` respectively.
### Why are the changes needed?
**Jersey**: This will bring the following `Jersey` updates.
- https://eclipse-ee4j.github.io/jersey.github.io/release-notes/2.30.html
- https://github.com/eclipse-ee4j/jersey/issues/4245 (Java 11 java.desktop module dependency)
**javassist**: This is a transitive dependency from 3.20.0-CR2 to 3.25.0-GA.
- `javassist` officially supports JDK11 from [3.24.0-GA release note](https://github.com/jboss-javassist/javassist/blob/master/Readme.html#L308).
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Pass the Jenkins with both JDK8 and JDK11.
Closes#27357 from dongjoon-hyun/SPARK-30639.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Update the scalafmt plugin to 1.0.3 and use its new onlyChangedFiles feature rather than --diff
### Why are the changes needed?
Older versions of the plugin either didn't work with scala 2.13, or got rid of the --diff argument and didn't allow for formatting only changed files
### Does this PR introduce any user-facing change?
The /dev/scalafmt script no longer passes through arbitrary args, instead using the arg to select scala version. The issue here is the plugin name literally contains the scala version, and doesn't appear to have a shorter way to refer to it. If srowen or someone else with better maven-fu has an idea I'm all ears.
### How was this patch tested?
Manually, e.g. edited a file and ran
dev/scalafmt
or
dev/scalafmt 2.13
Closes#27279 from koeninger/SPARK-30570.
Authored-by: cody koeninger <cody@koeninger.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR proposes to redesign pandas UDFs as described in [the proposal](https://docs.google.com/document/d/1-kV0FS_LF2zvaRh_GhkV32Uqksm_Sq8SvnBBmRyxm30/edit?usp=sharing).
```python
from pyspark.sql.functions import pandas_udf
import pandas as pd
pandas_udf("long")
def plug_one(s: pd.Series) -> pd.Series:
return s + 1
spark.range(10).select(plug_one("id")).show()
```
```
+------------+
|plug_one(id)|
+------------+
| 1|
| 2|
| 3|
| 4|
| 5|
| 6|
| 7|
| 8|
| 9|
| 10|
+------------+
```
Note that, this PR address one of the future improvements described [here](https://docs.google.com/document/d/1-kV0FS_LF2zvaRh_GhkV32Uqksm_Sq8SvnBBmRyxm30/edit#heading=h.h3ncjpk6ujqu), "A couple of less-intuitive pandas UDF types" (by zero323) together.
In short,
- Adds new way with type hints as an alternative and experimental way.
```python
pandas_udf(schema='...')
def func(c1: Series, c2: Series) -> DataFrame:
pass
```
- Replace and/or add an alias for three types below from UDF, and make them as separate standalone APIs. So, `pandas_udf` is now consistent with regular `udf`s and other expressions.
`df.mapInPandas(udf)` -replace-> `df.mapInPandas(f, schema)`
`df.groupby.apply(udf)` -alias-> `df.groupby.applyInPandas(f, schema)`
`df.groupby.cogroup.apply(udf)` -replace-> `df.groupby.cogroup.applyInPandas(f, schema)`
*`df.groupby.apply` was added from 2.3 while the other were added in the master only.
- No deprecation for the existing ways for now.
```python
pandas_udf(schema='...', functionType=PandasUDFType.SCALAR)
def func(c1, c2):
pass
```
If users are happy with this, I plan to deprecate the existing way and declare using type hints is not experimental anymore.
One design goal in this PR was that, avoid touching the internal (since we didn't deprecate the old ways for now), but supports type hints with a minimised changes only at the interface.
- Once we deprecate or remove the old ways, I think it requires another refactoring for the internal in the future. At the very least, we should rename internal pandas evaluation types.
- If users find this experimental type hints isn't quite helpful, we should simply revert the changes at the interface level.
### Why are the changes needed?
In order to address old design issues. Please see [the proposal](https://docs.google.com/document/d/1-kV0FS_LF2zvaRh_GhkV32Uqksm_Sq8SvnBBmRyxm30/edit?usp=sharing).
### Does this PR introduce any user-facing change?
For behaviour changes, No.
It adds new ways to use pandas UDFs by using type hints. See below.
**SCALAR**:
```python
pandas_udf(schema='...')
def func(c1: Series, c2: DataFrame) -> Series:
pass # DataFrame represents a struct column
```
**SCALAR_ITER**:
```python
pandas_udf(schema='...')
def func(iter: Iterator[Tuple[Series, DataFrame, ...]]) -> Iterator[Series]:
pass # Same as SCALAR but wrapped by Iterator
```
**GROUPED_AGG**:
```python
pandas_udf(schema='...')
def func(c1: Series, c2: DataFrame) -> int:
pass # DataFrame represents a struct column
```
**GROUPED_MAP**:
This was added in Spark 2.3 as of SPARK-20396. As described above, it keeps the existing behaviour. Additionally, we now have a new alias `groupby.applyInPandas` for `groupby.apply`. See the example below:
```python
def func(pdf):
return pdf
df.groupby("...").applyInPandas(func, schema=df.schema)
```
**MAP_ITER**: this is not a pandas UDF anymore
This was added in Spark 3.0 as of SPARK-28198; and this PR replaces the usages. See the example below:
```python
def func(iter):
for df in iter:
yield df
df.mapInPandas(func, df.schema)
```
**COGROUPED_MAP**: this is not a pandas UDF anymore
This was added in Spark 3.0 as of SPARK-27463; and this PR replaces the usages. See the example below:
```python
def asof_join(left, right):
return pd.merge_asof(left, right, on="...", by="...")
df1.groupby("...").cogroup(df2.groupby("...")).applyInPandas(asof_join, schema="...")
```
### How was this patch tested?
Unittests added and tested against Python 2.7, 3.6 and 3.7.
Closes#27165 from HyukjinKwon/revisit-pandas.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This pr intends to upgrade lz4-java from 1.7.0 to 1.7.1.
### Why are the changes needed?
This release includes a bug fix for older macOS. You can see the link below for the changes;
https://github.com/lz4/lz4-java/blob/master/CHANGES.md#171
### Does this PR introduce any user-facing change?
### How was this patch tested?
Existing tests.
Closes#27271 from maropu/SPARK-30486.
Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Update Twitter Chill to 0.9.5.
### Why are the changes needed?
Primarily, Scala 2.13 support for later.
Other changes from 0.9.3 are apparently just minor fixes and improvements:
https://github.com/twitter/chill/releases
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Existing tests
Closes#27227 from srowen/SPARK-29290.
Authored-by: Sean Owen <srowen@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR aims to use `mvn` instead of `sbt` in `dev/scalastyle` to recover GitHub Action.
### Why are the changes needed?
As of now, Apache Spark sbt build is broken by the Maven Central repository policy.
https://stackoverflow.com/questions/59764749/requests-to-http-repo1-maven-org-maven2-return-a-501-https-required-status-an
> Effective January 15, 2020, The Central Maven Repository no longer supports insecure
> communication over plain HTTP and requires that all requests to the repository are
> encrypted over HTTPS.
We can reproduce this locally by the following.
```
$ rm -rf ~/.m2/repository/org/apache/apache/18/
$ build/sbt clean
```
And, in GitHub Action, `lint-scala` is the only one which is using `sbt`.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
First of all, GitHub Action should be recovered.
Also, manually, do the following.
**Without Scalastyle violation**
```
$ dev/scalastyle
OpenJDK 64-Bit Server VM warning: ignoring option MaxPermSize=384m; support was removed in 8.0
Using `mvn` from path: /usr/local/bin/mvn
Scalastyle checks passed.
```
**With Scalastyle violation**
```
$ dev/scalastyle
OpenJDK 64-Bit Server VM warning: ignoring option MaxPermSize=384m; support was removed in 8.0
Using `mvn` from path: /usr/local/bin/mvn
Scalastyle checks failed at following occurrences:
error file=/Users/dongjoon/PRS/SPARK-HTTP-501/core/src/main/scala/org/apache/spark/SparkConf.scala message=There should be no empty line separating imports in the same group. line=22 column=0
error file=/Users/dongjoon/PRS/SPARK-HTTP-501/core/src/test/scala/org/apache/spark/resource/ResourceProfileSuite.scala message=There should be no empty line separating imports in the same group. line=22 column=0
```
Closes#27242 from dongjoon-hyun/SPARK-30534.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Enable dependency audit files to tell the value of artifact id, version, and classifier of a dependency.
For example, `avro-mapred-1.8.2-hadoop2.jar` should be expanded to `avro-mapred/1.8.2/hadoop2/avro-mapred-1.8.2-hadoop2.jar` where `avro-mapred` is the artifact id, `1.8.2` is the version, and `haddop2` is the classifier.
### Why are the changes needed?
Dependency audit files are expected to be consumed by automated tests or downstream tools.
However, current dependency audit files under `dev/deps` only show jar names. And there isn't a simple rule on how to parse the jar name to get the values of different fields. For example, `hadoop2` is the classifier of `avro-mapred-1.8.2-hadoop2.jar`, in contrast, `incubating` is the version of `htrace-core-3.1.0-incubating.jar`.
Reference: There is a good example of the downstream tool that would be enabled as yhuai suggested,
> Say we have a Spark application that depends on a third-party dependency `foo`, which pulls in `jackson` as a transient dependency. Unfortunately, `foo` depends on a different version of `jackson` than Spark. So, in the pom of this Spark application, we use the dependency management section to pin the version of `jackson`. By doing this, we are lifting `jackson` to the top-level dependency of my application and I want to have a way to keep tracking what Spark uses. What we can do is to cross-check my Spark application's classpath with what Spark uses. Then, with a test written in my code base, whenever my application bumps Spark version, this test will check what we define in the application and what Spark has, and then remind us to change our application's pom if needed. In my case, I am fine to directly access git to get these audit files.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Code changes are verified by generated dependency audit files naturally. Thus, there are no tests added.
Closes#27177 from mengCareers/depsOptimize.
Lead-authored-by: Xinrong Meng <meng.careers@gmail.com>
Co-authored-by: mengCareers <meng.careers@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>