### What changes were proposed in this pull request?
This PR proposes to redesign the PySpark documentation.
I made a demo site to make it easier to review: https://hyukjin-spark.readthedocs.io/en/stable/reference/index.html.
Here is the initial draft for the final PySpark docs shape: https://hyukjin-spark.readthedocs.io/en/latest/index.html.
In more details, this PR proposes:
1. Use [pydata_sphinx_theme](https://github.com/pandas-dev/pydata-sphinx-theme) theme - [pandas](https://pandas.pydata.org/docs/) and [Koalas](https://koalas.readthedocs.io/en/latest/) use this theme. The CSS overwrite is ported from Koalas. The colours in the CSS were actually chosen by designers to use in Spark.
2. Use the Sphinx option to separate `source` and `build` directories as the documentation pages will likely grow.
3. Port current API documentation into the new style. It mimics Koalas and pandas to use the theme most effectively.
One disadvantage of this approach is that you should list up APIs or classes; however, I think this isn't a big issue in PySpark since we're being conservative on adding APIs. I also intentionally listed classes only instead of functions in ML and MLlib to make it relatively easier to manage.
### Why are the changes needed?
Often I hear the complaints, from the users, that current PySpark documentation is pretty messy to read - https://spark.apache.org/docs/latest/api/python/index.html compared other projects such as [pandas](https://pandas.pydata.org/docs/) and [Koalas](https://koalas.readthedocs.io/en/latest/).
It would be nicer if we can make it more organised instead of just listing all classes, methods and attributes to make it easier to navigate.
Also, the documentation has been there from almost the very first version of PySpark. Maybe it's time to update it.
### Does this PR introduce _any_ user-facing change?
Yes, PySpark API documentation will be redesigned.
### How was this patch tested?
Manually tested, and the demo site was made to show.
Closes#29188 from HyukjinKwon/SPARK-32179.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Use `/usr/bin/env python3` consistently instead of `/usr/bin/env python` in build scripts, to reliably select Python 3.
### Why are the changes needed?
Scripts no longer work with Python 2.
### Does this PR introduce _any_ user-facing change?
No, should be all build system changes.
### How was this patch tested?
Existing tests / NA
Closes#29151 from srowen/SPARK-29909.2.
Authored-by: Sean Owen <srowen@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR aims to drop Python 2.7, 3.4 and 3.5.
Roughly speaking, it removes all the widely known Python 2 compatibility workarounds such as `sys.version` comparison, `__future__`. Also, it removes the Python 2 dedicated codes such as `ArrayConstructor` in Spark.
### Why are the changes needed?
1. Unsupport EOL Python versions
2. Reduce maintenance overhead and remove a bit of legacy codes and hacks for Python 2.
3. PyPy2 has a critical bug that causes a flaky test, SPARK-28358 given my testing and investigation.
4. Users can use Python type hints with Pandas UDFs without thinking about Python version
5. Users can leverage one latest cloudpickle, https://github.com/apache/spark/pull/28950. With Python 3.8+ it can also leverage C pickle.
### Does this PR introduce _any_ user-facing change?
Yes, users cannot use Python 2.7, 3.4 and 3.5 in the upcoming Spark version.
### How was this patch tested?
Manually tested and also tested in Jenkins.
Closes#28957 from HyukjinKwon/SPARK-32138.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
According to the dev mailing list discussion, this PR aims to switch the default Apache Hadoop dependency from 2.7.4 to 3.2.0 for Apache Spark 3.1.0 on December 2020.
| Item | Default Hadoop Dependency |
|------|-----------------------------|
| Apache Spark Website | 3.2.0 |
| Apache Download Site | 3.2.0 |
| Apache Snapshot | 3.2.0 |
| Maven Central | 3.2.0 |
| PyPI | 2.7.4 (We will switch later) |
| CRAN | 2.7.4 (We will switch later) |
| Homebrew | 3.2.0 (already) |
In Apache Spark 3.0.0 release, we focused on the other features. This PR targets for [Apache Spark 3.1.0 scheduled on December 2020](https://spark.apache.org/versioning-policy.html).
### Why are the changes needed?
Apache Hadoop 3.2 has many fixes and new cloud-friendly features.
**Reference**
- 2017-08-04: https://hadoop.apache.org/release/2.7.4.html
- 2019-01-16: https://hadoop.apache.org/release/3.2.0.html
### Does this PR introduce _any_ user-facing change?
Since the default Hadoop dependency changes, the users will get a better support in a cloud environment.
### How was this patch tested?
Pass the Jenkins.
Closes#28897 from dongjoon-hyun/SPARK-32058.
Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
This PR proposes to upgrade R version to 4.0.2 in the release docker image. As of SPARK-31918, we should make a release with R 4.0.0+ which works with R 3.5+ too.
### Why are the changes needed?
To unblock releases on CRAN.
### Does this PR introduce _any_ user-facing change?
No, dev-only.
### How was this patch tested?
Manually tested via scripts under `dev/create-release`, manually attaching to the container and checking the R version.
Closes#28922 from HyukjinKwon/SPARK-32089.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
This is auto-updated by running script `translate-contributors.py`
Closes#28861 from cloud-fan/update.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
When removing non-existing files in the release script, do not fail.
### Why are the changes needed?
This is to make the release script more robust, as we don't care if the files exist before we remove them.
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
tested when cutting 3.0.0 RC
Closes#28815 from cloud-fan/release.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
Only push the release tag after the build has finished.
### Why are the changes needed?
If the build fails we don't need a release tag.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Running locally with a fake user in https://github.com/apache/spark/pull/28667Closes#28700 from holdenk/SPARK-31860-build-master-only-push-tags-on-success.
Authored-by: Holden Karau <hkarau@apple.com>
Signed-off-by: Holden Karau <hkarau@apple.com>
### What changes were proposed in this pull request?
Allow overriding the zinc options in the docker release and set a higher so the publish step can succeed consistently.
### Why are the changes needed?
The publish step experiences memory pressure.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Running test locally with fake user to see if publish step (besides svn part) succeeds
Closes#28698 from holdenk/SPARK-31889-docker-release-script-does-not-allocate-enough-memory-to-reliably-publish.
Authored-by: Holden Karau <hkarau@apple.com>
Signed-off-by: Holden Karau <hkarau@apple.com>
### What changes were proposed in this pull request?
This PR aims to use the style that is compatible with both python 2 and 3.
### Why are the changes needed?
This will help python 3 migration.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Manual.
Closes#28632 from williamhyun/use_python3_style.
Authored-by: William Hyun <williamhyun3@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR adds `i` option to ignore additional `build/mvn` output which is irrelevant to version string.
### Why are the changes needed?
SPARK-28963 added additional output message, `Falling back to archive.apache.org to download Maven` in build/mvn. This breaks `dev/create-release/release-build.sh` and currently Spark Packaging Jenkins job is hitting this issue consistently and broken.
- https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/2912/console
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
This happens only when the mirror fails. So, this is verified manually hiject the script. It works like the following.
```
$ echo 'Falling back to archive.apache.org to download Maven' > out
$ build/mvn help:evaluate -Dexpression=project.version >> out
Using `mvn` from path: /Users/dongjoon/PRS/SPARK_RELEASE_2/build/apache-maven-3.6.3/bin/mvn
$ cat out | grep -v INFO | grep -v WARNING | grep -v Download
Falling back to archive.apache.org to download Maven
3.1.0-SNAPSHOT
$ cat out | grep -v INFO | grep -v WARNING | grep -vi Download
3.1.0-SNAPSHOT
```
Closes#28514 from dongjoon-hyun/SPARK_RELEASE_2.
Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
This PR (SPARK-31462) fixes the usage of getopts and case statement in `do-release.sh` and `do-release-docker.sh`.
### Why are the changes needed?
In the current master, do-release.sh contains the following code.
```
while getopts "bn" opt; do
case $opt in
b) GIT_BRANCH=$OPTARG ;;
n) DRY_RUN=1 ;;
?) error "Invalid option: $OPTARG" ;;
esac
done
```
There are 3 wrong usage in getopts and case statement.
1. To set $OPTARG to an argument passed for the option "b", the parameter for getopts should be "b:".
2. To set $OPTARG to the invalid option name passed, the parameter for getopts starts with ":".
3. It's minor but to match the character "?", it's better to escape like "\\?".
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
I checked that $GIT_BRANCH is set when do-release.sh is launched with -b option.
I also checked that the error message contains invalid option name when do-release.sh is launched with an invalid option.
Closes#28234 from sarutak/fix-do-release.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
The release script stops working after d5865493ae , as we require `mkdocs 1.0.0`.
This PR upgrades `mkdocs` from 0.1.6.3 to 1.0.4. To do that ruby is also upgraded to 2.5.
This PR also fixes some small issues.
### Why are the changes needed?
to make RC
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
tested by 3.0.0-rc1
Closes#28088 from cloud-fan/rc.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR makes the following refinements to the workflow for building docs:
* Install Python and Ruby consistently using pyenv and rbenv across both the docs README and the release Dockerfile.
* Pin the Python and Ruby versions we use.
* Pin all direct Python and Ruby dependency versions.
* Eliminate any use of `sudo pip`, which the Python community discourages, or `sudo gem`.
### Why are the changes needed?
This PR should increase the consistency and reproducibility of the doc-building process by managing Python and Ruby in a more consistent way, and by eliminating unused or outdated code.
Here's a possible example of an issue building the docs that would be addressed by the changes in this PR: https://github.com/apache/spark/pull/27459#discussion_r376135719
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Manual tests:
* I was able to build the Docker image successfully, minus the final part about `RUN useradd`.
* I am unable to run `do-release-docker.sh` because I am not a committer and don't have the required GPG key.
* I built the docs locally and viewed them in the browser.
I think I need a committer to more fully test out these changes.
Closes#27534 from nchammas/SPARK-30731-building-docs.
Authored-by: Nicholas Chammas <nicholas.chammas@liveramp.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
This PR aims to pin the version of `jekyll-redirect-from` to 0.15.0. This is a release blocker for both Apache Spark 3.0.0 and 2.4.5.
### Why are the changes needed?
`jekyll-redirect-from` released 0.16.0 a few days ago and that requires Ruby 2.4.0.
- https://github.com/jekyll/jekyll-redirect-from/releases/tag/v0.16.0
```
$ cd dev/create-release/spark-rm/
$ docker build -t spark:test .
...
ERROR: Error installing jekyll-redirect-from:
jekyll-redirect-from requires Ruby version >= 2.4.0.
...
```
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Manually do the above command to build `spark-rm` Docker image.
```
...
Successfully installed jekyll-redirect-from-0.15.0
Parsing documentation for jekyll-redirect-from-0.15.0
Installing ri documentation for jekyll-redirect-from-0.15.0
Done installing documentation for jekyll-redirect-from after 0 seconds
1 gem installed
Successfully installed rouge-3.15.0
Parsing documentation for rouge-3.15.0
Installing ri documentation for rouge-3.15.0
Done installing documentation for rouge after 4 seconds
1 gem installed
Removing intermediate container e0ec7c77b69f
---> 32dec37291c6
```
Closes#27434 from dongjoon-hyun/SPARK-30704.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR removes any dependencies on pypandoc. It also makes related tweaks to the docs README to clarify the dependency on pandoc (not pypandoc).
### Why are the changes needed?
We are using pypandoc to convert the Spark README from Markdown to ReST for PyPI. PyPI now natively supports Markdown, so we don't need pypandoc anymore. The dependency on pypandoc also sometimes causes issues when installing Python packages that depend on PySpark, as described in #18981.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Manually:
```sh
python -m venv venv
source venv/bin/activate
pip install -U pip
cd python/
python setup.py sdist
pip install dist/pyspark-3.0.0.dev0.tar.gz
pyspark --version
```
I also built the PySpark and R API docs with `jekyll` and reviewed them locally.
It would be good if a maintainer could also test this by creating a PySpark distribution and uploading it to [Test PyPI](https://test.pypi.org) to confirm the README looks as it should.
Closes#27376 from nchammas/SPARK-30665-pypandoc.
Authored-by: Nicholas Chammas <nicholas.chammas@liveramp.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR fix incorrect pyspark version when releasing preview versions.
### Why are the changes needed?
Failed to make Spark binary distribution:
```
cp: cannot stat 'spark-3.0.0-preview2-bin-hadoop2.7/python/dist/pyspark-3.0.0.dev02.tar.gz': No such file or directory
gpg: can't open 'pyspark-3.0.0.dev02.tar.gz': No such file or directory
gpg: signing failed: No such file or directory
gpg: pyspark-3.0.0.dev02.tar.gz: No such file or directory
```
```
yumwangubuntu-3513086:~/spark-release/output$ ll spark-3.0.0-preview2-bin-hadoop2.7/python/dist/
total 214140
drwxr-xr-x 2 yumwang stack 4096 Dec 16 06:17 ./
drwxr-xr-x 9 yumwang stack 4096 Dec 16 06:17 ../
-rw-r--r-- 1 yumwang stack 219267173 Dec 16 06:17 pyspark-3.0.0.dev2.tar.gz
```
```
/usr/local/lib/python3.6/dist-packages/setuptools/dist.py:476: UserWarning: Normalizing '3.0.0.dev02' to '3.0.0.dev2'
normalized_version,
```
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
manual test:
```
LM-SHC-16502798:spark yumwang$ SPARK_VERSION=3.0.0-preview2
LM-SHC-16502798:spark yumwang$ echo "$SPARK_VERSION" | sed -e "s/-/./" -e "s/SNAPSHOT/dev0/" -e "s/preview/dev/"
3.0.0.dev2
```
Closes#26909 from wangyum/SPARK-30268.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR add support do not add commits to master branch when releasing preview version.
### Why are the changes needed?
We need manual revert this change, example:
![image](https://user-images.githubusercontent.com/5399861/70788945-f9d15180-1dcc-11ea-81f5-c0d89c28440a.png)
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
manual test
Closes#26879 from wangyum/SPARK-30253.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Yuming Wang <wgyumg@gmail.com>
### What changes were proposed in this pull request?
- Reverts commit 1f94bf4 and d6be46e
- Switches python to python3 in Docker release image.
### Why are the changes needed?
`dev/make-distribution.sh` and `python/setup.py` are use python3.
https://github.com/apache/spark/pull/26844/files#diff-ba2c046d92a1d2b5b417788bfb5cb5f8L236https://github.com/apache/spark/pull/26330/files#diff-8cf6167d58ce775a08acafcfe6f40966
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
manual test:
```
yumwangubuntu-3513086:~/spark$ dev/create-release/do-release-docker.sh -n -d /home/yumwang/spark-release
Output directory already exists. Overwrite and continue? [y/n] y
Branch [branch-2.4]: master
Current branch version is 3.0.0-SNAPSHOT.
Release [3.0.0]: 3.0.0-preview2
RC # [1]:
This is a dry run. Please confirm the ref that will be built for testing.
Ref [master]:
ASF user [yumwang]:
Full name [Yuming Wang]:
GPG key [yumwangapache.org]: DBD447010C1B4F7DAD3F7DFD6E1B4122F6A3A338
================
Release details:
BRANCH: master
VERSION: 3.0.0-preview2
TAG: v3.0.0-preview2-rc1
NEXT: 3.0.1-SNAPSHOT
ASF USER: yumwang
GPG KEY: DBD447010C1B4F7DAD3F7DFD6E1B4122F6A3A338
FULL NAME: Yuming Wang
E-MAIL: yumwangapache.org
================
Is this info correct [y/n]? y
GPG passphrase:
========================
= Building spark-rm image with tag latest...
Command: docker build -t spark-rm:latest --build-arg UID=110302528 /home/yumwang/spark/dev/create-release/spark-rm
Log file: docker-build.log
Building v3.0.0-preview2-rc1; output will be at /home/yumwang/spark-release/output
gpg: directory '/home/spark-rm/.gnupg' created
gpg: keybox '/home/spark-rm/.gnupg/pubring.kbx' created
gpg: /home/spark-rm/.gnupg/trustdb.gpg: trustdb created
gpg: key 6E1B4122F6A3A338: public key "Yuming Wang <yumwangapache.org>" imported
gpg: key 6E1B4122F6A3A338: secret key imported
gpg: Total number processed: 1
gpg: imported: 1
gpg: secret keys read: 1
gpg: secret keys imported: 1
========================
= Creating release tag v3.0.0-preview2-rc1...
Command: /opt/spark-rm/release-tag.sh
Log file: tag.log
It may take some time for the tag to be synchronized to github.
Press enter when you've verified that the new tag (v3.0.0-preview2-rc1) is available.
========================
= Building Spark...
Command: /opt/spark-rm/release-build.sh package
Log file: build.log
========================
= Building documentation...
Command: /opt/spark-rm/release-build.sh docs
Log file: docs.log
========================
= Publishing release
Command: /opt/spark-rm/release-build.sh publish-release
Log file: publish.log
```
Generated doc:
![image](https://user-images.githubusercontent.com/5399861/70693075-a7723100-1cf7-11ea-9f88-9356a02349a1.png)
Closes#26848 from wangyum/SPARK-30216.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR aims to add another pre-built binary distribution with `-Phadoop-2.7 -Phive-1.2` at `Apache Spark 3.0.0`.
**PRE-BUILT BINARY DISTRIBUTION**
```
spark-3.0.0-SNAPSHOT-bin-hadoop2.7-hive1.2.tgz
spark-3.0.0-SNAPSHOT-bin-hadoop2.7-hive1.2.tgz.asc
spark-3.0.0-SNAPSHOT-bin-hadoop2.7-hive1.2.tgz.sha512
```
**CONTENTS (snippet)**
```
$ ls *hadoop-*
hadoop-annotations-2.7.4.jar hadoop-mapreduce-client-shuffle-2.7.4.jar
hadoop-auth-2.7.4.jar hadoop-yarn-api-2.7.4.jar
hadoop-client-2.7.4.jar hadoop-yarn-client-2.7.4.jar
hadoop-common-2.7.4.jar hadoop-yarn-common-2.7.4.jar
hadoop-hdfs-2.7.4.jar hadoop-yarn-server-common-2.7.4.jar
hadoop-mapreduce-client-app-2.7.4.jar hadoop-yarn-server-web-proxy-2.7.4.jar
hadoop-mapreduce-client-common-2.7.4.jar parquet-hadoop-1.10.1.jar
hadoop-mapreduce-client-core-2.7.4.jar parquet-hadoop-bundle-1.6.0.jar
hadoop-mapreduce-client-jobclient-2.7.4.jar
$ ls *hive-*
hive-beeline-1.2.1.spark2.jar hive-jdbc-1.2.1.spark2.jar
hive-cli-1.2.1.spark2.jar hive-metastore-1.2.1.spark2.jar
hive-exec-1.2.1.spark2.jar spark-hive-thriftserver_2.12-3.0.0-SNAPSHOT.jar
```
### Why are the changes needed?
Since Apache Spark switched to use `-Phive-2.3` by default, all pre-built binary distribution will use `-Phive-2.3`. This PR adds `hadoop-2.7/hive-1.2` distribution to provide a similar combination like `Apache Spark 2.4` line.
### Does this PR introduce any user-facing change?
Yes. This is additional distribution which resembles to `Apache Spark 2.4` line in terms of `hive` version.
### How was this patch tested?
Manual.
Please note that we need a dry-run mode, but the AS-IS release script do not generate additional combinations including this in `dry-run` mode.
Closes#26688 from dongjoon-hyun/SPARK-29989.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
### What changes were proposed in this pull request?
This PR aims to fix a type; `rogue` -> `rouge` .
This is a follow-up of https://github.com/apache/spark/pull/26521.
### Why are the changes needed?
To support `Python 3`, we upgraded from `pygments` to `rouge`.
### Does this PR introduce any user-facing change?
No. (This is for only document generation.)
### How was this patch tested?
Manually.
```
$ docker build -t test dev/create-release/spark-rm/
...
1 gem installed
Successfully installed rouge-3.13.0
Parsing documentation for rouge-3.13.0
Installing ri documentation for rouge-3.13.0
Done installing documentation for rouge after 4 seconds
1 gem installed
Removing intermediate container 9bd8707d9e84
---> a18b2f6b0bb9
...
```
Closes#26686 from dongjoon-hyun/SPARK-28752.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR aims to add `-Phive-2.3` to publish profiles.
Since Apache Spark 3.0.0, Maven artifacts will be publish with Apache Hive 2.3 profile only.
This PR also will recover `SNAPSHOT` publishing Jenkins job.
- https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/
We will provide the pre-built distributions (with Hive 1.2.1 also) like Apache Spark 2.4.
SPARK-29989 will update the release script to generate all combinations.
### Why are the changes needed?
This will reduce the explicit dependency on the illegitimate Hive fork in Maven repository.
### Does this PR introduce any user-facing change?
Yes, but this is dev only changes.
### How was this patch tested?
Manual.
Closes#26648 from dongjoon-hyun/SPARK-30007.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR proposes to switch `pygments.rb`, which only support Python 2 and seems inactive for the last few years (https://github.com/tmm1/pygments.rb), to Rouge which is pure Ruby code highlighter that is compatible with Pygments.
I thought it would be pretty difficult to change but thankfully Rouge does a great job as the alternative.
### Why are the changes needed?
We're moving to Python 3 and drop Python 2 completely.
### Does this PR introduce any user-facing change?
Maybe a little bit of different syntax style but should not have a notable change.
### How was this patch tested?
Manually tested the build and checked the documentation.
Closes#26521 from HyukjinKwon/SPARK-28752.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
`release-build.sh` fail to publish release under dry run mode with the following error message:
```
/opt/spark-rm/release-build.sh: line 429: pushd: spark-repo-g4MBm/org/apache/spark: No such file or directory
```
We need to at least run the `mvn clean install` command once to create the `$tmp_repo` path, but now those steps are all skipped under dry-run mode. This PR fixes the issue.
### How was this patch tested?
Tested locally.
Closes#26329 from jiangxb1987/dryrun.
Authored-by: Xingbo Jiang <xingbo.jiang@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Update `release-build.sh`, to allow pyspark version name format `${versionNumber}-preview`, otherwise the release script won't generate pyspark release tarballs.
### How was this patch tested?
Tested locally.
Closes#26306 from jiangxb1987/buildPython.
Authored-by: Xingbo Jiang <xingbo.jiang@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR aims to add `hadoop-3.2` profile to pre-built binary package releases.
### Why are the changes needed?
Since Apache Spark 3.0.0, we provides Hadoop 3.2 pre-built binary.
### Does this PR introduce any user-facing change?
No. (Although the artifacts are available, this change is for release managers).
### How was this patch tested?
Manual. Please note that `DRY_RUN=0` disables these combination.
```
$ dev/create-release/release-build.sh package
...
Packages to build: without-hadoop hadoop3.2 hadoop2.7
make_binary_release without-hadoop -Pscala-2.12 -Phadoop-provided 2.12
make_binary_release hadoop3.2 -Pscala-2.12 -Phadoop-3.2 -Phive -Phive-thriftserver 2.12
make_binary_release hadoop2.7 -Pscala-2.12 -Phadoop-2.7 -Phive -Phive-thriftserver withpip,withr 2.12
```
Closes#26260 from dongjoon-hyun/SPARK-29608.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR allows `bin/spark-submit --version` to show the correct information while the previous versions, which were created by `dev/create-release/do-release-docker.sh`, show incorrect information.
There are two root causes to show incorrect information:
1. Did not pass `USER` environment variable to the docker container
1. Did not keep `.git` directory in the work directory
### Why are the changes needed?
The information is missing while the previous versions show the correct information.
### Does this PR introduce any user-facing change?
Yes, the following is the console output in branch-2.3
```
$ bin/spark-submit --version
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.3.4
/_/
Using Scala version 2.11.8, OpenJDK 64-Bit Server VM, 1.8.0_212
Branch HEAD
Compiled by user ishizaki on 2019-09-02T02:18:10Z
Revision 8c6f8150f3
Url https://gitbox.apache.org/repos/asf/spark.git
Type --help for more information.
```
Without this PR, the console output is as follows
```
$ spark-submit --version
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.3.4
/_/
Using Scala version 2.11.8, OpenJDK 64-Bit Server VM, 1.8.0_212
Branch
Compiled by user on 2019-08-26T08:29:39Z
Revision
Url
Type --help for more information.
```
### How was this patch tested?
After building the package, I manually executed `bin/spark-submit --version`
Closes#25655 from kiszk/SPARK-28906.
Authored-by: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
### What changes were proposed in this pull request?
This PR adds a release announce template.
### Why are the changes needed?
- We want to use a formal template including HTTPS in the future release.
- The future release managers don't need to search mailing list to find this form.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
N/A.
Closes#25656 from dongjoon-hyun/SPARK-28951.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR aims to specify Jekyll Version explicitly in our release docker image.
### Why are the changes needed?
Recently, Jekyll 4.0 is released and it dropped Ruby 2.3 support.
This breaks our release docker image build.
```
Building native extensions. This could take a while...
ERROR: Error installing jekyll:
jekyll-sass-converter requires Ruby version >= 2.4.0.
```
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
The following should succeed.
```
$ docker build -t spark-rm:test --build-arg UID=501 dev/create-release/spark-rm
...
Successfully tagged spark-rm:test
```
Closes#25578 from dongjoon-hyun/SPARK-28868.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
## What changes were proposed in this pull request?
CRAN repo changed the key and it causes our release script failure. This is a release blocker for Apache Spark 2.4.4 and 3.0.0.
- https://cran.r-project.org/bin/linux/ubuntu/README.html
```
Err:1 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran35/ InRelease
The following signatures couldn't be verified because the public key is not available: NO_PUBKEY 51716619E084DAB9
...
W: GPG error: https://cloud.r-project.org/bin/linux/ubuntu bionic-cran35/ InRelease: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY 51716619E084DAB9
E: The repository 'https://cloud.r-project.org/bin/linux/ubuntu bionic-cran35/ InRelease' is not signed.
```
Note that they are reusing `cran35` for R 3.6 although they changed the key.
```
Even though R has moved to version 3.6, for compatibility the sources.list entry still uses the cran3.5 designation.
```
This PR aims to recover the docker image generation first. We will verify the R doc generation in a separate JIRA and PR.
## How was this patch tested?
Manual. After `docker-build.log`, it should continue to the next stage, `Building v3.0.0-rc1`.
```
$ dev/create-release/do-release-docker.sh -d /tmp/spark-3.0.0 -n -s docs
...
Log file: docker-build.log
Building v3.0.0-rc1; output will be at /tmp/spark-3.0.0/output
```
Closes#25339 from dongjoon-hyun/SPARK-28606.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: DB Tsai <d_tsai@apple.com>
## What changes were proposed in this pull request?
Unlike the other versions, `x.x.0-SNAPSHOT` causes `x.x.-1`. Although this will not happen in the tags (there is no `SNAPSHOT` postfix), we had better fix this.
```
$ dev/create-release/do-release-docker.sh -d /tmp/spark-3.0.0 -n
Output directory already exists. Overwrite and continue? [y/n] y
Branch [branch-2.4]: master
Current branch version is 3.0.0-SNAPSHOT.
Release [3.0.-1]:
```
Since we already have `RELEASE_VERSION` by removing `SNAPSHOT`. This PR uses `RELEASE_VERSION` instead of `VERSION`.
```
$ dev/create-release/do-release-docker.sh -d /tmp/spark-3.0.0 -n
Branch [branch-2.4]: master
Current branch version is 3.0.0-SNAPSHOT.
Release [3.0.0]:
```
## How was this patch tested?
Manually do `dev/create-release/do-release-docker.sh -d /tmp/spark-3.0.0 -n` and see the default value of `Release`.
Closes#25254 from dongjoon-hyun/SPARK-28511.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
## What changes were proposed in this pull request?
There are two cases when we use `dry run`.
First, when the tag already exists, we can ask `confirmation` on the existing tag name.
```
$ dev/create-release/do-release-docker.sh -d /tmp/spark-2.4.4 -n -s docs
Output directory already exists. Overwrite and continue? [y/n] y
Branch [branch-2.4]:
Current branch version is 2.4.4-SNAPSHOT.
Release [2.4.4]: 2.4.3
RC # [1]:
v2.4.3-rc1 already exists. Continue anyway [y/n]? y
This is a dry run. Please confirm the ref that will be built for testing.
Ref [v2.4.3-rc1]:
```
Second, when the tag doesn't exist, we had better ask `confirmation` on the branch name. If we do not change the default value, it will fail eventually.
```
$ dev/create-release/do-release-docker.sh -d /tmp/spark-2.4.4 -n -s docs
Branch [branch-2.4]:
Current branch version is 2.4.4-SNAPSHOT.
Release [2.4.4]:
RC # [1]:
This is a dry run. Please confirm the ref that will be built for testing.
Ref [v2.4.4-rc1]:
```
This PR improves the second case by providing the branch name instead. This helps the release testing before tagging.
## How was this patch tested?
Manually do the following and check the default value of `Ref` field.
```
$ dev/create-release/do-release-docker.sh -d /tmp/spark-2.4.4 -n -s docs
Branch [branch-2.4]:
Current branch version is 2.4.4-SNAPSHOT.
Release [2.4.4]:
RC # [1]:
This is a dry run. Please confirm the ref that will be built for testing.
Ref [branch-2.4]:
...
```
Closes#25240 from dongjoon-hyun/SPARK-28496.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
## What changes were proposed in this pull request?
Use https URL for CRAN repo (and for a Scala download in a Dockerfile)
## How was this patch tested?
Existing tests.
Closes#24664 from srowen/SPARK-27794.
Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
## What changes were proposed in this pull request?
Tighten up some key links to the project and download pages to use HTTPS
## How was this patch tested?
N/A
Closes#24665 from srowen/HTTPSURLs.
Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
## What changes were proposed in this pull request?
Remove Scala 2.11 support in build files and docs, and in various parts of code that accommodated 2.11. See some targeted comments below.
## How was this patch tested?
Existing tests.
Closes#23098 from srowen/SPARK-26132.
Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
## What changes were proposed in this pull request?
Upgrade Docker image for release build to Ubuntu 18.04LTS
## How was this patch tested?
Manually tested.
Closes#23932 from dbtsai/ubuntu18.04.
Authored-by: DB Tsai <d_tsai@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
## What changes were proposed in this pull request?
The call to `translate_component` only supplied 2 out of the 3 required arguments. I added a default empty list for the missing argument to avoid a run-time error.
I work for Semmle, and noticed the bug with our LGTM code analyzer:
0655f1624f/files/dev/create-release/releaseutils.py?sort=name&dir=ASC&mode=heatmap#x1434915b6576fb40:1
## How was this patch tested?
I checked that `./dev/run-tests` pass OK.
Closes#23567 from ipwright/wrong-number-of-arguments-fix.
Authored-by: wright <wright@semmle.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
## What changes were proposed in this pull request?
To skip some steps to remove binary license/notice files in a source release for branch2.3 (these files only exist in master/branch-2.4 now), this pr checked a Spark release version in `dev/create-release/release-build.sh`.
## How was this patch tested?
Manually checked.
Closes#23538 from maropu/FixReleaseScript.
Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
## What changes were proposed in this pull request?
This PR uses GitHub repository instead of GitBox because GitHub repo returns HTTP header status correctly.
## How was this patch tested?
Manual.
```
$ ./do-release-docker.sh -d /tmp/test -n
Branch [branch-2.4]:
Current branch version is 2.4.1-SNAPSHOT.
Release [2.4.1]:
RC # [1]:
This is a dry run. Please confirm the ref that will be built for testing.
Ref [v2.4.1-rc1]:
```
Closes#23482 from dongjoon-hyun/SPARK-26554-2.
Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
## What changes were proposed in this pull request?
This PR fixes the old script name in `release-tag.sh`.
$ ./release-tag.sh --help | head -n1
usage: tag-release.sh
## How was this patch tested?
Manual.
$ ./release-tag.sh --help | head -n1
usage: release-tag.sh
Closes#23477 from dongjoon-hyun/SPARK-RELEASE-TAG.
Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
## What changes were proposed in this pull request?
Unlike the previous Apache Git repository, new GitBox repository returns a fake HTTP 200 header instead of `404 Not Found` header. This makes release scripts out of order. This PR aims to fix it to handle the html body message instead of the fake HTTP headers. This is a release blocker.
```bash
$ curl -s --head --fail "https://gitbox.apache.org/repos/asf?p=spark.git;a=commit;h=v3.0.0"
HTTP/1.1 200 OK
Date: Sun, 06 Jan 2019 22:42:39 GMT
Server: Apache/2.4.18 (Ubuntu)
Vary: Accept-Encoding
Access-Control-Allow-Origin: *
Access-Control-Allow-Methods: POST, GET, OPTIONS
Access-Control-Allow-Headers: X-PINGOTHER
Access-Control-Max-Age: 1728000
Content-Type: text/html; charset=utf-8
```
**BEFORE**
```bash
$ ./do-release-docker.sh -d /tmp/test -n
Branch [branch-2.4]:
Current branch version is 2.4.1-SNAPSHOT.
Release [2.4.1]:
RC # [1]:
v2.4.1-rc1 already exists. Continue anyway [y/n]?
```
**AFTER**
```bash
$ ./do-release-docker.sh -d /tmp/test -n
Branch [branch-2.4]:
Current branch version is 2.4.1-SNAPSHOT.
Release [2.4.1]:
RC # [1]:
This is a dry run. Please confirm the ref that will be built for testing.
Ref [v2.4.1-rc1]:
```
## How was this patch tested?
Manual.
Closes#23476 from dongjoon-hyun/SPARK-26554.
Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
## What changes were proposed in this pull request?
due to apache recently moving from git-wip-us.apache.org to gitbox.apache.org, we need to update the packaging scripts to point to the new repo location.
this will also need to be backported to 2.4, 2.3, 2.1, 2.0 and 1.6.
## How was this patch tested?
the build system will test this.
Please review http://spark.apache.org/contributing.html before opening a pull request.
Closes#23454 from shaneknapp/update-apache-repo.
Authored-by: shane knapp <incomplete@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
## What changes were proposed in this pull request?
update known_translations after running `translate-contributors.py` during 2.4.0 release
## How was this patch tested?
N/A
Closes#22949 from cloud-fan/contributors.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: gatorsmile <gatorsmile@gmail.com>