spark-instrumented-optimizer/dev/create-release
HyukjinKwon 942f577b6e [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI
### What changes were proposed in this pull request?

This PR proposes to add a way to select Hadoop and Hive versions in pip installation.
Users can select Hive or Hadoop versions as below:

```bash
HADOOP_VERSION=3.2 pip install pyspark
HIVE_VERSION=1.2 pip install pyspark
HIVE_VERSION=1.2 HADOOP_VERSION=2.7 pip install pyspark
```

When the environment variables are set, internally it downloads the corresponding Spark version and then sets the Spark home to it. Also this PR exposes a mirror to set as an environment variable, `PYSPARK_RELEASE_MIRROR`.

**Please NOTE that:**
- We cannot currently leverage pip's native installation option, for example:

    ```bash
    pip install pyspark --install-option="hadoop3.2"
    ```

    This is because of a limitation and bug in pip itself. Once they fix this issue, we can switch from the environment variables to the proper installation options, see SPARK-32837.

    It IS possible to workaround but very ugly or hacky with a big change. See [this PR](https://github.com/microsoft/nni/pull/139/files) as an example.

- In pip installation, we pack the relevant jars together. This PR _does not touch existing packaging way_ in order to prevent any behaviour changes.

  Once this experimental way is proven to be safe, we can avoid packing the relevant jars together (and keep only the relevant Python scripts). And downloads the Spark distribution as this PR proposes.

- This way is sort of consistent with SparkR:

  SparkR provides a method `SparkR::install.spark` to support CRAN installation. This is fine because SparkR is provided purely as a R library. For example, `sparkr` script is not packed together.

  PySpark cannot take this approach because PySpark packaging ships relevant executable script together, e.g.) `pyspark` shell.

  If PySpark has a method such as `pyspark.install_spark`, users cannot call it in `pyspark` because `pyspark` already assumes relevant Spark is installed, JVM is launched, etc.

- There looks no way to release that contains different Hadoop or Hive to PyPI due to [the version semantics](https://www.python.org/dev/peps/pep-0440/). This is not an option.

  The usual way looks either `--install-option` above with hacks or environment variables given my investigation.

### Why are the changes needed?

To provide users the options to select Hadoop and Hive versions.

### Does this PR introduce _any_ user-facing change?

Yes, users will be able to select Hive and Hadoop version as below when they install it from `pip`;

```bash
HADOOP_VERSION=3.2 pip install pyspark
HIVE_VERSION=1.2 pip install pyspark
HIVE_VERSION=1.2 HADOOP_VERSION=2.7 pip install pyspark
```

### How was this patch tested?

Unit tests were added. I also manually tested in Mac and Windows (after building Spark with `python/dist/pyspark-3.1.0.dev0.tar.gz`):

```bash
./build/mvn -DskipTests -Phive-thriftserver clean package
```

Mac:

```bash
SPARK_VERSION=3.0.1 HADOOP_VERSION=3.2 pip install pyspark-3.1.0.dev0.tar.gz
```

Windows:

```bash
set HADOOP_VERSION=3.2
set SPARK_VERSION=3.0.1
pip install pyspark-3.1.0.dev0.tar.gz
```

Closes #29703 from HyukjinKwon/SPARK-32017.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-09-23 09:30:51 +09:00
..
spark-rm [SPARK-32204][SPARK-32182][DOCS] Add a quickstart page with Binder integration in PySpark documentation 2020-08-26 12:23:24 +09:00
announce.tmpl [SPARK-28951][INFRA] Add release announce template 2019-09-02 14:55:05 -07:00
do-release-docker.sh [SPARK-31889][BUILD] Docker release script does not allocate enough memory to reliably publish 2020-06-01 15:49:17 -07:00
do-release.sh Revert "[SPARK-31860][BUILD] only push release tags on succes" 2020-06-12 17:50:43 +08:00
generate-contributors.py [SPARK-32719][PYTHON] Add Flake8 check missing imports 2020-08-31 11:23:31 +09:00
known_translations [MINOR] update dev/create-release/known_translations 2020-06-29 11:53:56 +00:00
release-build.sh [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI 2020-09-23 09:30:51 +09:00
release-tag.sh [SPARK-32556][INFRA] Fix release script to have urlencoded passwords where required 2020-08-07 03:31:45 -05:00
release-util.sh [SPARK-32556][INFRA] Fix release script to have urlencoded passwords where required 2020-08-07 03:31:45 -05:00
releaseutils.py [MINOR] Fix usage print to guide pip3 to install jira-python library 2020-09-03 01:10:59 +09:00
translate-contributors.py [SPARK-32719][PYTHON] Add Flake8 check missing imports 2020-08-31 11:23:31 +09:00
vote.tmpl [MINOR][DOCS] Tighten up some key links to the project and download pages to use HTTPS 2019-05-21 10:56:42 -07:00