History

HyukjinKwon 942f577b6e [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI ### What changes were proposed in this pull request? This PR proposes to add a way to select Hadoop and Hive versions in pip installation. Users can select Hive or Hadoop versions as below: ```bash HADOOP_VERSION=3.2 pip install pyspark HIVE_VERSION=1.2 pip install pyspark HIVE_VERSION=1.2 HADOOP_VERSION=2.7 pip install pyspark ``` When the environment variables are set, internally it downloads the corresponding Spark version and then sets the Spark home to it. Also this PR exposes a mirror to set as an environment variable, `PYSPARK_RELEASE_MIRROR`. Please NOTE that: - We cannot currently leverage pip's native installation option, for example: ```bash pip install pyspark --install-option="hadoop3.2" ``` This is because of a limitation and bug in pip itself. Once they fix this issue, we can switch from the environment variables to the proper installation options, see SPARK-32837. It IS possible to workaround but very ugly or hacky with a big change. See [this PR](https://github.com/microsoft/nni/pull/139/files) as an example. - In pip installation, we pack the relevant jars together. This PR _does not touch existing packaging way_ in order to prevent any behaviour changes. Once this experimental way is proven to be safe, we can avoid packing the relevant jars together (and keep only the relevant Python scripts). And downloads the Spark distribution as this PR proposes. - This way is sort of consistent with SparkR: SparkR provides a method `SparkR::install.spark` to support CRAN installation. This is fine because SparkR is provided purely as a R library. For example, `sparkr` script is not packed together. PySpark cannot take this approach because PySpark packaging ships relevant executable script together, e.g.) `pyspark` shell. If PySpark has a method such as `pyspark.install_spark`, users cannot call it in `pyspark` because `pyspark` already assumes relevant Spark is installed, JVM is launched, etc. - There looks no way to release that contains different Hadoop or Hive to PyPI due to [the version semantics](https://www.python.org/dev/peps/pep-0440/). This is not an option. The usual way looks either `--install-option` above with hacks or environment variables given my investigation. ### Why are the changes needed? To provide users the options to select Hadoop and Hive versions. ### Does this PR introduce _any_ user-facing change? Yes, users will be able to select Hive and Hadoop version as below when they install it from `pip`; ```bash HADOOP_VERSION=3.2 pip install pyspark HIVE_VERSION=1.2 pip install pyspark HIVE_VERSION=1.2 HADOOP_VERSION=2.7 pip install pyspark ``` ### How was this patch tested? Unit tests were added. I also manually tested in Mac and Windows (after building Spark with `python/dist/pyspark-3.1.0.dev0.tar.gz`): ```bash ./build/mvn -DskipTests -Phive-thriftserver clean package ``` Mac: ```bash SPARK_VERSION=3.0.1 HADOOP_VERSION=3.2 pip install pyspark-3.1.0.dev0.tar.gz ``` Windows: ```bash set HADOOP_VERSION=3.2 set SPARK_VERSION=3.0.1 pip install pyspark-3.1.0.dev0.tar.gz ``` Closes #29703 from HyukjinKwon/SPARK-32017. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>		2020-09-23 09:30:51 +09:00
..
create-release	[SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI	2020-09-23 09:30:51 +09:00
deps	[SPARK-32312][SQL][PYTHON][TEST-JAVA11] Upgrade Apache Arrow to version 1.0.1	2020-09-10 14:16:19 +09:00
sparktestsupport	[SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI	2020-09-23 09:30:51 +09:00
tests	[MINOR] Fix typos in dev/* scripts.	2018-01-31 07:37:25 +09:00
.gitignore	[SPARK-23174][BUILD][PYTHON][FOLLOWUP] Add pycodestyle*.py to .gitignore file.	2018-01-31 00:51:00 +09:00
.rat-excludes	[SPARK-23431][CORE] Expose stage level peak executor metrics via REST API	2020-08-04 21:11:00 +08:00
.scalafmt.conf	[SPARK-26177] Config change followup to [] Automated formatting for Scala code	2018-12-03 10:03:51 -06:00
appveyor-guide.md	[SPARK-26918][DOCS] All .md should have ASF license header	2019-03-30 19:49:45 -05:00
appveyor-install-dependencies.ps1	[SPARK-32231][R][INFRA] Use Hadoop 3.2 winutils in AppVeyor build	2020-07-09 17:18:39 +09:00
change-scala-version.sh	[SPARK-30012][CORE][SQL] Change classes extending scala collection classes to work with 2.13	2019-12-03 08:59:43 -08:00
check-license	[MINOR][BUILD] Upgrade apache-rat to 0.13	2019-04-01 16:44:42 +09:00
checkstyle-suppressions.xml	[SPARK-29674][CORE] Update dropwizard metrics to 4.1.x for JDK 9+	2019-11-03 15:13:06 -08:00
checkstyle.xml	[MINOR] Fix google style guide address	2019-12-12 11:04:01 -06:00
github_jira_sync.py	[MINOR] Fix usage print to guide pip3 to install jira-python library	2020-09-03 01:10:59 +09:00
lint-java	[SPARK-23063][K8S] K8s changes for publishing scripts (and a couple of other misses)	2018-01-13 21:34:28 -08:00
lint-python	[SPARK-32204][SPARK-32182][DOCS][FOLLOW-UP] Use IPython instead of ipython to check if installed in dev/lint-python	2020-09-09 12:22:13 +08:00
lint-r	[SPARK-29932][R][TESTS] lint-r should do non-zero exit in case of errors	2019-11-17 10:09:46 -08:00
lint-r.R	[MINOR][R] small tidying of sh scripts for R	2020-04-30 16:58:05 -07:00
lint-scala	[SPARK-27158][BUILD] dev/mima and dev/scalastyle support dynamic profiles	2019-03-15 08:20:42 +09:00
make-distribution.sh	[SPARK-31041][BUILD] Show Maven errors from within make-distribution.sh	2020-03-11 08:22:02 -05:00
merge_spark_pr.py	[MINOR] Fix usage print to guide pip3 to install jira-python library	2020-09-03 01:10:59 +09:00
mima	[SPARK-27158][BUILD] dev/mima and dev/scalastyle support dynamic profiles	2019-03-15 08:20:42 +09:00
pip-sanity-check.py	[SPARK-32319][PYSPARK] Disallow the use of unused imports	2020-08-08 08:51:57 -07:00
README.md	Merge pull request #565 from pwendell/dev-scripts. Closes #565 .	2014-02-08 23:13:34 -08:00
requirements.txt	[SPARK-32204][SPARK-32182][DOCS] Add a quickstart page with Binder integration in PySpark documentation	2020-08-26 12:23:24 +09:00
run-pip-tests	[SPARK-32419][PYTHON][BUILD] Avoid using subshell for Conda env (de)activation in pip packaging test	2020-07-25 13:09:23 +09:00
run-tests	[SPARK-29672][PYSPARK] update spark testing framework to use python3	2019-11-14 10:18:55 -08:00
run-tests-jenkins	[SPARK-29672][PYSPARK] update spark testing framework to use python3	2019-11-14 10:18:55 -08:00
run-tests-jenkins.py	[SPARK-32138] Drop Python 2.7, 3.4 and 3.5	2020-07-14 11:22:44 +09:00
run-tests.py	[SPARK-32682][INFRA] Use workflow_dispatch to enable manual test triggers	2020-08-21 21:23:41 +09:00
sbt-checkstyle	[SPARK-27158][BUILD] dev/mima and dev/scalastyle support dynamic profiles	2019-03-15 08:20:42 +09:00
scalafmt	[SPARK-30570][BUILD] Update scalafmt plugin to 1.0.3 with onlyChangedFiles feature	2020-01-23 12:44:43 -08:00
scalastyle	Revert "[SPARK-30534][INFRA] Use mvn in `dev/scalastyle`"	2020-01-21 18:23:03 +09:00
test-dependencies.sh	[SPARK-32329][TESTS] Rename HADOOP2_MODULE_PROFILES to HADOOP_MODULE_PROFILES	2020-07-17 11:59:19 -05:00
tox.ini	[SPARK-32719][PYTHON] Add Flake8 check missing imports	2020-08-31 11:23:31 +09:00

README.md

Spark Developer Scripts

This directory contains scripts useful to developers when packaging, testing, or committing to Spark.

Many of these scripts require Apache credentials to work correctly.