[SPARK-32017][PYTHON][FOLLOW-UP] Rename HADOOP_VERSION to PYSPARK_HADOOP_VERSION in pip installation option

### What changes were proposed in this pull request?

This PR is a followup of https://github.com/apache/spark/pull/29703.
It renames `HADOOP_VERSION` environment variable to `PYSPARK_HADOOP_VERSION` in case `HADOOP_VERSION` is already being used somewhere. Arguably `HADOOP_VERSION` is a pretty common name. I see here and there:
- https://www.ibm.com/support/knowledgecenter/SSZUMP_7.2.1/install_grid_sym/understanding_advanced_edition.html
- https://cwiki.apache.org/confluence/display/ARROW/HDFS+Filesystem+Support
- http://crs4.github.io/pydoop/_pydoop1/installation.html

### Why are the changes needed?

To avoid the environment variables is unexpectedly conflicted.

### Does this PR introduce _any_ user-facing change?

It renames the environment variable but it's not released yet.

### How was this patch tested?

Existing unittests will test.

Closes #31028 from HyukjinKwon/SPARK-32017-followup.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
This commit is contained in:
HyukjinKwon 2021-01-05 17:21:32 +09:00
parent 356fdc9a7f
commit 329850c667
3 changed files with 13 additions and 13 deletions

View file

@ -48,11 +48,11 @@ If you want to install extra dependencies for a specific component, you can inst
pip install pyspark[sql]
For PySpark with/without a specific Hadoop version, you can install it by using ``HADOOP_VERSION`` environment variables as below:
For PySpark with/without a specific Hadoop version, you can install it by using ``PYSPARK_HADOOP_VERSION`` environment variables as below:
.. code-block:: bash
HADOOP_VERSION=2.7 pip install pyspark
PYSPARK_HADOOP_VERSION=2.7 pip install pyspark
The default distribution uses Hadoop 3.2 and Hive 2.3. If users specify different versions of Hadoop, the pip installation automatically
downloads a different version and use it in PySpark. Downloading it can take a while depending on
@ -60,15 +60,15 @@ the network and the mirror chosen. ``PYSPARK_RELEASE_MIRROR`` can be set to manu
.. code-block:: bash
PYSPARK_RELEASE_MIRROR=http://mirror.apache-kr.org HADOOP_VERSION=2.7 pip install
PYSPARK_RELEASE_MIRROR=http://mirror.apache-kr.org PYSPARK_HADOOP_VERSION=2.7 pip install
It is recommended to use ``-v`` option in ``pip`` to track the installation and download status.
.. code-block:: bash
HADOOP_VERSION=2.7 pip install pyspark -v
PYSPARK_HADOOP_VERSION=2.7 pip install pyspark -v
Supported values in ``HADOOP_VERSION`` are:
Supported values in ``PYSPARK_HADOOP_VERSION`` are:
- ``without``: Spark pre-built with user-provided Apache Hadoop
- ``2.7``: Spark pre-built for Apache Hadoop 2.7

View file

@ -36,7 +36,7 @@ def _find_spark_home():
(os.path.isdir(os.path.join(path, "jars")) or
os.path.isdir(os.path.join(path, "assembly"))))
# Spark distribution can be downloaded when HADOOP_VERSION environment variable is set.
# Spark distribution can be downloaded when PYSPARK_HADOOP_VERSION environment variable is set.
# We should look up this directory first, see also SPARK-32017.
spark_dist_dir = "spark-distribution"
paths = [

View file

@ -125,16 +125,16 @@ class InstallCommand(install):
spark_dist = os.path.join(self.install_lib, "pyspark", "spark-distribution")
rmtree(spark_dist, ignore_errors=True)
if ("HADOOP_VERSION" in os.environ) or ("HIVE_VERSION" in os.environ):
# Note that SPARK_VERSION environment is just a testing purpose.
# HIVE_VERSION environment variable is also internal for now in case
if ("PYSPARK_HADOOP_VERSION" in os.environ) or ("PYSPARK_HIVE_VERSION" in os.environ):
# Note that PYSPARK_VERSION environment is just a testing purpose.
# PYSPARK_HIVE_VERSION environment variable is also internal for now in case
# we support another version of Hive in the future.
spark_version, hadoop_version, hive_version = install_module.checked_versions(
os.environ.get("SPARK_VERSION", VERSION).lower(),
os.environ.get("HADOOP_VERSION", install_module.DEFAULT_HADOOP).lower(),
os.environ.get("HIVE_VERSION", install_module.DEFAULT_HIVE).lower())
os.environ.get("PYSPARK_VERSION", VERSION).lower(),
os.environ.get("PYSPARK_HADOOP_VERSION", install_module.DEFAULT_HADOOP).lower(),
os.environ.get("PYSPARK_HIVE_VERSION", install_module.DEFAULT_HIVE).lower())
if ("SPARK_VERSION" not in os.environ and
if ("PYSPARK_VERSION" not in os.environ and
((install_module.DEFAULT_HADOOP, install_module.DEFAULT_HIVE) ==
(hadoop_version, hive_version))):
# Do not download and install if they are same as default.