[SPARK-32017][PYTHON][FOLLOW-UP] Rename HADOOP_VERSION to PYSPARK_HADOOP_VERSION in pip installation option

### What changes were proposed in this pull request? This PR is a followup of https://github.com/apache/spark/pull/29703. It renames `HADOOP_VERSION` environment variable to `PYSPARK_HADOOP_VERSION` in case `HADOOP_VERSION` is already being used somewhere. Arguably `HADOOP_VERSION` is a pretty common name. I see here and there: - https://www.ibm.com/support/knowledgecenter/SSZUMP_7.2.1/install_grid_sym/understanding_advanced_edition.html - https://cwiki.apache.org/confluence/display/ARROW/HDFS+Filesystem+Support - http://crs4.github.io/pydoop/_pydoop1/installation.html ### Why are the changes needed? To avoid the environment variables is unexpectedly conflicted. ### Does this PR introduce _any_ user-facing change? It renames the environment variable but it's not released yet. ### How was this patch tested? Existing unittests will test. Closes #31028 from HyukjinKwon/SPARK-32017-followup. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-01-05 17:21:32 +09:00 · 2021-01-05 17:21:32 +09:00 · 329850c667
parent 356fdc9a7f
commit 329850c667
3 changed files with 13 additions and 13 deletions
--- a/python/docs/source/getting_started/install.rst
+++ b/python/docs/source/getting_started/install.rst
@ -48,11 +48,11 @@ If you want to install extra dependencies for a specific component, you can inst

    pip install pyspark[sql]

-For PySpark with/without a specific Hadoop version, you can install it by using ``HADOOP_VERSION`` environment variables as below:
+For PySpark with/without a specific Hadoop version, you can install it by using ``PYSPARK_HADOOP_VERSION`` environment variables as below:

 .. code-block:: bash

-    HADOOP_VERSION=2.7 pip install pyspark
+    PYSPARK_HADOOP_VERSION=2.7 pip install pyspark

 The default distribution uses Hadoop 3.2 and Hive 2.3. If users specify different versions of Hadoop, the pip installation automatically
 downloads a different version and use it in PySpark. Downloading it can take a while depending on
@ -60,15 +60,15 @@ the network and the mirror chosen. ``PYSPARK_RELEASE_MIRROR`` can be set to manu

 .. code-block:: bash

-    PYSPARK_RELEASE_MIRROR=http://mirror.apache-kr.org HADOOP_VERSION=2.7 pip install
+    PYSPARK_RELEASE_MIRROR=http://mirror.apache-kr.org PYSPARK_HADOOP_VERSION=2.7 pip install

 It is recommended to use ``-v`` option in ``pip`` to track the installation and download status.

 .. code-block:: bash

-    HADOOP_VERSION=2.7 pip install pyspark -v
+    PYSPARK_HADOOP_VERSION=2.7 pip install pyspark -v

-Supported values in ``HADOOP_VERSION`` are:
+Supported values in ``PYSPARK_HADOOP_VERSION`` are:

 - ``without``: Spark pre-built with user-provided Apache Hadoop
 - ``2.7``: Spark pre-built for Apache Hadoop 2.7
--- a/python/pyspark/find_spark_home.py
+++ b/python/pyspark/find_spark_home.py
@ -36,7 +36,7 @@ def _find_spark_home():
                (os.path.isdir(os.path.join(path, "jars")) or
                 os.path.isdir(os.path.join(path, "assembly"))))

-    # Spark distribution can be downloaded when HADOOP_VERSION environment variable is set.
+    # Spark distribution can be downloaded when PYSPARK_HADOOP_VERSION environment variable is set.
    # We should look up this directory first, see also SPARK-32017.
    spark_dist_dir = "spark-distribution"
    paths = [
--- a/python/setup.py
+++ b/python/setup.py
@ -125,16 +125,16 @@ class InstallCommand(install):
        spark_dist = os.path.join(self.install_lib, "pyspark", "spark-distribution")
        rmtree(spark_dist, ignore_errors=True)

-        if ("HADOOP_VERSION" in os.environ) or ("HIVE_VERSION" in os.environ):
-            # Note that SPARK_VERSION environment is just a testing purpose.
-            # HIVE_VERSION environment variable is also internal for now in case
+        if ("PYSPARK_HADOOP_VERSION" in os.environ) or ("PYSPARK_HIVE_VERSION" in os.environ):
+            # Note that PYSPARK_VERSION environment is just a testing purpose.
+            # PYSPARK_HIVE_VERSION environment variable is also internal for now in case
            # we support another version of Hive in the future.
            spark_version, hadoop_version, hive_version = install_module.checked_versions(
-                os.environ.get("SPARK_VERSION", VERSION).lower(),
-                os.environ.get("HADOOP_VERSION", install_module.DEFAULT_HADOOP).lower(),
-                os.environ.get("HIVE_VERSION", install_module.DEFAULT_HIVE).lower())
+                os.environ.get("PYSPARK_VERSION", VERSION).lower(),
+                os.environ.get("PYSPARK_HADOOP_VERSION", install_module.DEFAULT_HADOOP).lower(),
+                os.environ.get("PYSPARK_HIVE_VERSION", install_module.DEFAULT_HIVE).lower())

-            if ("SPARK_VERSION" not in os.environ and
+            if ("PYSPARK_VERSION" not in os.environ and
                ((install_module.DEFAULT_HADOOP, install_module.DEFAULT_HIVE) ==
                    (hadoop_version, hive_version))):
                # Do not download and install if they are same as default.