[SPARK-32017][PYTHON][FOLLOW-UP] Rename HADOOP_VERSION to PYSPARK_HADOOP_VERSION in pip installation option
### What changes were proposed in this pull request? This PR is a followup of https://github.com/apache/spark/pull/29703. It renames `HADOOP_VERSION` environment variable to `PYSPARK_HADOOP_VERSION` in case `HADOOP_VERSION` is already being used somewhere. Arguably `HADOOP_VERSION` is a pretty common name. I see here and there: - https://www.ibm.com/support/knowledgecenter/SSZUMP_7.2.1/install_grid_sym/understanding_advanced_edition.html - https://cwiki.apache.org/confluence/display/ARROW/HDFS+Filesystem+Support - http://crs4.github.io/pydoop/_pydoop1/installation.html ### Why are the changes needed? To avoid the environment variables is unexpectedly conflicted. ### Does this PR introduce _any_ user-facing change? It renames the environment variable but it's not released yet. ### How was this patch tested? Existing unittests will test. Closes #31028 from HyukjinKwon/SPARK-32017-followup. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
This commit is contained in:
parent
356fdc9a7f
commit
329850c667
|
@ -48,11 +48,11 @@ If you want to install extra dependencies for a specific component, you can inst
|
|||
|
||||
pip install pyspark[sql]
|
||||
|
||||
For PySpark with/without a specific Hadoop version, you can install it by using ``HADOOP_VERSION`` environment variables as below:
|
||||
For PySpark with/without a specific Hadoop version, you can install it by using ``PYSPARK_HADOOP_VERSION`` environment variables as below:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
HADOOP_VERSION=2.7 pip install pyspark
|
||||
PYSPARK_HADOOP_VERSION=2.7 pip install pyspark
|
||||
|
||||
The default distribution uses Hadoop 3.2 and Hive 2.3. If users specify different versions of Hadoop, the pip installation automatically
|
||||
downloads a different version and use it in PySpark. Downloading it can take a while depending on
|
||||
|
@ -60,15 +60,15 @@ the network and the mirror chosen. ``PYSPARK_RELEASE_MIRROR`` can be set to manu
|
|||
|
||||
.. code-block:: bash
|
||||
|
||||
PYSPARK_RELEASE_MIRROR=http://mirror.apache-kr.org HADOOP_VERSION=2.7 pip install
|
||||
PYSPARK_RELEASE_MIRROR=http://mirror.apache-kr.org PYSPARK_HADOOP_VERSION=2.7 pip install
|
||||
|
||||
It is recommended to use ``-v`` option in ``pip`` to track the installation and download status.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
HADOOP_VERSION=2.7 pip install pyspark -v
|
||||
PYSPARK_HADOOP_VERSION=2.7 pip install pyspark -v
|
||||
|
||||
Supported values in ``HADOOP_VERSION`` are:
|
||||
Supported values in ``PYSPARK_HADOOP_VERSION`` are:
|
||||
|
||||
- ``without``: Spark pre-built with user-provided Apache Hadoop
|
||||
- ``2.7``: Spark pre-built for Apache Hadoop 2.7
|
||||
|
|
|
@ -36,7 +36,7 @@ def _find_spark_home():
|
|||
(os.path.isdir(os.path.join(path, "jars")) or
|
||||
os.path.isdir(os.path.join(path, "assembly"))))
|
||||
|
||||
# Spark distribution can be downloaded when HADOOP_VERSION environment variable is set.
|
||||
# Spark distribution can be downloaded when PYSPARK_HADOOP_VERSION environment variable is set.
|
||||
# We should look up this directory first, see also SPARK-32017.
|
||||
spark_dist_dir = "spark-distribution"
|
||||
paths = [
|
||||
|
|
|
@ -125,16 +125,16 @@ class InstallCommand(install):
|
|||
spark_dist = os.path.join(self.install_lib, "pyspark", "spark-distribution")
|
||||
rmtree(spark_dist, ignore_errors=True)
|
||||
|
||||
if ("HADOOP_VERSION" in os.environ) or ("HIVE_VERSION" in os.environ):
|
||||
# Note that SPARK_VERSION environment is just a testing purpose.
|
||||
# HIVE_VERSION environment variable is also internal for now in case
|
||||
if ("PYSPARK_HADOOP_VERSION" in os.environ) or ("PYSPARK_HIVE_VERSION" in os.environ):
|
||||
# Note that PYSPARK_VERSION environment is just a testing purpose.
|
||||
# PYSPARK_HIVE_VERSION environment variable is also internal for now in case
|
||||
# we support another version of Hive in the future.
|
||||
spark_version, hadoop_version, hive_version = install_module.checked_versions(
|
||||
os.environ.get("SPARK_VERSION", VERSION).lower(),
|
||||
os.environ.get("HADOOP_VERSION", install_module.DEFAULT_HADOOP).lower(),
|
||||
os.environ.get("HIVE_VERSION", install_module.DEFAULT_HIVE).lower())
|
||||
os.environ.get("PYSPARK_VERSION", VERSION).lower(),
|
||||
os.environ.get("PYSPARK_HADOOP_VERSION", install_module.DEFAULT_HADOOP).lower(),
|
||||
os.environ.get("PYSPARK_HIVE_VERSION", install_module.DEFAULT_HIVE).lower())
|
||||
|
||||
if ("SPARK_VERSION" not in os.environ and
|
||||
if ("PYSPARK_VERSION" not in os.environ and
|
||||
((install_module.DEFAULT_HADOOP, install_module.DEFAULT_HIVE) ==
|
||||
(hadoop_version, hive_version))):
|
||||
# Do not download and install if they are same as default.
|
||||
|
|
Loading…
Reference in a new issue