2020-07-18 22:02:37 -04:00
|
|
|
#!/usr/bin/env python3
|
2016-11-16 17:22:15 -05:00
|
|
|
|
|
|
|
#
|
|
|
|
# Licensed to the Apache Software Foundation (ASF) under one or more
|
|
|
|
# contributor license agreements. See the NOTICE file distributed with
|
|
|
|
# this work for additional information regarding copyright ownership.
|
|
|
|
# The ASF licenses this file to You under the Apache License, Version 2.0
|
|
|
|
# (the "License"); you may not use this file except in compliance with
|
|
|
|
# the License. You may obtain a copy of the License at
|
|
|
|
#
|
|
|
|
# http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
#
|
|
|
|
# Unless required by applicable law or agreed to in writing, software
|
|
|
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
|
|
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
|
|
# See the License for the specific language governing permissions and
|
|
|
|
# limitations under the License.
|
|
|
|
|
[SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI
### What changes were proposed in this pull request?
This PR proposes to add a way to select Hadoop and Hive versions in pip installation.
Users can select Hive or Hadoop versions as below:
```bash
HADOOP_VERSION=3.2 pip install pyspark
HIVE_VERSION=1.2 pip install pyspark
HIVE_VERSION=1.2 HADOOP_VERSION=2.7 pip install pyspark
```
When the environment variables are set, internally it downloads the corresponding Spark version and then sets the Spark home to it. Also this PR exposes a mirror to set as an environment variable, `PYSPARK_RELEASE_MIRROR`.
**Please NOTE that:**
- We cannot currently leverage pip's native installation option, for example:
```bash
pip install pyspark --install-option="hadoop3.2"
```
This is because of a limitation and bug in pip itself. Once they fix this issue, we can switch from the environment variables to the proper installation options, see SPARK-32837.
It IS possible to workaround but very ugly or hacky with a big change. See [this PR](https://github.com/microsoft/nni/pull/139/files) as an example.
- In pip installation, we pack the relevant jars together. This PR _does not touch existing packaging way_ in order to prevent any behaviour changes.
Once this experimental way is proven to be safe, we can avoid packing the relevant jars together (and keep only the relevant Python scripts). And downloads the Spark distribution as this PR proposes.
- This way is sort of consistent with SparkR:
SparkR provides a method `SparkR::install.spark` to support CRAN installation. This is fine because SparkR is provided purely as a R library. For example, `sparkr` script is not packed together.
PySpark cannot take this approach because PySpark packaging ships relevant executable script together, e.g.) `pyspark` shell.
If PySpark has a method such as `pyspark.install_spark`, users cannot call it in `pyspark` because `pyspark` already assumes relevant Spark is installed, JVM is launched, etc.
- There looks no way to release that contains different Hadoop or Hive to PyPI due to [the version semantics](https://www.python.org/dev/peps/pep-0440/). This is not an option.
The usual way looks either `--install-option` above with hacks or environment variables given my investigation.
### Why are the changes needed?
To provide users the options to select Hadoop and Hive versions.
### Does this PR introduce _any_ user-facing change?
Yes, users will be able to select Hive and Hadoop version as below when they install it from `pip`;
```bash
HADOOP_VERSION=3.2 pip install pyspark
HIVE_VERSION=1.2 pip install pyspark
HIVE_VERSION=1.2 HADOOP_VERSION=2.7 pip install pyspark
```
### How was this patch tested?
Unit tests were added. I also manually tested in Mac and Windows (after building Spark with `python/dist/pyspark-3.1.0.dev0.tar.gz`):
```bash
./build/mvn -DskipTests -Phive-thriftserver clean package
```
Mac:
```bash
SPARK_VERSION=3.0.1 HADOOP_VERSION=3.2 pip install pyspark-3.1.0.dev0.tar.gz
```
Windows:
```bash
set HADOOP_VERSION=3.2
set SPARK_VERSION=3.0.1
pip install pyspark-3.1.0.dev0.tar.gz
```
Closes #29703 from HyukjinKwon/SPARK-32017.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-09-22 20:30:51 -04:00
|
|
|
import importlib.util
|
2016-11-16 17:22:15 -05:00
|
|
|
import glob
|
|
|
|
import os
|
|
|
|
import sys
|
2019-01-17 20:40:39 -05:00
|
|
|
from setuptools import setup
|
[SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI
### What changes were proposed in this pull request?
This PR proposes to add a way to select Hadoop and Hive versions in pip installation.
Users can select Hive or Hadoop versions as below:
```bash
HADOOP_VERSION=3.2 pip install pyspark
HIVE_VERSION=1.2 pip install pyspark
HIVE_VERSION=1.2 HADOOP_VERSION=2.7 pip install pyspark
```
When the environment variables are set, internally it downloads the corresponding Spark version and then sets the Spark home to it. Also this PR exposes a mirror to set as an environment variable, `PYSPARK_RELEASE_MIRROR`.
**Please NOTE that:**
- We cannot currently leverage pip's native installation option, for example:
```bash
pip install pyspark --install-option="hadoop3.2"
```
This is because of a limitation and bug in pip itself. Once they fix this issue, we can switch from the environment variables to the proper installation options, see SPARK-32837.
It IS possible to workaround but very ugly or hacky with a big change. See [this PR](https://github.com/microsoft/nni/pull/139/files) as an example.
- In pip installation, we pack the relevant jars together. This PR _does not touch existing packaging way_ in order to prevent any behaviour changes.
Once this experimental way is proven to be safe, we can avoid packing the relevant jars together (and keep only the relevant Python scripts). And downloads the Spark distribution as this PR proposes.
- This way is sort of consistent with SparkR:
SparkR provides a method `SparkR::install.spark` to support CRAN installation. This is fine because SparkR is provided purely as a R library. For example, `sparkr` script is not packed together.
PySpark cannot take this approach because PySpark packaging ships relevant executable script together, e.g.) `pyspark` shell.
If PySpark has a method such as `pyspark.install_spark`, users cannot call it in `pyspark` because `pyspark` already assumes relevant Spark is installed, JVM is launched, etc.
- There looks no way to release that contains different Hadoop or Hive to PyPI due to [the version semantics](https://www.python.org/dev/peps/pep-0440/). This is not an option.
The usual way looks either `--install-option` above with hacks or environment variables given my investigation.
### Why are the changes needed?
To provide users the options to select Hadoop and Hive versions.
### Does this PR introduce _any_ user-facing change?
Yes, users will be able to select Hive and Hadoop version as below when they install it from `pip`;
```bash
HADOOP_VERSION=3.2 pip install pyspark
HIVE_VERSION=1.2 pip install pyspark
HIVE_VERSION=1.2 HADOOP_VERSION=2.7 pip install pyspark
```
### How was this patch tested?
Unit tests were added. I also manually tested in Mac and Windows (after building Spark with `python/dist/pyspark-3.1.0.dev0.tar.gz`):
```bash
./build/mvn -DskipTests -Phive-thriftserver clean package
```
Mac:
```bash
SPARK_VERSION=3.0.1 HADOOP_VERSION=3.2 pip install pyspark-3.1.0.dev0.tar.gz
```
Windows:
```bash
set HADOOP_VERSION=3.2
set SPARK_VERSION=3.0.1
pip install pyspark-3.1.0.dev0.tar.gz
```
Closes #29703 from HyukjinKwon/SPARK-32017.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-09-22 20:30:51 -04:00
|
|
|
from setuptools.command.install import install
|
2016-11-16 17:22:15 -05:00
|
|
|
from shutil import copyfile, copytree, rmtree
|
|
|
|
|
|
|
|
try:
|
|
|
|
exec(open('pyspark/version.py').read())
|
|
|
|
except IOError:
|
|
|
|
print("Failed to load PySpark version file for packaging. You must be in Spark's python dir.",
|
|
|
|
file=sys.stderr)
|
|
|
|
sys.exit(-1)
|
[SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI
### What changes were proposed in this pull request?
This PR proposes to add a way to select Hadoop and Hive versions in pip installation.
Users can select Hive or Hadoop versions as below:
```bash
HADOOP_VERSION=3.2 pip install pyspark
HIVE_VERSION=1.2 pip install pyspark
HIVE_VERSION=1.2 HADOOP_VERSION=2.7 pip install pyspark
```
When the environment variables are set, internally it downloads the corresponding Spark version and then sets the Spark home to it. Also this PR exposes a mirror to set as an environment variable, `PYSPARK_RELEASE_MIRROR`.
**Please NOTE that:**
- We cannot currently leverage pip's native installation option, for example:
```bash
pip install pyspark --install-option="hadoop3.2"
```
This is because of a limitation and bug in pip itself. Once they fix this issue, we can switch from the environment variables to the proper installation options, see SPARK-32837.
It IS possible to workaround but very ugly or hacky with a big change. See [this PR](https://github.com/microsoft/nni/pull/139/files) as an example.
- In pip installation, we pack the relevant jars together. This PR _does not touch existing packaging way_ in order to prevent any behaviour changes.
Once this experimental way is proven to be safe, we can avoid packing the relevant jars together (and keep only the relevant Python scripts). And downloads the Spark distribution as this PR proposes.
- This way is sort of consistent with SparkR:
SparkR provides a method `SparkR::install.spark` to support CRAN installation. This is fine because SparkR is provided purely as a R library. For example, `sparkr` script is not packed together.
PySpark cannot take this approach because PySpark packaging ships relevant executable script together, e.g.) `pyspark` shell.
If PySpark has a method such as `pyspark.install_spark`, users cannot call it in `pyspark` because `pyspark` already assumes relevant Spark is installed, JVM is launched, etc.
- There looks no way to release that contains different Hadoop or Hive to PyPI due to [the version semantics](https://www.python.org/dev/peps/pep-0440/). This is not an option.
The usual way looks either `--install-option` above with hacks or environment variables given my investigation.
### Why are the changes needed?
To provide users the options to select Hadoop and Hive versions.
### Does this PR introduce _any_ user-facing change?
Yes, users will be able to select Hive and Hadoop version as below when they install it from `pip`;
```bash
HADOOP_VERSION=3.2 pip install pyspark
HIVE_VERSION=1.2 pip install pyspark
HIVE_VERSION=1.2 HADOOP_VERSION=2.7 pip install pyspark
```
### How was this patch tested?
Unit tests were added. I also manually tested in Mac and Windows (after building Spark with `python/dist/pyspark-3.1.0.dev0.tar.gz`):
```bash
./build/mvn -DskipTests -Phive-thriftserver clean package
```
Mac:
```bash
SPARK_VERSION=3.0.1 HADOOP_VERSION=3.2 pip install pyspark-3.1.0.dev0.tar.gz
```
Windows:
```bash
set HADOOP_VERSION=3.2
set SPARK_VERSION=3.0.1
pip install pyspark-3.1.0.dev0.tar.gz
```
Closes #29703 from HyukjinKwon/SPARK-32017.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-09-22 20:30:51 -04:00
|
|
|
try:
|
|
|
|
spec = importlib.util.spec_from_file_location("install", "pyspark/install.py")
|
|
|
|
install_module = importlib.util.module_from_spec(spec)
|
|
|
|
spec.loader.exec_module(install_module)
|
|
|
|
except IOError:
|
|
|
|
print("Failed to load the installing module (pyspark/install.py) which had to be "
|
|
|
|
"packaged together.",
|
|
|
|
file=sys.stderr)
|
|
|
|
sys.exit(-1)
|
2018-08-26 22:02:31 -04:00
|
|
|
VERSION = __version__ # noqa
|
2016-11-16 17:22:15 -05:00
|
|
|
# A temporary path so we can access above the Python project root and fetch scripts and jars we need
|
|
|
|
TEMP_PATH = "deps"
|
|
|
|
SPARK_HOME = os.path.abspath("../")
|
|
|
|
|
|
|
|
# Provide guidance about how to use setup.py
|
|
|
|
incorrect_invocation_message = """
|
|
|
|
If you are installing pyspark from spark source, you must first build Spark and
|
|
|
|
run sdist.
|
|
|
|
|
|
|
|
To build Spark with maven you can run:
|
|
|
|
./build/mvn -DskipTests clean package
|
|
|
|
Building the source dist is done in the Python directory:
|
|
|
|
cd python
|
|
|
|
python setup.py sdist
|
|
|
|
pip install dist/*.tar.gz"""
|
|
|
|
|
|
|
|
# Figure out where the jars are we need to package with PySpark.
|
|
|
|
JARS_PATH = glob.glob(os.path.join(SPARK_HOME, "assembly/target/scala-*/jars/"))
|
|
|
|
|
|
|
|
if len(JARS_PATH) == 1:
|
|
|
|
JARS_PATH = JARS_PATH[0]
|
|
|
|
elif (os.path.isfile("../RELEASE") and len(glob.glob("../jars/spark*core*.jar")) == 1):
|
|
|
|
# Release mode puts the jars in a jars directory
|
|
|
|
JARS_PATH = os.path.join(SPARK_HOME, "jars")
|
|
|
|
elif len(JARS_PATH) > 1:
|
|
|
|
print("Assembly jars exist for multiple scalas ({0}), please cleanup assembly/target".format(
|
|
|
|
JARS_PATH), file=sys.stderr)
|
|
|
|
sys.exit(-1)
|
|
|
|
elif len(JARS_PATH) == 0 and not os.path.exists(TEMP_PATH):
|
|
|
|
print(incorrect_invocation_message, file=sys.stderr)
|
|
|
|
sys.exit(-1)
|
|
|
|
|
|
|
|
EXAMPLES_PATH = os.path.join(SPARK_HOME, "examples/src/main/python")
|
|
|
|
SCRIPTS_PATH = os.path.join(SPARK_HOME, "bin")
|
2019-02-27 09:39:55 -05:00
|
|
|
USER_SCRIPTS_PATH = os.path.join(SPARK_HOME, "sbin")
|
2016-12-06 17:09:27 -05:00
|
|
|
DATA_PATH = os.path.join(SPARK_HOME, "data")
|
|
|
|
LICENSES_PATH = os.path.join(SPARK_HOME, "licenses")
|
|
|
|
|
2016-11-16 17:22:15 -05:00
|
|
|
SCRIPTS_TARGET = os.path.join(TEMP_PATH, "bin")
|
2019-02-27 09:39:55 -05:00
|
|
|
USER_SCRIPTS_TARGET = os.path.join(TEMP_PATH, "sbin")
|
2016-11-16 17:22:15 -05:00
|
|
|
JARS_TARGET = os.path.join(TEMP_PATH, "jars")
|
|
|
|
EXAMPLES_TARGET = os.path.join(TEMP_PATH, "examples")
|
2016-12-06 17:09:27 -05:00
|
|
|
DATA_TARGET = os.path.join(TEMP_PATH, "data")
|
|
|
|
LICENSES_TARGET = os.path.join(TEMP_PATH, "licenses")
|
2016-11-16 17:22:15 -05:00
|
|
|
|
|
|
|
# Check and see if we are under the spark path in which case we need to build the symlink farm.
|
|
|
|
# This is important because we only want to build the symlink farm while under Spark otherwise we
|
|
|
|
# want to use the symlink farm. And if the symlink farm exists under while under Spark (e.g. a
|
|
|
|
# partially built sdist) we should error and have the user sort it out.
|
|
|
|
in_spark = (os.path.isfile("../core/src/main/scala/org/apache/spark/SparkContext.scala") or
|
|
|
|
(os.path.isfile("../RELEASE") and len(glob.glob("../jars/spark*core*.jar")) == 1))
|
|
|
|
|
|
|
|
|
|
|
|
def _supports_symlinks():
|
|
|
|
"""Check if the system supports symlinks (e.g. *nix) or not."""
|
|
|
|
return getattr(os, "symlink", None) is not None
|
|
|
|
|
|
|
|
|
|
|
|
if (in_spark):
|
|
|
|
# Construct links for setup
|
|
|
|
try:
|
|
|
|
os.mkdir(TEMP_PATH)
|
|
|
|
except:
|
|
|
|
print("Temp path for symlink to parent already exists {0}".format(TEMP_PATH),
|
|
|
|
file=sys.stderr)
|
2018-03-08 06:38:34 -05:00
|
|
|
sys.exit(-1)
|
2016-11-16 17:22:15 -05:00
|
|
|
|
2020-09-10 01:16:19 -04:00
|
|
|
# If you are changing the versions here, please also change ./python/pyspark/sql/pandas/utils.py
|
2019-04-22 06:30:31 -04:00
|
|
|
# For Arrow, you should also check ./pom.xml and ensure there are no breaking changes in the
|
|
|
|
# binary format protocol with the Java version, see ARROW_HOME/format/* for specifications.
|
2020-09-19 21:58:17 -04:00
|
|
|
# Also don't forget to update python/docs/source/getting_started/install.rst.
|
2019-06-17 20:10:58 -04:00
|
|
|
_minimum_pandas_version = "0.23.2"
|
2020-09-10 01:16:19 -04:00
|
|
|
_minimum_pyarrow_version = "1.0.0"
|
[SPARK-23319][TESTS] Explicitly specify Pandas and PyArrow versions in PySpark tests (to skip or test)
## What changes were proposed in this pull request?
This PR proposes to explicitly specify Pandas and PyArrow versions in PySpark tests to skip or test.
We declared the extra dependencies:
https://github.com/apache/spark/blob/b8bfce51abf28c66ba1fc67b0f25fe1617c81025/python/setup.py#L204
In case of PyArrow:
Currently we only check if pyarrow is installed or not without checking the version. It already fails to run tests. For example, if PyArrow 0.7.0 is installed:
```
======================================================================
ERROR: test_vectorized_udf_wrong_return_type (pyspark.sql.tests.ScalarPandasUDF)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/.../spark/python/pyspark/sql/tests.py", line 4019, in test_vectorized_udf_wrong_return_type
f = pandas_udf(lambda x: x * 1.0, MapType(LongType(), LongType()))
File "/.../spark/python/pyspark/sql/functions.py", line 2309, in pandas_udf
return _create_udf(f=f, returnType=return_type, evalType=eval_type)
File "/.../spark/python/pyspark/sql/udf.py", line 47, in _create_udf
require_minimum_pyarrow_version()
File "/.../spark/python/pyspark/sql/utils.py", line 132, in require_minimum_pyarrow_version
"however, your version was %s." % pyarrow.__version__)
ImportError: pyarrow >= 0.8.0 must be installed on calling Python process; however, your version was 0.7.0.
----------------------------------------------------------------------
Ran 33 tests in 8.098s
FAILED (errors=33)
```
In case of Pandas:
There are few tests for old Pandas which were tested only when Pandas version was lower, and I rewrote them to be tested when both Pandas version is lower and missing.
## How was this patch tested?
Manually tested by modifying the condition:
```
test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 1.19.2 must be installed; however, your version was 0.19.2.'
test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 1.19.2 must be installed; however, your version was 0.19.2.'
test_createDataFrame_respect_session_timezone (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 1.19.2 must be installed; however, your version was 0.19.2.'
```
```
test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 0.19.2 must be installed; however, it was not found.'
test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 0.19.2 must be installed; however, it was not found.'
test_createDataFrame_respect_session_timezone (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 0.19.2 must be installed; however, it was not found.'
```
```
test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 1.8.0 must be installed; however, your version was 0.8.0.'
test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 1.8.0 must be installed; however, your version was 0.8.0.'
test_createDataFrame_respect_session_timezone (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 1.8.0 must be installed; however, your version was 0.8.0.'
```
```
test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 0.8.0 must be installed; however, it was not found.'
test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 0.8.0 must be installed; however, it was not found.'
test_createDataFrame_respect_session_timezone (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 0.8.0 must be installed; however, it was not found.'
```
Author: hyukjinkwon <gurwls223@gmail.com>
Closes #20487 from HyukjinKwon/pyarrow-pandas-skip.
2018-02-07 09:28:10 -05:00
|
|
|
|
[SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI
### What changes were proposed in this pull request?
This PR proposes to add a way to select Hadoop and Hive versions in pip installation.
Users can select Hive or Hadoop versions as below:
```bash
HADOOP_VERSION=3.2 pip install pyspark
HIVE_VERSION=1.2 pip install pyspark
HIVE_VERSION=1.2 HADOOP_VERSION=2.7 pip install pyspark
```
When the environment variables are set, internally it downloads the corresponding Spark version and then sets the Spark home to it. Also this PR exposes a mirror to set as an environment variable, `PYSPARK_RELEASE_MIRROR`.
**Please NOTE that:**
- We cannot currently leverage pip's native installation option, for example:
```bash
pip install pyspark --install-option="hadoop3.2"
```
This is because of a limitation and bug in pip itself. Once they fix this issue, we can switch from the environment variables to the proper installation options, see SPARK-32837.
It IS possible to workaround but very ugly or hacky with a big change. See [this PR](https://github.com/microsoft/nni/pull/139/files) as an example.
- In pip installation, we pack the relevant jars together. This PR _does not touch existing packaging way_ in order to prevent any behaviour changes.
Once this experimental way is proven to be safe, we can avoid packing the relevant jars together (and keep only the relevant Python scripts). And downloads the Spark distribution as this PR proposes.
- This way is sort of consistent with SparkR:
SparkR provides a method `SparkR::install.spark` to support CRAN installation. This is fine because SparkR is provided purely as a R library. For example, `sparkr` script is not packed together.
PySpark cannot take this approach because PySpark packaging ships relevant executable script together, e.g.) `pyspark` shell.
If PySpark has a method such as `pyspark.install_spark`, users cannot call it in `pyspark` because `pyspark` already assumes relevant Spark is installed, JVM is launched, etc.
- There looks no way to release that contains different Hadoop or Hive to PyPI due to [the version semantics](https://www.python.org/dev/peps/pep-0440/). This is not an option.
The usual way looks either `--install-option` above with hacks or environment variables given my investigation.
### Why are the changes needed?
To provide users the options to select Hadoop and Hive versions.
### Does this PR introduce _any_ user-facing change?
Yes, users will be able to select Hive and Hadoop version as below when they install it from `pip`;
```bash
HADOOP_VERSION=3.2 pip install pyspark
HIVE_VERSION=1.2 pip install pyspark
HIVE_VERSION=1.2 HADOOP_VERSION=2.7 pip install pyspark
```
### How was this patch tested?
Unit tests were added. I also manually tested in Mac and Windows (after building Spark with `python/dist/pyspark-3.1.0.dev0.tar.gz`):
```bash
./build/mvn -DskipTests -Phive-thriftserver clean package
```
Mac:
```bash
SPARK_VERSION=3.0.1 HADOOP_VERSION=3.2 pip install pyspark-3.1.0.dev0.tar.gz
```
Windows:
```bash
set HADOOP_VERSION=3.2
set SPARK_VERSION=3.0.1
pip install pyspark-3.1.0.dev0.tar.gz
```
Closes #29703 from HyukjinKwon/SPARK-32017.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-09-22 20:30:51 -04:00
|
|
|
|
|
|
|
class InstallCommand(install):
|
|
|
|
# TODO(SPARK-32837) leverage pip's custom options
|
|
|
|
|
|
|
|
def run(self):
|
|
|
|
install.run(self)
|
|
|
|
|
|
|
|
# Make sure the destination is always clean.
|
|
|
|
spark_dist = os.path.join(self.install_lib, "pyspark", "spark-distribution")
|
|
|
|
rmtree(spark_dist, ignore_errors=True)
|
|
|
|
|
2021-01-05 03:21:32 -05:00
|
|
|
if ("PYSPARK_HADOOP_VERSION" in os.environ) or ("PYSPARK_HIVE_VERSION" in os.environ):
|
|
|
|
# Note that PYSPARK_VERSION environment is just a testing purpose.
|
|
|
|
# PYSPARK_HIVE_VERSION environment variable is also internal for now in case
|
2020-09-24 01:49:58 -04:00
|
|
|
# we support another version of Hive in the future.
|
[SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI
### What changes were proposed in this pull request?
This PR proposes to add a way to select Hadoop and Hive versions in pip installation.
Users can select Hive or Hadoop versions as below:
```bash
HADOOP_VERSION=3.2 pip install pyspark
HIVE_VERSION=1.2 pip install pyspark
HIVE_VERSION=1.2 HADOOP_VERSION=2.7 pip install pyspark
```
When the environment variables are set, internally it downloads the corresponding Spark version and then sets the Spark home to it. Also this PR exposes a mirror to set as an environment variable, `PYSPARK_RELEASE_MIRROR`.
**Please NOTE that:**
- We cannot currently leverage pip's native installation option, for example:
```bash
pip install pyspark --install-option="hadoop3.2"
```
This is because of a limitation and bug in pip itself. Once they fix this issue, we can switch from the environment variables to the proper installation options, see SPARK-32837.
It IS possible to workaround but very ugly or hacky with a big change. See [this PR](https://github.com/microsoft/nni/pull/139/files) as an example.
- In pip installation, we pack the relevant jars together. This PR _does not touch existing packaging way_ in order to prevent any behaviour changes.
Once this experimental way is proven to be safe, we can avoid packing the relevant jars together (and keep only the relevant Python scripts). And downloads the Spark distribution as this PR proposes.
- This way is sort of consistent with SparkR:
SparkR provides a method `SparkR::install.spark` to support CRAN installation. This is fine because SparkR is provided purely as a R library. For example, `sparkr` script is not packed together.
PySpark cannot take this approach because PySpark packaging ships relevant executable script together, e.g.) `pyspark` shell.
If PySpark has a method such as `pyspark.install_spark`, users cannot call it in `pyspark` because `pyspark` already assumes relevant Spark is installed, JVM is launched, etc.
- There looks no way to release that contains different Hadoop or Hive to PyPI due to [the version semantics](https://www.python.org/dev/peps/pep-0440/). This is not an option.
The usual way looks either `--install-option` above with hacks or environment variables given my investigation.
### Why are the changes needed?
To provide users the options to select Hadoop and Hive versions.
### Does this PR introduce _any_ user-facing change?
Yes, users will be able to select Hive and Hadoop version as below when they install it from `pip`;
```bash
HADOOP_VERSION=3.2 pip install pyspark
HIVE_VERSION=1.2 pip install pyspark
HIVE_VERSION=1.2 HADOOP_VERSION=2.7 pip install pyspark
```
### How was this patch tested?
Unit tests were added. I also manually tested in Mac and Windows (after building Spark with `python/dist/pyspark-3.1.0.dev0.tar.gz`):
```bash
./build/mvn -DskipTests -Phive-thriftserver clean package
```
Mac:
```bash
SPARK_VERSION=3.0.1 HADOOP_VERSION=3.2 pip install pyspark-3.1.0.dev0.tar.gz
```
Windows:
```bash
set HADOOP_VERSION=3.2
set SPARK_VERSION=3.0.1
pip install pyspark-3.1.0.dev0.tar.gz
```
Closes #29703 from HyukjinKwon/SPARK-32017.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-09-22 20:30:51 -04:00
|
|
|
spark_version, hadoop_version, hive_version = install_module.checked_versions(
|
2021-01-05 03:21:32 -05:00
|
|
|
os.environ.get("PYSPARK_VERSION", VERSION).lower(),
|
|
|
|
os.environ.get("PYSPARK_HADOOP_VERSION", install_module.DEFAULT_HADOOP).lower(),
|
|
|
|
os.environ.get("PYSPARK_HIVE_VERSION", install_module.DEFAULT_HIVE).lower())
|
[SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI
### What changes were proposed in this pull request?
This PR proposes to add a way to select Hadoop and Hive versions in pip installation.
Users can select Hive or Hadoop versions as below:
```bash
HADOOP_VERSION=3.2 pip install pyspark
HIVE_VERSION=1.2 pip install pyspark
HIVE_VERSION=1.2 HADOOP_VERSION=2.7 pip install pyspark
```
When the environment variables are set, internally it downloads the corresponding Spark version and then sets the Spark home to it. Also this PR exposes a mirror to set as an environment variable, `PYSPARK_RELEASE_MIRROR`.
**Please NOTE that:**
- We cannot currently leverage pip's native installation option, for example:
```bash
pip install pyspark --install-option="hadoop3.2"
```
This is because of a limitation and bug in pip itself. Once they fix this issue, we can switch from the environment variables to the proper installation options, see SPARK-32837.
It IS possible to workaround but very ugly or hacky with a big change. See [this PR](https://github.com/microsoft/nni/pull/139/files) as an example.
- In pip installation, we pack the relevant jars together. This PR _does not touch existing packaging way_ in order to prevent any behaviour changes.
Once this experimental way is proven to be safe, we can avoid packing the relevant jars together (and keep only the relevant Python scripts). And downloads the Spark distribution as this PR proposes.
- This way is sort of consistent with SparkR:
SparkR provides a method `SparkR::install.spark` to support CRAN installation. This is fine because SparkR is provided purely as a R library. For example, `sparkr` script is not packed together.
PySpark cannot take this approach because PySpark packaging ships relevant executable script together, e.g.) `pyspark` shell.
If PySpark has a method such as `pyspark.install_spark`, users cannot call it in `pyspark` because `pyspark` already assumes relevant Spark is installed, JVM is launched, etc.
- There looks no way to release that contains different Hadoop or Hive to PyPI due to [the version semantics](https://www.python.org/dev/peps/pep-0440/). This is not an option.
The usual way looks either `--install-option` above with hacks or environment variables given my investigation.
### Why are the changes needed?
To provide users the options to select Hadoop and Hive versions.
### Does this PR introduce _any_ user-facing change?
Yes, users will be able to select Hive and Hadoop version as below when they install it from `pip`;
```bash
HADOOP_VERSION=3.2 pip install pyspark
HIVE_VERSION=1.2 pip install pyspark
HIVE_VERSION=1.2 HADOOP_VERSION=2.7 pip install pyspark
```
### How was this patch tested?
Unit tests were added. I also manually tested in Mac and Windows (after building Spark with `python/dist/pyspark-3.1.0.dev0.tar.gz`):
```bash
./build/mvn -DskipTests -Phive-thriftserver clean package
```
Mac:
```bash
SPARK_VERSION=3.0.1 HADOOP_VERSION=3.2 pip install pyspark-3.1.0.dev0.tar.gz
```
Windows:
```bash
set HADOOP_VERSION=3.2
set SPARK_VERSION=3.0.1
pip install pyspark-3.1.0.dev0.tar.gz
```
Closes #29703 from HyukjinKwon/SPARK-32017.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-09-22 20:30:51 -04:00
|
|
|
|
2021-01-05 03:21:32 -05:00
|
|
|
if ("PYSPARK_VERSION" not in os.environ and
|
[SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI
### What changes were proposed in this pull request?
This PR proposes to add a way to select Hadoop and Hive versions in pip installation.
Users can select Hive or Hadoop versions as below:
```bash
HADOOP_VERSION=3.2 pip install pyspark
HIVE_VERSION=1.2 pip install pyspark
HIVE_VERSION=1.2 HADOOP_VERSION=2.7 pip install pyspark
```
When the environment variables are set, internally it downloads the corresponding Spark version and then sets the Spark home to it. Also this PR exposes a mirror to set as an environment variable, `PYSPARK_RELEASE_MIRROR`.
**Please NOTE that:**
- We cannot currently leverage pip's native installation option, for example:
```bash
pip install pyspark --install-option="hadoop3.2"
```
This is because of a limitation and bug in pip itself. Once they fix this issue, we can switch from the environment variables to the proper installation options, see SPARK-32837.
It IS possible to workaround but very ugly or hacky with a big change. See [this PR](https://github.com/microsoft/nni/pull/139/files) as an example.
- In pip installation, we pack the relevant jars together. This PR _does not touch existing packaging way_ in order to prevent any behaviour changes.
Once this experimental way is proven to be safe, we can avoid packing the relevant jars together (and keep only the relevant Python scripts). And downloads the Spark distribution as this PR proposes.
- This way is sort of consistent with SparkR:
SparkR provides a method `SparkR::install.spark` to support CRAN installation. This is fine because SparkR is provided purely as a R library. For example, `sparkr` script is not packed together.
PySpark cannot take this approach because PySpark packaging ships relevant executable script together, e.g.) `pyspark` shell.
If PySpark has a method such as `pyspark.install_spark`, users cannot call it in `pyspark` because `pyspark` already assumes relevant Spark is installed, JVM is launched, etc.
- There looks no way to release that contains different Hadoop or Hive to PyPI due to [the version semantics](https://www.python.org/dev/peps/pep-0440/). This is not an option.
The usual way looks either `--install-option` above with hacks or environment variables given my investigation.
### Why are the changes needed?
To provide users the options to select Hadoop and Hive versions.
### Does this PR introduce _any_ user-facing change?
Yes, users will be able to select Hive and Hadoop version as below when they install it from `pip`;
```bash
HADOOP_VERSION=3.2 pip install pyspark
HIVE_VERSION=1.2 pip install pyspark
HIVE_VERSION=1.2 HADOOP_VERSION=2.7 pip install pyspark
```
### How was this patch tested?
Unit tests were added. I also manually tested in Mac and Windows (after building Spark with `python/dist/pyspark-3.1.0.dev0.tar.gz`):
```bash
./build/mvn -DskipTests -Phive-thriftserver clean package
```
Mac:
```bash
SPARK_VERSION=3.0.1 HADOOP_VERSION=3.2 pip install pyspark-3.1.0.dev0.tar.gz
```
Windows:
```bash
set HADOOP_VERSION=3.2
set SPARK_VERSION=3.0.1
pip install pyspark-3.1.0.dev0.tar.gz
```
Closes #29703 from HyukjinKwon/SPARK-32017.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-09-22 20:30:51 -04:00
|
|
|
((install_module.DEFAULT_HADOOP, install_module.DEFAULT_HIVE) ==
|
|
|
|
(hadoop_version, hive_version))):
|
|
|
|
# Do not download and install if they are same as default.
|
|
|
|
return
|
|
|
|
|
|
|
|
install_module.install_spark(
|
|
|
|
dest=spark_dist,
|
|
|
|
spark_version=spark_version,
|
|
|
|
hadoop_version=hadoop_version,
|
|
|
|
hive_version=hive_version)
|
|
|
|
|
|
|
|
|
2016-11-16 17:22:15 -05:00
|
|
|
try:
|
|
|
|
# We copy the shell script to be under pyspark/python/pyspark so that the launcher scripts
|
|
|
|
# find it where expected. The rest of the files aren't copied because they are accessed
|
|
|
|
# using Python imports instead which will be resolved correctly.
|
|
|
|
try:
|
|
|
|
os.makedirs("pyspark/python/pyspark")
|
|
|
|
except OSError:
|
|
|
|
# Don't worry if the directory already exists.
|
|
|
|
pass
|
|
|
|
copyfile("pyspark/shell.py", "pyspark/python/pyspark/shell.py")
|
|
|
|
|
|
|
|
if (in_spark):
|
|
|
|
# Construct the symlink farm - this is necessary since we can't refer to the path above the
|
|
|
|
# package root and we need to copy the jars and scripts which are up above the python root.
|
|
|
|
if _supports_symlinks():
|
|
|
|
os.symlink(JARS_PATH, JARS_TARGET)
|
|
|
|
os.symlink(SCRIPTS_PATH, SCRIPTS_TARGET)
|
2019-02-27 09:39:55 -05:00
|
|
|
os.symlink(USER_SCRIPTS_PATH, USER_SCRIPTS_TARGET)
|
2016-11-16 17:22:15 -05:00
|
|
|
os.symlink(EXAMPLES_PATH, EXAMPLES_TARGET)
|
2016-12-06 17:09:27 -05:00
|
|
|
os.symlink(DATA_PATH, DATA_TARGET)
|
|
|
|
os.symlink(LICENSES_PATH, LICENSES_TARGET)
|
2016-11-16 17:22:15 -05:00
|
|
|
else:
|
|
|
|
# For windows fall back to the slower copytree
|
|
|
|
copytree(JARS_PATH, JARS_TARGET)
|
|
|
|
copytree(SCRIPTS_PATH, SCRIPTS_TARGET)
|
2019-02-27 09:39:55 -05:00
|
|
|
copytree(USER_SCRIPTS_PATH, USER_SCRIPTS_TARGET)
|
2016-11-16 17:22:15 -05:00
|
|
|
copytree(EXAMPLES_PATH, EXAMPLES_TARGET)
|
2016-12-06 17:09:27 -05:00
|
|
|
copytree(DATA_PATH, DATA_TARGET)
|
|
|
|
copytree(LICENSES_PATH, LICENSES_TARGET)
|
2016-11-16 17:22:15 -05:00
|
|
|
else:
|
|
|
|
# If we are not inside of SPARK_HOME verify we have the required symlink farm
|
|
|
|
if not os.path.exists(JARS_TARGET):
|
|
|
|
print("To build packaging must be in the python directory under the SPARK_HOME.",
|
|
|
|
file=sys.stderr)
|
|
|
|
|
|
|
|
if not os.path.isdir(SCRIPTS_TARGET):
|
|
|
|
print(incorrect_invocation_message, file=sys.stderr)
|
2018-03-08 06:38:34 -05:00
|
|
|
sys.exit(-1)
|
2016-11-16 17:22:15 -05:00
|
|
|
|
|
|
|
# Scripts directive requires a list of each script path and does not take wild cards.
|
|
|
|
script_names = os.listdir(SCRIPTS_TARGET)
|
|
|
|
scripts = list(map(lambda script: os.path.join(SCRIPTS_TARGET, script), script_names))
|
|
|
|
# We add find_spark_home.py to the bin directory we install so that pip installed PySpark
|
|
|
|
# will search for SPARK_HOME with Python.
|
|
|
|
scripts.append("pyspark/find_spark_home.py")
|
|
|
|
|
2020-01-30 02:40:38 -05:00
|
|
|
with open('README.md') as f:
|
|
|
|
long_description = f.read()
|
2016-11-16 17:22:15 -05:00
|
|
|
|
|
|
|
setup(
|
|
|
|
name='pyspark',
|
|
|
|
version=VERSION,
|
|
|
|
description='Apache Spark Python API',
|
|
|
|
long_description=long_description,
|
2020-01-30 02:40:38 -05:00
|
|
|
long_description_content_type="text/markdown",
|
2016-11-16 17:22:15 -05:00
|
|
|
author='Spark Developers',
|
|
|
|
author_email='dev@spark.apache.org',
|
|
|
|
url='https://github.com/apache/spark/tree/master/python',
|
|
|
|
packages=['pyspark',
|
[SPARK-32094][PYTHON] Update cloudpickle to v1.5.0
### What changes were proposed in this pull request?
This PR aims to upgrade PySpark's embedded cloudpickle to the latest cloudpickle v1.5.0 (See https://github.com/cloudpipe/cloudpickle/blob/v1.5.0/cloudpickle/cloudpickle.py)
### Why are the changes needed?
There are many bug fixes. For example, the bug described in the JIRA:
dill unpickling fails because they define `types.ClassType`, which is undefined in dill. This results in the following error:
```
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/apache_beam/internal/pickler.py", line 279, in loads
return dill.loads(s)
File "/usr/local/lib/python3.6/site-packages/dill/_dill.py", line 317, in loads
return load(file, ignore)
File "/usr/local/lib/python3.6/site-packages/dill/_dill.py", line 305, in load
obj = pik.load()
File "/usr/local/lib/python3.6/site-packages/dill/_dill.py", line 577, in _load_type
return _reverse_typemap[name]
KeyError: 'ClassType'
```
See also https://github.com/cloudpipe/cloudpickle/issues/82. This was fixed for cloudpickle 1.3.0+ (https://github.com/cloudpipe/cloudpickle/pull/337), but PySpark's cloudpickle.py doesn't have this change yet.
More notably, now it supports C pickle implementation with Python 3.8 which hugely improve performance. This is already adopted in another project such as Ray.
### Does this PR introduce _any_ user-facing change?
Yes, as described above, the bug fixes. Internally, users also could leverage the fast cloudpickle backed by C pickle.
### How was this patch tested?
Jenkins will test it out.
Closes #29114 from HyukjinKwon/SPARK-32094.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-07-16 22:49:18 -04:00
|
|
|
'pyspark.cloudpickle',
|
2016-11-16 17:22:15 -05:00
|
|
|
'pyspark.mllib',
|
2017-01-25 17:43:39 -05:00
|
|
|
'pyspark.mllib.linalg',
|
|
|
|
'pyspark.mllib.stat',
|
2016-11-16 17:22:15 -05:00
|
|
|
'pyspark.ml',
|
2017-01-25 17:43:39 -05:00
|
|
|
'pyspark.ml.linalg',
|
|
|
|
'pyspark.ml.param',
|
2016-11-16 17:22:15 -05:00
|
|
|
'pyspark.sql',
|
[SPARK-30434][PYTHON][SQL] Move pandas related functionalities into 'pandas' sub-package
### What changes were proposed in this pull request?
This PR proposes to move pandas related functionalities into pandas package. Namely:
```bash
pyspark/sql/pandas
├── __init__.py
├── conversion.py # Conversion between pandas <> PySpark DataFrames
├── functions.py # pandas_udf
├── group_ops.py # Grouped UDF / Cogrouped UDF + groupby.apply, groupby.cogroup.apply
├── map_ops.py # Map Iter UDF + mapInPandas
├── serializers.py # pandas <> PyArrow serializers
├── types.py # Type utils between pandas <> PyArrow
└── utils.py # Version requirement checks
```
In order to separately locate `groupby.apply`, `groupby.cogroup.apply`, `mapInPandas`, `toPandas`, and `createDataFrame(pdf)` under `pandas` sub-package, I had to use a mix-in approach which Scala side uses often by `trait`, and also pandas itself uses this approach (see `IndexOpsMixin` as an example) to group related functionalities. Currently, you can think it's like Scala's self typed trait. See the structure below:
```python
class PandasMapOpsMixin(object):
def mapInPandas(self, ...):
...
return ...
# other Pandas <> PySpark APIs
```
```python
class DataFrame(PandasMapOpsMixin):
# other DataFrame APIs equivalent to Scala side.
```
Yes, This is a big PR but they are mostly just moving around except one case `createDataFrame` which I had to split the methods.
### Why are the changes needed?
There are pandas functionalities here and there and I myself gets lost where it was. Also, when you have to make a change commonly for all of pandas related features, it's almost impossible now.
Also, after this change, `DataFrame` and `SparkSession` become more consistent with Scala side since pandas is specific to Python, and this change separates pandas-specific APIs away from `DataFrame` or `SparkSession`.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Existing tests should cover. Also, I manually built the PySpark API documentation and checked.
Closes #27109 from HyukjinKwon/pandas-refactoring.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-01-08 20:22:50 -05:00
|
|
|
'pyspark.sql.avro',
|
|
|
|
'pyspark.sql.pandas',
|
2016-11-16 17:22:15 -05:00
|
|
|
'pyspark.streaming',
|
|
|
|
'pyspark.bin',
|
2019-02-27 09:39:55 -05:00
|
|
|
'pyspark.sbin',
|
2016-11-16 17:22:15 -05:00
|
|
|
'pyspark.jars',
|
2021-05-03 20:04:23 -04:00
|
|
|
'pyspark.pandas',
|
[SPARK-35338][PYTHON] Separate arithmetic operations into data type based structures
### What changes were proposed in this pull request?
The PR is proposed for **pandas APIs on Spark**, in order to separate arithmetic operations shown as below into data-type-based structures.
`__add__, __sub__, __mul__, __truediv__, __floordiv__, __pow__, __mod__,
__radd__, __rsub__, __rmul__, __rtruediv__, __rfloordiv__, __rpow__,__rmod__`
DataTypeOps and subclasses are introduced.
The existing behaviors of each arithmetic operation should be preserved.
### Why are the changes needed?
Currently, the same arithmetic operation of all data types is defined in one function, so it’s difficult to extend the behavior change based on the data types.
Introducing DataTypeOps would be the foundation for [pandas APIs on Spark: Separate basic operations into data type based structures.](https://docs.google.com/document/d/12MS6xK0hETYmrcl5b9pX5lgV4FmGVfpmcSKq--_oQlc/edit?usp=sharing).
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Tests are introduced under pyspark.pandas.tests.data_type_ops. One test file per DataTypeOps class.
Closes #32596 from xinrong-databricks/datatypeop_arith_fix.
Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-05-19 22:47:00 -04:00
|
|
|
'pyspark.pandas.data_type_ops',
|
2021-05-03 20:04:23 -04:00
|
|
|
'pyspark.pandas.indexes',
|
|
|
|
'pyspark.pandas.missing',
|
|
|
|
'pyspark.pandas.plot',
|
|
|
|
'pyspark.pandas.spark',
|
|
|
|
'pyspark.pandas.typedef',
|
|
|
|
'pyspark.pandas.usage_logging',
|
2016-11-16 17:22:15 -05:00
|
|
|
'pyspark.python.pyspark',
|
|
|
|
'pyspark.python.lib',
|
2016-12-06 17:09:27 -05:00
|
|
|
'pyspark.data',
|
|
|
|
'pyspark.licenses',
|
2020-04-22 21:20:39 -04:00
|
|
|
'pyspark.resource',
|
2016-11-16 17:22:15 -05:00
|
|
|
'pyspark.examples.src.main.python'],
|
|
|
|
include_package_data=True,
|
|
|
|
package_dir={
|
|
|
|
'pyspark.jars': 'deps/jars',
|
|
|
|
'pyspark.bin': 'deps/bin',
|
2019-02-27 09:39:55 -05:00
|
|
|
'pyspark.sbin': 'deps/sbin',
|
2016-11-16 17:22:15 -05:00
|
|
|
'pyspark.python.lib': 'lib',
|
2016-12-06 17:09:27 -05:00
|
|
|
'pyspark.data': 'deps/data',
|
|
|
|
'pyspark.licenses': 'deps/licenses',
|
2016-11-16 17:22:15 -05:00
|
|
|
'pyspark.examples.src.main.python': 'deps/examples',
|
|
|
|
},
|
|
|
|
package_data={
|
|
|
|
'pyspark.jars': ['*.jar'],
|
|
|
|
'pyspark.bin': ['*'],
|
2019-02-27 09:39:55 -05:00
|
|
|
'pyspark.sbin': ['spark-config.sh', 'spark-daemon.sh',
|
|
|
|
'start-history-server.sh',
|
|
|
|
'stop-history-server.sh', ],
|
2016-11-16 17:22:15 -05:00
|
|
|
'pyspark.python.lib': ['*.zip'],
|
2016-12-06 17:09:27 -05:00
|
|
|
'pyspark.data': ['*.txt', '*.data'],
|
|
|
|
'pyspark.licenses': ['*.txt'],
|
2016-11-16 17:22:15 -05:00
|
|
|
'pyspark.examples.src.main.python': ['*.py', '*/*.py']},
|
|
|
|
scripts=scripts,
|
|
|
|
license='http://www.apache.org/licenses/LICENSE-2.0',
|
2020-09-19 21:58:17 -04:00
|
|
|
# Don't forget to update python/docs/source/getting_started/install.rst
|
|
|
|
# if you're updating the versions or dependencies.
|
2021-03-11 10:51:41 -05:00
|
|
|
install_requires=['py4j==0.10.9.2'],
|
2016-11-16 17:22:15 -05:00
|
|
|
extras_require={
|
|
|
|
'ml': ['numpy>=1.7'],
|
|
|
|
'mllib': ['numpy>=1.7'],
|
[SPARK-23319][TESTS] Explicitly specify Pandas and PyArrow versions in PySpark tests (to skip or test)
## What changes were proposed in this pull request?
This PR proposes to explicitly specify Pandas and PyArrow versions in PySpark tests to skip or test.
We declared the extra dependencies:
https://github.com/apache/spark/blob/b8bfce51abf28c66ba1fc67b0f25fe1617c81025/python/setup.py#L204
In case of PyArrow:
Currently we only check if pyarrow is installed or not without checking the version. It already fails to run tests. For example, if PyArrow 0.7.0 is installed:
```
======================================================================
ERROR: test_vectorized_udf_wrong_return_type (pyspark.sql.tests.ScalarPandasUDF)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/.../spark/python/pyspark/sql/tests.py", line 4019, in test_vectorized_udf_wrong_return_type
f = pandas_udf(lambda x: x * 1.0, MapType(LongType(), LongType()))
File "/.../spark/python/pyspark/sql/functions.py", line 2309, in pandas_udf
return _create_udf(f=f, returnType=return_type, evalType=eval_type)
File "/.../spark/python/pyspark/sql/udf.py", line 47, in _create_udf
require_minimum_pyarrow_version()
File "/.../spark/python/pyspark/sql/utils.py", line 132, in require_minimum_pyarrow_version
"however, your version was %s." % pyarrow.__version__)
ImportError: pyarrow >= 0.8.0 must be installed on calling Python process; however, your version was 0.7.0.
----------------------------------------------------------------------
Ran 33 tests in 8.098s
FAILED (errors=33)
```
In case of Pandas:
There are few tests for old Pandas which were tested only when Pandas version was lower, and I rewrote them to be tested when both Pandas version is lower and missing.
## How was this patch tested?
Manually tested by modifying the condition:
```
test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 1.19.2 must be installed; however, your version was 0.19.2.'
test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 1.19.2 must be installed; however, your version was 0.19.2.'
test_createDataFrame_respect_session_timezone (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 1.19.2 must be installed; however, your version was 0.19.2.'
```
```
test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 0.19.2 must be installed; however, it was not found.'
test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 0.19.2 must be installed; however, it was not found.'
test_createDataFrame_respect_session_timezone (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 0.19.2 must be installed; however, it was not found.'
```
```
test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 1.8.0 must be installed; however, your version was 0.8.0.'
test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 1.8.0 must be installed; however, your version was 0.8.0.'
test_createDataFrame_respect_session_timezone (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 1.8.0 must be installed; however, your version was 0.8.0.'
```
```
test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 0.8.0 must be installed; however, it was not found.'
test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 0.8.0 must be installed; however, it was not found.'
test_createDataFrame_respect_session_timezone (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 0.8.0 must be installed; however, it was not found.'
```
Author: hyukjinkwon <gurwls223@gmail.com>
Closes #20487 from HyukjinKwon/pyarrow-pandas-skip.
2018-02-07 09:28:10 -05:00
|
|
|
'sql': [
|
|
|
|
'pandas>=%s' % _minimum_pandas_version,
|
|
|
|
'pyarrow>=%s' % _minimum_pyarrow_version,
|
2021-05-03 20:04:23 -04:00
|
|
|
],
|
|
|
|
'pandas_on_spark': [
|
|
|
|
'pandas>=%s' % _minimum_pandas_version,
|
|
|
|
'pyarrow>=%s' % _minimum_pyarrow_version,
|
|
|
|
'numpy>=1.14,<1.20.0',
|
|
|
|
],
|
2016-11-16 17:22:15 -05:00
|
|
|
},
|
2020-07-13 22:22:44 -04:00
|
|
|
python_requires='>=3.6',
|
2016-11-16 17:22:15 -05:00
|
|
|
classifiers=[
|
|
|
|
'Development Status :: 5 - Production/Stable',
|
|
|
|
'License :: OSI Approved :: Apache Software License',
|
2017-12-27 06:51:26 -05:00
|
|
|
'Programming Language :: Python :: 3.6',
|
2018-07-06 23:37:41 -04:00
|
|
|
'Programming Language :: Python :: 3.7',
|
2019-10-22 03:18:34 -04:00
|
|
|
'Programming Language :: Python :: 3.8',
|
2020-11-06 18:05:37 -05:00
|
|
|
'Programming Language :: Python :: 3.9',
|
2016-11-16 17:22:15 -05:00
|
|
|
'Programming Language :: Python :: Implementation :: CPython',
|
2020-09-24 01:15:36 -04:00
|
|
|
'Programming Language :: Python :: Implementation :: PyPy',
|
|
|
|
'Typing :: Typed'],
|
[SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI
### What changes were proposed in this pull request?
This PR proposes to add a way to select Hadoop and Hive versions in pip installation.
Users can select Hive or Hadoop versions as below:
```bash
HADOOP_VERSION=3.2 pip install pyspark
HIVE_VERSION=1.2 pip install pyspark
HIVE_VERSION=1.2 HADOOP_VERSION=2.7 pip install pyspark
```
When the environment variables are set, internally it downloads the corresponding Spark version and then sets the Spark home to it. Also this PR exposes a mirror to set as an environment variable, `PYSPARK_RELEASE_MIRROR`.
**Please NOTE that:**
- We cannot currently leverage pip's native installation option, for example:
```bash
pip install pyspark --install-option="hadoop3.2"
```
This is because of a limitation and bug in pip itself. Once they fix this issue, we can switch from the environment variables to the proper installation options, see SPARK-32837.
It IS possible to workaround but very ugly or hacky with a big change. See [this PR](https://github.com/microsoft/nni/pull/139/files) as an example.
- In pip installation, we pack the relevant jars together. This PR _does not touch existing packaging way_ in order to prevent any behaviour changes.
Once this experimental way is proven to be safe, we can avoid packing the relevant jars together (and keep only the relevant Python scripts). And downloads the Spark distribution as this PR proposes.
- This way is sort of consistent with SparkR:
SparkR provides a method `SparkR::install.spark` to support CRAN installation. This is fine because SparkR is provided purely as a R library. For example, `sparkr` script is not packed together.
PySpark cannot take this approach because PySpark packaging ships relevant executable script together, e.g.) `pyspark` shell.
If PySpark has a method such as `pyspark.install_spark`, users cannot call it in `pyspark` because `pyspark` already assumes relevant Spark is installed, JVM is launched, etc.
- There looks no way to release that contains different Hadoop or Hive to PyPI due to [the version semantics](https://www.python.org/dev/peps/pep-0440/). This is not an option.
The usual way looks either `--install-option` above with hacks or environment variables given my investigation.
### Why are the changes needed?
To provide users the options to select Hadoop and Hive versions.
### Does this PR introduce _any_ user-facing change?
Yes, users will be able to select Hive and Hadoop version as below when they install it from `pip`;
```bash
HADOOP_VERSION=3.2 pip install pyspark
HIVE_VERSION=1.2 pip install pyspark
HIVE_VERSION=1.2 HADOOP_VERSION=2.7 pip install pyspark
```
### How was this patch tested?
Unit tests were added. I also manually tested in Mac and Windows (after building Spark with `python/dist/pyspark-3.1.0.dev0.tar.gz`):
```bash
./build/mvn -DskipTests -Phive-thriftserver clean package
```
Mac:
```bash
SPARK_VERSION=3.0.1 HADOOP_VERSION=3.2 pip install pyspark-3.1.0.dev0.tar.gz
```
Windows:
```bash
set HADOOP_VERSION=3.2
set SPARK_VERSION=3.0.1
pip install pyspark-3.1.0.dev0.tar.gz
```
Closes #29703 from HyukjinKwon/SPARK-32017.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-09-22 20:30:51 -04:00
|
|
|
cmdclass={
|
|
|
|
'install': InstallCommand,
|
|
|
|
},
|
2016-11-16 17:22:15 -05:00
|
|
|
)
|
|
|
|
finally:
|
|
|
|
# We only cleanup the symlink farm if we were in Spark, otherwise we are installing rather than
|
|
|
|
# packaging.
|
|
|
|
if (in_spark):
|
|
|
|
# Depending on cleaning up the symlink farm or copied version
|
|
|
|
if _supports_symlinks():
|
|
|
|
os.remove(os.path.join(TEMP_PATH, "jars"))
|
|
|
|
os.remove(os.path.join(TEMP_PATH, "bin"))
|
2019-02-27 09:39:55 -05:00
|
|
|
os.remove(os.path.join(TEMP_PATH, "sbin"))
|
2016-11-16 17:22:15 -05:00
|
|
|
os.remove(os.path.join(TEMP_PATH, "examples"))
|
2016-12-06 17:09:27 -05:00
|
|
|
os.remove(os.path.join(TEMP_PATH, "data"))
|
|
|
|
os.remove(os.path.join(TEMP_PATH, "licenses"))
|
2016-11-16 17:22:15 -05:00
|
|
|
else:
|
|
|
|
rmtree(os.path.join(TEMP_PATH, "jars"))
|
|
|
|
rmtree(os.path.join(TEMP_PATH, "bin"))
|
2019-02-27 09:39:55 -05:00
|
|
|
rmtree(os.path.join(TEMP_PATH, "sbin"))
|
2016-11-16 17:22:15 -05:00
|
|
|
rmtree(os.path.join(TEMP_PATH, "examples"))
|
2016-12-06 17:09:27 -05:00
|
|
|
rmtree(os.path.join(TEMP_PATH, "data"))
|
|
|
|
rmtree(os.path.join(TEMP_PATH, "licenses"))
|
2016-11-16 17:22:15 -05:00
|
|
|
os.rmdir(TEMP_PATH)
|