spark-instrumented-optimizer/python/setup.py

297 lines
12 KiB
Python
Raw Normal View History

#!/usr/bin/env python3
[SPARK-1267][SPARK-18129] Allow PySpark to be pip installed ## What changes were proposed in this pull request? This PR aims to provide a pip installable PySpark package. This does a bunch of work to copy the jars over and package them with the Python code (to prevent challenges from trying to use different versions of the Python code with different versions of the JAR). It does not currently publish to PyPI but that is the natural follow up (SPARK-18129). Done: - pip installable on conda [manual tested] - setup.py installed on a non-pip managed system (RHEL) with YARN [manual tested] - Automated testing of this (virtualenv) - packaging and signing with release-build* Possible follow up work: - release-build update to publish to PyPI (SPARK-18128) - figure out who owns the pyspark package name on prod PyPI (is it someone with in the project or should we ask PyPI or should we choose a different name to publish with like ApachePySpark?) - Windows support and or testing ( SPARK-18136 ) - investigate details of wheel caching and see if we can avoid cleaning the wheel cache during our test - consider how we want to number our dev/snapshot versions Explicitly out of scope: - Using pip installed PySpark to start a standalone cluster - Using pip installed PySpark for non-Python Spark programs *I've done some work to test release-build locally but as a non-committer I've just done local testing. ## How was this patch tested? Automated testing with virtualenv, manual testing with conda, a system wide install, and YARN integration. release-build changes tested locally as a non-committer (no testing of upload artifacts to Apache staging websites) Author: Holden Karau <holden@us.ibm.com> Author: Juliet Hougland <juliet@cloudera.com> Author: Juliet Hougland <not@myemail.com> Closes #15659 from holdenk/SPARK-1267-pip-install-pyspark.
2016-11-16 17:22:15 -05:00
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
[SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI ### What changes were proposed in this pull request? This PR proposes to add a way to select Hadoop and Hive versions in pip installation. Users can select Hive or Hadoop versions as below: ```bash HADOOP_VERSION=3.2 pip install pyspark HIVE_VERSION=1.2 pip install pyspark HIVE_VERSION=1.2 HADOOP_VERSION=2.7 pip install pyspark ``` When the environment variables are set, internally it downloads the corresponding Spark version and then sets the Spark home to it. Also this PR exposes a mirror to set as an environment variable, `PYSPARK_RELEASE_MIRROR`. **Please NOTE that:** - We cannot currently leverage pip's native installation option, for example: ```bash pip install pyspark --install-option="hadoop3.2" ``` This is because of a limitation and bug in pip itself. Once they fix this issue, we can switch from the environment variables to the proper installation options, see SPARK-32837. It IS possible to workaround but very ugly or hacky with a big change. See [this PR](https://github.com/microsoft/nni/pull/139/files) as an example. - In pip installation, we pack the relevant jars together. This PR _does not touch existing packaging way_ in order to prevent any behaviour changes. Once this experimental way is proven to be safe, we can avoid packing the relevant jars together (and keep only the relevant Python scripts). And downloads the Spark distribution as this PR proposes. - This way is sort of consistent with SparkR: SparkR provides a method `SparkR::install.spark` to support CRAN installation. This is fine because SparkR is provided purely as a R library. For example, `sparkr` script is not packed together. PySpark cannot take this approach because PySpark packaging ships relevant executable script together, e.g.) `pyspark` shell. If PySpark has a method such as `pyspark.install_spark`, users cannot call it in `pyspark` because `pyspark` already assumes relevant Spark is installed, JVM is launched, etc. - There looks no way to release that contains different Hadoop or Hive to PyPI due to [the version semantics](https://www.python.org/dev/peps/pep-0440/). This is not an option. The usual way looks either `--install-option` above with hacks or environment variables given my investigation. ### Why are the changes needed? To provide users the options to select Hadoop and Hive versions. ### Does this PR introduce _any_ user-facing change? Yes, users will be able to select Hive and Hadoop version as below when they install it from `pip`; ```bash HADOOP_VERSION=3.2 pip install pyspark HIVE_VERSION=1.2 pip install pyspark HIVE_VERSION=1.2 HADOOP_VERSION=2.7 pip install pyspark ``` ### How was this patch tested? Unit tests were added. I also manually tested in Mac and Windows (after building Spark with `python/dist/pyspark-3.1.0.dev0.tar.gz`): ```bash ./build/mvn -DskipTests -Phive-thriftserver clean package ``` Mac: ```bash SPARK_VERSION=3.0.1 HADOOP_VERSION=3.2 pip install pyspark-3.1.0.dev0.tar.gz ``` Windows: ```bash set HADOOP_VERSION=3.2 set SPARK_VERSION=3.0.1 pip install pyspark-3.1.0.dev0.tar.gz ``` Closes #29703 from HyukjinKwon/SPARK-32017. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-09-22 20:30:51 -04:00
import importlib.util
[SPARK-1267][SPARK-18129] Allow PySpark to be pip installed ## What changes were proposed in this pull request? This PR aims to provide a pip installable PySpark package. This does a bunch of work to copy the jars over and package them with the Python code (to prevent challenges from trying to use different versions of the Python code with different versions of the JAR). It does not currently publish to PyPI but that is the natural follow up (SPARK-18129). Done: - pip installable on conda [manual tested] - setup.py installed on a non-pip managed system (RHEL) with YARN [manual tested] - Automated testing of this (virtualenv) - packaging and signing with release-build* Possible follow up work: - release-build update to publish to PyPI (SPARK-18128) - figure out who owns the pyspark package name on prod PyPI (is it someone with in the project or should we ask PyPI or should we choose a different name to publish with like ApachePySpark?) - Windows support and or testing ( SPARK-18136 ) - investigate details of wheel caching and see if we can avoid cleaning the wheel cache during our test - consider how we want to number our dev/snapshot versions Explicitly out of scope: - Using pip installed PySpark to start a standalone cluster - Using pip installed PySpark for non-Python Spark programs *I've done some work to test release-build locally but as a non-committer I've just done local testing. ## How was this patch tested? Automated testing with virtualenv, manual testing with conda, a system wide install, and YARN integration. release-build changes tested locally as a non-committer (no testing of upload artifacts to Apache staging websites) Author: Holden Karau <holden@us.ibm.com> Author: Juliet Hougland <juliet@cloudera.com> Author: Juliet Hougland <not@myemail.com> Closes #15659 from holdenk/SPARK-1267-pip-install-pyspark.
2016-11-16 17:22:15 -05:00
import glob
import os
import sys
from setuptools import setup
[SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI ### What changes were proposed in this pull request? This PR proposes to add a way to select Hadoop and Hive versions in pip installation. Users can select Hive or Hadoop versions as below: ```bash HADOOP_VERSION=3.2 pip install pyspark HIVE_VERSION=1.2 pip install pyspark HIVE_VERSION=1.2 HADOOP_VERSION=2.7 pip install pyspark ``` When the environment variables are set, internally it downloads the corresponding Spark version and then sets the Spark home to it. Also this PR exposes a mirror to set as an environment variable, `PYSPARK_RELEASE_MIRROR`. **Please NOTE that:** - We cannot currently leverage pip's native installation option, for example: ```bash pip install pyspark --install-option="hadoop3.2" ``` This is because of a limitation and bug in pip itself. Once they fix this issue, we can switch from the environment variables to the proper installation options, see SPARK-32837. It IS possible to workaround but very ugly or hacky with a big change. See [this PR](https://github.com/microsoft/nni/pull/139/files) as an example. - In pip installation, we pack the relevant jars together. This PR _does not touch existing packaging way_ in order to prevent any behaviour changes. Once this experimental way is proven to be safe, we can avoid packing the relevant jars together (and keep only the relevant Python scripts). And downloads the Spark distribution as this PR proposes. - This way is sort of consistent with SparkR: SparkR provides a method `SparkR::install.spark` to support CRAN installation. This is fine because SparkR is provided purely as a R library. For example, `sparkr` script is not packed together. PySpark cannot take this approach because PySpark packaging ships relevant executable script together, e.g.) `pyspark` shell. If PySpark has a method such as `pyspark.install_spark`, users cannot call it in `pyspark` because `pyspark` already assumes relevant Spark is installed, JVM is launched, etc. - There looks no way to release that contains different Hadoop or Hive to PyPI due to [the version semantics](https://www.python.org/dev/peps/pep-0440/). This is not an option. The usual way looks either `--install-option` above with hacks or environment variables given my investigation. ### Why are the changes needed? To provide users the options to select Hadoop and Hive versions. ### Does this PR introduce _any_ user-facing change? Yes, users will be able to select Hive and Hadoop version as below when they install it from `pip`; ```bash HADOOP_VERSION=3.2 pip install pyspark HIVE_VERSION=1.2 pip install pyspark HIVE_VERSION=1.2 HADOOP_VERSION=2.7 pip install pyspark ``` ### How was this patch tested? Unit tests were added. I also manually tested in Mac and Windows (after building Spark with `python/dist/pyspark-3.1.0.dev0.tar.gz`): ```bash ./build/mvn -DskipTests -Phive-thriftserver clean package ``` Mac: ```bash SPARK_VERSION=3.0.1 HADOOP_VERSION=3.2 pip install pyspark-3.1.0.dev0.tar.gz ``` Windows: ```bash set HADOOP_VERSION=3.2 set SPARK_VERSION=3.0.1 pip install pyspark-3.1.0.dev0.tar.gz ``` Closes #29703 from HyukjinKwon/SPARK-32017. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-09-22 20:30:51 -04:00
from setuptools.command.install import install
[SPARK-1267][SPARK-18129] Allow PySpark to be pip installed ## What changes were proposed in this pull request? This PR aims to provide a pip installable PySpark package. This does a bunch of work to copy the jars over and package them with the Python code (to prevent challenges from trying to use different versions of the Python code with different versions of the JAR). It does not currently publish to PyPI but that is the natural follow up (SPARK-18129). Done: - pip installable on conda [manual tested] - setup.py installed on a non-pip managed system (RHEL) with YARN [manual tested] - Automated testing of this (virtualenv) - packaging and signing with release-build* Possible follow up work: - release-build update to publish to PyPI (SPARK-18128) - figure out who owns the pyspark package name on prod PyPI (is it someone with in the project or should we ask PyPI or should we choose a different name to publish with like ApachePySpark?) - Windows support and or testing ( SPARK-18136 ) - investigate details of wheel caching and see if we can avoid cleaning the wheel cache during our test - consider how we want to number our dev/snapshot versions Explicitly out of scope: - Using pip installed PySpark to start a standalone cluster - Using pip installed PySpark for non-Python Spark programs *I've done some work to test release-build locally but as a non-committer I've just done local testing. ## How was this patch tested? Automated testing with virtualenv, manual testing with conda, a system wide install, and YARN integration. release-build changes tested locally as a non-committer (no testing of upload artifacts to Apache staging websites) Author: Holden Karau <holden@us.ibm.com> Author: Juliet Hougland <juliet@cloudera.com> Author: Juliet Hougland <not@myemail.com> Closes #15659 from holdenk/SPARK-1267-pip-install-pyspark.
2016-11-16 17:22:15 -05:00
from shutil import copyfile, copytree, rmtree
try:
exec(open('pyspark/version.py').read())
except IOError:
print("Failed to load PySpark version file for packaging. You must be in Spark's python dir.",
file=sys.stderr)
sys.exit(-1)
[SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI ### What changes were proposed in this pull request? This PR proposes to add a way to select Hadoop and Hive versions in pip installation. Users can select Hive or Hadoop versions as below: ```bash HADOOP_VERSION=3.2 pip install pyspark HIVE_VERSION=1.2 pip install pyspark HIVE_VERSION=1.2 HADOOP_VERSION=2.7 pip install pyspark ``` When the environment variables are set, internally it downloads the corresponding Spark version and then sets the Spark home to it. Also this PR exposes a mirror to set as an environment variable, `PYSPARK_RELEASE_MIRROR`. **Please NOTE that:** - We cannot currently leverage pip's native installation option, for example: ```bash pip install pyspark --install-option="hadoop3.2" ``` This is because of a limitation and bug in pip itself. Once they fix this issue, we can switch from the environment variables to the proper installation options, see SPARK-32837. It IS possible to workaround but very ugly or hacky with a big change. See [this PR](https://github.com/microsoft/nni/pull/139/files) as an example. - In pip installation, we pack the relevant jars together. This PR _does not touch existing packaging way_ in order to prevent any behaviour changes. Once this experimental way is proven to be safe, we can avoid packing the relevant jars together (and keep only the relevant Python scripts). And downloads the Spark distribution as this PR proposes. - This way is sort of consistent with SparkR: SparkR provides a method `SparkR::install.spark` to support CRAN installation. This is fine because SparkR is provided purely as a R library. For example, `sparkr` script is not packed together. PySpark cannot take this approach because PySpark packaging ships relevant executable script together, e.g.) `pyspark` shell. If PySpark has a method such as `pyspark.install_spark`, users cannot call it in `pyspark` because `pyspark` already assumes relevant Spark is installed, JVM is launched, etc. - There looks no way to release that contains different Hadoop or Hive to PyPI due to [the version semantics](https://www.python.org/dev/peps/pep-0440/). This is not an option. The usual way looks either `--install-option` above with hacks or environment variables given my investigation. ### Why are the changes needed? To provide users the options to select Hadoop and Hive versions. ### Does this PR introduce _any_ user-facing change? Yes, users will be able to select Hive and Hadoop version as below when they install it from `pip`; ```bash HADOOP_VERSION=3.2 pip install pyspark HIVE_VERSION=1.2 pip install pyspark HIVE_VERSION=1.2 HADOOP_VERSION=2.7 pip install pyspark ``` ### How was this patch tested? Unit tests were added. I also manually tested in Mac and Windows (after building Spark with `python/dist/pyspark-3.1.0.dev0.tar.gz`): ```bash ./build/mvn -DskipTests -Phive-thriftserver clean package ``` Mac: ```bash SPARK_VERSION=3.0.1 HADOOP_VERSION=3.2 pip install pyspark-3.1.0.dev0.tar.gz ``` Windows: ```bash set HADOOP_VERSION=3.2 set SPARK_VERSION=3.0.1 pip install pyspark-3.1.0.dev0.tar.gz ``` Closes #29703 from HyukjinKwon/SPARK-32017. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-09-22 20:30:51 -04:00
try:
spec = importlib.util.spec_from_file_location("install", "pyspark/install.py")
install_module = importlib.util.module_from_spec(spec)
spec.loader.exec_module(install_module)
except IOError:
print("Failed to load the installing module (pyspark/install.py) which had to be "
"packaged together.",
file=sys.stderr)
sys.exit(-1)
VERSION = __version__ # noqa
[SPARK-1267][SPARK-18129] Allow PySpark to be pip installed ## What changes were proposed in this pull request? This PR aims to provide a pip installable PySpark package. This does a bunch of work to copy the jars over and package them with the Python code (to prevent challenges from trying to use different versions of the Python code with different versions of the JAR). It does not currently publish to PyPI but that is the natural follow up (SPARK-18129). Done: - pip installable on conda [manual tested] - setup.py installed on a non-pip managed system (RHEL) with YARN [manual tested] - Automated testing of this (virtualenv) - packaging and signing with release-build* Possible follow up work: - release-build update to publish to PyPI (SPARK-18128) - figure out who owns the pyspark package name on prod PyPI (is it someone with in the project or should we ask PyPI or should we choose a different name to publish with like ApachePySpark?) - Windows support and or testing ( SPARK-18136 ) - investigate details of wheel caching and see if we can avoid cleaning the wheel cache during our test - consider how we want to number our dev/snapshot versions Explicitly out of scope: - Using pip installed PySpark to start a standalone cluster - Using pip installed PySpark for non-Python Spark programs *I've done some work to test release-build locally but as a non-committer I've just done local testing. ## How was this patch tested? Automated testing with virtualenv, manual testing with conda, a system wide install, and YARN integration. release-build changes tested locally as a non-committer (no testing of upload artifacts to Apache staging websites) Author: Holden Karau <holden@us.ibm.com> Author: Juliet Hougland <juliet@cloudera.com> Author: Juliet Hougland <not@myemail.com> Closes #15659 from holdenk/SPARK-1267-pip-install-pyspark.
2016-11-16 17:22:15 -05:00
# A temporary path so we can access above the Python project root and fetch scripts and jars we need
TEMP_PATH = "deps"
SPARK_HOME = os.path.abspath("../")
# Provide guidance about how to use setup.py
incorrect_invocation_message = """
If you are installing pyspark from spark source, you must first build Spark and
run sdist.
To build Spark with maven you can run:
./build/mvn -DskipTests clean package
Building the source dist is done in the Python directory:
cd python
python setup.py sdist
pip install dist/*.tar.gz"""
# Figure out where the jars are we need to package with PySpark.
JARS_PATH = glob.glob(os.path.join(SPARK_HOME, "assembly/target/scala-*/jars/"))
if len(JARS_PATH) == 1:
JARS_PATH = JARS_PATH[0]
elif (os.path.isfile("../RELEASE") and len(glob.glob("../jars/spark*core*.jar")) == 1):
# Release mode puts the jars in a jars directory
JARS_PATH = os.path.join(SPARK_HOME, "jars")
elif len(JARS_PATH) > 1:
print("Assembly jars exist for multiple scalas ({0}), please cleanup assembly/target".format(
JARS_PATH), file=sys.stderr)
sys.exit(-1)
elif len(JARS_PATH) == 0 and not os.path.exists(TEMP_PATH):
print(incorrect_invocation_message, file=sys.stderr)
sys.exit(-1)
EXAMPLES_PATH = os.path.join(SPARK_HOME, "examples/src/main/python")
SCRIPTS_PATH = os.path.join(SPARK_HOME, "bin")
USER_SCRIPTS_PATH = os.path.join(SPARK_HOME, "sbin")
DATA_PATH = os.path.join(SPARK_HOME, "data")
LICENSES_PATH = os.path.join(SPARK_HOME, "licenses")
[SPARK-1267][SPARK-18129] Allow PySpark to be pip installed ## What changes were proposed in this pull request? This PR aims to provide a pip installable PySpark package. This does a bunch of work to copy the jars over and package them with the Python code (to prevent challenges from trying to use different versions of the Python code with different versions of the JAR). It does not currently publish to PyPI but that is the natural follow up (SPARK-18129). Done: - pip installable on conda [manual tested] - setup.py installed on a non-pip managed system (RHEL) with YARN [manual tested] - Automated testing of this (virtualenv) - packaging and signing with release-build* Possible follow up work: - release-build update to publish to PyPI (SPARK-18128) - figure out who owns the pyspark package name on prod PyPI (is it someone with in the project or should we ask PyPI or should we choose a different name to publish with like ApachePySpark?) - Windows support and or testing ( SPARK-18136 ) - investigate details of wheel caching and see if we can avoid cleaning the wheel cache during our test - consider how we want to number our dev/snapshot versions Explicitly out of scope: - Using pip installed PySpark to start a standalone cluster - Using pip installed PySpark for non-Python Spark programs *I've done some work to test release-build locally but as a non-committer I've just done local testing. ## How was this patch tested? Automated testing with virtualenv, manual testing with conda, a system wide install, and YARN integration. release-build changes tested locally as a non-committer (no testing of upload artifacts to Apache staging websites) Author: Holden Karau <holden@us.ibm.com> Author: Juliet Hougland <juliet@cloudera.com> Author: Juliet Hougland <not@myemail.com> Closes #15659 from holdenk/SPARK-1267-pip-install-pyspark.
2016-11-16 17:22:15 -05:00
SCRIPTS_TARGET = os.path.join(TEMP_PATH, "bin")
USER_SCRIPTS_TARGET = os.path.join(TEMP_PATH, "sbin")
[SPARK-1267][SPARK-18129] Allow PySpark to be pip installed ## What changes were proposed in this pull request? This PR aims to provide a pip installable PySpark package. This does a bunch of work to copy the jars over and package them with the Python code (to prevent challenges from trying to use different versions of the Python code with different versions of the JAR). It does not currently publish to PyPI but that is the natural follow up (SPARK-18129). Done: - pip installable on conda [manual tested] - setup.py installed on a non-pip managed system (RHEL) with YARN [manual tested] - Automated testing of this (virtualenv) - packaging and signing with release-build* Possible follow up work: - release-build update to publish to PyPI (SPARK-18128) - figure out who owns the pyspark package name on prod PyPI (is it someone with in the project or should we ask PyPI or should we choose a different name to publish with like ApachePySpark?) - Windows support and or testing ( SPARK-18136 ) - investigate details of wheel caching and see if we can avoid cleaning the wheel cache during our test - consider how we want to number our dev/snapshot versions Explicitly out of scope: - Using pip installed PySpark to start a standalone cluster - Using pip installed PySpark for non-Python Spark programs *I've done some work to test release-build locally but as a non-committer I've just done local testing. ## How was this patch tested? Automated testing with virtualenv, manual testing with conda, a system wide install, and YARN integration. release-build changes tested locally as a non-committer (no testing of upload artifacts to Apache staging websites) Author: Holden Karau <holden@us.ibm.com> Author: Juliet Hougland <juliet@cloudera.com> Author: Juliet Hougland <not@myemail.com> Closes #15659 from holdenk/SPARK-1267-pip-install-pyspark.
2016-11-16 17:22:15 -05:00
JARS_TARGET = os.path.join(TEMP_PATH, "jars")
EXAMPLES_TARGET = os.path.join(TEMP_PATH, "examples")
DATA_TARGET = os.path.join(TEMP_PATH, "data")
LICENSES_TARGET = os.path.join(TEMP_PATH, "licenses")
[SPARK-1267][SPARK-18129] Allow PySpark to be pip installed ## What changes were proposed in this pull request? This PR aims to provide a pip installable PySpark package. This does a bunch of work to copy the jars over and package them with the Python code (to prevent challenges from trying to use different versions of the Python code with different versions of the JAR). It does not currently publish to PyPI but that is the natural follow up (SPARK-18129). Done: - pip installable on conda [manual tested] - setup.py installed on a non-pip managed system (RHEL) with YARN [manual tested] - Automated testing of this (virtualenv) - packaging and signing with release-build* Possible follow up work: - release-build update to publish to PyPI (SPARK-18128) - figure out who owns the pyspark package name on prod PyPI (is it someone with in the project or should we ask PyPI or should we choose a different name to publish with like ApachePySpark?) - Windows support and or testing ( SPARK-18136 ) - investigate details of wheel caching and see if we can avoid cleaning the wheel cache during our test - consider how we want to number our dev/snapshot versions Explicitly out of scope: - Using pip installed PySpark to start a standalone cluster - Using pip installed PySpark for non-Python Spark programs *I've done some work to test release-build locally but as a non-committer I've just done local testing. ## How was this patch tested? Automated testing with virtualenv, manual testing with conda, a system wide install, and YARN integration. release-build changes tested locally as a non-committer (no testing of upload artifacts to Apache staging websites) Author: Holden Karau <holden@us.ibm.com> Author: Juliet Hougland <juliet@cloudera.com> Author: Juliet Hougland <not@myemail.com> Closes #15659 from holdenk/SPARK-1267-pip-install-pyspark.
2016-11-16 17:22:15 -05:00
# Check and see if we are under the spark path in which case we need to build the symlink farm.
# This is important because we only want to build the symlink farm while under Spark otherwise we
# want to use the symlink farm. And if the symlink farm exists under while under Spark (e.g. a
# partially built sdist) we should error and have the user sort it out.
in_spark = (os.path.isfile("../core/src/main/scala/org/apache/spark/SparkContext.scala") or
(os.path.isfile("../RELEASE") and len(glob.glob("../jars/spark*core*.jar")) == 1))
def _supports_symlinks():
"""Check if the system supports symlinks (e.g. *nix) or not."""
return getattr(os, "symlink", None) is not None
if (in_spark):
# Construct links for setup
try:
os.mkdir(TEMP_PATH)
except:
print("Temp path for symlink to parent already exists {0}".format(TEMP_PATH),
file=sys.stderr)
sys.exit(-1)
[SPARK-1267][SPARK-18129] Allow PySpark to be pip installed ## What changes were proposed in this pull request? This PR aims to provide a pip installable PySpark package. This does a bunch of work to copy the jars over and package them with the Python code (to prevent challenges from trying to use different versions of the Python code with different versions of the JAR). It does not currently publish to PyPI but that is the natural follow up (SPARK-18129). Done: - pip installable on conda [manual tested] - setup.py installed on a non-pip managed system (RHEL) with YARN [manual tested] - Automated testing of this (virtualenv) - packaging and signing with release-build* Possible follow up work: - release-build update to publish to PyPI (SPARK-18128) - figure out who owns the pyspark package name on prod PyPI (is it someone with in the project or should we ask PyPI or should we choose a different name to publish with like ApachePySpark?) - Windows support and or testing ( SPARK-18136 ) - investigate details of wheel caching and see if we can avoid cleaning the wheel cache during our test - consider how we want to number our dev/snapshot versions Explicitly out of scope: - Using pip installed PySpark to start a standalone cluster - Using pip installed PySpark for non-Python Spark programs *I've done some work to test release-build locally but as a non-committer I've just done local testing. ## How was this patch tested? Automated testing with virtualenv, manual testing with conda, a system wide install, and YARN integration. release-build changes tested locally as a non-committer (no testing of upload artifacts to Apache staging websites) Author: Holden Karau <holden@us.ibm.com> Author: Juliet Hougland <juliet@cloudera.com> Author: Juliet Hougland <not@myemail.com> Closes #15659 from holdenk/SPARK-1267-pip-install-pyspark.
2016-11-16 17:22:15 -05:00
[SPARK-32312][SQL][PYTHON][TEST-JAVA11] Upgrade Apache Arrow to version 1.0.1 ### What changes were proposed in this pull request? Upgrade Apache Arrow to version 1.0.1 for the Java dependency and increase minimum version of PyArrow to 1.0.0. This release marks a transition to binary stability of the columnar format (which was already informally backward-compatible going back to December 2017) and a transition to Semantic Versioning for the Arrow software libraries. Also note that the Java arrow-memory artifact has been split to separate dependence on netty-buffer and allow users to select an allocator. Spark will continue to use `arrow-memory-netty` to maintain performance benefits. Version 1.0.0 - 1.0.0 include the following selected fixes/improvements relevant to Spark users: ARROW-9300 - [Java] Separate Netty Memory to its own module ARROW-9272 - [C++][Python] Reduce complexity in python to arrow conversion ARROW-9016 - [Java] Remove direct references to Netty/Unsafe Allocators ARROW-8664 - [Java] Add skip null check to all Vector types ARROW-8485 - [Integration][Java] Implement extension types integration ARROW-8434 - [C++] Ipc RecordBatchFileReader deserializes the Schema multiple times ARROW-8314 - [Python] Provide a method to select a subset of columns of a Table ARROW-8230 - [Java] Move Netty memory manager into a separate module ARROW-8229 - [Java] Move ArrowBuf into the Arrow package ARROW-7955 - [Java] Support large buffer for file/stream IPC ARROW-7831 - [Java] unnecessary buffer allocation when calling splitAndTransferTo on variable width vectors ARROW-6111 - [Java] Support LargeVarChar and LargeBinary types and add integration test with C++ ARROW-6110 - [Java] Support LargeList Type and add integration test with C++ ARROW-5760 - [C++] Optimize Take implementation ARROW-300 - [Format] Add body buffer compression option to IPC message protocol using LZ4 or ZSTD ARROW-9098 - RecordBatch::ToStructArray cannot handle record batches with 0 column ARROW-9066 - [Python] Raise correct error in isnull() ARROW-9223 - [Python] Fix to_pandas() export for timestamps within structs ARROW-9195 - [Java] Wrong usage of Unsafe.get from bytearray in ByteFunctionsHelper class ARROW-7610 - [Java] Finish support for 64 bit int allocations ARROW-8115 - [Python] Conversion when mixing NaT and datetime objects not working ARROW-8392 - [Java] Fix overflow related corner cases for vector value comparison ARROW-8537 - [C++] Performance regression from ARROW-8523 ARROW-8803 - [Java] Row count should be set before loading buffers in VectorLoader ARROW-8911 - [C++] Slicing a ChunkedArray with zero chunks segfaults View release notes here: https://arrow.apache.org/release/1.0.1.html https://arrow.apache.org/release/1.0.0.html ### Why are the changes needed? Upgrade brings fixes, improvements and stability guarantees. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests with pyarrow 1.0.0 and 1.0.1 Closes #29686 from BryanCutler/arrow-upgrade-100-SPARK-32312. Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-09-10 01:16:19 -04:00
# If you are changing the versions here, please also change ./python/pyspark/sql/pandas/utils.py
# For Arrow, you should also check ./pom.xml and ensure there are no breaking changes in the
# binary format protocol with the Java version, see ARROW_HOME/format/* for specifications.
# Also don't forget to update python/docs/source/getting_started/install.rst.
_minimum_pandas_version = "0.23.2"
[SPARK-32312][SQL][PYTHON][TEST-JAVA11] Upgrade Apache Arrow to version 1.0.1 ### What changes were proposed in this pull request? Upgrade Apache Arrow to version 1.0.1 for the Java dependency and increase minimum version of PyArrow to 1.0.0. This release marks a transition to binary stability of the columnar format (which was already informally backward-compatible going back to December 2017) and a transition to Semantic Versioning for the Arrow software libraries. Also note that the Java arrow-memory artifact has been split to separate dependence on netty-buffer and allow users to select an allocator. Spark will continue to use `arrow-memory-netty` to maintain performance benefits. Version 1.0.0 - 1.0.0 include the following selected fixes/improvements relevant to Spark users: ARROW-9300 - [Java] Separate Netty Memory to its own module ARROW-9272 - [C++][Python] Reduce complexity in python to arrow conversion ARROW-9016 - [Java] Remove direct references to Netty/Unsafe Allocators ARROW-8664 - [Java] Add skip null check to all Vector types ARROW-8485 - [Integration][Java] Implement extension types integration ARROW-8434 - [C++] Ipc RecordBatchFileReader deserializes the Schema multiple times ARROW-8314 - [Python] Provide a method to select a subset of columns of a Table ARROW-8230 - [Java] Move Netty memory manager into a separate module ARROW-8229 - [Java] Move ArrowBuf into the Arrow package ARROW-7955 - [Java] Support large buffer for file/stream IPC ARROW-7831 - [Java] unnecessary buffer allocation when calling splitAndTransferTo on variable width vectors ARROW-6111 - [Java] Support LargeVarChar and LargeBinary types and add integration test with C++ ARROW-6110 - [Java] Support LargeList Type and add integration test with C++ ARROW-5760 - [C++] Optimize Take implementation ARROW-300 - [Format] Add body buffer compression option to IPC message protocol using LZ4 or ZSTD ARROW-9098 - RecordBatch::ToStructArray cannot handle record batches with 0 column ARROW-9066 - [Python] Raise correct error in isnull() ARROW-9223 - [Python] Fix to_pandas() export for timestamps within structs ARROW-9195 - [Java] Wrong usage of Unsafe.get from bytearray in ByteFunctionsHelper class ARROW-7610 - [Java] Finish support for 64 bit int allocations ARROW-8115 - [Python] Conversion when mixing NaT and datetime objects not working ARROW-8392 - [Java] Fix overflow related corner cases for vector value comparison ARROW-8537 - [C++] Performance regression from ARROW-8523 ARROW-8803 - [Java] Row count should be set before loading buffers in VectorLoader ARROW-8911 - [C++] Slicing a ChunkedArray with zero chunks segfaults View release notes here: https://arrow.apache.org/release/1.0.1.html https://arrow.apache.org/release/1.0.0.html ### Why are the changes needed? Upgrade brings fixes, improvements and stability guarantees. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests with pyarrow 1.0.0 and 1.0.1 Closes #29686 from BryanCutler/arrow-upgrade-100-SPARK-32312. Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-09-10 01:16:19 -04:00
_minimum_pyarrow_version = "1.0.0"
[SPARK-23319][TESTS] Explicitly specify Pandas and PyArrow versions in PySpark tests (to skip or test) ## What changes were proposed in this pull request? This PR proposes to explicitly specify Pandas and PyArrow versions in PySpark tests to skip or test. We declared the extra dependencies: https://github.com/apache/spark/blob/b8bfce51abf28c66ba1fc67b0f25fe1617c81025/python/setup.py#L204 In case of PyArrow: Currently we only check if pyarrow is installed or not without checking the version. It already fails to run tests. For example, if PyArrow 0.7.0 is installed: ``` ====================================================================== ERROR: test_vectorized_udf_wrong_return_type (pyspark.sql.tests.ScalarPandasUDF) ---------------------------------------------------------------------- Traceback (most recent call last): File "/.../spark/python/pyspark/sql/tests.py", line 4019, in test_vectorized_udf_wrong_return_type f = pandas_udf(lambda x: x * 1.0, MapType(LongType(), LongType())) File "/.../spark/python/pyspark/sql/functions.py", line 2309, in pandas_udf return _create_udf(f=f, returnType=return_type, evalType=eval_type) File "/.../spark/python/pyspark/sql/udf.py", line 47, in _create_udf require_minimum_pyarrow_version() File "/.../spark/python/pyspark/sql/utils.py", line 132, in require_minimum_pyarrow_version "however, your version was %s." % pyarrow.__version__) ImportError: pyarrow >= 0.8.0 must be installed on calling Python process; however, your version was 0.7.0. ---------------------------------------------------------------------- Ran 33 tests in 8.098s FAILED (errors=33) ``` In case of Pandas: There are few tests for old Pandas which were tested only when Pandas version was lower, and I rewrote them to be tested when both Pandas version is lower and missing. ## How was this patch tested? Manually tested by modifying the condition: ``` test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 1.19.2 must be installed; however, your version was 0.19.2.' test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 1.19.2 must be installed; however, your version was 0.19.2.' test_createDataFrame_respect_session_timezone (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 1.19.2 must be installed; however, your version was 0.19.2.' ``` ``` test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 0.19.2 must be installed; however, it was not found.' test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 0.19.2 must be installed; however, it was not found.' test_createDataFrame_respect_session_timezone (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 0.19.2 must be installed; however, it was not found.' ``` ``` test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 1.8.0 must be installed; however, your version was 0.8.0.' test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 1.8.0 must be installed; however, your version was 0.8.0.' test_createDataFrame_respect_session_timezone (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 1.8.0 must be installed; however, your version was 0.8.0.' ``` ``` test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 0.8.0 must be installed; however, it was not found.' test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 0.8.0 must be installed; however, it was not found.' test_createDataFrame_respect_session_timezone (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 0.8.0 must be installed; however, it was not found.' ``` Author: hyukjinkwon <gurwls223@gmail.com> Closes #20487 from HyukjinKwon/pyarrow-pandas-skip.
2018-02-07 09:28:10 -05:00
[SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI ### What changes were proposed in this pull request? This PR proposes to add a way to select Hadoop and Hive versions in pip installation. Users can select Hive or Hadoop versions as below: ```bash HADOOP_VERSION=3.2 pip install pyspark HIVE_VERSION=1.2 pip install pyspark HIVE_VERSION=1.2 HADOOP_VERSION=2.7 pip install pyspark ``` When the environment variables are set, internally it downloads the corresponding Spark version and then sets the Spark home to it. Also this PR exposes a mirror to set as an environment variable, `PYSPARK_RELEASE_MIRROR`. **Please NOTE that:** - We cannot currently leverage pip's native installation option, for example: ```bash pip install pyspark --install-option="hadoop3.2" ``` This is because of a limitation and bug in pip itself. Once they fix this issue, we can switch from the environment variables to the proper installation options, see SPARK-32837. It IS possible to workaround but very ugly or hacky with a big change. See [this PR](https://github.com/microsoft/nni/pull/139/files) as an example. - In pip installation, we pack the relevant jars together. This PR _does not touch existing packaging way_ in order to prevent any behaviour changes. Once this experimental way is proven to be safe, we can avoid packing the relevant jars together (and keep only the relevant Python scripts). And downloads the Spark distribution as this PR proposes. - This way is sort of consistent with SparkR: SparkR provides a method `SparkR::install.spark` to support CRAN installation. This is fine because SparkR is provided purely as a R library. For example, `sparkr` script is not packed together. PySpark cannot take this approach because PySpark packaging ships relevant executable script together, e.g.) `pyspark` shell. If PySpark has a method such as `pyspark.install_spark`, users cannot call it in `pyspark` because `pyspark` already assumes relevant Spark is installed, JVM is launched, etc. - There looks no way to release that contains different Hadoop or Hive to PyPI due to [the version semantics](https://www.python.org/dev/peps/pep-0440/). This is not an option. The usual way looks either `--install-option` above with hacks or environment variables given my investigation. ### Why are the changes needed? To provide users the options to select Hadoop and Hive versions. ### Does this PR introduce _any_ user-facing change? Yes, users will be able to select Hive and Hadoop version as below when they install it from `pip`; ```bash HADOOP_VERSION=3.2 pip install pyspark HIVE_VERSION=1.2 pip install pyspark HIVE_VERSION=1.2 HADOOP_VERSION=2.7 pip install pyspark ``` ### How was this patch tested? Unit tests were added. I also manually tested in Mac and Windows (after building Spark with `python/dist/pyspark-3.1.0.dev0.tar.gz`): ```bash ./build/mvn -DskipTests -Phive-thriftserver clean package ``` Mac: ```bash SPARK_VERSION=3.0.1 HADOOP_VERSION=3.2 pip install pyspark-3.1.0.dev0.tar.gz ``` Windows: ```bash set HADOOP_VERSION=3.2 set SPARK_VERSION=3.0.1 pip install pyspark-3.1.0.dev0.tar.gz ``` Closes #29703 from HyukjinKwon/SPARK-32017. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-09-22 20:30:51 -04:00
class InstallCommand(install):
# TODO(SPARK-32837) leverage pip's custom options
def run(self):
install.run(self)
# Make sure the destination is always clean.
spark_dist = os.path.join(self.install_lib, "pyspark", "spark-distribution")
rmtree(spark_dist, ignore_errors=True)
if ("PYSPARK_HADOOP_VERSION" in os.environ) or ("PYSPARK_HIVE_VERSION" in os.environ):
# Note that PYSPARK_VERSION environment is just a testing purpose.
# PYSPARK_HIVE_VERSION environment variable is also internal for now in case
# we support another version of Hive in the future.
[SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI ### What changes were proposed in this pull request? This PR proposes to add a way to select Hadoop and Hive versions in pip installation. Users can select Hive or Hadoop versions as below: ```bash HADOOP_VERSION=3.2 pip install pyspark HIVE_VERSION=1.2 pip install pyspark HIVE_VERSION=1.2 HADOOP_VERSION=2.7 pip install pyspark ``` When the environment variables are set, internally it downloads the corresponding Spark version and then sets the Spark home to it. Also this PR exposes a mirror to set as an environment variable, `PYSPARK_RELEASE_MIRROR`. **Please NOTE that:** - We cannot currently leverage pip's native installation option, for example: ```bash pip install pyspark --install-option="hadoop3.2" ``` This is because of a limitation and bug in pip itself. Once they fix this issue, we can switch from the environment variables to the proper installation options, see SPARK-32837. It IS possible to workaround but very ugly or hacky with a big change. See [this PR](https://github.com/microsoft/nni/pull/139/files) as an example. - In pip installation, we pack the relevant jars together. This PR _does not touch existing packaging way_ in order to prevent any behaviour changes. Once this experimental way is proven to be safe, we can avoid packing the relevant jars together (and keep only the relevant Python scripts). And downloads the Spark distribution as this PR proposes. - This way is sort of consistent with SparkR: SparkR provides a method `SparkR::install.spark` to support CRAN installation. This is fine because SparkR is provided purely as a R library. For example, `sparkr` script is not packed together. PySpark cannot take this approach because PySpark packaging ships relevant executable script together, e.g.) `pyspark` shell. If PySpark has a method such as `pyspark.install_spark`, users cannot call it in `pyspark` because `pyspark` already assumes relevant Spark is installed, JVM is launched, etc. - There looks no way to release that contains different Hadoop or Hive to PyPI due to [the version semantics](https://www.python.org/dev/peps/pep-0440/). This is not an option. The usual way looks either `--install-option` above with hacks or environment variables given my investigation. ### Why are the changes needed? To provide users the options to select Hadoop and Hive versions. ### Does this PR introduce _any_ user-facing change? Yes, users will be able to select Hive and Hadoop version as below when they install it from `pip`; ```bash HADOOP_VERSION=3.2 pip install pyspark HIVE_VERSION=1.2 pip install pyspark HIVE_VERSION=1.2 HADOOP_VERSION=2.7 pip install pyspark ``` ### How was this patch tested? Unit tests were added. I also manually tested in Mac and Windows (after building Spark with `python/dist/pyspark-3.1.0.dev0.tar.gz`): ```bash ./build/mvn -DskipTests -Phive-thriftserver clean package ``` Mac: ```bash SPARK_VERSION=3.0.1 HADOOP_VERSION=3.2 pip install pyspark-3.1.0.dev0.tar.gz ``` Windows: ```bash set HADOOP_VERSION=3.2 set SPARK_VERSION=3.0.1 pip install pyspark-3.1.0.dev0.tar.gz ``` Closes #29703 from HyukjinKwon/SPARK-32017. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-09-22 20:30:51 -04:00
spark_version, hadoop_version, hive_version = install_module.checked_versions(
os.environ.get("PYSPARK_VERSION", VERSION).lower(),
os.environ.get("PYSPARK_HADOOP_VERSION", install_module.DEFAULT_HADOOP).lower(),
os.environ.get("PYSPARK_HIVE_VERSION", install_module.DEFAULT_HIVE).lower())
[SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI ### What changes were proposed in this pull request? This PR proposes to add a way to select Hadoop and Hive versions in pip installation. Users can select Hive or Hadoop versions as below: ```bash HADOOP_VERSION=3.2 pip install pyspark HIVE_VERSION=1.2 pip install pyspark HIVE_VERSION=1.2 HADOOP_VERSION=2.7 pip install pyspark ``` When the environment variables are set, internally it downloads the corresponding Spark version and then sets the Spark home to it. Also this PR exposes a mirror to set as an environment variable, `PYSPARK_RELEASE_MIRROR`. **Please NOTE that:** - We cannot currently leverage pip's native installation option, for example: ```bash pip install pyspark --install-option="hadoop3.2" ``` This is because of a limitation and bug in pip itself. Once they fix this issue, we can switch from the environment variables to the proper installation options, see SPARK-32837. It IS possible to workaround but very ugly or hacky with a big change. See [this PR](https://github.com/microsoft/nni/pull/139/files) as an example. - In pip installation, we pack the relevant jars together. This PR _does not touch existing packaging way_ in order to prevent any behaviour changes. Once this experimental way is proven to be safe, we can avoid packing the relevant jars together (and keep only the relevant Python scripts). And downloads the Spark distribution as this PR proposes. - This way is sort of consistent with SparkR: SparkR provides a method `SparkR::install.spark` to support CRAN installation. This is fine because SparkR is provided purely as a R library. For example, `sparkr` script is not packed together. PySpark cannot take this approach because PySpark packaging ships relevant executable script together, e.g.) `pyspark` shell. If PySpark has a method such as `pyspark.install_spark`, users cannot call it in `pyspark` because `pyspark` already assumes relevant Spark is installed, JVM is launched, etc. - There looks no way to release that contains different Hadoop or Hive to PyPI due to [the version semantics](https://www.python.org/dev/peps/pep-0440/). This is not an option. The usual way looks either `--install-option` above with hacks or environment variables given my investigation. ### Why are the changes needed? To provide users the options to select Hadoop and Hive versions. ### Does this PR introduce _any_ user-facing change? Yes, users will be able to select Hive and Hadoop version as below when they install it from `pip`; ```bash HADOOP_VERSION=3.2 pip install pyspark HIVE_VERSION=1.2 pip install pyspark HIVE_VERSION=1.2 HADOOP_VERSION=2.7 pip install pyspark ``` ### How was this patch tested? Unit tests were added. I also manually tested in Mac and Windows (after building Spark with `python/dist/pyspark-3.1.0.dev0.tar.gz`): ```bash ./build/mvn -DskipTests -Phive-thriftserver clean package ``` Mac: ```bash SPARK_VERSION=3.0.1 HADOOP_VERSION=3.2 pip install pyspark-3.1.0.dev0.tar.gz ``` Windows: ```bash set HADOOP_VERSION=3.2 set SPARK_VERSION=3.0.1 pip install pyspark-3.1.0.dev0.tar.gz ``` Closes #29703 from HyukjinKwon/SPARK-32017. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-09-22 20:30:51 -04:00
if ("PYSPARK_VERSION" not in os.environ and
[SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI ### What changes were proposed in this pull request? This PR proposes to add a way to select Hadoop and Hive versions in pip installation. Users can select Hive or Hadoop versions as below: ```bash HADOOP_VERSION=3.2 pip install pyspark HIVE_VERSION=1.2 pip install pyspark HIVE_VERSION=1.2 HADOOP_VERSION=2.7 pip install pyspark ``` When the environment variables are set, internally it downloads the corresponding Spark version and then sets the Spark home to it. Also this PR exposes a mirror to set as an environment variable, `PYSPARK_RELEASE_MIRROR`. **Please NOTE that:** - We cannot currently leverage pip's native installation option, for example: ```bash pip install pyspark --install-option="hadoop3.2" ``` This is because of a limitation and bug in pip itself. Once they fix this issue, we can switch from the environment variables to the proper installation options, see SPARK-32837. It IS possible to workaround but very ugly or hacky with a big change. See [this PR](https://github.com/microsoft/nni/pull/139/files) as an example. - In pip installation, we pack the relevant jars together. This PR _does not touch existing packaging way_ in order to prevent any behaviour changes. Once this experimental way is proven to be safe, we can avoid packing the relevant jars together (and keep only the relevant Python scripts). And downloads the Spark distribution as this PR proposes. - This way is sort of consistent with SparkR: SparkR provides a method `SparkR::install.spark` to support CRAN installation. This is fine because SparkR is provided purely as a R library. For example, `sparkr` script is not packed together. PySpark cannot take this approach because PySpark packaging ships relevant executable script together, e.g.) `pyspark` shell. If PySpark has a method such as `pyspark.install_spark`, users cannot call it in `pyspark` because `pyspark` already assumes relevant Spark is installed, JVM is launched, etc. - There looks no way to release that contains different Hadoop or Hive to PyPI due to [the version semantics](https://www.python.org/dev/peps/pep-0440/). This is not an option. The usual way looks either `--install-option` above with hacks or environment variables given my investigation. ### Why are the changes needed? To provide users the options to select Hadoop and Hive versions. ### Does this PR introduce _any_ user-facing change? Yes, users will be able to select Hive and Hadoop version as below when they install it from `pip`; ```bash HADOOP_VERSION=3.2 pip install pyspark HIVE_VERSION=1.2 pip install pyspark HIVE_VERSION=1.2 HADOOP_VERSION=2.7 pip install pyspark ``` ### How was this patch tested? Unit tests were added. I also manually tested in Mac and Windows (after building Spark with `python/dist/pyspark-3.1.0.dev0.tar.gz`): ```bash ./build/mvn -DskipTests -Phive-thriftserver clean package ``` Mac: ```bash SPARK_VERSION=3.0.1 HADOOP_VERSION=3.2 pip install pyspark-3.1.0.dev0.tar.gz ``` Windows: ```bash set HADOOP_VERSION=3.2 set SPARK_VERSION=3.0.1 pip install pyspark-3.1.0.dev0.tar.gz ``` Closes #29703 from HyukjinKwon/SPARK-32017. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-09-22 20:30:51 -04:00
((install_module.DEFAULT_HADOOP, install_module.DEFAULT_HIVE) ==
(hadoop_version, hive_version))):
# Do not download and install if they are same as default.
return
install_module.install_spark(
dest=spark_dist,
spark_version=spark_version,
hadoop_version=hadoop_version,
hive_version=hive_version)
[SPARK-1267][SPARK-18129] Allow PySpark to be pip installed ## What changes were proposed in this pull request? This PR aims to provide a pip installable PySpark package. This does a bunch of work to copy the jars over and package them with the Python code (to prevent challenges from trying to use different versions of the Python code with different versions of the JAR). It does not currently publish to PyPI but that is the natural follow up (SPARK-18129). Done: - pip installable on conda [manual tested] - setup.py installed on a non-pip managed system (RHEL) with YARN [manual tested] - Automated testing of this (virtualenv) - packaging and signing with release-build* Possible follow up work: - release-build update to publish to PyPI (SPARK-18128) - figure out who owns the pyspark package name on prod PyPI (is it someone with in the project or should we ask PyPI or should we choose a different name to publish with like ApachePySpark?) - Windows support and or testing ( SPARK-18136 ) - investigate details of wheel caching and see if we can avoid cleaning the wheel cache during our test - consider how we want to number our dev/snapshot versions Explicitly out of scope: - Using pip installed PySpark to start a standalone cluster - Using pip installed PySpark for non-Python Spark programs *I've done some work to test release-build locally but as a non-committer I've just done local testing. ## How was this patch tested? Automated testing with virtualenv, manual testing with conda, a system wide install, and YARN integration. release-build changes tested locally as a non-committer (no testing of upload artifacts to Apache staging websites) Author: Holden Karau <holden@us.ibm.com> Author: Juliet Hougland <juliet@cloudera.com> Author: Juliet Hougland <not@myemail.com> Closes #15659 from holdenk/SPARK-1267-pip-install-pyspark.
2016-11-16 17:22:15 -05:00
try:
# We copy the shell script to be under pyspark/python/pyspark so that the launcher scripts
# find it where expected. The rest of the files aren't copied because they are accessed
# using Python imports instead which will be resolved correctly.
try:
os.makedirs("pyspark/python/pyspark")
except OSError:
# Don't worry if the directory already exists.
pass
copyfile("pyspark/shell.py", "pyspark/python/pyspark/shell.py")
if (in_spark):
# Construct the symlink farm - this is necessary since we can't refer to the path above the
# package root and we need to copy the jars and scripts which are up above the python root.
if _supports_symlinks():
os.symlink(JARS_PATH, JARS_TARGET)
os.symlink(SCRIPTS_PATH, SCRIPTS_TARGET)
os.symlink(USER_SCRIPTS_PATH, USER_SCRIPTS_TARGET)
[SPARK-1267][SPARK-18129] Allow PySpark to be pip installed ## What changes were proposed in this pull request? This PR aims to provide a pip installable PySpark package. This does a bunch of work to copy the jars over and package them with the Python code (to prevent challenges from trying to use different versions of the Python code with different versions of the JAR). It does not currently publish to PyPI but that is the natural follow up (SPARK-18129). Done: - pip installable on conda [manual tested] - setup.py installed on a non-pip managed system (RHEL) with YARN [manual tested] - Automated testing of this (virtualenv) - packaging and signing with release-build* Possible follow up work: - release-build update to publish to PyPI (SPARK-18128) - figure out who owns the pyspark package name on prod PyPI (is it someone with in the project or should we ask PyPI or should we choose a different name to publish with like ApachePySpark?) - Windows support and or testing ( SPARK-18136 ) - investigate details of wheel caching and see if we can avoid cleaning the wheel cache during our test - consider how we want to number our dev/snapshot versions Explicitly out of scope: - Using pip installed PySpark to start a standalone cluster - Using pip installed PySpark for non-Python Spark programs *I've done some work to test release-build locally but as a non-committer I've just done local testing. ## How was this patch tested? Automated testing with virtualenv, manual testing with conda, a system wide install, and YARN integration. release-build changes tested locally as a non-committer (no testing of upload artifacts to Apache staging websites) Author: Holden Karau <holden@us.ibm.com> Author: Juliet Hougland <juliet@cloudera.com> Author: Juliet Hougland <not@myemail.com> Closes #15659 from holdenk/SPARK-1267-pip-install-pyspark.
2016-11-16 17:22:15 -05:00
os.symlink(EXAMPLES_PATH, EXAMPLES_TARGET)
os.symlink(DATA_PATH, DATA_TARGET)
os.symlink(LICENSES_PATH, LICENSES_TARGET)
[SPARK-1267][SPARK-18129] Allow PySpark to be pip installed ## What changes were proposed in this pull request? This PR aims to provide a pip installable PySpark package. This does a bunch of work to copy the jars over and package them with the Python code (to prevent challenges from trying to use different versions of the Python code with different versions of the JAR). It does not currently publish to PyPI but that is the natural follow up (SPARK-18129). Done: - pip installable on conda [manual tested] - setup.py installed on a non-pip managed system (RHEL) with YARN [manual tested] - Automated testing of this (virtualenv) - packaging and signing with release-build* Possible follow up work: - release-build update to publish to PyPI (SPARK-18128) - figure out who owns the pyspark package name on prod PyPI (is it someone with in the project or should we ask PyPI or should we choose a different name to publish with like ApachePySpark?) - Windows support and or testing ( SPARK-18136 ) - investigate details of wheel caching and see if we can avoid cleaning the wheel cache during our test - consider how we want to number our dev/snapshot versions Explicitly out of scope: - Using pip installed PySpark to start a standalone cluster - Using pip installed PySpark for non-Python Spark programs *I've done some work to test release-build locally but as a non-committer I've just done local testing. ## How was this patch tested? Automated testing with virtualenv, manual testing with conda, a system wide install, and YARN integration. release-build changes tested locally as a non-committer (no testing of upload artifacts to Apache staging websites) Author: Holden Karau <holden@us.ibm.com> Author: Juliet Hougland <juliet@cloudera.com> Author: Juliet Hougland <not@myemail.com> Closes #15659 from holdenk/SPARK-1267-pip-install-pyspark.
2016-11-16 17:22:15 -05:00
else:
# For windows fall back to the slower copytree
copytree(JARS_PATH, JARS_TARGET)
copytree(SCRIPTS_PATH, SCRIPTS_TARGET)
copytree(USER_SCRIPTS_PATH, USER_SCRIPTS_TARGET)
[SPARK-1267][SPARK-18129] Allow PySpark to be pip installed ## What changes were proposed in this pull request? This PR aims to provide a pip installable PySpark package. This does a bunch of work to copy the jars over and package them with the Python code (to prevent challenges from trying to use different versions of the Python code with different versions of the JAR). It does not currently publish to PyPI but that is the natural follow up (SPARK-18129). Done: - pip installable on conda [manual tested] - setup.py installed on a non-pip managed system (RHEL) with YARN [manual tested] - Automated testing of this (virtualenv) - packaging and signing with release-build* Possible follow up work: - release-build update to publish to PyPI (SPARK-18128) - figure out who owns the pyspark package name on prod PyPI (is it someone with in the project or should we ask PyPI or should we choose a different name to publish with like ApachePySpark?) - Windows support and or testing ( SPARK-18136 ) - investigate details of wheel caching and see if we can avoid cleaning the wheel cache during our test - consider how we want to number our dev/snapshot versions Explicitly out of scope: - Using pip installed PySpark to start a standalone cluster - Using pip installed PySpark for non-Python Spark programs *I've done some work to test release-build locally but as a non-committer I've just done local testing. ## How was this patch tested? Automated testing with virtualenv, manual testing with conda, a system wide install, and YARN integration. release-build changes tested locally as a non-committer (no testing of upload artifacts to Apache staging websites) Author: Holden Karau <holden@us.ibm.com> Author: Juliet Hougland <juliet@cloudera.com> Author: Juliet Hougland <not@myemail.com> Closes #15659 from holdenk/SPARK-1267-pip-install-pyspark.
2016-11-16 17:22:15 -05:00
copytree(EXAMPLES_PATH, EXAMPLES_TARGET)
copytree(DATA_PATH, DATA_TARGET)
copytree(LICENSES_PATH, LICENSES_TARGET)
[SPARK-1267][SPARK-18129] Allow PySpark to be pip installed ## What changes were proposed in this pull request? This PR aims to provide a pip installable PySpark package. This does a bunch of work to copy the jars over and package them with the Python code (to prevent challenges from trying to use different versions of the Python code with different versions of the JAR). It does not currently publish to PyPI but that is the natural follow up (SPARK-18129). Done: - pip installable on conda [manual tested] - setup.py installed on a non-pip managed system (RHEL) with YARN [manual tested] - Automated testing of this (virtualenv) - packaging and signing with release-build* Possible follow up work: - release-build update to publish to PyPI (SPARK-18128) - figure out who owns the pyspark package name on prod PyPI (is it someone with in the project or should we ask PyPI or should we choose a different name to publish with like ApachePySpark?) - Windows support and or testing ( SPARK-18136 ) - investigate details of wheel caching and see if we can avoid cleaning the wheel cache during our test - consider how we want to number our dev/snapshot versions Explicitly out of scope: - Using pip installed PySpark to start a standalone cluster - Using pip installed PySpark for non-Python Spark programs *I've done some work to test release-build locally but as a non-committer I've just done local testing. ## How was this patch tested? Automated testing with virtualenv, manual testing with conda, a system wide install, and YARN integration. release-build changes tested locally as a non-committer (no testing of upload artifacts to Apache staging websites) Author: Holden Karau <holden@us.ibm.com> Author: Juliet Hougland <juliet@cloudera.com> Author: Juliet Hougland <not@myemail.com> Closes #15659 from holdenk/SPARK-1267-pip-install-pyspark.
2016-11-16 17:22:15 -05:00
else:
# If we are not inside of SPARK_HOME verify we have the required symlink farm
if not os.path.exists(JARS_TARGET):
print("To build packaging must be in the python directory under the SPARK_HOME.",
file=sys.stderr)
if not os.path.isdir(SCRIPTS_TARGET):
print(incorrect_invocation_message, file=sys.stderr)
sys.exit(-1)
[SPARK-1267][SPARK-18129] Allow PySpark to be pip installed ## What changes were proposed in this pull request? This PR aims to provide a pip installable PySpark package. This does a bunch of work to copy the jars over and package them with the Python code (to prevent challenges from trying to use different versions of the Python code with different versions of the JAR). It does not currently publish to PyPI but that is the natural follow up (SPARK-18129). Done: - pip installable on conda [manual tested] - setup.py installed on a non-pip managed system (RHEL) with YARN [manual tested] - Automated testing of this (virtualenv) - packaging and signing with release-build* Possible follow up work: - release-build update to publish to PyPI (SPARK-18128) - figure out who owns the pyspark package name on prod PyPI (is it someone with in the project or should we ask PyPI or should we choose a different name to publish with like ApachePySpark?) - Windows support and or testing ( SPARK-18136 ) - investigate details of wheel caching and see if we can avoid cleaning the wheel cache during our test - consider how we want to number our dev/snapshot versions Explicitly out of scope: - Using pip installed PySpark to start a standalone cluster - Using pip installed PySpark for non-Python Spark programs *I've done some work to test release-build locally but as a non-committer I've just done local testing. ## How was this patch tested? Automated testing with virtualenv, manual testing with conda, a system wide install, and YARN integration. release-build changes tested locally as a non-committer (no testing of upload artifacts to Apache staging websites) Author: Holden Karau <holden@us.ibm.com> Author: Juliet Hougland <juliet@cloudera.com> Author: Juliet Hougland <not@myemail.com> Closes #15659 from holdenk/SPARK-1267-pip-install-pyspark.
2016-11-16 17:22:15 -05:00
# Scripts directive requires a list of each script path and does not take wild cards.
script_names = os.listdir(SCRIPTS_TARGET)
scripts = list(map(lambda script: os.path.join(SCRIPTS_TARGET, script), script_names))
# We add find_spark_home.py to the bin directory we install so that pip installed PySpark
# will search for SPARK_HOME with Python.
scripts.append("pyspark/find_spark_home.py")
with open('README.md') as f:
long_description = f.read()
[SPARK-1267][SPARK-18129] Allow PySpark to be pip installed ## What changes were proposed in this pull request? This PR aims to provide a pip installable PySpark package. This does a bunch of work to copy the jars over and package them with the Python code (to prevent challenges from trying to use different versions of the Python code with different versions of the JAR). It does not currently publish to PyPI but that is the natural follow up (SPARK-18129). Done: - pip installable on conda [manual tested] - setup.py installed on a non-pip managed system (RHEL) with YARN [manual tested] - Automated testing of this (virtualenv) - packaging and signing with release-build* Possible follow up work: - release-build update to publish to PyPI (SPARK-18128) - figure out who owns the pyspark package name on prod PyPI (is it someone with in the project or should we ask PyPI or should we choose a different name to publish with like ApachePySpark?) - Windows support and or testing ( SPARK-18136 ) - investigate details of wheel caching and see if we can avoid cleaning the wheel cache during our test - consider how we want to number our dev/snapshot versions Explicitly out of scope: - Using pip installed PySpark to start a standalone cluster - Using pip installed PySpark for non-Python Spark programs *I've done some work to test release-build locally but as a non-committer I've just done local testing. ## How was this patch tested? Automated testing with virtualenv, manual testing with conda, a system wide install, and YARN integration. release-build changes tested locally as a non-committer (no testing of upload artifacts to Apache staging websites) Author: Holden Karau <holden@us.ibm.com> Author: Juliet Hougland <juliet@cloudera.com> Author: Juliet Hougland <not@myemail.com> Closes #15659 from holdenk/SPARK-1267-pip-install-pyspark.
2016-11-16 17:22:15 -05:00
setup(
name='pyspark',
version=VERSION,
description='Apache Spark Python API',
long_description=long_description,
long_description_content_type="text/markdown",
[SPARK-1267][SPARK-18129] Allow PySpark to be pip installed ## What changes were proposed in this pull request? This PR aims to provide a pip installable PySpark package. This does a bunch of work to copy the jars over and package them with the Python code (to prevent challenges from trying to use different versions of the Python code with different versions of the JAR). It does not currently publish to PyPI but that is the natural follow up (SPARK-18129). Done: - pip installable on conda [manual tested] - setup.py installed on a non-pip managed system (RHEL) with YARN [manual tested] - Automated testing of this (virtualenv) - packaging and signing with release-build* Possible follow up work: - release-build update to publish to PyPI (SPARK-18128) - figure out who owns the pyspark package name on prod PyPI (is it someone with in the project or should we ask PyPI or should we choose a different name to publish with like ApachePySpark?) - Windows support and or testing ( SPARK-18136 ) - investigate details of wheel caching and see if we can avoid cleaning the wheel cache during our test - consider how we want to number our dev/snapshot versions Explicitly out of scope: - Using pip installed PySpark to start a standalone cluster - Using pip installed PySpark for non-Python Spark programs *I've done some work to test release-build locally but as a non-committer I've just done local testing. ## How was this patch tested? Automated testing with virtualenv, manual testing with conda, a system wide install, and YARN integration. release-build changes tested locally as a non-committer (no testing of upload artifacts to Apache staging websites) Author: Holden Karau <holden@us.ibm.com> Author: Juliet Hougland <juliet@cloudera.com> Author: Juliet Hougland <not@myemail.com> Closes #15659 from holdenk/SPARK-1267-pip-install-pyspark.
2016-11-16 17:22:15 -05:00
author='Spark Developers',
author_email='dev@spark.apache.org',
url='https://github.com/apache/spark/tree/master/python',
packages=['pyspark',
[SPARK-32094][PYTHON] Update cloudpickle to v1.5.0 ### What changes were proposed in this pull request? This PR aims to upgrade PySpark's embedded cloudpickle to the latest cloudpickle v1.5.0 (See https://github.com/cloudpipe/cloudpickle/blob/v1.5.0/cloudpickle/cloudpickle.py) ### Why are the changes needed? There are many bug fixes. For example, the bug described in the JIRA: dill unpickling fails because they define `types.ClassType`, which is undefined in dill. This results in the following error: ``` Traceback (most recent call last): File "/usr/local/lib/python3.6/site-packages/apache_beam/internal/pickler.py", line 279, in loads return dill.loads(s) File "/usr/local/lib/python3.6/site-packages/dill/_dill.py", line 317, in loads return load(file, ignore) File "/usr/local/lib/python3.6/site-packages/dill/_dill.py", line 305, in load obj = pik.load() File "/usr/local/lib/python3.6/site-packages/dill/_dill.py", line 577, in _load_type return _reverse_typemap[name] KeyError: 'ClassType' ``` See also https://github.com/cloudpipe/cloudpickle/issues/82. This was fixed for cloudpickle 1.3.0+ (https://github.com/cloudpipe/cloudpickle/pull/337), but PySpark's cloudpickle.py doesn't have this change yet. More notably, now it supports C pickle implementation with Python 3.8 which hugely improve performance. This is already adopted in another project such as Ray. ### Does this PR introduce _any_ user-facing change? Yes, as described above, the bug fixes. Internally, users also could leverage the fast cloudpickle backed by C pickle. ### How was this patch tested? Jenkins will test it out. Closes #29114 from HyukjinKwon/SPARK-32094. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-07-16 22:49:18 -04:00
'pyspark.cloudpickle',
[SPARK-1267][SPARK-18129] Allow PySpark to be pip installed ## What changes were proposed in this pull request? This PR aims to provide a pip installable PySpark package. This does a bunch of work to copy the jars over and package them with the Python code (to prevent challenges from trying to use different versions of the Python code with different versions of the JAR). It does not currently publish to PyPI but that is the natural follow up (SPARK-18129). Done: - pip installable on conda [manual tested] - setup.py installed on a non-pip managed system (RHEL) with YARN [manual tested] - Automated testing of this (virtualenv) - packaging and signing with release-build* Possible follow up work: - release-build update to publish to PyPI (SPARK-18128) - figure out who owns the pyspark package name on prod PyPI (is it someone with in the project or should we ask PyPI or should we choose a different name to publish with like ApachePySpark?) - Windows support and or testing ( SPARK-18136 ) - investigate details of wheel caching and see if we can avoid cleaning the wheel cache during our test - consider how we want to number our dev/snapshot versions Explicitly out of scope: - Using pip installed PySpark to start a standalone cluster - Using pip installed PySpark for non-Python Spark programs *I've done some work to test release-build locally but as a non-committer I've just done local testing. ## How was this patch tested? Automated testing with virtualenv, manual testing with conda, a system wide install, and YARN integration. release-build changes tested locally as a non-committer (no testing of upload artifacts to Apache staging websites) Author: Holden Karau <holden@us.ibm.com> Author: Juliet Hougland <juliet@cloudera.com> Author: Juliet Hougland <not@myemail.com> Closes #15659 from holdenk/SPARK-1267-pip-install-pyspark.
2016-11-16 17:22:15 -05:00
'pyspark.mllib',
'pyspark.mllib.linalg',
'pyspark.mllib.stat',
[SPARK-1267][SPARK-18129] Allow PySpark to be pip installed ## What changes were proposed in this pull request? This PR aims to provide a pip installable PySpark package. This does a bunch of work to copy the jars over and package them with the Python code (to prevent challenges from trying to use different versions of the Python code with different versions of the JAR). It does not currently publish to PyPI but that is the natural follow up (SPARK-18129). Done: - pip installable on conda [manual tested] - setup.py installed on a non-pip managed system (RHEL) with YARN [manual tested] - Automated testing of this (virtualenv) - packaging and signing with release-build* Possible follow up work: - release-build update to publish to PyPI (SPARK-18128) - figure out who owns the pyspark package name on prod PyPI (is it someone with in the project or should we ask PyPI or should we choose a different name to publish with like ApachePySpark?) - Windows support and or testing ( SPARK-18136 ) - investigate details of wheel caching and see if we can avoid cleaning the wheel cache during our test - consider how we want to number our dev/snapshot versions Explicitly out of scope: - Using pip installed PySpark to start a standalone cluster - Using pip installed PySpark for non-Python Spark programs *I've done some work to test release-build locally but as a non-committer I've just done local testing. ## How was this patch tested? Automated testing with virtualenv, manual testing with conda, a system wide install, and YARN integration. release-build changes tested locally as a non-committer (no testing of upload artifacts to Apache staging websites) Author: Holden Karau <holden@us.ibm.com> Author: Juliet Hougland <juliet@cloudera.com> Author: Juliet Hougland <not@myemail.com> Closes #15659 from holdenk/SPARK-1267-pip-install-pyspark.
2016-11-16 17:22:15 -05:00
'pyspark.ml',
'pyspark.ml.linalg',
'pyspark.ml.param',
[SPARK-1267][SPARK-18129] Allow PySpark to be pip installed ## What changes were proposed in this pull request? This PR aims to provide a pip installable PySpark package. This does a bunch of work to copy the jars over and package them with the Python code (to prevent challenges from trying to use different versions of the Python code with different versions of the JAR). It does not currently publish to PyPI but that is the natural follow up (SPARK-18129). Done: - pip installable on conda [manual tested] - setup.py installed on a non-pip managed system (RHEL) with YARN [manual tested] - Automated testing of this (virtualenv) - packaging and signing with release-build* Possible follow up work: - release-build update to publish to PyPI (SPARK-18128) - figure out who owns the pyspark package name on prod PyPI (is it someone with in the project or should we ask PyPI or should we choose a different name to publish with like ApachePySpark?) - Windows support and or testing ( SPARK-18136 ) - investigate details of wheel caching and see if we can avoid cleaning the wheel cache during our test - consider how we want to number our dev/snapshot versions Explicitly out of scope: - Using pip installed PySpark to start a standalone cluster - Using pip installed PySpark for non-Python Spark programs *I've done some work to test release-build locally but as a non-committer I've just done local testing. ## How was this patch tested? Automated testing with virtualenv, manual testing with conda, a system wide install, and YARN integration. release-build changes tested locally as a non-committer (no testing of upload artifacts to Apache staging websites) Author: Holden Karau <holden@us.ibm.com> Author: Juliet Hougland <juliet@cloudera.com> Author: Juliet Hougland <not@myemail.com> Closes #15659 from holdenk/SPARK-1267-pip-install-pyspark.
2016-11-16 17:22:15 -05:00
'pyspark.sql',
[SPARK-30434][PYTHON][SQL] Move pandas related functionalities into 'pandas' sub-package ### What changes were proposed in this pull request? This PR proposes to move pandas related functionalities into pandas package. Namely: ```bash pyspark/sql/pandas ├── __init__.py ├── conversion.py # Conversion between pandas <> PySpark DataFrames ├── functions.py # pandas_udf ├── group_ops.py # Grouped UDF / Cogrouped UDF + groupby.apply, groupby.cogroup.apply ├── map_ops.py # Map Iter UDF + mapInPandas ├── serializers.py # pandas <> PyArrow serializers ├── types.py # Type utils between pandas <> PyArrow └── utils.py # Version requirement checks ``` In order to separately locate `groupby.apply`, `groupby.cogroup.apply`, `mapInPandas`, `toPandas`, and `createDataFrame(pdf)` under `pandas` sub-package, I had to use a mix-in approach which Scala side uses often by `trait`, and also pandas itself uses this approach (see `IndexOpsMixin` as an example) to group related functionalities. Currently, you can think it's like Scala's self typed trait. See the structure below: ```python class PandasMapOpsMixin(object): def mapInPandas(self, ...): ... return ... # other Pandas <> PySpark APIs ``` ```python class DataFrame(PandasMapOpsMixin): # other DataFrame APIs equivalent to Scala side. ``` Yes, This is a big PR but they are mostly just moving around except one case `createDataFrame` which I had to split the methods. ### Why are the changes needed? There are pandas functionalities here and there and I myself gets lost where it was. Also, when you have to make a change commonly for all of pandas related features, it's almost impossible now. Also, after this change, `DataFrame` and `SparkSession` become more consistent with Scala side since pandas is specific to Python, and this change separates pandas-specific APIs away from `DataFrame` or `SparkSession`. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing tests should cover. Also, I manually built the PySpark API documentation and checked. Closes #27109 from HyukjinKwon/pandas-refactoring. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-01-08 20:22:50 -05:00
'pyspark.sql.avro',
'pyspark.sql.pandas',
[SPARK-1267][SPARK-18129] Allow PySpark to be pip installed ## What changes were proposed in this pull request? This PR aims to provide a pip installable PySpark package. This does a bunch of work to copy the jars over and package them with the Python code (to prevent challenges from trying to use different versions of the Python code with different versions of the JAR). It does not currently publish to PyPI but that is the natural follow up (SPARK-18129). Done: - pip installable on conda [manual tested] - setup.py installed on a non-pip managed system (RHEL) with YARN [manual tested] - Automated testing of this (virtualenv) - packaging and signing with release-build* Possible follow up work: - release-build update to publish to PyPI (SPARK-18128) - figure out who owns the pyspark package name on prod PyPI (is it someone with in the project or should we ask PyPI or should we choose a different name to publish with like ApachePySpark?) - Windows support and or testing ( SPARK-18136 ) - investigate details of wheel caching and see if we can avoid cleaning the wheel cache during our test - consider how we want to number our dev/snapshot versions Explicitly out of scope: - Using pip installed PySpark to start a standalone cluster - Using pip installed PySpark for non-Python Spark programs *I've done some work to test release-build locally but as a non-committer I've just done local testing. ## How was this patch tested? Automated testing with virtualenv, manual testing with conda, a system wide install, and YARN integration. release-build changes tested locally as a non-committer (no testing of upload artifacts to Apache staging websites) Author: Holden Karau <holden@us.ibm.com> Author: Juliet Hougland <juliet@cloudera.com> Author: Juliet Hougland <not@myemail.com> Closes #15659 from holdenk/SPARK-1267-pip-install-pyspark.
2016-11-16 17:22:15 -05:00
'pyspark.streaming',
'pyspark.bin',
'pyspark.sbin',
[SPARK-1267][SPARK-18129] Allow PySpark to be pip installed ## What changes were proposed in this pull request? This PR aims to provide a pip installable PySpark package. This does a bunch of work to copy the jars over and package them with the Python code (to prevent challenges from trying to use different versions of the Python code with different versions of the JAR). It does not currently publish to PyPI but that is the natural follow up (SPARK-18129). Done: - pip installable on conda [manual tested] - setup.py installed on a non-pip managed system (RHEL) with YARN [manual tested] - Automated testing of this (virtualenv) - packaging and signing with release-build* Possible follow up work: - release-build update to publish to PyPI (SPARK-18128) - figure out who owns the pyspark package name on prod PyPI (is it someone with in the project or should we ask PyPI or should we choose a different name to publish with like ApachePySpark?) - Windows support and or testing ( SPARK-18136 ) - investigate details of wheel caching and see if we can avoid cleaning the wheel cache during our test - consider how we want to number our dev/snapshot versions Explicitly out of scope: - Using pip installed PySpark to start a standalone cluster - Using pip installed PySpark for non-Python Spark programs *I've done some work to test release-build locally but as a non-committer I've just done local testing. ## How was this patch tested? Automated testing with virtualenv, manual testing with conda, a system wide install, and YARN integration. release-build changes tested locally as a non-committer (no testing of upload artifacts to Apache staging websites) Author: Holden Karau <holden@us.ibm.com> Author: Juliet Hougland <juliet@cloudera.com> Author: Juliet Hougland <not@myemail.com> Closes #15659 from holdenk/SPARK-1267-pip-install-pyspark.
2016-11-16 17:22:15 -05:00
'pyspark.jars',
'pyspark.python.pyspark',
'pyspark.python.lib',
'pyspark.data',
'pyspark.licenses',
'pyspark.resource',
[SPARK-1267][SPARK-18129] Allow PySpark to be pip installed ## What changes were proposed in this pull request? This PR aims to provide a pip installable PySpark package. This does a bunch of work to copy the jars over and package them with the Python code (to prevent challenges from trying to use different versions of the Python code with different versions of the JAR). It does not currently publish to PyPI but that is the natural follow up (SPARK-18129). Done: - pip installable on conda [manual tested] - setup.py installed on a non-pip managed system (RHEL) with YARN [manual tested] - Automated testing of this (virtualenv) - packaging and signing with release-build* Possible follow up work: - release-build update to publish to PyPI (SPARK-18128) - figure out who owns the pyspark package name on prod PyPI (is it someone with in the project or should we ask PyPI or should we choose a different name to publish with like ApachePySpark?) - Windows support and or testing ( SPARK-18136 ) - investigate details of wheel caching and see if we can avoid cleaning the wheel cache during our test - consider how we want to number our dev/snapshot versions Explicitly out of scope: - Using pip installed PySpark to start a standalone cluster - Using pip installed PySpark for non-Python Spark programs *I've done some work to test release-build locally but as a non-committer I've just done local testing. ## How was this patch tested? Automated testing with virtualenv, manual testing with conda, a system wide install, and YARN integration. release-build changes tested locally as a non-committer (no testing of upload artifacts to Apache staging websites) Author: Holden Karau <holden@us.ibm.com> Author: Juliet Hougland <juliet@cloudera.com> Author: Juliet Hougland <not@myemail.com> Closes #15659 from holdenk/SPARK-1267-pip-install-pyspark.
2016-11-16 17:22:15 -05:00
'pyspark.examples.src.main.python'],
include_package_data=True,
package_dir={
'pyspark.jars': 'deps/jars',
'pyspark.bin': 'deps/bin',
'pyspark.sbin': 'deps/sbin',
[SPARK-1267][SPARK-18129] Allow PySpark to be pip installed ## What changes were proposed in this pull request? This PR aims to provide a pip installable PySpark package. This does a bunch of work to copy the jars over and package them with the Python code (to prevent challenges from trying to use different versions of the Python code with different versions of the JAR). It does not currently publish to PyPI but that is the natural follow up (SPARK-18129). Done: - pip installable on conda [manual tested] - setup.py installed on a non-pip managed system (RHEL) with YARN [manual tested] - Automated testing of this (virtualenv) - packaging and signing with release-build* Possible follow up work: - release-build update to publish to PyPI (SPARK-18128) - figure out who owns the pyspark package name on prod PyPI (is it someone with in the project or should we ask PyPI or should we choose a different name to publish with like ApachePySpark?) - Windows support and or testing ( SPARK-18136 ) - investigate details of wheel caching and see if we can avoid cleaning the wheel cache during our test - consider how we want to number our dev/snapshot versions Explicitly out of scope: - Using pip installed PySpark to start a standalone cluster - Using pip installed PySpark for non-Python Spark programs *I've done some work to test release-build locally but as a non-committer I've just done local testing. ## How was this patch tested? Automated testing with virtualenv, manual testing with conda, a system wide install, and YARN integration. release-build changes tested locally as a non-committer (no testing of upload artifacts to Apache staging websites) Author: Holden Karau <holden@us.ibm.com> Author: Juliet Hougland <juliet@cloudera.com> Author: Juliet Hougland <not@myemail.com> Closes #15659 from holdenk/SPARK-1267-pip-install-pyspark.
2016-11-16 17:22:15 -05:00
'pyspark.python.lib': 'lib',
'pyspark.data': 'deps/data',
'pyspark.licenses': 'deps/licenses',
[SPARK-1267][SPARK-18129] Allow PySpark to be pip installed ## What changes were proposed in this pull request? This PR aims to provide a pip installable PySpark package. This does a bunch of work to copy the jars over and package them with the Python code (to prevent challenges from trying to use different versions of the Python code with different versions of the JAR). It does not currently publish to PyPI but that is the natural follow up (SPARK-18129). Done: - pip installable on conda [manual tested] - setup.py installed on a non-pip managed system (RHEL) with YARN [manual tested] - Automated testing of this (virtualenv) - packaging and signing with release-build* Possible follow up work: - release-build update to publish to PyPI (SPARK-18128) - figure out who owns the pyspark package name on prod PyPI (is it someone with in the project or should we ask PyPI or should we choose a different name to publish with like ApachePySpark?) - Windows support and or testing ( SPARK-18136 ) - investigate details of wheel caching and see if we can avoid cleaning the wheel cache during our test - consider how we want to number our dev/snapshot versions Explicitly out of scope: - Using pip installed PySpark to start a standalone cluster - Using pip installed PySpark for non-Python Spark programs *I've done some work to test release-build locally but as a non-committer I've just done local testing. ## How was this patch tested? Automated testing with virtualenv, manual testing with conda, a system wide install, and YARN integration. release-build changes tested locally as a non-committer (no testing of upload artifacts to Apache staging websites) Author: Holden Karau <holden@us.ibm.com> Author: Juliet Hougland <juliet@cloudera.com> Author: Juliet Hougland <not@myemail.com> Closes #15659 from holdenk/SPARK-1267-pip-install-pyspark.
2016-11-16 17:22:15 -05:00
'pyspark.examples.src.main.python': 'deps/examples',
},
package_data={
'pyspark.jars': ['*.jar'],
'pyspark.bin': ['*'],
'pyspark.sbin': ['spark-config.sh', 'spark-daemon.sh',
'start-history-server.sh',
'stop-history-server.sh', ],
[SPARK-1267][SPARK-18129] Allow PySpark to be pip installed ## What changes were proposed in this pull request? This PR aims to provide a pip installable PySpark package. This does a bunch of work to copy the jars over and package them with the Python code (to prevent challenges from trying to use different versions of the Python code with different versions of the JAR). It does not currently publish to PyPI but that is the natural follow up (SPARK-18129). Done: - pip installable on conda [manual tested] - setup.py installed on a non-pip managed system (RHEL) with YARN [manual tested] - Automated testing of this (virtualenv) - packaging and signing with release-build* Possible follow up work: - release-build update to publish to PyPI (SPARK-18128) - figure out who owns the pyspark package name on prod PyPI (is it someone with in the project or should we ask PyPI or should we choose a different name to publish with like ApachePySpark?) - Windows support and or testing ( SPARK-18136 ) - investigate details of wheel caching and see if we can avoid cleaning the wheel cache during our test - consider how we want to number our dev/snapshot versions Explicitly out of scope: - Using pip installed PySpark to start a standalone cluster - Using pip installed PySpark for non-Python Spark programs *I've done some work to test release-build locally but as a non-committer I've just done local testing. ## How was this patch tested? Automated testing with virtualenv, manual testing with conda, a system wide install, and YARN integration. release-build changes tested locally as a non-committer (no testing of upload artifacts to Apache staging websites) Author: Holden Karau <holden@us.ibm.com> Author: Juliet Hougland <juliet@cloudera.com> Author: Juliet Hougland <not@myemail.com> Closes #15659 from holdenk/SPARK-1267-pip-install-pyspark.
2016-11-16 17:22:15 -05:00
'pyspark.python.lib': ['*.zip'],
'pyspark.data': ['*.txt', '*.data'],
'pyspark.licenses': ['*.txt'],
[SPARK-1267][SPARK-18129] Allow PySpark to be pip installed ## What changes were proposed in this pull request? This PR aims to provide a pip installable PySpark package. This does a bunch of work to copy the jars over and package them with the Python code (to prevent challenges from trying to use different versions of the Python code with different versions of the JAR). It does not currently publish to PyPI but that is the natural follow up (SPARK-18129). Done: - pip installable on conda [manual tested] - setup.py installed on a non-pip managed system (RHEL) with YARN [manual tested] - Automated testing of this (virtualenv) - packaging and signing with release-build* Possible follow up work: - release-build update to publish to PyPI (SPARK-18128) - figure out who owns the pyspark package name on prod PyPI (is it someone with in the project or should we ask PyPI or should we choose a different name to publish with like ApachePySpark?) - Windows support and or testing ( SPARK-18136 ) - investigate details of wheel caching and see if we can avoid cleaning the wheel cache during our test - consider how we want to number our dev/snapshot versions Explicitly out of scope: - Using pip installed PySpark to start a standalone cluster - Using pip installed PySpark for non-Python Spark programs *I've done some work to test release-build locally but as a non-committer I've just done local testing. ## How was this patch tested? Automated testing with virtualenv, manual testing with conda, a system wide install, and YARN integration. release-build changes tested locally as a non-committer (no testing of upload artifacts to Apache staging websites) Author: Holden Karau <holden@us.ibm.com> Author: Juliet Hougland <juliet@cloudera.com> Author: Juliet Hougland <not@myemail.com> Closes #15659 from holdenk/SPARK-1267-pip-install-pyspark.
2016-11-16 17:22:15 -05:00
'pyspark.examples.src.main.python': ['*.py', '*/*.py']},
scripts=scripts,
license='http://www.apache.org/licenses/LICENSE-2.0',
# Don't forget to update python/docs/source/getting_started/install.rst
# if you're updating the versions or dependencies.
install_requires=['py4j==0.10.9.1'],
[SPARK-1267][SPARK-18129] Allow PySpark to be pip installed ## What changes were proposed in this pull request? This PR aims to provide a pip installable PySpark package. This does a bunch of work to copy the jars over and package them with the Python code (to prevent challenges from trying to use different versions of the Python code with different versions of the JAR). It does not currently publish to PyPI but that is the natural follow up (SPARK-18129). Done: - pip installable on conda [manual tested] - setup.py installed on a non-pip managed system (RHEL) with YARN [manual tested] - Automated testing of this (virtualenv) - packaging and signing with release-build* Possible follow up work: - release-build update to publish to PyPI (SPARK-18128) - figure out who owns the pyspark package name on prod PyPI (is it someone with in the project or should we ask PyPI or should we choose a different name to publish with like ApachePySpark?) - Windows support and or testing ( SPARK-18136 ) - investigate details of wheel caching and see if we can avoid cleaning the wheel cache during our test - consider how we want to number our dev/snapshot versions Explicitly out of scope: - Using pip installed PySpark to start a standalone cluster - Using pip installed PySpark for non-Python Spark programs *I've done some work to test release-build locally but as a non-committer I've just done local testing. ## How was this patch tested? Automated testing with virtualenv, manual testing with conda, a system wide install, and YARN integration. release-build changes tested locally as a non-committer (no testing of upload artifacts to Apache staging websites) Author: Holden Karau <holden@us.ibm.com> Author: Juliet Hougland <juliet@cloudera.com> Author: Juliet Hougland <not@myemail.com> Closes #15659 from holdenk/SPARK-1267-pip-install-pyspark.
2016-11-16 17:22:15 -05:00
extras_require={
'ml': ['numpy>=1.7'],
'mllib': ['numpy>=1.7'],
[SPARK-23319][TESTS] Explicitly specify Pandas and PyArrow versions in PySpark tests (to skip or test) ## What changes were proposed in this pull request? This PR proposes to explicitly specify Pandas and PyArrow versions in PySpark tests to skip or test. We declared the extra dependencies: https://github.com/apache/spark/blob/b8bfce51abf28c66ba1fc67b0f25fe1617c81025/python/setup.py#L204 In case of PyArrow: Currently we only check if pyarrow is installed or not without checking the version. It already fails to run tests. For example, if PyArrow 0.7.0 is installed: ``` ====================================================================== ERROR: test_vectorized_udf_wrong_return_type (pyspark.sql.tests.ScalarPandasUDF) ---------------------------------------------------------------------- Traceback (most recent call last): File "/.../spark/python/pyspark/sql/tests.py", line 4019, in test_vectorized_udf_wrong_return_type f = pandas_udf(lambda x: x * 1.0, MapType(LongType(), LongType())) File "/.../spark/python/pyspark/sql/functions.py", line 2309, in pandas_udf return _create_udf(f=f, returnType=return_type, evalType=eval_type) File "/.../spark/python/pyspark/sql/udf.py", line 47, in _create_udf require_minimum_pyarrow_version() File "/.../spark/python/pyspark/sql/utils.py", line 132, in require_minimum_pyarrow_version "however, your version was %s." % pyarrow.__version__) ImportError: pyarrow >= 0.8.0 must be installed on calling Python process; however, your version was 0.7.0. ---------------------------------------------------------------------- Ran 33 tests in 8.098s FAILED (errors=33) ``` In case of Pandas: There are few tests for old Pandas which were tested only when Pandas version was lower, and I rewrote them to be tested when both Pandas version is lower and missing. ## How was this patch tested? Manually tested by modifying the condition: ``` test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 1.19.2 must be installed; however, your version was 0.19.2.' test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 1.19.2 must be installed; however, your version was 0.19.2.' test_createDataFrame_respect_session_timezone (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 1.19.2 must be installed; however, your version was 0.19.2.' ``` ``` test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 0.19.2 must be installed; however, it was not found.' test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 0.19.2 must be installed; however, it was not found.' test_createDataFrame_respect_session_timezone (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 0.19.2 must be installed; however, it was not found.' ``` ``` test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 1.8.0 must be installed; however, your version was 0.8.0.' test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 1.8.0 must be installed; however, your version was 0.8.0.' test_createDataFrame_respect_session_timezone (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 1.8.0 must be installed; however, your version was 0.8.0.' ``` ``` test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 0.8.0 must be installed; however, it was not found.' test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 0.8.0 must be installed; however, it was not found.' test_createDataFrame_respect_session_timezone (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 0.8.0 must be installed; however, it was not found.' ``` Author: hyukjinkwon <gurwls223@gmail.com> Closes #20487 from HyukjinKwon/pyarrow-pandas-skip.
2018-02-07 09:28:10 -05:00
'sql': [
'pandas>=%s' % _minimum_pandas_version,
'pyarrow>=%s' % _minimum_pyarrow_version,
]
[SPARK-1267][SPARK-18129] Allow PySpark to be pip installed ## What changes were proposed in this pull request? This PR aims to provide a pip installable PySpark package. This does a bunch of work to copy the jars over and package them with the Python code (to prevent challenges from trying to use different versions of the Python code with different versions of the JAR). It does not currently publish to PyPI but that is the natural follow up (SPARK-18129). Done: - pip installable on conda [manual tested] - setup.py installed on a non-pip managed system (RHEL) with YARN [manual tested] - Automated testing of this (virtualenv) - packaging and signing with release-build* Possible follow up work: - release-build update to publish to PyPI (SPARK-18128) - figure out who owns the pyspark package name on prod PyPI (is it someone with in the project or should we ask PyPI or should we choose a different name to publish with like ApachePySpark?) - Windows support and or testing ( SPARK-18136 ) - investigate details of wheel caching and see if we can avoid cleaning the wheel cache during our test - consider how we want to number our dev/snapshot versions Explicitly out of scope: - Using pip installed PySpark to start a standalone cluster - Using pip installed PySpark for non-Python Spark programs *I've done some work to test release-build locally but as a non-committer I've just done local testing. ## How was this patch tested? Automated testing with virtualenv, manual testing with conda, a system wide install, and YARN integration. release-build changes tested locally as a non-committer (no testing of upload artifacts to Apache staging websites) Author: Holden Karau <holden@us.ibm.com> Author: Juliet Hougland <juliet@cloudera.com> Author: Juliet Hougland <not@myemail.com> Closes #15659 from holdenk/SPARK-1267-pip-install-pyspark.
2016-11-16 17:22:15 -05:00
},
python_requires='>=3.6',
[SPARK-1267][SPARK-18129] Allow PySpark to be pip installed ## What changes were proposed in this pull request? This PR aims to provide a pip installable PySpark package. This does a bunch of work to copy the jars over and package them with the Python code (to prevent challenges from trying to use different versions of the Python code with different versions of the JAR). It does not currently publish to PyPI but that is the natural follow up (SPARK-18129). Done: - pip installable on conda [manual tested] - setup.py installed on a non-pip managed system (RHEL) with YARN [manual tested] - Automated testing of this (virtualenv) - packaging and signing with release-build* Possible follow up work: - release-build update to publish to PyPI (SPARK-18128) - figure out who owns the pyspark package name on prod PyPI (is it someone with in the project or should we ask PyPI or should we choose a different name to publish with like ApachePySpark?) - Windows support and or testing ( SPARK-18136 ) - investigate details of wheel caching and see if we can avoid cleaning the wheel cache during our test - consider how we want to number our dev/snapshot versions Explicitly out of scope: - Using pip installed PySpark to start a standalone cluster - Using pip installed PySpark for non-Python Spark programs *I've done some work to test release-build locally but as a non-committer I've just done local testing. ## How was this patch tested? Automated testing with virtualenv, manual testing with conda, a system wide install, and YARN integration. release-build changes tested locally as a non-committer (no testing of upload artifacts to Apache staging websites) Author: Holden Karau <holden@us.ibm.com> Author: Juliet Hougland <juliet@cloudera.com> Author: Juliet Hougland <not@myemail.com> Closes #15659 from holdenk/SPARK-1267-pip-install-pyspark.
2016-11-16 17:22:15 -05:00
classifiers=[
'Development Status :: 5 - Production/Stable',
'License :: OSI Approved :: Apache Software License',
'Programming Language :: Python :: 3.6',
[SPARK-24739][PYTHON] Make PySpark compatible with Python 3.7 ## What changes were proposed in this pull request? This PR proposes to make PySpark compatible with Python 3.7. There are rather radical change in semantic of `StopIteration` within a generator. It now throws it as a `RuntimeError`. To make it compatible, we should fix it: ```python try: next(...) except StopIteration return ``` See [release note](https://docs.python.org/3/whatsnew/3.7.html#porting-to-python-3-7) and [PEP 479](https://www.python.org/dev/peps/pep-0479/). ## How was this patch tested? Manually tested: ``` $ ./run-tests --python-executables=python3.7 Running PySpark tests. Output is in /.../spark/python/unit-tests.log Will test against the following Python executables: ['python3.7'] Will test the following Python modules: ['pyspark-core', 'pyspark-ml', 'pyspark-mllib', 'pyspark-sql', 'pyspark-streaming'] Starting test(python3.7): pyspark.mllib.tests Starting test(python3.7): pyspark.sql.tests Starting test(python3.7): pyspark.streaming.tests Starting test(python3.7): pyspark.tests Finished test(python3.7): pyspark.streaming.tests (130s) Starting test(python3.7): pyspark.accumulators Finished test(python3.7): pyspark.accumulators (8s) Starting test(python3.7): pyspark.broadcast Finished test(python3.7): pyspark.broadcast (9s) Starting test(python3.7): pyspark.conf Finished test(python3.7): pyspark.conf (6s) Starting test(python3.7): pyspark.context Finished test(python3.7): pyspark.context (27s) Starting test(python3.7): pyspark.ml.classification Finished test(python3.7): pyspark.tests (200s) ... 3 tests were skipped Starting test(python3.7): pyspark.ml.clustering Finished test(python3.7): pyspark.mllib.tests (244s) Starting test(python3.7): pyspark.ml.evaluation Finished test(python3.7): pyspark.ml.classification (63s) Starting test(python3.7): pyspark.ml.feature Finished test(python3.7): pyspark.ml.clustering (48s) Starting test(python3.7): pyspark.ml.fpm Finished test(python3.7): pyspark.ml.fpm (0s) Starting test(python3.7): pyspark.ml.image Finished test(python3.7): pyspark.ml.evaluation (23s) Starting test(python3.7): pyspark.ml.linalg.__init__ Finished test(python3.7): pyspark.ml.linalg.__init__ (0s) Starting test(python3.7): pyspark.ml.recommendation Finished test(python3.7): pyspark.ml.image (20s) Starting test(python3.7): pyspark.ml.regression Finished test(python3.7): pyspark.ml.regression (58s) Starting test(python3.7): pyspark.ml.stat Finished test(python3.7): pyspark.ml.feature (90s) Starting test(python3.7): pyspark.ml.tests Finished test(python3.7): pyspark.ml.recommendation (82s) Starting test(python3.7): pyspark.ml.tuning Finished test(python3.7): pyspark.ml.stat (27s) Starting test(python3.7): pyspark.mllib.classification Finished test(python3.7): pyspark.sql.tests (362s) ... 102 tests were skipped Starting test(python3.7): pyspark.mllib.clustering Finished test(python3.7): pyspark.ml.tuning (29s) Starting test(python3.7): pyspark.mllib.evaluation Finished test(python3.7): pyspark.mllib.classification (39s) Starting test(python3.7): pyspark.mllib.feature Finished test(python3.7): pyspark.mllib.evaluation (30s) Starting test(python3.7): pyspark.mllib.fpm Finished test(python3.7): pyspark.mllib.feature (44s) Starting test(python3.7): pyspark.mllib.linalg.__init__ Finished test(python3.7): pyspark.mllib.linalg.__init__ (0s) Starting test(python3.7): pyspark.mllib.linalg.distributed Finished test(python3.7): pyspark.mllib.clustering (78s) Starting test(python3.7): pyspark.mllib.random Finished test(python3.7): pyspark.mllib.fpm (33s) Starting test(python3.7): pyspark.mllib.recommendation Finished test(python3.7): pyspark.mllib.random (12s) Starting test(python3.7): pyspark.mllib.regression Finished test(python3.7): pyspark.mllib.linalg.distributed (45s) Starting test(python3.7): pyspark.mllib.stat.KernelDensity Finished test(python3.7): pyspark.mllib.stat.KernelDensity (0s) Starting test(python3.7): pyspark.mllib.stat._statistics Finished test(python3.7): pyspark.mllib.recommendation (41s) Starting test(python3.7): pyspark.mllib.tree Finished test(python3.7): pyspark.mllib.regression (44s) Starting test(python3.7): pyspark.mllib.util Finished test(python3.7): pyspark.mllib.stat._statistics (20s) Starting test(python3.7): pyspark.profiler Finished test(python3.7): pyspark.mllib.tree (26s) Starting test(python3.7): pyspark.rdd Finished test(python3.7): pyspark.profiler (11s) Starting test(python3.7): pyspark.serializers Finished test(python3.7): pyspark.mllib.util (24s) Starting test(python3.7): pyspark.shuffle Finished test(python3.7): pyspark.shuffle (0s) Starting test(python3.7): pyspark.sql.catalog Finished test(python3.7): pyspark.serializers (15s) Starting test(python3.7): pyspark.sql.column Finished test(python3.7): pyspark.rdd (27s) Starting test(python3.7): pyspark.sql.conf Finished test(python3.7): pyspark.sql.catalog (24s) Starting test(python3.7): pyspark.sql.context Finished test(python3.7): pyspark.sql.conf (8s) Starting test(python3.7): pyspark.sql.dataframe Finished test(python3.7): pyspark.sql.column (29s) Starting test(python3.7): pyspark.sql.functions Finished test(python3.7): pyspark.sql.context (26s) Starting test(python3.7): pyspark.sql.group Finished test(python3.7): pyspark.sql.dataframe (51s) Starting test(python3.7): pyspark.sql.readwriter Finished test(python3.7): pyspark.ml.tests (266s) Starting test(python3.7): pyspark.sql.session Finished test(python3.7): pyspark.sql.group (36s) Starting test(python3.7): pyspark.sql.streaming Finished test(python3.7): pyspark.sql.functions (57s) Starting test(python3.7): pyspark.sql.types Finished test(python3.7): pyspark.sql.session (25s) Starting test(python3.7): pyspark.sql.udf Finished test(python3.7): pyspark.sql.types (10s) Starting test(python3.7): pyspark.sql.window Finished test(python3.7): pyspark.sql.readwriter (31s) Starting test(python3.7): pyspark.streaming.util Finished test(python3.7): pyspark.sql.streaming (22s) Starting test(python3.7): pyspark.util Finished test(python3.7): pyspark.util (0s) Finished test(python3.7): pyspark.streaming.util (0s) Finished test(python3.7): pyspark.sql.udf (16s) Finished test(python3.7): pyspark.sql.window (12s) ``` In my local (I have two Macs but both have the same issues), I currently faced some issues for now to install both extra dependencies PyArrow and Pandas same as Jenkins's, against Python 3.7. Author: hyukjinkwon <gurwls223@apache.org> Closes #21714 from HyukjinKwon/SPARK-24739.
2018-07-06 23:37:41 -04:00
'Programming Language :: Python :: 3.7',
[SPARK-29536][PYTHON] Upgrade cloudpickle to 1.1.1 to support Python 3.8 ### What changes were proposed in this pull request? Inline cloudpickle in PySpark to cloudpickle 1.1.1. See https://github.com/cloudpipe/cloudpickle/blob/v1.1.1/cloudpickle/cloudpickle.py https://github.com/cloudpipe/cloudpickle/pull/269 was added for Python 3.8 support (fixed from 1.1.0). Using 1.2.2 seems breaking PyPy 2 due to cloudpipe/cloudpickle#278 so this PR currently uses 1.1.1. Once we drop Python 2, we can switch to the highest version. ### Why are the changes needed? positional-only arguments was newly introduced from Python 3.8 (see https://docs.python.org/3/whatsnew/3.8.html#positional-only-parameters) Particularly the newly added argument to `types.CodeType` was the problem (https://docs.python.org/3/whatsnew/3.8.html#changes-in-the-python-api): > `types.CodeType` has a new parameter in the second position of the constructor (posonlyargcount) to support positional-only arguments defined in **PEP 570**. The first argument (argcount) now represents the total number of positional arguments (including positional-only arguments). The new `replace()` method of `types.CodeType` can be used to make the code future-proof. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Manually tested. Note that the optional dependency PyArrow looks not yet supporting Python 3.8; therefore, it was not tested. See "Details" below. <details> <p> ```bash cd python ./run-tests --python-executables=python3.8 ``` ``` Running PySpark tests. Output is in /Users/hyukjin.kwon/workspace/forked/spark/python/unit-tests.log Will test against the following Python executables: ['python3.8'] Will test the following Python modules: ['pyspark-core', 'pyspark-ml', 'pyspark-mllib', 'pyspark-sql', 'pyspark-streaming'] Starting test(python3.8): pyspark.ml.tests.test_algorithms Starting test(python3.8): pyspark.ml.tests.test_feature Starting test(python3.8): pyspark.ml.tests.test_base Starting test(python3.8): pyspark.ml.tests.test_evaluation Finished test(python3.8): pyspark.ml.tests.test_base (12s) Starting test(python3.8): pyspark.ml.tests.test_image Finished test(python3.8): pyspark.ml.tests.test_evaluation (14s) Starting test(python3.8): pyspark.ml.tests.test_linalg Finished test(python3.8): pyspark.ml.tests.test_feature (23s) Starting test(python3.8): pyspark.ml.tests.test_param Finished test(python3.8): pyspark.ml.tests.test_image (22s) Starting test(python3.8): pyspark.ml.tests.test_persistence Finished test(python3.8): pyspark.ml.tests.test_param (25s) Starting test(python3.8): pyspark.ml.tests.test_pipeline Finished test(python3.8): pyspark.ml.tests.test_linalg (37s) Starting test(python3.8): pyspark.ml.tests.test_stat Finished test(python3.8): pyspark.ml.tests.test_pipeline (7s) Starting test(python3.8): pyspark.ml.tests.test_training_summary Finished test(python3.8): pyspark.ml.tests.test_stat (21s) Starting test(python3.8): pyspark.ml.tests.test_tuning Finished test(python3.8): pyspark.ml.tests.test_persistence (45s) Starting test(python3.8): pyspark.ml.tests.test_wrapper Finished test(python3.8): pyspark.ml.tests.test_algorithms (83s) Starting test(python3.8): pyspark.mllib.tests.test_algorithms Finished test(python3.8): pyspark.ml.tests.test_training_summary (32s) Starting test(python3.8): pyspark.mllib.tests.test_feature Finished test(python3.8): pyspark.ml.tests.test_wrapper (20s) Starting test(python3.8): pyspark.mllib.tests.test_linalg Finished test(python3.8): pyspark.mllib.tests.test_feature (32s) Starting test(python3.8): pyspark.mllib.tests.test_stat Finished test(python3.8): pyspark.mllib.tests.test_algorithms (70s) Starting test(python3.8): pyspark.mllib.tests.test_streaming_algorithms Finished test(python3.8): pyspark.mllib.tests.test_stat (37s) Starting test(python3.8): pyspark.mllib.tests.test_util Finished test(python3.8): pyspark.mllib.tests.test_linalg (70s) Starting test(python3.8): pyspark.sql.tests.test_arrow Finished test(python3.8): pyspark.sql.tests.test_arrow (1s) ... 53 tests were skipped Starting test(python3.8): pyspark.sql.tests.test_catalog Finished test(python3.8): pyspark.mllib.tests.test_util (15s) Starting test(python3.8): pyspark.sql.tests.test_column Finished test(python3.8): pyspark.sql.tests.test_catalog (24s) Starting test(python3.8): pyspark.sql.tests.test_conf Finished test(python3.8): pyspark.sql.tests.test_column (21s) Starting test(python3.8): pyspark.sql.tests.test_context Finished test(python3.8): pyspark.ml.tests.test_tuning (125s) Starting test(python3.8): pyspark.sql.tests.test_dataframe Finished test(python3.8): pyspark.sql.tests.test_conf (9s) Starting test(python3.8): pyspark.sql.tests.test_datasources Finished test(python3.8): pyspark.sql.tests.test_context (29s) Starting test(python3.8): pyspark.sql.tests.test_functions Finished test(python3.8): pyspark.sql.tests.test_datasources (32s) Starting test(python3.8): pyspark.sql.tests.test_group Finished test(python3.8): pyspark.sql.tests.test_dataframe (39s) ... 3 tests were skipped Starting test(python3.8): pyspark.sql.tests.test_pandas_udf Finished test(python3.8): pyspark.sql.tests.test_pandas_udf (1s) ... 6 tests were skipped Starting test(python3.8): pyspark.sql.tests.test_pandas_udf_cogrouped_map Finished test(python3.8): pyspark.sql.tests.test_pandas_udf_cogrouped_map (0s) ... 14 tests were skipped Starting test(python3.8): pyspark.sql.tests.test_pandas_udf_grouped_agg Finished test(python3.8): pyspark.sql.tests.test_pandas_udf_grouped_agg (1s) ... 15 tests were skipped Starting test(python3.8): pyspark.sql.tests.test_pandas_udf_grouped_map Finished test(python3.8): pyspark.sql.tests.test_pandas_udf_grouped_map (1s) ... 20 tests were skipped Starting test(python3.8): pyspark.sql.tests.test_pandas_udf_scalar Finished test(python3.8): pyspark.sql.tests.test_pandas_udf_scalar (1s) ... 49 tests were skipped Starting test(python3.8): pyspark.sql.tests.test_pandas_udf_window Finished test(python3.8): pyspark.sql.tests.test_pandas_udf_window (1s) ... 14 tests were skipped Starting test(python3.8): pyspark.sql.tests.test_readwriter Finished test(python3.8): pyspark.sql.tests.test_functions (29s) Starting test(python3.8): pyspark.sql.tests.test_serde Finished test(python3.8): pyspark.sql.tests.test_group (20s) Starting test(python3.8): pyspark.sql.tests.test_session Finished test(python3.8): pyspark.mllib.tests.test_streaming_algorithms (126s) Starting test(python3.8): pyspark.sql.tests.test_streaming Finished test(python3.8): pyspark.sql.tests.test_serde (25s) Starting test(python3.8): pyspark.sql.tests.test_types Finished test(python3.8): pyspark.sql.tests.test_readwriter (38s) Starting test(python3.8): pyspark.sql.tests.test_udf Finished test(python3.8): pyspark.sql.tests.test_session (32s) Starting test(python3.8): pyspark.sql.tests.test_utils Finished test(python3.8): pyspark.sql.tests.test_utils (17s) Starting test(python3.8): pyspark.streaming.tests.test_context Finished test(python3.8): pyspark.sql.tests.test_types (45s) Starting test(python3.8): pyspark.streaming.tests.test_dstream Finished test(python3.8): pyspark.sql.tests.test_udf (44s) Starting test(python3.8): pyspark.streaming.tests.test_kinesis Finished test(python3.8): pyspark.streaming.tests.test_kinesis (0s) ... 2 tests were skipped Starting test(python3.8): pyspark.streaming.tests.test_listener Finished test(python3.8): pyspark.streaming.tests.test_context (28s) Starting test(python3.8): pyspark.tests.test_appsubmit Finished test(python3.8): pyspark.sql.tests.test_streaming (60s) Starting test(python3.8): pyspark.tests.test_broadcast Finished test(python3.8): pyspark.streaming.tests.test_listener (11s) Starting test(python3.8): pyspark.tests.test_conf Finished test(python3.8): pyspark.tests.test_conf (17s) Starting test(python3.8): pyspark.tests.test_context Finished test(python3.8): pyspark.tests.test_broadcast (39s) Starting test(python3.8): pyspark.tests.test_daemon Finished test(python3.8): pyspark.tests.test_daemon (5s) Starting test(python3.8): pyspark.tests.test_join Finished test(python3.8): pyspark.tests.test_context (31s) Starting test(python3.8): pyspark.tests.test_profiler Finished test(python3.8): pyspark.tests.test_join (9s) Starting test(python3.8): pyspark.tests.test_rdd Finished test(python3.8): pyspark.tests.test_profiler (12s) Starting test(python3.8): pyspark.tests.test_readwrite Finished test(python3.8): pyspark.tests.test_readwrite (23s) ... 3 tests were skipped Starting test(python3.8): pyspark.tests.test_serializers Finished test(python3.8): pyspark.tests.test_appsubmit (94s) Starting test(python3.8): pyspark.tests.test_shuffle Finished test(python3.8): pyspark.streaming.tests.test_dstream (110s) Starting test(python3.8): pyspark.tests.test_taskcontext Finished test(python3.8): pyspark.tests.test_rdd (42s) Starting test(python3.8): pyspark.tests.test_util Finished test(python3.8): pyspark.tests.test_serializers (11s) Starting test(python3.8): pyspark.tests.test_worker Finished test(python3.8): pyspark.tests.test_shuffle (12s) Starting test(python3.8): pyspark.accumulators Finished test(python3.8): pyspark.tests.test_util (7s) Starting test(python3.8): pyspark.broadcast Finished test(python3.8): pyspark.accumulators (8s) Starting test(python3.8): pyspark.conf Finished test(python3.8): pyspark.broadcast (8s) Starting test(python3.8): pyspark.context Finished test(python3.8): pyspark.tests.test_worker (19s) Starting test(python3.8): pyspark.ml.classification Finished test(python3.8): pyspark.conf (4s) Starting test(python3.8): pyspark.ml.clustering Finished test(python3.8): pyspark.context (22s) Starting test(python3.8): pyspark.ml.evaluation Finished test(python3.8): pyspark.tests.test_taskcontext (49s) Starting test(python3.8): pyspark.ml.feature Finished test(python3.8): pyspark.ml.clustering (43s) Starting test(python3.8): pyspark.ml.fpm Finished test(python3.8): pyspark.ml.evaluation (27s) Starting test(python3.8): pyspark.ml.image Finished test(python3.8): pyspark.ml.image (8s) Starting test(python3.8): pyspark.ml.linalg.__init__ Finished test(python3.8): pyspark.ml.linalg.__init__ (0s) Starting test(python3.8): pyspark.ml.recommendation Finished test(python3.8): pyspark.ml.classification (63s) Starting test(python3.8): pyspark.ml.regression Finished test(python3.8): pyspark.ml.fpm (23s) Starting test(python3.8): pyspark.ml.stat Finished test(python3.8): pyspark.ml.stat (30s) Starting test(python3.8): pyspark.ml.tuning Finished test(python3.8): pyspark.ml.regression (51s) Starting test(python3.8): pyspark.mllib.classification Finished test(python3.8): pyspark.ml.feature (93s) Starting test(python3.8): pyspark.mllib.clustering Finished test(python3.8): pyspark.ml.tuning (39s) Starting test(python3.8): pyspark.mllib.evaluation Finished test(python3.8): pyspark.mllib.classification (38s) Starting test(python3.8): pyspark.mllib.feature Finished test(python3.8): pyspark.mllib.evaluation (25s) Starting test(python3.8): pyspark.mllib.fpm Finished test(python3.8): pyspark.mllib.clustering (64s) Starting test(python3.8): pyspark.mllib.linalg.__init__ Finished test(python3.8): pyspark.ml.recommendation (131s) Starting test(python3.8): pyspark.mllib.linalg.distributed Finished test(python3.8): pyspark.mllib.linalg.__init__ (0s) Starting test(python3.8): pyspark.mllib.random Finished test(python3.8): pyspark.mllib.feature (36s) Starting test(python3.8): pyspark.mllib.recommendation Finished test(python3.8): pyspark.mllib.fpm (31s) Starting test(python3.8): pyspark.mllib.regression Finished test(python3.8): pyspark.mllib.random (16s) Starting test(python3.8): pyspark.mllib.stat.KernelDensity Finished test(python3.8): pyspark.mllib.stat.KernelDensity (1s) Starting test(python3.8): pyspark.mllib.stat._statistics Finished test(python3.8): pyspark.mllib.stat._statistics (25s) Starting test(python3.8): pyspark.mllib.tree Finished test(python3.8): pyspark.mllib.regression (44s) Starting test(python3.8): pyspark.mllib.util Finished test(python3.8): pyspark.mllib.recommendation (49s) Starting test(python3.8): pyspark.profiler Finished test(python3.8): pyspark.mllib.linalg.distributed (53s) Starting test(python3.8): pyspark.rdd Finished test(python3.8): pyspark.profiler (14s) Starting test(python3.8): pyspark.serializers Finished test(python3.8): pyspark.mllib.tree (30s) Starting test(python3.8): pyspark.shuffle Finished test(python3.8): pyspark.shuffle (2s) Starting test(python3.8): pyspark.sql.avro.functions Finished test(python3.8): pyspark.mllib.util (30s) Starting test(python3.8): pyspark.sql.catalog Finished test(python3.8): pyspark.serializers (17s) Starting test(python3.8): pyspark.sql.column Finished test(python3.8): pyspark.rdd (31s) Starting test(python3.8): pyspark.sql.conf Finished test(python3.8): pyspark.sql.conf (7s) Starting test(python3.8): pyspark.sql.context Finished test(python3.8): pyspark.sql.avro.functions (19s) Starting test(python3.8): pyspark.sql.dataframe Finished test(python3.8): pyspark.sql.catalog (16s) Starting test(python3.8): pyspark.sql.functions Finished test(python3.8): pyspark.sql.column (27s) Starting test(python3.8): pyspark.sql.group Finished test(python3.8): pyspark.sql.context (26s) Starting test(python3.8): pyspark.sql.readwriter Finished test(python3.8): pyspark.sql.group (52s) Starting test(python3.8): pyspark.sql.session Finished test(python3.8): pyspark.sql.dataframe (73s) Starting test(python3.8): pyspark.sql.streaming Finished test(python3.8): pyspark.sql.functions (75s) Starting test(python3.8): pyspark.sql.types Finished test(python3.8): pyspark.sql.readwriter (57s) Starting test(python3.8): pyspark.sql.udf Finished test(python3.8): pyspark.sql.types (13s) Starting test(python3.8): pyspark.sql.window Finished test(python3.8): pyspark.sql.session (32s) Starting test(python3.8): pyspark.streaming.util Finished test(python3.8): pyspark.streaming.util (1s) Starting test(python3.8): pyspark.util Finished test(python3.8): pyspark.util (0s) Finished test(python3.8): pyspark.sql.streaming (30s) Finished test(python3.8): pyspark.sql.udf (27s) Finished test(python3.8): pyspark.sql.window (22s) Tests passed in 855 seconds ``` </p> </details> Closes #26194 from HyukjinKwon/SPARK-29536. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-10-22 03:18:34 -04:00
'Programming Language :: Python :: 3.8',
[SPARK-33371][PYTHON] Update setup.py and tests for Python 3.9 ### What changes were proposed in this pull request? This PR proposes to fix PySpark to officially support Python 3.9. The main codes already work. We should just note that we support Python 3.9. Also, this PR fixes some minor fixes into the test codes. - `Thread.isAlive` is removed in Python 3.9, and `Thread.is_alive` exists in Python 3.6+, see https://docs.python.org/3/whatsnew/3.9.html#removed - Fixed `TaskContextTestsWithWorkerReuse.test_barrier_with_python_worker_reuse` and `TaskContextTests.test_barrier` to be less flaky. This becomes more flaky in Python 3.9 for some reasons. NOTE that PyArrow does not support Python 3.9 yet. ### Why are the changes needed? To officially support Python 3.9. ### Does this PR introduce _any_ user-facing change? Yes, it officially supports Python 3.9. ### How was this patch tested? Manually ran the tests: ``` $ ./run-tests --python-executable=python Running PySpark tests. Output is in /.../spark/python/unit-tests.log Will test against the following Python executables: ['python'] Will test the following Python modules: ['pyspark-core', 'pyspark-ml', 'pyspark-mllib', 'pyspark-resource', 'pyspark-sql', 'pyspark-streaming'] python python_implementation is CPython python version is: Python 3.9.0 Starting test(python): pyspark.ml.tests.test_base Starting test(python): pyspark.ml.tests.test_evaluation Starting test(python): pyspark.ml.tests.test_algorithms Starting test(python): pyspark.ml.tests.test_feature Finished test(python): pyspark.ml.tests.test_base (12s) Starting test(python): pyspark.ml.tests.test_image Finished test(python): pyspark.ml.tests.test_evaluation (15s) Starting test(python): pyspark.ml.tests.test_linalg Finished test(python): pyspark.ml.tests.test_feature (25s) Starting test(python): pyspark.ml.tests.test_param Finished test(python): pyspark.ml.tests.test_image (17s) Starting test(python): pyspark.ml.tests.test_persistence Finished test(python): pyspark.ml.tests.test_param (17s) Starting test(python): pyspark.ml.tests.test_pipeline Finished test(python): pyspark.ml.tests.test_linalg (30s) Starting test(python): pyspark.ml.tests.test_stat Finished test(python): pyspark.ml.tests.test_pipeline (6s) Starting test(python): pyspark.ml.tests.test_training_summary Finished test(python): pyspark.ml.tests.test_stat (12s) Starting test(python): pyspark.ml.tests.test_tuning Finished test(python): pyspark.ml.tests.test_algorithms (68s) Starting test(python): pyspark.ml.tests.test_wrapper Finished test(python): pyspark.ml.tests.test_persistence (51s) Starting test(python): pyspark.mllib.tests.test_algorithms Finished test(python): pyspark.ml.tests.test_training_summary (33s) Starting test(python): pyspark.mllib.tests.test_feature Finished test(python): pyspark.ml.tests.test_wrapper (19s) Starting test(python): pyspark.mllib.tests.test_linalg Finished test(python): pyspark.mllib.tests.test_feature (26s) Starting test(python): pyspark.mllib.tests.test_stat Finished test(python): pyspark.mllib.tests.test_stat (22s) Starting test(python): pyspark.mllib.tests.test_streaming_algorithms Finished test(python): pyspark.mllib.tests.test_algorithms (53s) Starting test(python): pyspark.mllib.tests.test_util Finished test(python): pyspark.mllib.tests.test_linalg (54s) Starting test(python): pyspark.sql.tests.test_arrow Finished test(python): pyspark.sql.tests.test_arrow (0s) ... 61 tests were skipped Starting test(python): pyspark.sql.tests.test_catalog Finished test(python): pyspark.mllib.tests.test_util (11s) Starting test(python): pyspark.sql.tests.test_column Finished test(python): pyspark.sql.tests.test_catalog (16s) Starting test(python): pyspark.sql.tests.test_conf Finished test(python): pyspark.sql.tests.test_column (17s) Starting test(python): pyspark.sql.tests.test_context Finished test(python): pyspark.sql.tests.test_context (6s) ... 3 tests were skipped Starting test(python): pyspark.sql.tests.test_dataframe Finished test(python): pyspark.sql.tests.test_conf (11s) Starting test(python): pyspark.sql.tests.test_datasources Finished test(python): pyspark.sql.tests.test_datasources (19s) Starting test(python): pyspark.sql.tests.test_functions Finished test(python): pyspark.sql.tests.test_dataframe (35s) ... 3 tests were skipped Starting test(python): pyspark.sql.tests.test_group Finished test(python): pyspark.sql.tests.test_functions (32s) Starting test(python): pyspark.sql.tests.test_pandas_cogrouped_map Finished test(python): pyspark.sql.tests.test_pandas_cogrouped_map (1s) ... 15 tests were skipped Starting test(python): pyspark.sql.tests.test_pandas_grouped_map Finished test(python): pyspark.sql.tests.test_group (19s) Starting test(python): pyspark.sql.tests.test_pandas_map Finished test(python): pyspark.sql.tests.test_pandas_grouped_map (0s) ... 21 tests were skipped Starting test(python): pyspark.sql.tests.test_pandas_udf Finished test(python): pyspark.sql.tests.test_pandas_map (0s) ... 6 tests were skipped Starting test(python): pyspark.sql.tests.test_pandas_udf_grouped_agg Finished test(python): pyspark.sql.tests.test_pandas_udf (0s) ... 6 tests were skipped Starting test(python): pyspark.sql.tests.test_pandas_udf_scalar Finished test(python): pyspark.sql.tests.test_pandas_udf_grouped_agg (0s) ... 13 tests were skipped Starting test(python): pyspark.sql.tests.test_pandas_udf_typehints Finished test(python): pyspark.sql.tests.test_pandas_udf_scalar (0s) ... 50 tests were skipped Starting test(python): pyspark.sql.tests.test_pandas_udf_window Finished test(python): pyspark.sql.tests.test_pandas_udf_typehints (0s) ... 10 tests were skipped Starting test(python): pyspark.sql.tests.test_readwriter Finished test(python): pyspark.sql.tests.test_pandas_udf_window (0s) ... 14 tests were skipped Starting test(python): pyspark.sql.tests.test_serde Finished test(python): pyspark.sql.tests.test_serde (19s) Starting test(python): pyspark.sql.tests.test_session Finished test(python): pyspark.mllib.tests.test_streaming_algorithms (120s) Starting test(python): pyspark.sql.tests.test_streaming Finished test(python): pyspark.sql.tests.test_readwriter (25s) Starting test(python): pyspark.sql.tests.test_types Finished test(python): pyspark.ml.tests.test_tuning (208s) Starting test(python): pyspark.sql.tests.test_udf Finished test(python): pyspark.sql.tests.test_session (31s) Starting test(python): pyspark.sql.tests.test_utils Finished test(python): pyspark.sql.tests.test_streaming (35s) Starting test(python): pyspark.streaming.tests.test_context Finished test(python): pyspark.sql.tests.test_types (34s) Starting test(python): pyspark.streaming.tests.test_dstream Finished test(python): pyspark.sql.tests.test_utils (14s) Starting test(python): pyspark.streaming.tests.test_kinesis Finished test(python): pyspark.streaming.tests.test_kinesis (0s) ... 2 tests were skipped Starting test(python): pyspark.streaming.tests.test_listener Finished test(python): pyspark.streaming.tests.test_listener (11s) Starting test(python): pyspark.tests.test_appsubmit Finished test(python): pyspark.sql.tests.test_udf (39s) Starting test(python): pyspark.tests.test_broadcast Finished test(python): pyspark.streaming.tests.test_context (23s) Starting test(python): pyspark.tests.test_conf Finished test(python): pyspark.tests.test_conf (15s) Starting test(python): pyspark.tests.test_context Finished test(python): pyspark.tests.test_broadcast (33s) Starting test(python): pyspark.tests.test_daemon Finished test(python): pyspark.tests.test_daemon (5s) Starting test(python): pyspark.tests.test_install_spark Finished test(python): pyspark.tests.test_context (44s) Starting test(python): pyspark.tests.test_join Finished test(python): pyspark.tests.test_appsubmit (68s) Starting test(python): pyspark.tests.test_profiler Finished test(python): pyspark.tests.test_join (7s) Starting test(python): pyspark.tests.test_rdd Finished test(python): pyspark.tests.test_profiler (9s) Starting test(python): pyspark.tests.test_rddbarrier Finished test(python): pyspark.tests.test_rddbarrier (7s) Starting test(python): pyspark.tests.test_readwrite Finished test(python): pyspark.streaming.tests.test_dstream (107s) Starting test(python): pyspark.tests.test_serializers Finished test(python): pyspark.tests.test_serializers (8s) Starting test(python): pyspark.tests.test_shuffle Finished test(python): pyspark.tests.test_readwrite (14s) Starting test(python): pyspark.tests.test_taskcontext Finished test(python): pyspark.tests.test_install_spark (65s) Starting test(python): pyspark.tests.test_util Finished test(python): pyspark.tests.test_shuffle (8s) Starting test(python): pyspark.tests.test_worker Finished test(python): pyspark.tests.test_util (5s) Starting test(python): pyspark.accumulators Finished test(python): pyspark.accumulators (5s) Starting test(python): pyspark.broadcast Finished test(python): pyspark.broadcast (6s) Starting test(python): pyspark.conf Finished test(python): pyspark.tests.test_worker (14s) Starting test(python): pyspark.context Finished test(python): pyspark.conf (4s) Starting test(python): pyspark.ml.classification Finished test(python): pyspark.tests.test_rdd (60s) Starting test(python): pyspark.ml.clustering Finished test(python): pyspark.context (21s) Starting test(python): pyspark.ml.evaluation Finished test(python): pyspark.tests.test_taskcontext (69s) Starting test(python): pyspark.ml.feature Finished test(python): pyspark.ml.evaluation (26s) Starting test(python): pyspark.ml.fpm Finished test(python): pyspark.ml.clustering (45s) Starting test(python): pyspark.ml.functions Finished test(python): pyspark.ml.fpm (24s) Starting test(python): pyspark.ml.image Finished test(python): pyspark.ml.functions (17s) Starting test(python): pyspark.ml.linalg.__init__ Finished test(python): pyspark.ml.linalg.__init__ (0s) Starting test(python): pyspark.ml.recommendation Finished test(python): pyspark.ml.classification (74s) Starting test(python): pyspark.ml.regression Finished test(python): pyspark.ml.image (8s) Starting test(python): pyspark.ml.stat Finished test(python): pyspark.ml.stat (29s) Starting test(python): pyspark.ml.tuning Finished test(python): pyspark.ml.regression (53s) Starting test(python): pyspark.mllib.classification Finished test(python): pyspark.ml.tuning (35s) Starting test(python): pyspark.mllib.clustering Finished test(python): pyspark.ml.feature (103s) Starting test(python): pyspark.mllib.evaluation Finished test(python): pyspark.mllib.classification (33s) Starting test(python): pyspark.mllib.feature Finished test(python): pyspark.mllib.evaluation (21s) Starting test(python): pyspark.mllib.fpm Finished test(python): pyspark.ml.recommendation (103s) Starting test(python): pyspark.mllib.linalg.__init__ Finished test(python): pyspark.mllib.linalg.__init__ (1s) Starting test(python): pyspark.mllib.linalg.distributed Finished test(python): pyspark.mllib.feature (26s) Starting test(python): pyspark.mllib.random Finished test(python): pyspark.mllib.fpm (23s) Starting test(python): pyspark.mllib.recommendation Finished test(python): pyspark.mllib.clustering (50s) Starting test(python): pyspark.mllib.regression Finished test(python): pyspark.mllib.random (13s) Starting test(python): pyspark.mllib.stat.KernelDensity Finished test(python): pyspark.mllib.stat.KernelDensity (1s) Starting test(python): pyspark.mllib.stat._statistics Finished test(python): pyspark.mllib.linalg.distributed (42s) Starting test(python): pyspark.mllib.tree Finished test(python): pyspark.mllib.stat._statistics (19s) Starting test(python): pyspark.mllib.util Finished test(python): pyspark.mllib.regression (33s) Starting test(python): pyspark.profiler Finished test(python): pyspark.mllib.recommendation (36s) Starting test(python): pyspark.rdd Finished test(python): pyspark.profiler (9s) Starting test(python): pyspark.resource.tests.test_resources Finished test(python): pyspark.mllib.tree (19s) Starting test(python): pyspark.serializers Finished test(python): pyspark.mllib.util (21s) Starting test(python): pyspark.shuffle Finished test(python): pyspark.resource.tests.test_resources (9s) Starting test(python): pyspark.sql.avro.functions Finished test(python): pyspark.shuffle (1s) Starting test(python): pyspark.sql.catalog Finished test(python): pyspark.rdd (22s) Starting test(python): pyspark.sql.column Finished test(python): pyspark.serializers (12s) Starting test(python): pyspark.sql.conf Finished test(python): pyspark.sql.conf (6s) Starting test(python): pyspark.sql.context Finished test(python): pyspark.sql.catalog (14s) Starting test(python): pyspark.sql.dataframe Finished test(python): pyspark.sql.avro.functions (15s) Starting test(python): pyspark.sql.functions Finished test(python): pyspark.sql.column (24s) Starting test(python): pyspark.sql.group Finished test(python): pyspark.sql.context (20s) Starting test(python): pyspark.sql.pandas.conversion Finished test(python): pyspark.sql.pandas.conversion (13s) Starting test(python): pyspark.sql.pandas.group_ops Finished test(python): pyspark.sql.group (36s) Starting test(python): pyspark.sql.pandas.map_ops Finished test(python): pyspark.sql.pandas.group_ops (21s) Starting test(python): pyspark.sql.pandas.serializers Finished test(python): pyspark.sql.pandas.serializers (0s) Starting test(python): pyspark.sql.pandas.typehints Finished test(python): pyspark.sql.pandas.typehints (0s) Starting test(python): pyspark.sql.pandas.types Finished test(python): pyspark.sql.pandas.types (0s) Starting test(python): pyspark.sql.pandas.utils Finished test(python): pyspark.sql.pandas.utils (0s) Starting test(python): pyspark.sql.readwriter Finished test(python): pyspark.sql.dataframe (56s) Starting test(python): pyspark.sql.session Finished test(python): pyspark.sql.functions (57s) Starting test(python): pyspark.sql.streaming Finished test(python): pyspark.sql.pandas.map_ops (12s) Starting test(python): pyspark.sql.types Finished test(python): pyspark.sql.types (10s) Starting test(python): pyspark.sql.udf Finished test(python): pyspark.sql.streaming (16s) Starting test(python): pyspark.sql.window Finished test(python): pyspark.sql.session (19s) Starting test(python): pyspark.streaming.util Finished test(python): pyspark.streaming.util (0s) Starting test(python): pyspark.util Finished test(python): pyspark.util (0s) Finished test(python): pyspark.sql.readwriter (24s) Finished test(python): pyspark.sql.udf (13s) Finished test(python): pyspark.sql.window (14s) Tests passed in 780 seconds ``` Closes #30277 from HyukjinKwon/SPARK-33371. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-11-06 18:05:37 -05:00
'Programming Language :: Python :: 3.9',
[SPARK-1267][SPARK-18129] Allow PySpark to be pip installed ## What changes were proposed in this pull request? This PR aims to provide a pip installable PySpark package. This does a bunch of work to copy the jars over and package them with the Python code (to prevent challenges from trying to use different versions of the Python code with different versions of the JAR). It does not currently publish to PyPI but that is the natural follow up (SPARK-18129). Done: - pip installable on conda [manual tested] - setup.py installed on a non-pip managed system (RHEL) with YARN [manual tested] - Automated testing of this (virtualenv) - packaging and signing with release-build* Possible follow up work: - release-build update to publish to PyPI (SPARK-18128) - figure out who owns the pyspark package name on prod PyPI (is it someone with in the project or should we ask PyPI or should we choose a different name to publish with like ApachePySpark?) - Windows support and or testing ( SPARK-18136 ) - investigate details of wheel caching and see if we can avoid cleaning the wheel cache during our test - consider how we want to number our dev/snapshot versions Explicitly out of scope: - Using pip installed PySpark to start a standalone cluster - Using pip installed PySpark for non-Python Spark programs *I've done some work to test release-build locally but as a non-committer I've just done local testing. ## How was this patch tested? Automated testing with virtualenv, manual testing with conda, a system wide install, and YARN integration. release-build changes tested locally as a non-committer (no testing of upload artifacts to Apache staging websites) Author: Holden Karau <holden@us.ibm.com> Author: Juliet Hougland <juliet@cloudera.com> Author: Juliet Hougland <not@myemail.com> Closes #15659 from holdenk/SPARK-1267-pip-install-pyspark.
2016-11-16 17:22:15 -05:00
'Programming Language :: Python :: Implementation :: CPython',
'Programming Language :: Python :: Implementation :: PyPy',
'Typing :: Typed'],
[SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI ### What changes were proposed in this pull request? This PR proposes to add a way to select Hadoop and Hive versions in pip installation. Users can select Hive or Hadoop versions as below: ```bash HADOOP_VERSION=3.2 pip install pyspark HIVE_VERSION=1.2 pip install pyspark HIVE_VERSION=1.2 HADOOP_VERSION=2.7 pip install pyspark ``` When the environment variables are set, internally it downloads the corresponding Spark version and then sets the Spark home to it. Also this PR exposes a mirror to set as an environment variable, `PYSPARK_RELEASE_MIRROR`. **Please NOTE that:** - We cannot currently leverage pip's native installation option, for example: ```bash pip install pyspark --install-option="hadoop3.2" ``` This is because of a limitation and bug in pip itself. Once they fix this issue, we can switch from the environment variables to the proper installation options, see SPARK-32837. It IS possible to workaround but very ugly or hacky with a big change. See [this PR](https://github.com/microsoft/nni/pull/139/files) as an example. - In pip installation, we pack the relevant jars together. This PR _does not touch existing packaging way_ in order to prevent any behaviour changes. Once this experimental way is proven to be safe, we can avoid packing the relevant jars together (and keep only the relevant Python scripts). And downloads the Spark distribution as this PR proposes. - This way is sort of consistent with SparkR: SparkR provides a method `SparkR::install.spark` to support CRAN installation. This is fine because SparkR is provided purely as a R library. For example, `sparkr` script is not packed together. PySpark cannot take this approach because PySpark packaging ships relevant executable script together, e.g.) `pyspark` shell. If PySpark has a method such as `pyspark.install_spark`, users cannot call it in `pyspark` because `pyspark` already assumes relevant Spark is installed, JVM is launched, etc. - There looks no way to release that contains different Hadoop or Hive to PyPI due to [the version semantics](https://www.python.org/dev/peps/pep-0440/). This is not an option. The usual way looks either `--install-option` above with hacks or environment variables given my investigation. ### Why are the changes needed? To provide users the options to select Hadoop and Hive versions. ### Does this PR introduce _any_ user-facing change? Yes, users will be able to select Hive and Hadoop version as below when they install it from `pip`; ```bash HADOOP_VERSION=3.2 pip install pyspark HIVE_VERSION=1.2 pip install pyspark HIVE_VERSION=1.2 HADOOP_VERSION=2.7 pip install pyspark ``` ### How was this patch tested? Unit tests were added. I also manually tested in Mac and Windows (after building Spark with `python/dist/pyspark-3.1.0.dev0.tar.gz`): ```bash ./build/mvn -DskipTests -Phive-thriftserver clean package ``` Mac: ```bash SPARK_VERSION=3.0.1 HADOOP_VERSION=3.2 pip install pyspark-3.1.0.dev0.tar.gz ``` Windows: ```bash set HADOOP_VERSION=3.2 set SPARK_VERSION=3.0.1 pip install pyspark-3.1.0.dev0.tar.gz ``` Closes #29703 from HyukjinKwon/SPARK-32017. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-09-22 20:30:51 -04:00
cmdclass={
'install': InstallCommand,
},
[SPARK-1267][SPARK-18129] Allow PySpark to be pip installed ## What changes were proposed in this pull request? This PR aims to provide a pip installable PySpark package. This does a bunch of work to copy the jars over and package them with the Python code (to prevent challenges from trying to use different versions of the Python code with different versions of the JAR). It does not currently publish to PyPI but that is the natural follow up (SPARK-18129). Done: - pip installable on conda [manual tested] - setup.py installed on a non-pip managed system (RHEL) with YARN [manual tested] - Automated testing of this (virtualenv) - packaging and signing with release-build* Possible follow up work: - release-build update to publish to PyPI (SPARK-18128) - figure out who owns the pyspark package name on prod PyPI (is it someone with in the project or should we ask PyPI or should we choose a different name to publish with like ApachePySpark?) - Windows support and or testing ( SPARK-18136 ) - investigate details of wheel caching and see if we can avoid cleaning the wheel cache during our test - consider how we want to number our dev/snapshot versions Explicitly out of scope: - Using pip installed PySpark to start a standalone cluster - Using pip installed PySpark for non-Python Spark programs *I've done some work to test release-build locally but as a non-committer I've just done local testing. ## How was this patch tested? Automated testing with virtualenv, manual testing with conda, a system wide install, and YARN integration. release-build changes tested locally as a non-committer (no testing of upload artifacts to Apache staging websites) Author: Holden Karau <holden@us.ibm.com> Author: Juliet Hougland <juliet@cloudera.com> Author: Juliet Hougland <not@myemail.com> Closes #15659 from holdenk/SPARK-1267-pip-install-pyspark.
2016-11-16 17:22:15 -05:00
)
finally:
# We only cleanup the symlink farm if we were in Spark, otherwise we are installing rather than
# packaging.
if (in_spark):
# Depending on cleaning up the symlink farm or copied version
if _supports_symlinks():
os.remove(os.path.join(TEMP_PATH, "jars"))
os.remove(os.path.join(TEMP_PATH, "bin"))
os.remove(os.path.join(TEMP_PATH, "sbin"))
[SPARK-1267][SPARK-18129] Allow PySpark to be pip installed ## What changes were proposed in this pull request? This PR aims to provide a pip installable PySpark package. This does a bunch of work to copy the jars over and package them with the Python code (to prevent challenges from trying to use different versions of the Python code with different versions of the JAR). It does not currently publish to PyPI but that is the natural follow up (SPARK-18129). Done: - pip installable on conda [manual tested] - setup.py installed on a non-pip managed system (RHEL) with YARN [manual tested] - Automated testing of this (virtualenv) - packaging and signing with release-build* Possible follow up work: - release-build update to publish to PyPI (SPARK-18128) - figure out who owns the pyspark package name on prod PyPI (is it someone with in the project or should we ask PyPI or should we choose a different name to publish with like ApachePySpark?) - Windows support and or testing ( SPARK-18136 ) - investigate details of wheel caching and see if we can avoid cleaning the wheel cache during our test - consider how we want to number our dev/snapshot versions Explicitly out of scope: - Using pip installed PySpark to start a standalone cluster - Using pip installed PySpark for non-Python Spark programs *I've done some work to test release-build locally but as a non-committer I've just done local testing. ## How was this patch tested? Automated testing with virtualenv, manual testing with conda, a system wide install, and YARN integration. release-build changes tested locally as a non-committer (no testing of upload artifacts to Apache staging websites) Author: Holden Karau <holden@us.ibm.com> Author: Juliet Hougland <juliet@cloudera.com> Author: Juliet Hougland <not@myemail.com> Closes #15659 from holdenk/SPARK-1267-pip-install-pyspark.
2016-11-16 17:22:15 -05:00
os.remove(os.path.join(TEMP_PATH, "examples"))
os.remove(os.path.join(TEMP_PATH, "data"))
os.remove(os.path.join(TEMP_PATH, "licenses"))
[SPARK-1267][SPARK-18129] Allow PySpark to be pip installed ## What changes were proposed in this pull request? This PR aims to provide a pip installable PySpark package. This does a bunch of work to copy the jars over and package them with the Python code (to prevent challenges from trying to use different versions of the Python code with different versions of the JAR). It does not currently publish to PyPI but that is the natural follow up (SPARK-18129). Done: - pip installable on conda [manual tested] - setup.py installed on a non-pip managed system (RHEL) with YARN [manual tested] - Automated testing of this (virtualenv) - packaging and signing with release-build* Possible follow up work: - release-build update to publish to PyPI (SPARK-18128) - figure out who owns the pyspark package name on prod PyPI (is it someone with in the project or should we ask PyPI or should we choose a different name to publish with like ApachePySpark?) - Windows support and or testing ( SPARK-18136 ) - investigate details of wheel caching and see if we can avoid cleaning the wheel cache during our test - consider how we want to number our dev/snapshot versions Explicitly out of scope: - Using pip installed PySpark to start a standalone cluster - Using pip installed PySpark for non-Python Spark programs *I've done some work to test release-build locally but as a non-committer I've just done local testing. ## How was this patch tested? Automated testing with virtualenv, manual testing with conda, a system wide install, and YARN integration. release-build changes tested locally as a non-committer (no testing of upload artifacts to Apache staging websites) Author: Holden Karau <holden@us.ibm.com> Author: Juliet Hougland <juliet@cloudera.com> Author: Juliet Hougland <not@myemail.com> Closes #15659 from holdenk/SPARK-1267-pip-install-pyspark.
2016-11-16 17:22:15 -05:00
else:
rmtree(os.path.join(TEMP_PATH, "jars"))
rmtree(os.path.join(TEMP_PATH, "bin"))
rmtree(os.path.join(TEMP_PATH, "sbin"))
[SPARK-1267][SPARK-18129] Allow PySpark to be pip installed ## What changes were proposed in this pull request? This PR aims to provide a pip installable PySpark package. This does a bunch of work to copy the jars over and package them with the Python code (to prevent challenges from trying to use different versions of the Python code with different versions of the JAR). It does not currently publish to PyPI but that is the natural follow up (SPARK-18129). Done: - pip installable on conda [manual tested] - setup.py installed on a non-pip managed system (RHEL) with YARN [manual tested] - Automated testing of this (virtualenv) - packaging and signing with release-build* Possible follow up work: - release-build update to publish to PyPI (SPARK-18128) - figure out who owns the pyspark package name on prod PyPI (is it someone with in the project or should we ask PyPI or should we choose a different name to publish with like ApachePySpark?) - Windows support and or testing ( SPARK-18136 ) - investigate details of wheel caching and see if we can avoid cleaning the wheel cache during our test - consider how we want to number our dev/snapshot versions Explicitly out of scope: - Using pip installed PySpark to start a standalone cluster - Using pip installed PySpark for non-Python Spark programs *I've done some work to test release-build locally but as a non-committer I've just done local testing. ## How was this patch tested? Automated testing with virtualenv, manual testing with conda, a system wide install, and YARN integration. release-build changes tested locally as a non-committer (no testing of upload artifacts to Apache staging websites) Author: Holden Karau <holden@us.ibm.com> Author: Juliet Hougland <juliet@cloudera.com> Author: Juliet Hougland <not@myemail.com> Closes #15659 from holdenk/SPARK-1267-pip-install-pyspark.
2016-11-16 17:22:15 -05:00
rmtree(os.path.join(TEMP_PATH, "examples"))
rmtree(os.path.join(TEMP_PATH, "data"))
rmtree(os.path.join(TEMP_PATH, "licenses"))
[SPARK-1267][SPARK-18129] Allow PySpark to be pip installed ## What changes were proposed in this pull request? This PR aims to provide a pip installable PySpark package. This does a bunch of work to copy the jars over and package them with the Python code (to prevent challenges from trying to use different versions of the Python code with different versions of the JAR). It does not currently publish to PyPI but that is the natural follow up (SPARK-18129). Done: - pip installable on conda [manual tested] - setup.py installed on a non-pip managed system (RHEL) with YARN [manual tested] - Automated testing of this (virtualenv) - packaging and signing with release-build* Possible follow up work: - release-build update to publish to PyPI (SPARK-18128) - figure out who owns the pyspark package name on prod PyPI (is it someone with in the project or should we ask PyPI or should we choose a different name to publish with like ApachePySpark?) - Windows support and or testing ( SPARK-18136 ) - investigate details of wheel caching and see if we can avoid cleaning the wheel cache during our test - consider how we want to number our dev/snapshot versions Explicitly out of scope: - Using pip installed PySpark to start a standalone cluster - Using pip installed PySpark for non-Python Spark programs *I've done some work to test release-build locally but as a non-committer I've just done local testing. ## How was this patch tested? Automated testing with virtualenv, manual testing with conda, a system wide install, and YARN integration. release-build changes tested locally as a non-committer (no testing of upload artifacts to Apache staging websites) Author: Holden Karau <holden@us.ibm.com> Author: Juliet Hougland <juliet@cloudera.com> Author: Juliet Hougland <not@myemail.com> Closes #15659 from holdenk/SPARK-1267-pip-install-pyspark.
2016-11-16 17:22:15 -05:00
os.rmdir(TEMP_PATH)