[SPARK-29672][BUILD][PYTHON][FOLLOW-UP] Recover PySpark via pip installation with deprecated Python 2, 3.4 and 3.5
### What changes were proposed in this pull request?
The RC fails to install against Python 2.7 via `pip`. We deprecated but didn't remove Python 2, 3.4 and 3.5 support yet. This PR partially reverts the changes from SPARK-29672 to recover Python 2, 3.4 and 3.5 pip installation.
```bash
python2.7 -m pip install https://dist.apache.org/repos/dist/dev/spark/v3.0.0-rc1-bin/pyspark-3.0.0.tar.gz
```
```
...
Collecting https://dist.apache.org/repos/dist/dev/spark/v3.0.0-rc1-bin/pyspark-3.0.0.tar.gz
Using cached https://dist.apache.org/repos/dist/dev/spark/v3.0.0-rc1-bin/pyspark-3.0.0.tar.gz (203.0 MB)
ERROR: Command errored out with exit status 1:
command: /System/Library/Frameworks/Python.framework/Versions/2.7/Resources/Python.app/Contents/MacOS/Python -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/_1/bzcp960d0hlb988k90654z2w0000gp/T/pip-req-build-sfCnmZ/setup.py'"'"'; __file__='"'"'/private/var/folders/_1/bzcp960d0hlb988k90654z2w0000gp/T/pip-req-build-sfCnmZ/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /private/var/folders/_1/bzcp960d0hlb988k90654z2w0000gp/T/pip-req-build-sfCnmZ/pip-egg-info
cwd: /private/var/folders/_1/bzcp960d0hlb988k90654z2w0000gp/T/pip-req-build-sfCnmZ/
Complete output (6 lines):
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/private/var/folders/_1/bzcp960d0hlb988k90654z2w0000gp/T/pip-req-build-sfCnmZ/setup.py", line 27
file=sys.stderr)
^
SyntaxError: invalid syntax
----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
```
### Why are the changes needed?
To keep the deprecated support instead of removing.
### Does this PR introduce any user-facing change?
No, it's the change in unreleased branches only yet.
### How was this patch tested?
```bash
./build/mvn -DskipTests -Phive -Phive-thriftserver clean package
cd python
python2.7 setup.py sdist
python2.7 -m pip install dist/pyspark-3.1.0.dev0.tar.gz
```
Closes #28243 from HyukjinKwon/SPARK-29672-followup.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-04-19 21:31:44 -04:00
|
|
|
#!/usr/bin/env python
|
2016-11-16 17:22:15 -05:00
|
|
|
|
|
|
|
#
|
|
|
|
# Licensed to the Apache Software Foundation (ASF) under one or more
|
|
|
|
# contributor license agreements. See the NOTICE file distributed with
|
|
|
|
# this work for additional information regarding copyright ownership.
|
|
|
|
# The ASF licenses this file to You under the Apache License, Version 2.0
|
|
|
|
# (the "License"); you may not use this file except in compliance with
|
|
|
|
# the License. You may obtain a copy of the License at
|
|
|
|
#
|
|
|
|
# http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
#
|
|
|
|
# Unless required by applicable law or agreed to in writing, software
|
|
|
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
|
|
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
|
|
# See the License for the specific language governing permissions and
|
|
|
|
# limitations under the License.
|
|
|
|
|
[SPARK-29672][BUILD][PYTHON][FOLLOW-UP] Recover PySpark via pip installation with deprecated Python 2, 3.4 and 3.5
### What changes were proposed in this pull request?
The RC fails to install against Python 2.7 via `pip`. We deprecated but didn't remove Python 2, 3.4 and 3.5 support yet. This PR partially reverts the changes from SPARK-29672 to recover Python 2, 3.4 and 3.5 pip installation.
```bash
python2.7 -m pip install https://dist.apache.org/repos/dist/dev/spark/v3.0.0-rc1-bin/pyspark-3.0.0.tar.gz
```
```
...
Collecting https://dist.apache.org/repos/dist/dev/spark/v3.0.0-rc1-bin/pyspark-3.0.0.tar.gz
Using cached https://dist.apache.org/repos/dist/dev/spark/v3.0.0-rc1-bin/pyspark-3.0.0.tar.gz (203.0 MB)
ERROR: Command errored out with exit status 1:
command: /System/Library/Frameworks/Python.framework/Versions/2.7/Resources/Python.app/Contents/MacOS/Python -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/_1/bzcp960d0hlb988k90654z2w0000gp/T/pip-req-build-sfCnmZ/setup.py'"'"'; __file__='"'"'/private/var/folders/_1/bzcp960d0hlb988k90654z2w0000gp/T/pip-req-build-sfCnmZ/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /private/var/folders/_1/bzcp960d0hlb988k90654z2w0000gp/T/pip-req-build-sfCnmZ/pip-egg-info
cwd: /private/var/folders/_1/bzcp960d0hlb988k90654z2w0000gp/T/pip-req-build-sfCnmZ/
Complete output (6 lines):
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/private/var/folders/_1/bzcp960d0hlb988k90654z2w0000gp/T/pip-req-build-sfCnmZ/setup.py", line 27
file=sys.stderr)
^
SyntaxError: invalid syntax
----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
```
### Why are the changes needed?
To keep the deprecated support instead of removing.
### Does this PR introduce any user-facing change?
No, it's the change in unreleased branches only yet.
### How was this patch tested?
```bash
./build/mvn -DskipTests -Phive -Phive-thriftserver clean package
cd python
python2.7 setup.py sdist
python2.7 -m pip install dist/pyspark-3.1.0.dev0.tar.gz
```
Closes #28243 from HyukjinKwon/SPARK-29672-followup.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-04-19 21:31:44 -04:00
|
|
|
from __future__ import print_function
|
2016-11-16 17:22:15 -05:00
|
|
|
import glob
|
|
|
|
import os
|
|
|
|
import sys
|
2019-01-17 20:40:39 -05:00
|
|
|
from setuptools import setup
|
2016-11-16 17:22:15 -05:00
|
|
|
from shutil import copyfile, copytree, rmtree
|
|
|
|
|
[SPARK-29672][BUILD][PYTHON][FOLLOW-UP] Recover PySpark via pip installation with deprecated Python 2, 3.4 and 3.5
### What changes were proposed in this pull request?
The RC fails to install against Python 2.7 via `pip`. We deprecated but didn't remove Python 2, 3.4 and 3.5 support yet. This PR partially reverts the changes from SPARK-29672 to recover Python 2, 3.4 and 3.5 pip installation.
```bash
python2.7 -m pip install https://dist.apache.org/repos/dist/dev/spark/v3.0.0-rc1-bin/pyspark-3.0.0.tar.gz
```
```
...
Collecting https://dist.apache.org/repos/dist/dev/spark/v3.0.0-rc1-bin/pyspark-3.0.0.tar.gz
Using cached https://dist.apache.org/repos/dist/dev/spark/v3.0.0-rc1-bin/pyspark-3.0.0.tar.gz (203.0 MB)
ERROR: Command errored out with exit status 1:
command: /System/Library/Frameworks/Python.framework/Versions/2.7/Resources/Python.app/Contents/MacOS/Python -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/_1/bzcp960d0hlb988k90654z2w0000gp/T/pip-req-build-sfCnmZ/setup.py'"'"'; __file__='"'"'/private/var/folders/_1/bzcp960d0hlb988k90654z2w0000gp/T/pip-req-build-sfCnmZ/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /private/var/folders/_1/bzcp960d0hlb988k90654z2w0000gp/T/pip-req-build-sfCnmZ/pip-egg-info
cwd: /private/var/folders/_1/bzcp960d0hlb988k90654z2w0000gp/T/pip-req-build-sfCnmZ/
Complete output (6 lines):
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/private/var/folders/_1/bzcp960d0hlb988k90654z2w0000gp/T/pip-req-build-sfCnmZ/setup.py", line 27
file=sys.stderr)
^
SyntaxError: invalid syntax
----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
```
### Why are the changes needed?
To keep the deprecated support instead of removing.
### Does this PR introduce any user-facing change?
No, it's the change in unreleased branches only yet.
### How was this patch tested?
```bash
./build/mvn -DskipTests -Phive -Phive-thriftserver clean package
cd python
python2.7 setup.py sdist
python2.7 -m pip install dist/pyspark-3.1.0.dev0.tar.gz
```
Closes #28243 from HyukjinKwon/SPARK-29672-followup.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-04-19 21:31:44 -04:00
|
|
|
if sys.version_info < (2, 7):
|
|
|
|
print("Python versions prior to 2.7 are not supported for pip installed PySpark.",
|
2016-11-16 17:22:15 -05:00
|
|
|
file=sys.stderr)
|
2018-03-08 06:38:34 -05:00
|
|
|
sys.exit(-1)
|
2016-11-16 17:22:15 -05:00
|
|
|
|
|
|
|
try:
|
|
|
|
exec(open('pyspark/version.py').read())
|
|
|
|
except IOError:
|
|
|
|
print("Failed to load PySpark version file for packaging. You must be in Spark's python dir.",
|
|
|
|
file=sys.stderr)
|
|
|
|
sys.exit(-1)
|
2018-08-26 22:02:31 -04:00
|
|
|
VERSION = __version__ # noqa
|
2016-11-16 17:22:15 -05:00
|
|
|
# A temporary path so we can access above the Python project root and fetch scripts and jars we need
|
|
|
|
TEMP_PATH = "deps"
|
|
|
|
SPARK_HOME = os.path.abspath("../")
|
|
|
|
|
|
|
|
# Provide guidance about how to use setup.py
|
|
|
|
incorrect_invocation_message = """
|
|
|
|
If you are installing pyspark from spark source, you must first build Spark and
|
|
|
|
run sdist.
|
|
|
|
|
|
|
|
To build Spark with maven you can run:
|
|
|
|
./build/mvn -DskipTests clean package
|
|
|
|
Building the source dist is done in the Python directory:
|
|
|
|
cd python
|
|
|
|
python setup.py sdist
|
|
|
|
pip install dist/*.tar.gz"""
|
|
|
|
|
|
|
|
# Figure out where the jars are we need to package with PySpark.
|
|
|
|
JARS_PATH = glob.glob(os.path.join(SPARK_HOME, "assembly/target/scala-*/jars/"))
|
|
|
|
|
|
|
|
if len(JARS_PATH) == 1:
|
|
|
|
JARS_PATH = JARS_PATH[0]
|
|
|
|
elif (os.path.isfile("../RELEASE") and len(glob.glob("../jars/spark*core*.jar")) == 1):
|
|
|
|
# Release mode puts the jars in a jars directory
|
|
|
|
JARS_PATH = os.path.join(SPARK_HOME, "jars")
|
|
|
|
elif len(JARS_PATH) > 1:
|
|
|
|
print("Assembly jars exist for multiple scalas ({0}), please cleanup assembly/target".format(
|
|
|
|
JARS_PATH), file=sys.stderr)
|
|
|
|
sys.exit(-1)
|
|
|
|
elif len(JARS_PATH) == 0 and not os.path.exists(TEMP_PATH):
|
|
|
|
print(incorrect_invocation_message, file=sys.stderr)
|
|
|
|
sys.exit(-1)
|
|
|
|
|
|
|
|
EXAMPLES_PATH = os.path.join(SPARK_HOME, "examples/src/main/python")
|
|
|
|
SCRIPTS_PATH = os.path.join(SPARK_HOME, "bin")
|
2019-02-27 09:39:55 -05:00
|
|
|
USER_SCRIPTS_PATH = os.path.join(SPARK_HOME, "sbin")
|
2016-12-06 17:09:27 -05:00
|
|
|
DATA_PATH = os.path.join(SPARK_HOME, "data")
|
|
|
|
LICENSES_PATH = os.path.join(SPARK_HOME, "licenses")
|
|
|
|
|
2016-11-16 17:22:15 -05:00
|
|
|
SCRIPTS_TARGET = os.path.join(TEMP_PATH, "bin")
|
2019-02-27 09:39:55 -05:00
|
|
|
USER_SCRIPTS_TARGET = os.path.join(TEMP_PATH, "sbin")
|
2016-11-16 17:22:15 -05:00
|
|
|
JARS_TARGET = os.path.join(TEMP_PATH, "jars")
|
|
|
|
EXAMPLES_TARGET = os.path.join(TEMP_PATH, "examples")
|
2016-12-06 17:09:27 -05:00
|
|
|
DATA_TARGET = os.path.join(TEMP_PATH, "data")
|
|
|
|
LICENSES_TARGET = os.path.join(TEMP_PATH, "licenses")
|
2016-11-16 17:22:15 -05:00
|
|
|
|
|
|
|
# Check and see if we are under the spark path in which case we need to build the symlink farm.
|
|
|
|
# This is important because we only want to build the symlink farm while under Spark otherwise we
|
|
|
|
# want to use the symlink farm. And if the symlink farm exists under while under Spark (e.g. a
|
|
|
|
# partially built sdist) we should error and have the user sort it out.
|
|
|
|
in_spark = (os.path.isfile("../core/src/main/scala/org/apache/spark/SparkContext.scala") or
|
|
|
|
(os.path.isfile("../RELEASE") and len(glob.glob("../jars/spark*core*.jar")) == 1))
|
|
|
|
|
|
|
|
|
|
|
|
def _supports_symlinks():
|
|
|
|
"""Check if the system supports symlinks (e.g. *nix) or not."""
|
|
|
|
return getattr(os, "symlink", None) is not None
|
|
|
|
|
|
|
|
|
|
|
|
if (in_spark):
|
|
|
|
# Construct links for setup
|
|
|
|
try:
|
|
|
|
os.mkdir(TEMP_PATH)
|
|
|
|
except:
|
|
|
|
print("Temp path for symlink to parent already exists {0}".format(TEMP_PATH),
|
|
|
|
file=sys.stderr)
|
2018-03-08 06:38:34 -05:00
|
|
|
sys.exit(-1)
|
2016-11-16 17:22:15 -05:00
|
|
|
|
2019-04-22 06:30:31 -04:00
|
|
|
# If you are changing the versions here, please also change ./python/pyspark/sql/utils.py
|
|
|
|
# For Arrow, you should also check ./pom.xml and ensure there are no breaking changes in the
|
|
|
|
# binary format protocol with the Java version, see ARROW_HOME/format/* for specifications.
|
2019-06-17 20:10:58 -04:00
|
|
|
_minimum_pandas_version = "0.23.2"
|
2019-11-14 23:27:30 -05:00
|
|
|
_minimum_pyarrow_version = "0.15.1"
|
[SPARK-23319][TESTS] Explicitly specify Pandas and PyArrow versions in PySpark tests (to skip or test)
## What changes were proposed in this pull request?
This PR proposes to explicitly specify Pandas and PyArrow versions in PySpark tests to skip or test.
We declared the extra dependencies:
https://github.com/apache/spark/blob/b8bfce51abf28c66ba1fc67b0f25fe1617c81025/python/setup.py#L204
In case of PyArrow:
Currently we only check if pyarrow is installed or not without checking the version. It already fails to run tests. For example, if PyArrow 0.7.0 is installed:
```
======================================================================
ERROR: test_vectorized_udf_wrong_return_type (pyspark.sql.tests.ScalarPandasUDF)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/.../spark/python/pyspark/sql/tests.py", line 4019, in test_vectorized_udf_wrong_return_type
f = pandas_udf(lambda x: x * 1.0, MapType(LongType(), LongType()))
File "/.../spark/python/pyspark/sql/functions.py", line 2309, in pandas_udf
return _create_udf(f=f, returnType=return_type, evalType=eval_type)
File "/.../spark/python/pyspark/sql/udf.py", line 47, in _create_udf
require_minimum_pyarrow_version()
File "/.../spark/python/pyspark/sql/utils.py", line 132, in require_minimum_pyarrow_version
"however, your version was %s." % pyarrow.__version__)
ImportError: pyarrow >= 0.8.0 must be installed on calling Python process; however, your version was 0.7.0.
----------------------------------------------------------------------
Ran 33 tests in 8.098s
FAILED (errors=33)
```
In case of Pandas:
There are few tests for old Pandas which were tested only when Pandas version was lower, and I rewrote them to be tested when both Pandas version is lower and missing.
## How was this patch tested?
Manually tested by modifying the condition:
```
test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 1.19.2 must be installed; however, your version was 0.19.2.'
test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 1.19.2 must be installed; however, your version was 0.19.2.'
test_createDataFrame_respect_session_timezone (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 1.19.2 must be installed; however, your version was 0.19.2.'
```
```
test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 0.19.2 must be installed; however, it was not found.'
test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 0.19.2 must be installed; however, it was not found.'
test_createDataFrame_respect_session_timezone (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 0.19.2 must be installed; however, it was not found.'
```
```
test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 1.8.0 must be installed; however, your version was 0.8.0.'
test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 1.8.0 must be installed; however, your version was 0.8.0.'
test_createDataFrame_respect_session_timezone (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 1.8.0 must be installed; however, your version was 0.8.0.'
```
```
test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 0.8.0 must be installed; however, it was not found.'
test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 0.8.0 must be installed; however, it was not found.'
test_createDataFrame_respect_session_timezone (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 0.8.0 must be installed; however, it was not found.'
```
Author: hyukjinkwon <gurwls223@gmail.com>
Closes #20487 from HyukjinKwon/pyarrow-pandas-skip.
2018-02-07 09:28:10 -05:00
|
|
|
|
2016-11-16 17:22:15 -05:00
|
|
|
try:
|
|
|
|
# We copy the shell script to be under pyspark/python/pyspark so that the launcher scripts
|
|
|
|
# find it where expected. The rest of the files aren't copied because they are accessed
|
|
|
|
# using Python imports instead which will be resolved correctly.
|
|
|
|
try:
|
|
|
|
os.makedirs("pyspark/python/pyspark")
|
|
|
|
except OSError:
|
|
|
|
# Don't worry if the directory already exists.
|
|
|
|
pass
|
|
|
|
copyfile("pyspark/shell.py", "pyspark/python/pyspark/shell.py")
|
|
|
|
|
|
|
|
if (in_spark):
|
|
|
|
# Construct the symlink farm - this is necessary since we can't refer to the path above the
|
|
|
|
# package root and we need to copy the jars and scripts which are up above the python root.
|
|
|
|
if _supports_symlinks():
|
|
|
|
os.symlink(JARS_PATH, JARS_TARGET)
|
|
|
|
os.symlink(SCRIPTS_PATH, SCRIPTS_TARGET)
|
2019-02-27 09:39:55 -05:00
|
|
|
os.symlink(USER_SCRIPTS_PATH, USER_SCRIPTS_TARGET)
|
2016-11-16 17:22:15 -05:00
|
|
|
os.symlink(EXAMPLES_PATH, EXAMPLES_TARGET)
|
2016-12-06 17:09:27 -05:00
|
|
|
os.symlink(DATA_PATH, DATA_TARGET)
|
|
|
|
os.symlink(LICENSES_PATH, LICENSES_TARGET)
|
2016-11-16 17:22:15 -05:00
|
|
|
else:
|
|
|
|
# For windows fall back to the slower copytree
|
|
|
|
copytree(JARS_PATH, JARS_TARGET)
|
|
|
|
copytree(SCRIPTS_PATH, SCRIPTS_TARGET)
|
2019-02-27 09:39:55 -05:00
|
|
|
copytree(USER_SCRIPTS_PATH, USER_SCRIPTS_TARGET)
|
2016-11-16 17:22:15 -05:00
|
|
|
copytree(EXAMPLES_PATH, EXAMPLES_TARGET)
|
2016-12-06 17:09:27 -05:00
|
|
|
copytree(DATA_PATH, DATA_TARGET)
|
|
|
|
copytree(LICENSES_PATH, LICENSES_TARGET)
|
2016-11-16 17:22:15 -05:00
|
|
|
else:
|
|
|
|
# If we are not inside of SPARK_HOME verify we have the required symlink farm
|
|
|
|
if not os.path.exists(JARS_TARGET):
|
|
|
|
print("To build packaging must be in the python directory under the SPARK_HOME.",
|
|
|
|
file=sys.stderr)
|
|
|
|
|
|
|
|
if not os.path.isdir(SCRIPTS_TARGET):
|
|
|
|
print(incorrect_invocation_message, file=sys.stderr)
|
2018-03-08 06:38:34 -05:00
|
|
|
sys.exit(-1)
|
2016-11-16 17:22:15 -05:00
|
|
|
|
|
|
|
# Scripts directive requires a list of each script path and does not take wild cards.
|
|
|
|
script_names = os.listdir(SCRIPTS_TARGET)
|
|
|
|
scripts = list(map(lambda script: os.path.join(SCRIPTS_TARGET, script), script_names))
|
|
|
|
# We add find_spark_home.py to the bin directory we install so that pip installed PySpark
|
|
|
|
# will search for SPARK_HOME with Python.
|
|
|
|
scripts.append("pyspark/find_spark_home.py")
|
|
|
|
|
2020-01-30 02:40:38 -05:00
|
|
|
with open('README.md') as f:
|
|
|
|
long_description = f.read()
|
2016-11-16 17:22:15 -05:00
|
|
|
|
|
|
|
setup(
|
|
|
|
name='pyspark',
|
|
|
|
version=VERSION,
|
|
|
|
description='Apache Spark Python API',
|
|
|
|
long_description=long_description,
|
2020-01-30 02:40:38 -05:00
|
|
|
long_description_content_type="text/markdown",
|
2016-11-16 17:22:15 -05:00
|
|
|
author='Spark Developers',
|
|
|
|
author_email='dev@spark.apache.org',
|
|
|
|
url='https://github.com/apache/spark/tree/master/python',
|
|
|
|
packages=['pyspark',
|
|
|
|
'pyspark.mllib',
|
2017-01-25 17:43:39 -05:00
|
|
|
'pyspark.mllib.linalg',
|
|
|
|
'pyspark.mllib.stat',
|
2016-11-16 17:22:15 -05:00
|
|
|
'pyspark.ml',
|
2017-01-25 17:43:39 -05:00
|
|
|
'pyspark.ml.linalg',
|
|
|
|
'pyspark.ml.param',
|
2016-11-16 17:22:15 -05:00
|
|
|
'pyspark.sql',
|
[SPARK-30434][PYTHON][SQL] Move pandas related functionalities into 'pandas' sub-package
### What changes were proposed in this pull request?
This PR proposes to move pandas related functionalities into pandas package. Namely:
```bash
pyspark/sql/pandas
├── __init__.py
├── conversion.py # Conversion between pandas <> PySpark DataFrames
├── functions.py # pandas_udf
├── group_ops.py # Grouped UDF / Cogrouped UDF + groupby.apply, groupby.cogroup.apply
├── map_ops.py # Map Iter UDF + mapInPandas
├── serializers.py # pandas <> PyArrow serializers
├── types.py # Type utils between pandas <> PyArrow
└── utils.py # Version requirement checks
```
In order to separately locate `groupby.apply`, `groupby.cogroup.apply`, `mapInPandas`, `toPandas`, and `createDataFrame(pdf)` under `pandas` sub-package, I had to use a mix-in approach which Scala side uses often by `trait`, and also pandas itself uses this approach (see `IndexOpsMixin` as an example) to group related functionalities. Currently, you can think it's like Scala's self typed trait. See the structure below:
```python
class PandasMapOpsMixin(object):
def mapInPandas(self, ...):
...
return ...
# other Pandas <> PySpark APIs
```
```python
class DataFrame(PandasMapOpsMixin):
# other DataFrame APIs equivalent to Scala side.
```
Yes, This is a big PR but they are mostly just moving around except one case `createDataFrame` which I had to split the methods.
### Why are the changes needed?
There are pandas functionalities here and there and I myself gets lost where it was. Also, when you have to make a change commonly for all of pandas related features, it's almost impossible now.
Also, after this change, `DataFrame` and `SparkSession` become more consistent with Scala side since pandas is specific to Python, and this change separates pandas-specific APIs away from `DataFrame` or `SparkSession`.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Existing tests should cover. Also, I manually built the PySpark API documentation and checked.
Closes #27109 from HyukjinKwon/pandas-refactoring.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-01-08 20:22:50 -05:00
|
|
|
'pyspark.sql.avro',
|
|
|
|
'pyspark.sql.pandas',
|
2016-11-16 17:22:15 -05:00
|
|
|
'pyspark.streaming',
|
|
|
|
'pyspark.bin',
|
2019-02-27 09:39:55 -05:00
|
|
|
'pyspark.sbin',
|
2016-11-16 17:22:15 -05:00
|
|
|
'pyspark.jars',
|
|
|
|
'pyspark.python.pyspark',
|
|
|
|
'pyspark.python.lib',
|
2016-12-06 17:09:27 -05:00
|
|
|
'pyspark.data',
|
|
|
|
'pyspark.licenses',
|
2020-04-22 21:20:39 -04:00
|
|
|
'pyspark.resource',
|
2016-11-16 17:22:15 -05:00
|
|
|
'pyspark.examples.src.main.python'],
|
|
|
|
include_package_data=True,
|
|
|
|
package_dir={
|
|
|
|
'pyspark.jars': 'deps/jars',
|
|
|
|
'pyspark.bin': 'deps/bin',
|
2019-02-27 09:39:55 -05:00
|
|
|
'pyspark.sbin': 'deps/sbin',
|
2016-11-16 17:22:15 -05:00
|
|
|
'pyspark.python.lib': 'lib',
|
2016-12-06 17:09:27 -05:00
|
|
|
'pyspark.data': 'deps/data',
|
|
|
|
'pyspark.licenses': 'deps/licenses',
|
2016-11-16 17:22:15 -05:00
|
|
|
'pyspark.examples.src.main.python': 'deps/examples',
|
|
|
|
},
|
|
|
|
package_data={
|
|
|
|
'pyspark.jars': ['*.jar'],
|
|
|
|
'pyspark.bin': ['*'],
|
2019-02-27 09:39:55 -05:00
|
|
|
'pyspark.sbin': ['spark-config.sh', 'spark-daemon.sh',
|
|
|
|
'start-history-server.sh',
|
|
|
|
'stop-history-server.sh', ],
|
2016-11-16 17:22:15 -05:00
|
|
|
'pyspark.python.lib': ['*.zip'],
|
2016-12-06 17:09:27 -05:00
|
|
|
'pyspark.data': ['*.txt', '*.data'],
|
|
|
|
'pyspark.licenses': ['*.txt'],
|
2016-11-16 17:22:15 -05:00
|
|
|
'pyspark.examples.src.main.python': ['*.py', '*/*.py']},
|
|
|
|
scripts=scripts,
|
|
|
|
license='http://www.apache.org/licenses/LICENSE-2.0',
|
2020-02-20 12:08:33 -05:00
|
|
|
install_requires=['py4j==0.10.9'],
|
2016-11-16 17:22:15 -05:00
|
|
|
extras_require={
|
|
|
|
'ml': ['numpy>=1.7'],
|
|
|
|
'mllib': ['numpy>=1.7'],
|
[SPARK-23319][TESTS] Explicitly specify Pandas and PyArrow versions in PySpark tests (to skip or test)
## What changes were proposed in this pull request?
This PR proposes to explicitly specify Pandas and PyArrow versions in PySpark tests to skip or test.
We declared the extra dependencies:
https://github.com/apache/spark/blob/b8bfce51abf28c66ba1fc67b0f25fe1617c81025/python/setup.py#L204
In case of PyArrow:
Currently we only check if pyarrow is installed or not without checking the version. It already fails to run tests. For example, if PyArrow 0.7.0 is installed:
```
======================================================================
ERROR: test_vectorized_udf_wrong_return_type (pyspark.sql.tests.ScalarPandasUDF)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/.../spark/python/pyspark/sql/tests.py", line 4019, in test_vectorized_udf_wrong_return_type
f = pandas_udf(lambda x: x * 1.0, MapType(LongType(), LongType()))
File "/.../spark/python/pyspark/sql/functions.py", line 2309, in pandas_udf
return _create_udf(f=f, returnType=return_type, evalType=eval_type)
File "/.../spark/python/pyspark/sql/udf.py", line 47, in _create_udf
require_minimum_pyarrow_version()
File "/.../spark/python/pyspark/sql/utils.py", line 132, in require_minimum_pyarrow_version
"however, your version was %s." % pyarrow.__version__)
ImportError: pyarrow >= 0.8.0 must be installed on calling Python process; however, your version was 0.7.0.
----------------------------------------------------------------------
Ran 33 tests in 8.098s
FAILED (errors=33)
```
In case of Pandas:
There are few tests for old Pandas which were tested only when Pandas version was lower, and I rewrote them to be tested when both Pandas version is lower and missing.
## How was this patch tested?
Manually tested by modifying the condition:
```
test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 1.19.2 must be installed; however, your version was 0.19.2.'
test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 1.19.2 must be installed; however, your version was 0.19.2.'
test_createDataFrame_respect_session_timezone (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 1.19.2 must be installed; however, your version was 0.19.2.'
```
```
test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 0.19.2 must be installed; however, it was not found.'
test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 0.19.2 must be installed; however, it was not found.'
test_createDataFrame_respect_session_timezone (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 0.19.2 must be installed; however, it was not found.'
```
```
test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 1.8.0 must be installed; however, your version was 0.8.0.'
test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 1.8.0 must be installed; however, your version was 0.8.0.'
test_createDataFrame_respect_session_timezone (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 1.8.0 must be installed; however, your version was 0.8.0.'
```
```
test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 0.8.0 must be installed; however, it was not found.'
test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 0.8.0 must be installed; however, it was not found.'
test_createDataFrame_respect_session_timezone (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 0.8.0 must be installed; however, it was not found.'
```
Author: hyukjinkwon <gurwls223@gmail.com>
Closes #20487 from HyukjinKwon/pyarrow-pandas-skip.
2018-02-07 09:28:10 -05:00
|
|
|
'sql': [
|
|
|
|
'pandas>=%s' % _minimum_pandas_version,
|
|
|
|
'pyarrow>=%s' % _minimum_pyarrow_version,
|
|
|
|
]
|
2016-11-16 17:22:15 -05:00
|
|
|
},
|
|
|
|
classifiers=[
|
|
|
|
'Development Status :: 5 - Production/Stable',
|
|
|
|
'License :: OSI Approved :: Apache Software License',
|
|
|
|
'Programming Language :: Python :: 2.7',
|
|
|
|
'Programming Language :: Python :: 3',
|
|
|
|
'Programming Language :: Python :: 3.4',
|
|
|
|
'Programming Language :: Python :: 3.5',
|
2017-12-27 06:51:26 -05:00
|
|
|
'Programming Language :: Python :: 3.6',
|
2018-07-06 23:37:41 -04:00
|
|
|
'Programming Language :: Python :: 3.7',
|
2019-10-22 03:18:34 -04:00
|
|
|
'Programming Language :: Python :: 3.8',
|
2016-11-16 17:22:15 -05:00
|
|
|
'Programming Language :: Python :: Implementation :: CPython',
|
|
|
|
'Programming Language :: Python :: Implementation :: PyPy']
|
|
|
|
)
|
|
|
|
finally:
|
|
|
|
# We only cleanup the symlink farm if we were in Spark, otherwise we are installing rather than
|
|
|
|
# packaging.
|
|
|
|
if (in_spark):
|
|
|
|
# Depending on cleaning up the symlink farm or copied version
|
|
|
|
if _supports_symlinks():
|
|
|
|
os.remove(os.path.join(TEMP_PATH, "jars"))
|
|
|
|
os.remove(os.path.join(TEMP_PATH, "bin"))
|
2019-02-27 09:39:55 -05:00
|
|
|
os.remove(os.path.join(TEMP_PATH, "sbin"))
|
2016-11-16 17:22:15 -05:00
|
|
|
os.remove(os.path.join(TEMP_PATH, "examples"))
|
2016-12-06 17:09:27 -05:00
|
|
|
os.remove(os.path.join(TEMP_PATH, "data"))
|
|
|
|
os.remove(os.path.join(TEMP_PATH, "licenses"))
|
2016-11-16 17:22:15 -05:00
|
|
|
else:
|
|
|
|
rmtree(os.path.join(TEMP_PATH, "jars"))
|
|
|
|
rmtree(os.path.join(TEMP_PATH, "bin"))
|
2019-02-27 09:39:55 -05:00
|
|
|
rmtree(os.path.join(TEMP_PATH, "sbin"))
|
2016-11-16 17:22:15 -05:00
|
|
|
rmtree(os.path.join(TEMP_PATH, "examples"))
|
2016-12-06 17:09:27 -05:00
|
|
|
rmtree(os.path.join(TEMP_PATH, "data"))
|
|
|
|
rmtree(os.path.join(TEMP_PATH, "licenses"))
|
2016-11-16 17:22:15 -05:00
|
|
|
os.rmdir(TEMP_PATH)
|