spark-instrumented-optimizer/python/pyspark/sql/pandas/utils.py

#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#


def require_minimum_pandas_version():
    """ Raise ImportError if minimum version of Pandas is not installed
    """
    # TODO(HyukjinKwon): Relocate and deduplicate the version specification.
    minimum_pandas_version = "0.23.2"

    from distutils.version import LooseVersion
    try:
        import pandas
        have_pandas = True
    except ImportError:
        have_pandas = False
    if not have_pandas:
        raise ImportError("Pandas >= %s must be installed; however, "
                          "it was not found." % minimum_pandas_version)
    if LooseVersion(pandas.__version__) < LooseVersion(minimum_pandas_version):
        raise ImportError("Pandas >= %s must be installed; however, "
                          "your version was %s." % (minimum_pandas_version, pandas.__version__))


def require_minimum_pyarrow_version():
    """ Raise ImportError if minimum version of pyarrow is not installed
    """
    # TODO(HyukjinKwon): Relocate and deduplicate the version specification.
    minimum_pyarrow_version = "1.0.0"

    from distutils.version import LooseVersion
    import os
    try:
        import pyarrow
        have_arrow = True
    except ImportError:
        have_arrow = False
    if not have_arrow:
        raise ImportError("PyArrow >= %s must be installed; however, "
                          "it was not found." % minimum_pyarrow_version)
    if LooseVersion(pyarrow.__version__) < LooseVersion(minimum_pyarrow_version):
        raise ImportError("PyArrow >= %s must be installed; however, "
                          "your version was %s." % (minimum_pyarrow_version, pyarrow.__version__))
    if os.environ.get("ARROW_PRE_0_15_IPC_FORMAT", "0") == "1":
        raise RuntimeError("Arrow legacy IPC format is not supported in PySpark, "
                           "please unset ARROW_PRE_0_15_IPC_FORMAT")
[SPARK-30434][PYTHON][SQL] Move pandas related functionalities into 'pandas' sub-package ### What changes were proposed in this pull request? This PR proposes to move pandas related functionalities into pandas package. Namely: ```bash pyspark/sql/pandas ├── __init__.py ├── conversion.py # Conversion between pandas <> PySpark DataFrames ├── functions.py # pandas_udf ├── group_ops.py # Grouped UDF / Cogrouped UDF + groupby.apply, groupby.cogroup.apply ├── map_ops.py # Map Iter UDF + mapInPandas ├── serializers.py # pandas <> PyArrow serializers ├── types.py # Type utils between pandas <> PyArrow └── utils.py # Version requirement checks ``` In order to separately locate `groupby.apply`, `groupby.cogroup.apply`, `mapInPandas`, `toPandas`, and `createDataFrame(pdf)` under `pandas` sub-package, I had to use a mix-in approach which Scala side uses often by `trait`, and also pandas itself uses this approach (see `IndexOpsMixin` as an example) to group related functionalities. Currently, you can think it's like Scala's self typed trait. See the structure below: ```python class PandasMapOpsMixin(object): def mapInPandas(self, ...): ... return ... # other Pandas <> PySpark APIs ``` ```python class DataFrame(PandasMapOpsMixin): # other DataFrame APIs equivalent to Scala side. ``` Yes, This is a big PR but they are mostly just moving around except one case `createDataFrame` which I had to split the methods. ### Why are the changes needed? There are pandas functionalities here and there and I myself gets lost where it was. Also, when you have to make a change commonly for all of pandas related features, it's almost impossible now. Also, after this change, `DataFrame` and `SparkSession` become more consistent with Scala side since pandas is specific to Python, and this change separates pandas-specific APIs away from `DataFrame` or `SparkSession`. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing tests should cover. Also, I manually built the PySpark API documentation and checked. Closes #27109 from HyukjinKwon/pandas-refactoring. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org> 2020-01-08 20:22:50 -05:00			`#`
			`# Licensed to the Apache Software Foundation (ASF) under one or more`
			`# contributor license agreements. See the NOTICE file distributed with`
			`# this work for additional information regarding copyright ownership.`
			`# The ASF licenses this file to You under the Apache License, Version 2.0`
			`# (the "License"); you may not use this file except in compliance with`
			`# the License. You may obtain a copy of the License at`
			`#`
			`# http://www.apache.org/licenses/LICENSE-2.0`
			`#`
			`# Unless required by applicable law or agreed to in writing, software`
			`# distributed under the License is distributed on an "AS IS" BASIS,`
			`# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.`
			`# See the License for the specific language governing permissions and`
			`# limitations under the License.`
			`#`


			`def require_minimum_pandas_version():`
			`""" Raise ImportError if minimum version of Pandas is not installed`
			`"""`
			`# TODO(HyukjinKwon): Relocate and deduplicate the version specification.`
			`minimum_pandas_version = "0.23.2"`

			`from distutils.version import LooseVersion`
			`try:`
			`import pandas`
			`have_pandas = True`
			`except ImportError:`
			`have_pandas = False`
			`if not have_pandas:`
			`raise ImportError("Pandas >= %s must be installed; however, "`
			`"it was not found." % minimum_pandas_version)`
			`if LooseVersion(pandas.__version__) < LooseVersion(minimum_pandas_version):`
			`raise ImportError("Pandas >= %s must be installed; however, "`
			`"your version was %s." % (minimum_pandas_version, pandas.__version__))`


			`def require_minimum_pyarrow_version():`
			`""" Raise ImportError if minimum version of pyarrow is not installed`
			`"""`
			`# TODO(HyukjinKwon): Relocate and deduplicate the version specification.`
[SPARK-32312][SQL][PYTHON][TEST-JAVA11] Upgrade Apache Arrow to version 1.0.1 ### What changes were proposed in this pull request? Upgrade Apache Arrow to version 1.0.1 for the Java dependency and increase minimum version of PyArrow to 1.0.0. This release marks a transition to binary stability of the columnar format (which was already informally backward-compatible going back to December 2017) and a transition to Semantic Versioning for the Arrow software libraries. Also note that the Java arrow-memory artifact has been split to separate dependence on netty-buffer and allow users to select an allocator. Spark will continue to use `arrow-memory-netty` to maintain performance benefits. Version 1.0.0 - 1.0.0 include the following selected fixes/improvements relevant to Spark users: ARROW-9300 - [Java] Separate Netty Memory to its own module ARROW-9272 - [C++][Python] Reduce complexity in python to arrow conversion ARROW-9016 - [Java] Remove direct references to Netty/Unsafe Allocators ARROW-8664 - [Java] Add skip null check to all Vector types ARROW-8485 - [Integration][Java] Implement extension types integration ARROW-8434 - [C++] Ipc RecordBatchFileReader deserializes the Schema multiple times ARROW-8314 - [Python] Provide a method to select a subset of columns of a Table ARROW-8230 - [Java] Move Netty memory manager into a separate module ARROW-8229 - [Java] Move ArrowBuf into the Arrow package ARROW-7955 - [Java] Support large buffer for file/stream IPC ARROW-7831 - [Java] unnecessary buffer allocation when calling splitAndTransferTo on variable width vectors ARROW-6111 - [Java] Support LargeVarChar and LargeBinary types and add integration test with C++ ARROW-6110 - [Java] Support LargeList Type and add integration test with C++ ARROW-5760 - [C++] Optimize Take implementation ARROW-300 - [Format] Add body buffer compression option to IPC message protocol using LZ4 or ZSTD ARROW-9098 - RecordBatch::ToStructArray cannot handle record batches with 0 column ARROW-9066 - [Python] Raise correct error in isnull() ARROW-9223 - [Python] Fix to_pandas() export for timestamps within structs ARROW-9195 - [Java] Wrong usage of Unsafe.get from bytearray in ByteFunctionsHelper class ARROW-7610 - [Java] Finish support for 64 bit int allocations ARROW-8115 - [Python] Conversion when mixing NaT and datetime objects not working ARROW-8392 - [Java] Fix overflow related corner cases for vector value comparison ARROW-8537 - [C++] Performance regression from ARROW-8523 ARROW-8803 - [Java] Row count should be set before loading buffers in VectorLoader ARROW-8911 - [C++] Slicing a ChunkedArray with zero chunks segfaults View release notes here: https://arrow.apache.org/release/1.0.1.html https://arrow.apache.org/release/1.0.0.html ### Why are the changes needed? Upgrade brings fixes, improvements and stability guarantees. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests with pyarrow 1.0.0 and 1.0.1 Closes #29686 from BryanCutler/arrow-upgrade-100-SPARK-32312. Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> 2020-09-10 01:16:19 -04:00			`minimum_pyarrow_version = "1.0.0"`
[SPARK-30434][PYTHON][SQL] Move pandas related functionalities into 'pandas' sub-package ### What changes were proposed in this pull request? This PR proposes to move pandas related functionalities into pandas package. Namely: ```bash pyspark/sql/pandas ├── __init__.py ├── conversion.py # Conversion between pandas <> PySpark DataFrames ├── functions.py # pandas_udf ├── group_ops.py # Grouped UDF / Cogrouped UDF + groupby.apply, groupby.cogroup.apply ├── map_ops.py # Map Iter UDF + mapInPandas ├── serializers.py # pandas <> PyArrow serializers ├── types.py # Type utils between pandas <> PyArrow └── utils.py # Version requirement checks ``` In order to separately locate `groupby.apply`, `groupby.cogroup.apply`, `mapInPandas`, `toPandas`, and `createDataFrame(pdf)` under `pandas` sub-package, I had to use a mix-in approach which Scala side uses often by `trait`, and also pandas itself uses this approach (see `IndexOpsMixin` as an example) to group related functionalities. Currently, you can think it's like Scala's self typed trait. See the structure below: ```python class PandasMapOpsMixin(object): def mapInPandas(self, ...): ... return ... # other Pandas <> PySpark APIs ``` ```python class DataFrame(PandasMapOpsMixin): # other DataFrame APIs equivalent to Scala side. ``` Yes, This is a big PR but they are mostly just moving around except one case `createDataFrame` which I had to split the methods. ### Why are the changes needed? There are pandas functionalities here and there and I myself gets lost where it was. Also, when you have to make a change commonly for all of pandas related features, it's almost impossible now. Also, after this change, `DataFrame` and `SparkSession` become more consistent with Scala side since pandas is specific to Python, and this change separates pandas-specific APIs away from `DataFrame` or `SparkSession`. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing tests should cover. Also, I manually built the PySpark API documentation and checked. Closes #27109 from HyukjinKwon/pandas-refactoring. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org> 2020-01-08 20:22:50 -05:00
			`from distutils.version import LooseVersion`
			`import os`
			`try:`
			`import pyarrow`
			`have_arrow = True`
			`except ImportError:`
			`have_arrow = False`
			`if not have_arrow:`
			`raise ImportError("PyArrow >= %s must be installed; however, "`
			`"it was not found." % minimum_pyarrow_version)`
			`if LooseVersion(pyarrow.__version__) < LooseVersion(minimum_pyarrow_version):`
			`raise ImportError("PyArrow >= %s must be installed; however, "`
			`"your version was %s." % (minimum_pyarrow_version, pyarrow.__version__))`
			`if os.environ.get("ARROW_PRE_0_15_IPC_FORMAT", "0") == "1":`
			`raise RuntimeError("Arrow legacy IPC format is not supported in PySpark, "`
			`"please unset ARROW_PRE_0_15_IPC_FORMAT")`