Commit graph

123 commits

Author SHA1 Message Date
Enrico Minack f90eb6a5db [SPARK-36263][SQL][PYTHON] Add Dataframe.observation to PySpark
### What changes were proposed in this pull request?
With SPARK-34806 we can now easily add an equivalent for `Dataset.observe(Observation, Column, Column*)` to PySpark's `DataFrame` API.

### Why are the changes needed?
This further aligns the Python DataFrame API with Scala Dataset API.

### Does this PR introduce _any_ user-facing change?
Yes, it adds the `Observation` class and the `DataFrame.observe` method.

### How was this patch tested?
Adds test `test_observe` to `pyspark.sql.test.test_dataframe`.

Closes #33484 from EnricoMi/branch-observation-python.

Authored-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-07-28 01:39:34 +08:00
Kousuke Saruta c7ccc602db [SPARK-36166][TESTS][FOLLOWUP] Add BLOCK_SCALA_VERSION to sparktestssupport/__init__.py
### What changes were proposed in this pull request?

This is a followup PR for SPARK-36166 (#33411), which adds `BLOCK_SCALA_VERSION` to `sparktestssupport/__init__.py`.

### Why are the changes needed?

The following command fails due to the definition is missing.
```
SCALA_PROFILE=scala2.12 dev/run-tests.py
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

The command shown above works.

Closes #33421 from sarutak/followup-SPARK-36166.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-19 22:47:03 +09:00
Yikun Jiang fdc50f4452 [SPARK-36002][PYTHON] Consolidate tests for data-type-based operations of decimal Series
### What changes were proposed in this pull request?
Merge test_decimal_ops into test_num_ops

- merge test_isnull() into test_num_ops.test_isnull()
- remove test_datatype_ops(), which already covered in 11fcbc73cb/python/pyspark/pandas/tests/data_type_ops/test_base.py (L58-L59)

### Why are the changes needed?
Tests for data-type-based operations of decimal Series are in two places:

- python/pyspark/pandas/tests/data_type_ops/test_decimal_ops.py
- python/pyspark/pandas/tests/data_type_ops/test_num_ops.py

We'd better merge test_decimal_ops into test_num_ops.

See also [SPARK-36002](https://issues.apache.org/jira/browse/SPARK-36002) .

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
unittests passed

Closes #33206 from Yikun/SPARK-36002.

Authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-09 14:08:13 +09:00
Takuya UESHIN 1f6e2f55d7 Revert "[SPARK-35721][PYTHON] Path level discover for python unittests"
This reverts commit 5db51efa1a.
2021-06-29 12:08:09 -07:00
Yikun Jiang 5db51efa1a [SPARK-35721][PYTHON] Path level discover for python unittests
### What changes were proposed in this pull request?
Add path level discover for python unittests.

### Why are the changes needed?
Now we need to specify the python test cases by manually when we add a new testcase. Sometime, we forgot to add the testcase to module list, the testcase would not be executed.

Such as:
- pyspark-core pyspark.tests.test_pin_thread

Thus we need some auto-discover way to find all testcase rather than specified every case by manually.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Add below code in end of `dev/sparktestsupport/modules.py`
```python
for m in sorted(all_modules):
    for g in sorted(m.python_test_goals):
        print(m.name, g)
```
Compare the result before and after:
https://www.diffchecker.com/iO3FvhKL

Closes #32867 from Yikun/SPARK_DISCOVER_TEST.

Authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-29 17:56:13 +09:00
Dongjoon Hyun 57896e662e [SPARK-35483][FOLLOWUP][TESTS] Enable docker_integration_tests for catalyst/sql module changes too
### What changes were proposed in this pull request?

This PR aims to enable `docker_integration_tests` when `catalyst` and `sql` module changes additionally.

### Why are the changes needed?

Currently, `catalyst` and `sql` module changes do not trigger the JDBC integration test.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

N/A

Closes #33125 from dongjoon-hyun/SPARK-35483.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-06-28 22:59:56 -07:00
Xinrong Meng 5f0113e3a6 [SPARK-35344][PYTHON] Support creating a Column of numpy literals in pandas API on Spark
### What changes were proposed in this pull request?

The PR is proposed to support creating a Column of numpy literal value in pandas-on-Spark. It consists of three changes mainly:
- Enable the `lit` function defined in `pyspark.pandas.spark.functions` to support numpy literals input.

```py
>>> from pyspark.pandas.spark import functions as SF
>>> SF.lit(np.int64(1))
Column<'CAST(1 AS BIGINT)'>
>>> SF.lit(np.int32(1))
Column<'CAST(1 AS INT)'>
>>> SF.lit(np.int8(1))
Column<'CAST(1 AS TINYINT)'>
>>> SF.lit(np.byte(1))
Column<'CAST(1 AS TINYINT)'>
>>> SF.lit(np.float32(1))
Column<'CAST(1.0 AS FLOAT)'>
```
- Substitute `F.lit` by `SF.lit`, that is, use `lit` function defined in `pyspark.pandas.spark.functions` rather than `lit` function defined in `pyspark.sql.functions` to allow creating columns out of numpy literals.
- Enable numpy literals input in `isin` method

Non-goal:
- Some pandas-on-Spark APIs use PySpark column-related APIs internally, and these column-related APIs don't support numpy literals, thus numpy literals are disallowed as input (e.g. `to_replace` parameter in `replace` API). This PR doesn't aim to adjust all of them. This PR adjusts `isin` only, because the PR is inspired by that (as https://github.com/databricks/koalas/issues/2161).
- To complete mappings between all kinds of numpy literals and Spark data types should be a followup task.

### Why are the changes needed?

Spark (`lit` function defined in `pyspark.sql.functions`) doesn't support creating a Column out of numpy literal value.
So `lit` function defined in `pyspark.pandas.spark.functions`  is adjusted in order to support that in pandas-on-Spark.

### Does this PR introduce _any_ user-facing change?

Yes.
Before:
```py
>>> a = ps.DataFrame({'source': [1,2,3,4,5]})
>>> a.source.isin([np.int64(1), np.int64(2)])
Traceback (most recent call last):
...
AttributeError: 'numpy.int64' object has no attribute '_get_object_id'
```

After:
```py
>>> a = ps.DataFrame({'source': [1,2,3,4,5]})
>>> a.source.isin([np.int64(1), np.int64(2)])
0     True
1     True
2    False
3    False
4    False
Name: source, dtype: bool
```

### How was this patch tested?

Unit tests.

Closes #32955 from xinrong-databricks/datatypeops_literal.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-06-28 19:03:42 -07:00
Yikun Jiang b7df75a777 [SPARK-35708][PYTHON][TEST] Add BaseTest for DataTypeOps
### What changes were proposed in this pull request?
This patch adds DataTypeOps test to check the ops is loaded as expected.

### Why are the changes needed?
When complete https://github.com/apache/spark/pull/32821, I found there are no test for DataTypeOps. There were many logic when DataTypeOps loaded, it's better to add the test to make sure interface stable.

### Does this PR introduce _any_ user-facing change?
No, test only

### How was this patch tested?
test passed.

Closes #32859 from Yikun/SPARK-XXXXX1.

Authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-06-18 18:54:50 -07:00
Yikun Jiang f84a720fe3 [SPARK-35342][PYTHON] Introduce DecimalOps and make isnull method data-type-based
### What changes were proposed in this pull request?
- Introduce a DecimalOps for DecimalType
- Make `isnull` method data-type-based

### Why are the changes needed?
Now DecimalType, DoubleType, and FloatType data share the FractionalOps class, but DecimalType behaves differently from FloatType and DoubleType (as https://github.com/apache/spark/blob/master/python/pyspark/pandas/base.py#L987-L990), so we propose to introduce DecimalOps. The behavior difference here is caused by DecimalType could not have NaN.

https://issues.apache.org/jira/browse/SPARK-35342

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
- New added DecimalOpsTest passed
- Existing NumOpsTest passed

Closes #32821 from Yikun/SPARK-35342.

Authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-06-18 10:44:35 -07:00
Xinrong Meng 04a8d2cbcf [SPARK-35343][PYTHON] Make the conversion from/to pandas data-type-based for non-ExtensionDtypes
### What changes were proposed in this pull request?

Make the conversion from/to pandas (for non-ExtensionDtype) data-type-based.
NOTE: Ops class per ExtensionDtype and its data-type-based from/to pandas will be implemented in a separate PR as https://issues.apache.org/jira/browse/SPARK-35614.

### Why are the changes needed?

The conversion from/to pandas includes logic for checking data types and behaving accordingly.
That makes code hard to change or maintain.
Since we have introduced the Ops class per non-ExtensionDtype data type, we ought to make the conversion from/to pandas data-type-based for non-ExtensionDtypes.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit tests.

Closes #32592 from xinrong-databricks/datatypeop_pd_conversion.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-06-07 13:12:12 -07:00
Takuya UESHIN 221553c204 [SPARK-35642][INFRA] Split pyspark-pandas tests to rebalance the test duration
### What changes were proposed in this pull request?

Splits some tests in `pyspark-pandas` module as slot tests to rebalance the test duration.

Picked the top 12 tests from the previous runs and the total times are almost even.

### Why are the changes needed?

Currently `pyspark-pandas` module tests take long time, so we should rebalance the tests.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #32778 from ueshin/issues/SPARK-35642/split-pandas-on-spark-tests.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-04 12:52:52 +09:00
Kousuke Saruta 2de19e460b [SPARK-35483][INFRA] Add docker-integration-tests to run-tests.py and GA
### What changes were proposed in this pull request?

This PR proposes to add `docker-integratin-tests` to `run-tests.py` and GA.
Once #32631 was merged but there was a lack of consideration.

Diff between this change and 692d95d145 merged in #32631 is as follows.

```
       if: github.repository != 'apache/spark'
       id: sync-branch
       run: |
+        apache_spark_ref=`git rev-parse HEAD`
         git fetch https://github.com/$GITHUB_REPOSITORY.git ${GITHUB_REF#refs/heads/}
         git -c user.name='Apache Spark Test Account' -c user.email='sparktestaccgmail.com' merge --no-commit --progress --squash FETCH_HEAD
         git -c user.name='Apache Spark Test Account' -c user.email='sparktestaccgmail.com' commit -m "Merged commit"
+        echo "::set-output name=APACHE_SPARK_REF::$apache_spark_ref"
     - name: Cache Scala, SBT and Maven
       uses: actions/cachev2
       with:
```

### Why are the changes needed?

CI for `docker-integration-tests` is absent for now.

### Does this PR introduce _any_ user-facing change?

GA.

### How was this patch tested?

Closes #32691 from sarutak/docker-integration-test-ga-take2.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-05-28 16:54:47 +09:00
Hyukjin Kwon d189cf75f9 Revert "[SPARK-35483][INFRA] Add docker-integration-tests to run-tests.py and GA"
This reverts commit 0a74ad66b3.
2021-05-28 14:29:12 +09:00
Kousuke Saruta 0a74ad66b3 [SPARK-35483][INFRA] Add docker-integration-tests to run-tests.py and GA
### What changes were proposed in this pull request?

This PR proposes to add `docker-integratin-tests` to `run-tests.py` and GA.
`doker-integration-tests` can't run if docker is not installed so it run only if `docker-integration-tests` is specified with `--module`.

### Why are the changes needed?

CI for `docker-integration-tests` is absent for now.

### Does this PR introduce _any_ user-facing change?

GA.

### How was this patch tested?

Closes #32631 from sarutak/docker-integration-test-ga.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-05-28 07:56:37 +09:00
Xinrong Meng 8cc7232ffa [SPARK-35522][PYTHON] Introduce BinaryOps for BinaryType
### What changes were proposed in this pull request?

BinaryType, which represents byte sequence values in Spark, doesn't support data-type-based operations yet. We are going to introduce BinaryOps for it.

### Why are the changes needed?

The data-type-based-operations class should be set for each individual data type, including BinaryType.
In addition, BinaryType has its special way of addition, which means concatenation.

### Does this PR introduce _any_ user-facing change?

Yes.

Before the change:
```py
>>> import pyspark.pandas as ps
>>> psser = ps.Series([b'1', b'2', b'3'])
>>> psser + psser
Traceback (most recent call last):
...
TypeError: Type object was not understood.
>>> psser + b'1'
Traceback (most recent call last):
...
TypeError: Type object was not understood.

```
After the change:
```py
>>> import pyspark.pandas as ps
>>> psser = ps.Series([b'1', b'2', b'3'])
>>> psser + psser
0    [49, 49]
1    [50, 50]
2    [51, 51]
dtype: object
>>> psser + b'1'
0    [49, 49]
1    [50, 49]
2    [51, 49]
dtype: object
```

### How was this patch tested?

Unit tests.

Closes #32665 from xinrong-databricks/datatypeops_binary.

Lead-authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Co-authored-by: xinrong-databricks <47337188+xinrong-databricks@users.noreply.github.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-05-26 14:30:24 -07:00
Xinrong Meng 266608d50e [SPARK-35452][PYTHON] Introduce ArrayOps, MapOps and StructOps
### What changes were proposed in this pull request?

The PR is proposed to introduce ArrayOps, MapOps and StructOps to handle data-type-based operations for StructType, ArrayType, and MapType separately.

### Why are the changes needed?

StructType, ArrayType, and MapType are not accepted by DataTypeOps now.

We should handle these complex types. Among them:

- ArrayType supports concatenation: for example, ps.Series([[1,2,3]]) + ps.Series([[4,5,6]]) should work the same as pd.Series([[1,2,3]]) + pd.Series([[4,5,6]]), as concatenation.

- StructOps will be helpful to make to/from pandas conversion data-type-based.

### Does this PR introduce _any_ user-facing change?

Yes.

Before the change:
```py
>>> import pyspark.pandas as ps
>>> from pyspark.pandas.config import set_option
>>> set_option("compute.ops_on_diff_frames", True)
>>> ps.Series([[1, 2, 3]]) + ps.Series([[0.4, 0.5]])
Traceback (most recent call last):
...
TypeError: Type object was not understood.
>>> ps.Series([[1, 2, 3]]) + ps.Series([[4, 5]])
Traceback (most recent call last):
...
TypeError: Type object was not understood.
>>> ps.Series([[1, 2, 3]]) + ps.Series([['x']])
Traceback (most recent call last):
...
TypeError: Type object was not understood.
```

After the change:
```py
>>> import pyspark.pandas as ps
>>> from pyspark.pandas.config import set_option
>>> set_option("compute.ops_on_diff_frames", True)
>>> ps.Series([[1, 2, 3]]) + ps.Series([[0.4, 0.5]])
0    [1.0, 2.0, 3.0, 0.4, 0.5]
dtype: object
>>> ps.Series([[1, 2, 3]]) + ps.Series([[4, 5]])
0    [1, 2, 3, 4, 5]
dtype: object
>>> ps.Series([[1, 2, 3]]) + ps.Series([['x']])
Traceback (most recent call last):
...
TypeError: Concatenation can only be applied to arrays of the same type
```

### How was this patch tested?

Unit tests.

Closes #32626 from xinrong-databricks/datatypeop_complex.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-05-26 10:40:01 -07:00
Xinrong Meng a970f8505d [SPARK-35338][PYTHON] Separate arithmetic operations into data type based structures
### What changes were proposed in this pull request?

The PR is proposed for **pandas APIs on Spark**, in order to separate arithmetic operations shown as below into data-type-based structures.
`__add__, __sub__, __mul__, __truediv__, __floordiv__, __pow__, __mod__,
__radd__, __rsub__, __rmul__, __rtruediv__, __rfloordiv__, __rpow__,__rmod__`

DataTypeOps and subclasses are introduced.

The existing behaviors of each arithmetic operation should be preserved.

### Why are the changes needed?

Currently, the same arithmetic operation of all data types is defined in one function, so it’s difficult to extend the behavior change based on the data types.

Introducing DataTypeOps would be the foundation for [pandas APIs on Spark: Separate basic operations into data type based structures.](https://docs.google.com/document/d/12MS6xK0hETYmrcl5b9pX5lgV4FmGVfpmcSKq--_oQlc/edit?usp=sharing).

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Tests are introduced under pyspark.pandas.tests.data_type_ops. One test file per DataTypeOps class.

Closes #32596 from xinrong-databricks/datatypeop_arith_fix.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-05-19 19:47:00 -07:00
Takuya UESHIN d44e6c7f10 Revert "[SPARK-35338][PYTHON] Separate arithmetic operations into data type based structures"
This reverts commit d1b24d8aba.
2021-05-19 16:49:47 -07:00
Xinrong Meng d1b24d8aba [SPARK-35338][PYTHON] Separate arithmetic operations into data type based structures
### What changes were proposed in this pull request?

The PR is proposed for **pandas APIs on Spark**, in order to separate arithmetic operations shown as below into data-type-based structures.
`__add__, __sub__, __mul__, __truediv__, __floordiv__, __pow__, __mod__,
__radd__, __rsub__, __rmul__, __rtruediv__, __rfloordiv__, __rpow__,__rmod__`

DataTypeOps and subclasses are introduced.

The existing behaviors of each arithmetic operation should be preserved.

### Why are the changes needed?

Currently, the same arithmetic operation of all data types is defined in one function, so it’s difficult to extend the behavior change based on the data types.

Introducing DataTypeOps would be the foundation for [pandas APIs on Spark: Separate basic operations into data type based structures.](https://docs.google.com/document/d/12MS6xK0hETYmrcl5b9pX5lgV4FmGVfpmcSKq--_oQlc/edit?usp=sharing).

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Tests are introduced under pyspark.pandas.tests.data_type_ops. One test file per DataTypeOps class.

Closes #32469 from xinrong-databricks/datatypeop_arith.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-05-19 15:05:32 -07:00
Xinrong Meng 4aee19efb4 [SPARK-35032][PYTHON] Port Koalas Index unit tests into PySpark
### What changes were proposed in this pull request?
Now that we merged the Koalas main code into the PySpark code base (#32036), we should port the Koalas Index unit tests to PySpark.

### Why are the changes needed?
Currently, the pandas-on-Spark modules are not tested fully. We should enable the Index unit tests.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Enable Index unit tests.

Closes #32139 from xinrong-databricks/port.indexes_tests.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-04-16 08:53:30 +09:00
xinrong-databricks 58feb85145 [SPARK-35034][PYTHON] Port Koalas miscellaneous unit tests into PySpark
### What changes were proposed in this pull request?
Now that we merged the Koalas main code into the PySpark code base (#32036), we should port the Koalas miscellaneous unit tests to PySpark.

### Why are the changes needed?
Currently, the pandas-on-Spark modules are not tested fully. We should enable miscellaneous unit tests.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Enable miscellaneous unit tests.

Closes #32152 from xinrong-databricks/port.misc_tests.

Lead-authored-by: xinrong-databricks <47337188+xinrong-databricks@users.noreply.github.com>
Co-authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-04-15 11:45:15 +09:00
Xinrong Meng 47d62af2a9 [SPARK-35035][PYTHON] Port Koalas internal implementation unit tests into PySpark
### What changes were proposed in this pull request?
Now that we merged the Koalas main code into the PySpark code base (#32036), we should port the Koalas internal implementation unit tests to PySpark.

### Why are the changes needed?
Currently, the pandas-on-Spark modules are not tested fully. We should enable the internal implementation unit tests.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Enable internal implementation unit tests.

Closes #32137 from xinrong-databricks/port.test_internal_impl.

Lead-authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Co-authored-by: xinrong-databricks <47337188+xinrong-databricks@users.noreply.github.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-04-14 13:59:33 +09:00
Xinrong Meng cd1e8e8158 [SPARK-35033][PYTHON] Port Koalas plot unit tests into PySpark
### What changes were proposed in this pull request?
Now that we merged the Koalas main code into the PySpark code base (#32036), we should port the Koalas plot unit tests to PySpark.

### Why are the changes needed?
Currently, the pandas-on-Spark modules are not tested fully. We should enable the plot unit tests.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Enable plot unit tests.

Closes #32151 from xinrong-databricks/port.plot_tests.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-04-14 13:20:16 +09:00
Xinrong Meng 8ebc3fca8c [SPARK-35012][PYTHON] Port Koalas DataFrame-related unit tests into PySpark
### What changes were proposed in this pull request?
Now that we merged the Koalas main code into the PySpark code base (#32036), we should port the Koalas DataFrame-related unit tests to PySpark.

### Why are the changes needed?
Currently, the pandas-on-Spark modules are not fully tested. We should enable the DataFrame-related unit tests first.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Enable DataFrame-related unit tests.

Closes #32131 from xinrong-databricks/port.test_dataframe_related.

Lead-authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Co-authored-by: xinrong-databricks <47337188+xinrong-databricks@users.noreply.github.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-04-13 14:24:08 -07:00
Xinrong Meng a392633566 [SPARK-34996][PYTHON] Port Koalas Series-related unit tests into PySpark
### What changes were proposed in this pull request?
Now that we merged the Koalas main code into the PySpark code base (#32036), we should port the Koalas Series related unit tests to PySpark.

### Why are the changes needed?
Currently, the pandas-on-Spark modules are not fully tested. We should enable the Series related unit tests first.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Enable Series-related unit tests.

Closes #32117 from xinrong-databricks/port.test_series_related.

Lead-authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Co-authored-by: xinrong-databricks <47337188+xinrong-databricks@users.noreply.github.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-04-13 13:03:35 +09:00
Xinrong Meng 9c1f807549 [SPARK-35031][PYTHON] Port Koalas operations on different frames tests into PySpark
### What changes were proposed in this pull request?
Now that we merged the Koalas main code into the PySpark code base (#32036), we should port the Koalas operations on different frames unit tests to PySpark.

### Why are the changes needed?
Currently, the pandas-on-Spark modules are not tested fully. We should enable the operations on different frames unit tests.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Enable operations on different frames unit tests.

Closes #32133 from xinrong-databricks/port.test_ops_on_diff_frames.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-04-13 11:22:51 +09:00
Xinrong Meng 3af2c1bb9c [SPARK-34886][PYTHON] Port/integrate Koalas DataFrame unit test into PySpark
### What changes were proposed in this pull request?
Now that we merged the Koalas main code into the PySpark code base (#32036), we should port the Koalas DataFrame unit test to PySpark.

### Why are the changes needed?
Currently, the pandas-on-Spark modules are not tested at all. We should enable the DataFrame unit test first.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Enable the DataFrame unit test.

Closes #32083 from xinrong-databricks/port.test_dataframe.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-04-09 15:48:13 +09:00
Takuya UESHIN 2635c3894f [SPARK-34972][PYTHON] Make pandas-on-Spark doctests work
### What changes were proposed in this pull request?

Now that we merged the Koalas main code into PySpark code base (#32036), we should enable doctests on the Spark's infrastructure.

### Why are the changes needed?

Currently the pandas-on-Spark modules are not tested at all.
We should enable doctests first, and we will port other unit tests separately later.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Enabled the whole doctests.

Closes #32069 from ueshin/issues/SPARK-34972/pyspark-pandas_doctests.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-04-07 20:50:41 +09:00
Fokko Driesprong e4d1c10760 [SPARK-32320][PYSPARK] Remove mutable default arguments
This is bad practice, and might lead to unexpected behaviour:
https://florimond.dev/blog/articles/2018/08/python-mutable-defaults-are-the-source-of-all-evil/

```
fokkodriesprongFan spark % grep -R "={}" python | grep def

python/pyspark/resource/profile.py:    def __init__(self, _java_resource_profile=None, _exec_req={}, _task_req={}):
python/pyspark/sql/functions.py:def from_json(col, schema, options={}):
python/pyspark/sql/functions.py:def to_json(col, options={}):
python/pyspark/sql/functions.py:def schema_of_json(json, options={}):
python/pyspark/sql/functions.py:def schema_of_csv(csv, options={}):
python/pyspark/sql/functions.py:def to_csv(col, options={}):
python/pyspark/sql/functions.py:def from_csv(col, schema, options={}):
python/pyspark/sql/avro/functions.py:def from_avro(data, jsonFormatSchema, options={}):
```

```
fokkodriesprongFan spark % grep -R "=\[\]" python | grep def
python/pyspark/ml/tuning.py:    def __init__(self, bestModel, avgMetrics=[], subModels=None):
python/pyspark/ml/tuning.py:    def __init__(self, bestModel, validationMetrics=[], subModels=None):
```

### What changes were proposed in this pull request?

Removing the mutable default arguments.

### Why are the changes needed?

Removing the mutable default arguments, and changing the signature to `Optional[...]`.

### Does this PR introduce _any_ user-facing change?

No 👍

### How was this patch tested?

Using the Flake8 bugbear code analysis plugin.

Closes #29122 from Fokko/SPARK-32320.

Authored-by: Fokko Driesprong <fokko@apache.org>
Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
2020-12-08 09:35:36 +08:00
Weichen Xu 80161238fe [SPARK-33592] Fix: Pyspark ML Validator params in estimatorParamMaps may be lost after saving and reloading
### What changes were proposed in this pull request?
Fix: Pyspark ML Validator params in estimatorParamMaps may be lost after saving and reloading

When saving validator estimatorParamMaps, will check all nested stages in tuned estimator to get correct param parent.

Two typical cases to manually test:
~~~python
tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
lr = LogisticRegression()
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])

paramGrid = ParamGridBuilder() \
    .addGrid(hashingTF.numFeatures, [10, 100]) \
    .addGrid(lr.maxIter, [100, 200]) \
    .build()
tvs = TrainValidationSplit(estimator=pipeline,
                           estimatorParamMaps=paramGrid,
                           evaluator=MulticlassClassificationEvaluator())

tvs.save(tvsPath)
loadedTvs = TrainValidationSplit.load(tvsPath)

# check `loadedTvs.getEstimatorParamMaps()` restored correctly.
~~~

~~~python
lr = LogisticRegression()
ova = OneVsRest(classifier=lr)
grid = ParamGridBuilder().addGrid(lr.maxIter, [100, 200]).build()
evaluator = MulticlassClassificationEvaluator()
tvs = TrainValidationSplit(estimator=ova, estimatorParamMaps=grid, evaluator=evaluator)

tvs.save(tvsPath)
loadedTvs = TrainValidationSplit.load(tvsPath)

# check `loadedTvs.getEstimatorParamMaps()` restored correctly.
~~~

### Why are the changes needed?
Bug fix.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Unit test.

Closes #30539 from WeichenXu123/fix_tuning_param_maps_io.

Authored-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
2020-12-01 09:36:42 +08:00
HyukjinKwon 942f577b6e [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI
### What changes were proposed in this pull request?

This PR proposes to add a way to select Hadoop and Hive versions in pip installation.
Users can select Hive or Hadoop versions as below:

```bash
HADOOP_VERSION=3.2 pip install pyspark
HIVE_VERSION=1.2 pip install pyspark
HIVE_VERSION=1.2 HADOOP_VERSION=2.7 pip install pyspark
```

When the environment variables are set, internally it downloads the corresponding Spark version and then sets the Spark home to it. Also this PR exposes a mirror to set as an environment variable, `PYSPARK_RELEASE_MIRROR`.

**Please NOTE that:**
- We cannot currently leverage pip's native installation option, for example:

    ```bash
    pip install pyspark --install-option="hadoop3.2"
    ```

    This is because of a limitation and bug in pip itself. Once they fix this issue, we can switch from the environment variables to the proper installation options, see SPARK-32837.

    It IS possible to workaround but very ugly or hacky with a big change. See [this PR](https://github.com/microsoft/nni/pull/139/files) as an example.

- In pip installation, we pack the relevant jars together. This PR _does not touch existing packaging way_ in order to prevent any behaviour changes.

  Once this experimental way is proven to be safe, we can avoid packing the relevant jars together (and keep only the relevant Python scripts). And downloads the Spark distribution as this PR proposes.

- This way is sort of consistent with SparkR:

  SparkR provides a method `SparkR::install.spark` to support CRAN installation. This is fine because SparkR is provided purely as a R library. For example, `sparkr` script is not packed together.

  PySpark cannot take this approach because PySpark packaging ships relevant executable script together, e.g.) `pyspark` shell.

  If PySpark has a method such as `pyspark.install_spark`, users cannot call it in `pyspark` because `pyspark` already assumes relevant Spark is installed, JVM is launched, etc.

- There looks no way to release that contains different Hadoop or Hive to PyPI due to [the version semantics](https://www.python.org/dev/peps/pep-0440/). This is not an option.

  The usual way looks either `--install-option` above with hacks or environment variables given my investigation.

### Why are the changes needed?

To provide users the options to select Hadoop and Hive versions.

### Does this PR introduce _any_ user-facing change?

Yes, users will be able to select Hive and Hadoop version as below when they install it from `pip`;

```bash
HADOOP_VERSION=3.2 pip install pyspark
HIVE_VERSION=1.2 pip install pyspark
HIVE_VERSION=1.2 HADOOP_VERSION=2.7 pip install pyspark
```

### How was this patch tested?

Unit tests were added. I also manually tested in Mac and Windows (after building Spark with `python/dist/pyspark-3.1.0.dev0.tar.gz`):

```bash
./build/mvn -DskipTests -Phive-thriftserver clean package
```

Mac:

```bash
SPARK_VERSION=3.0.1 HADOOP_VERSION=3.2 pip install pyspark-3.1.0.dev0.tar.gz
```

Windows:

```bash
set HADOOP_VERSION=3.2
set SPARK_VERSION=3.0.1
pip install pyspark-3.1.0.dev0.tar.gz
```

Closes #29703 from HyukjinKwon/SPARK-32017.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-09-23 09:30:51 +09:00
Erik Krogen cf22d947fb [SPARK-32036] Replace references to blacklist/whitelist language with more appropriate terminology, excluding the blacklisting feature
### What changes were proposed in this pull request?

This PR will remove references to these "blacklist" and "whitelist" terms besides the blacklisting feature as a whole, which can be handled in a separate JIRA/PR.

This touches quite a few files, but the changes are straightforward (variable/method/etc. name changes) and most quite self-contained.

### Why are the changes needed?

As per discussion on the Spark dev list, it will be beneficial to remove references to problematic language that can alienate potential community members. One such reference is "blacklist" and "whitelist". While it seems to me that there is some valid debate as to whether these terms have racist origins, the cultural connotations are inescapable in today's world.

### Does this PR introduce _any_ user-facing change?

In the test file `HiveQueryFileTest`, a developer has the ability to specify the system property `spark.hive.whitelist` to specify a list of Hive query files that should be tested. This system property has been renamed to `spark.hive.includelist`. The old property has been kept for compatibility, but will log a warning if used. I am open to feedback from others on whether keeping a deprecated property here is unnecessary given that this is just for developers running tests.

### How was this patch tested?

Existing tests should be suitable since no behavior changes are expected as a result of this PR.

Closes #28874 from xkrogen/xkrogen-SPARK-32036-rename-blacklists.

Authored-by: Erik Krogen <ekrogen@linkedin.com>
Signed-off-by: Thomas Graves <tgraves@apache.org>
2020-07-15 11:40:55 -05:00
HyukjinKwon 4ad9bfd53b [SPARK-32138] Drop Python 2.7, 3.4 and 3.5
### What changes were proposed in this pull request?

This PR aims to drop Python 2.7, 3.4 and 3.5.

Roughly speaking, it removes all the widely known Python 2 compatibility workarounds such as `sys.version` comparison, `__future__`. Also, it removes the Python 2 dedicated codes such as `ArrayConstructor` in Spark.

### Why are the changes needed?

 1. Unsupport EOL Python versions
 2. Reduce maintenance overhead and remove a bit of legacy codes and hacks for Python 2.
 3. PyPy2 has a critical bug that causes a flaky test, SPARK-28358 given my testing and investigation.
 4. Users can use Python type hints with Pandas UDFs without thinking about Python version
 5. Users can leverage one latest cloudpickle, https://github.com/apache/spark/pull/28950. With Python 3.8+ it can also leverage C pickle.

### Does this PR introduce _any_ user-facing change?

Yes, users cannot use Python 2.7, 3.4 and 3.5 in the upcoming Spark version.

### How was this patch tested?

Manually tested and also tested in Jenkins.

Closes #28957 from HyukjinKwon/SPARK-32138.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-07-14 11:22:44 +09:00
HyukjinKwon b84ed4146d [SPARK-32245][INFRA] Run Spark tests in Github Actions
### What changes were proposed in this pull request?

This PR aims to run the Spark tests in Github Actions.

To briefly explain the main idea:

- Reuse `dev/run-tests.py` with SBT build
- Reuse the modules in `dev/sparktestsupport/modules.py` to test each module
- Pass the modules to test into `dev/run-tests.py` directly via `TEST_ONLY_MODULES` environment variable. For example, `pyspark-sql,core,sql,hive`.
- `dev/run-tests.py` _does not_ take the dependent modules into account but solely the specified modules to test.

Another thing to note might be `SlowHiveTest` annotation. Running the tests in Hive modules takes too much so the slow tests are extracted and it runs as a separate job. It was extracted from the actual elapsed time in Jenkins:

![Screen Shot 2020-07-09 at 7 48 13 PM](https://user-images.githubusercontent.com/6477701/87050238-f6098e80-c238-11ea-9c4a-ab505af61381.png)

So, Hive tests are separated into to jobs. One is slow test cases, and the other one is the other test cases.

_Note that_ the current GitHub Actions build virtually copies what the default PR builder on Jenkins does (without other profiles such as JDK 11, Hadoop 2, etc.). The only exception is Kinesis https://github.com/apache/spark/pull/29057/files#diff-04eb107ee163a50b61281ca08f4e4c7bR23

### Why are the changes needed?

Last week and onwards, the Jenkins machines became very unstable for many reasons:
  - Apparently, the machines became extremely slow. Almost all tests can't pass.
  - One machine (worker 4) started to have the corrupt `.m2` which fails the build.
  - Documentation build fails time to time for an unknown reason in Jenkins machine specifically. This is disabled for now at https://github.com/apache/spark/pull/29017.
  - Almost all PRs are basically blocked by this instability currently.

The advantages of using Github Actions:
  - To avoid depending on few persons who can access to the cluster.
  - To reduce the elapsed time in the build - we could split the tests (e.g., SQL, ML, CORE), and run them in parallel so the total build time will significantly reduce.
  - To control the environment more flexibly.
  - Other contributors can test and propose to fix Github Actions configurations so we can distribute this build management cost.

Note that:
- The current build in Jenkins takes _more than 7 hours_. With Github actions it takes _less than 2 hours_
- We can now control the environments especially for Python easily.
- The test and build look more stable than the Jenkins'.

### Does this PR introduce _any_ user-facing change?

No, dev-only change.

### How was this patch tested?

Tested at https://github.com/HyukjinKwon/spark/pull/4

Closes #29057 from HyukjinKwon/migrate-to-github-actions.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-07-11 13:09:06 -07:00
Thomas Graves 95aec091e4 [SPARK-29641][PYTHON][CORE] Stage Level Sched: Add python api's and tests
### What changes were proposed in this pull request?

As part of the Stage level scheduling features, add the Python api's to set resource profiles.
This also adds the functionality to properly apply the pyspark memory configuration when specified in the ResourceProfile. The pyspark memory configuration is being passed in the task local properties. This was an easy way to get it to the PythonRunner that needs it. I modeled this off how the barrier task scheduling is passing the addresses. As part of this I added in the JavaRDD api's because those are needed by python.

### Why are the changes needed?

python api for this feature

### Does this PR introduce any user-facing change?

Yes adds the java and python apis for user to specify a ResourceProfile to use stage level scheduling.

### How was this patch tested?

unit tests and manually tested on yarn. Tests also run to verify it errors properly on standalone and local mode where its not yet supported.

Closes #28085 from tgravescs/SPARK-29641-pr-base.

Lead-authored-by: Thomas Graves <tgraves@nvidia.com>
Co-authored-by: Thomas Graves <tgraves@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-04-23 10:20:39 +09:00
HyukjinKwon aa6a60530e [SPARK-30722][PYTHON][DOCS] Update documentation for Pandas UDF with Python type hints
### What changes were proposed in this pull request?

This PR targets to document the Pandas UDF redesign with type hints introduced at SPARK-28264.
Mostly self-describing; however, there are few things to note for reviewers.

1. This PR replace the existing documentation of pandas UDFs to the newer redesign to promote the Python type hints. I added some words that Spark 3.0 still keeps the compatibility though.

2. This PR proposes to name non-pandas UDFs as "Pandas Function API"

3. SCALAR_ITER become two separate sections to reduce confusion:
  - `Iterator[pd.Series]` -> `Iterator[pd.Series]`
  - `Iterator[Tuple[pd.Series, ...]]` -> `Iterator[pd.Series]`

4. I removed some examples that look overkill to me.

5. I also removed some information in the doc, that seems duplicating or too much.

### Why are the changes needed?

To document new redesign in pandas UDF.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Existing tests should cover.

Closes #27466 from HyukjinKwon/SPARK-30722.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-02-12 10:49:46 +09:00
HyukjinKwon ab0890bdb1 [SPARK-28264][PYTHON][SQL] Support type hints in pandas UDF and rename/move inconsistent pandas UDF types
### What changes were proposed in this pull request?

This PR proposes to redesign pandas UDFs as described in [the proposal](https://docs.google.com/document/d/1-kV0FS_LF2zvaRh_GhkV32Uqksm_Sq8SvnBBmRyxm30/edit?usp=sharing).

```python
from pyspark.sql.functions import pandas_udf
import pandas as pd

pandas_udf("long")
def plug_one(s: pd.Series) -> pd.Series:
    return s + 1

spark.range(10).select(plug_one("id")).show()
```

```
+------------+
|plug_one(id)|
+------------+
|           1|
|           2|
|           3|
|           4|
|           5|
|           6|
|           7|
|           8|
|           9|
|          10|
+------------+
```

Note that, this PR address one of the future improvements described [here](https://docs.google.com/document/d/1-kV0FS_LF2zvaRh_GhkV32Uqksm_Sq8SvnBBmRyxm30/edit#heading=h.h3ncjpk6ujqu), "A couple of less-intuitive pandas UDF types" (by zero323) together.

In short,

- Adds new way with type hints as an alternative and experimental way.
    ```python
    pandas_udf(schema='...')
    def func(c1: Series, c2: Series) -> DataFrame:
        pass
    ```

- Replace and/or add an alias for three types below from UDF, and make them as separate standalone APIs. So, `pandas_udf` is now consistent with regular `udf`s and other expressions.

    `df.mapInPandas(udf)`  -replace-> `df.mapInPandas(f, schema)`
    `df.groupby.apply(udf)`  -alias-> `df.groupby.applyInPandas(f, schema)`
    `df.groupby.cogroup.apply(udf)`  -replace-> `df.groupby.cogroup.applyInPandas(f, schema)`

    *`df.groupby.apply` was added from 2.3 while the other were added in the master only.

- No deprecation for the existing ways for now.
    ```python
    pandas_udf(schema='...', functionType=PandasUDFType.SCALAR)
    def func(c1, c2):
        pass
    ```
If users are happy with this, I plan to deprecate the existing way and declare using type hints is not experimental anymore.

One design goal in this PR was that, avoid touching the internal (since we didn't deprecate the old ways for now), but supports type hints with a minimised changes only at the interface.

- Once we deprecate or remove the old ways, I think it requires another refactoring for the internal in the future. At the very least, we should rename internal pandas evaluation types.
- If users find this experimental type hints isn't quite helpful, we should simply revert the changes at the interface level.

### Why are the changes needed?

In order to address old design issues. Please see [the proposal](https://docs.google.com/document/d/1-kV0FS_LF2zvaRh_GhkV32Uqksm_Sq8SvnBBmRyxm30/edit?usp=sharing).

### Does this PR introduce any user-facing change?

For behaviour changes, No.

It adds new ways to use pandas UDFs by using type hints. See below.

**SCALAR**:

```python
pandas_udf(schema='...')
def func(c1: Series, c2: DataFrame) -> Series:
    pass  # DataFrame represents a struct column
```

**SCALAR_ITER**:

```python
pandas_udf(schema='...')
def func(iter: Iterator[Tuple[Series, DataFrame, ...]]) -> Iterator[Series]:
    pass  # Same as SCALAR but wrapped by Iterator
```

**GROUPED_AGG**:

```python
pandas_udf(schema='...')
def func(c1: Series, c2: DataFrame) -> int:
    pass  # DataFrame represents a struct column
```

**GROUPED_MAP**:

This was added in Spark 2.3 as of SPARK-20396. As described above, it keeps the existing behaviour. Additionally, we now have a new alias `groupby.applyInPandas` for `groupby.apply`. See the example below:

```python
def func(pdf):
    return pdf

df.groupby("...").applyInPandas(func, schema=df.schema)
```

**MAP_ITER**: this is not a pandas UDF anymore

This was added in Spark 3.0 as of SPARK-28198; and this PR replaces the usages. See the example below:

```python
def func(iter):
    for df in iter:
        yield df

df.mapInPandas(func, df.schema)
```

**COGROUPED_MAP**: this is not a pandas UDF anymore

This was added in Spark 3.0 as of SPARK-27463; and this PR replaces the usages. See the example below:

```python
def asof_join(left, right):
    return pd.merge_asof(left, right, on="...", by="...")

 df1.groupby("...").cogroup(df2.groupby("...")).applyInPandas(asof_join, schema="...")
```

### How was this patch tested?

Unittests added and tested against Python 2.7, 3.6 and 3.7.

Closes #27165 from HyukjinKwon/revisit-pandas.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-01-22 15:32:58 +09:00
HyukjinKwon dcdc9a8be7 [SPARK-28198][PYTHON][FOLLOW-UP] Run the tests of MAP ITER UDF in Jenkins
### What changes were proposed in this pull request?

https://github.com/apache/spark/pull/24997 missed to add `pyspark.sql.tests.test_pandas_udf_iter` to `modules.py`. This PR adds it.

### Why are the changes needed?

Currently, Jenkins does not run the test cases. We should run them.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Jenkins should test.

Closes #27141 from HyukjinKwon/SPARK-28198-followup.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-01-09 13:45:50 +09:00
HyukjinKwon ee8d661058 [SPARK-30434][PYTHON][SQL] Move pandas related functionalities into 'pandas' sub-package
### What changes were proposed in this pull request?

This PR proposes to move pandas related functionalities into pandas package. Namely:

```bash
pyspark/sql/pandas
├── __init__.py
├── conversion.py  # Conversion between pandas <> PySpark DataFrames
├── functions.py   # pandas_udf
├── group_ops.py   # Grouped UDF / Cogrouped UDF + groupby.apply, groupby.cogroup.apply
├── map_ops.py     # Map Iter UDF + mapInPandas
├── serializers.py # pandas <> PyArrow serializers
├── types.py       # Type utils between pandas <> PyArrow
└── utils.py       # Version requirement checks
```

In order to separately locate `groupby.apply`, `groupby.cogroup.apply`, `mapInPandas`, `toPandas`, and `createDataFrame(pdf)` under `pandas` sub-package, I had to use a mix-in approach which Scala side uses often by `trait`, and also pandas itself uses this approach (see `IndexOpsMixin` as an example) to group related functionalities. Currently, you can think it's like Scala's self typed trait. See the structure below:

```python
class PandasMapOpsMixin(object):
    def mapInPandas(self, ...):
        ...
        return ...

    # other Pandas <> PySpark APIs
```

```python
class DataFrame(PandasMapOpsMixin):

    # other DataFrame APIs equivalent to Scala side.

```

Yes, This is a big PR but they are mostly just moving around except one case `createDataFrame` which I had to split the methods.

### Why are the changes needed?

There are pandas functionalities here and there and I myself gets lost where it was. Also, when you have to make a change commonly for all of pandas related features, it's almost impossible now.

Also, after this change, `DataFrame` and `SparkSession` become more consistent with Scala side since pandas is specific to Python, and this change separates pandas-specific APIs away from `DataFrame` or `SparkSession`.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Existing tests should cover. Also, I manually built the PySpark API documentation and checked.

Closes #27109 from HyukjinKwon/pandas-refactoring.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-01-09 10:22:50 +09:00
WeichenXu 88542bc3d9 [SPARK-30154][ML] PySpark UDF to convert MLlib vectors to dense arrays
### What changes were proposed in this pull request?

PySpark UDF to convert MLlib vectors to dense arrays.
Example:
```
from pyspark.ml.functions import vector_to_array
df.select(vector_to_array(col("features"))
```

### Why are the changes needed?
If a PySpark user wants to convert MLlib sparse/dense vectors in a DataFrame into dense arrays, an efficient approach is to do that in JVM. However, it requires PySpark user to write Scala code and register it as a UDF. Often this is infeasible for a pure python project.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
UT.

Closes #26910 from WeichenXu123/vector_to_array.

Authored-by: WeichenXu <weichen.xu@databricks.com>
Signed-off-by: Xiangrui Meng <meng@databricks.com>
2020-01-06 16:18:51 -08:00
shane knapp 04e99c1e1b [SPARK-29672][PYSPARK] update spark testing framework to use python3
### What changes were proposed in this pull request?

remove python2.7 tests and test infra for 3.0+

### Why are the changes needed?

because python2.7 is finally going the way of the dodo.

### Does this PR introduce any user-facing change?

newp.

### How was this patch tested?

the build system will test this

Closes #26330 from shaneknapp/remove-py27-tests.

Lead-authored-by: shane knapp <incomplete@gmail.com>
Co-authored-by: shane <incomplete@gmail.com>
Signed-off-by: shane knapp <incomplete@gmail.com>
2019-11-14 10:18:55 -08:00
Xianyang Liu 0a7095156b [SPARK-29499][CORE][PYSPARK] Add mapPartitionsWithIndex for RDDBarrier
### What changes were proposed in this pull request?

Add mapPartitionsWithIndex for RDDBarrier.

### Why are the changes needed?

There is only one method in `RDDBarrier`. We often use the partition index as a label for the current partition. We need to get the index from `TaskContext`  index in the method of `mapPartitions` which is not convenient.
### Does this PR introduce any user-facing change?

No

### How was this patch tested?

New UT.

Closes #26148 from ConeyLiu/barrier-index.

Authored-by: Xianyang Liu <xianyang.liu@intel.com>
Signed-off-by: Xingbo Jiang <xingbo.jiang@databricks.com>
2019-10-23 13:46:09 +02:00
HyukjinKwon a838dbd2f9 [SPARK-27463][PYTHON][FOLLOW-UP] Run the tests of Cogrouped pandas UDF
### What changes were proposed in this pull request?
This is a followup for https://github.com/apache/spark/pull/24981
Seems we mistakenly didn't added `test_pandas_udf_cogrouped_map` into `modules.py`. So we don't have official test results against that PR.

```
...
Starting test(python3.6): pyspark.sql.tests.test_pandas_udf
...
Starting test(python3.6): pyspark.sql.tests.test_pandas_udf_grouped_agg
...
Starting test(python3.6): pyspark.sql.tests.test_pandas_udf_grouped_map
...
Starting test(python3.6): pyspark.sql.tests.test_pandas_udf_scalar
...
Starting test(python3.6): pyspark.sql.tests.test_pandas_udf_window
Finished test(python3.6): pyspark.sql.tests.test_pandas_udf (21s)
...
Finished test(python3.6): pyspark.sql.tests.test_pandas_udf_grouped_map (49s)
...
Finished test(python3.6): pyspark.sql.tests.test_pandas_udf_window (58s)
...
Finished test(python3.6): pyspark.sql.tests.test_pandas_udf_scalar (82s)
...
Finished test(python3.6): pyspark.sql.tests.test_pandas_udf_grouped_agg (105s)
...
```

If tests fail, we should revert that PR.

### Why are the changes needed?

Relevant tests should be ran.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Jenkins tests.

Closes #25890 from HyukjinKwon/SPARK-28840.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-09-22 21:39:30 +09:00
Sean Owen 6378d4bc06 [SPARK-28980][CORE][SQL][STREAMING][MLLIB] Remove most items deprecated in Spark 2.2.0 or earlier, for Spark 3
### What changes were proposed in this pull request?

- Remove SQLContext.createExternalTable and Catalog.createExternalTable, deprecated in favor of createTable since 2.2.0, plus tests of deprecated methods
- Remove HiveContext, deprecated in 2.0.0, in favor of `SparkSession.builder.enableHiveSupport`
- Remove deprecated KinesisUtils.createStream methods, plus tests of deprecated methods, deprecate in 2.2.0
- Remove deprecated MLlib (not Spark ML) linear method support, mostly utility constructors and 'train' methods, and associated docs. This includes methods in LinearRegression, LogisticRegression, Lasso, RidgeRegression. These have been deprecated since 2.0.0
- Remove deprecated Pyspark MLlib linear method support, including LogisticRegressionWithSGD, LinearRegressionWithSGD, LassoWithSGD
- Remove 'runs' argument in KMeans.train() method, which has been a no-op since 2.0.0
- Remove deprecated ChiSqSelector isSorted protected method
- Remove deprecated 'yarn-cluster' and 'yarn-client' master argument in favor of 'yarn' and deploy mode 'cluster', etc

Notes:

- I was not able to remove deprecated DataFrameReader.json(RDD) in favor of DataFrameReader.json(Dataset); the former was deprecated in 2.2.0, but, it is still needed to support Pyspark's .json() method, which can't use a Dataset.
- Looks like SQLContext.createExternalTable was not actually deprecated in Pyspark, but, almost certainly was meant to be? Catalog.createExternalTable was.
- I afterwards noted that the toDegrees, toRadians functions were almost removed fully in SPARK-25908, but Felix suggested keeping just the R version as they hadn't been technically deprecated. I'd like to revisit that. Do we really want the inconsistency? I'm not against reverting it again, but then that implies leaving SQLContext.createExternalTable just in Pyspark too, which seems weird.
- I *kept* LogisticRegressionWithSGD, LinearRegressionWithSGD, LassoWithSGD, RidgeRegressionWithSGD in Pyspark, though deprecated, as it is hard to remove them (still used by StreamingLogisticRegressionWithSGD?) and they are not fully removed in Scala. Maybe should not have been deprecated.

### Why are the changes needed?

Deprecated items are easiest to remove in a major release, so we should do so as much as possible for Spark 3. This does not target items deprecated 'recently' as of Spark 2.3, which is still 18 months old.

### Does this PR introduce any user-facing change?

Yes, in that deprecated items are removed from some public APIs.

### How was this patch tested?

Existing tests.

Closes #25684 from srowen/SPARK-28980.

Lead-authored-by: Sean Owen <sean.owen@databricks.com>
Co-authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-09-09 10:19:40 -05:00
WeichenXu f21bc1874a [SPARK-27889][INFRA] Make development scripts under dev/ support Python 3
## What changes were proposed in this pull request?

I made an audit and update all dev scripts to support python3. (except `merge_spark_pr.py` which already updated)

## How was this patch tested?

Manual.

Closes #25289 from WeichenXu123/dev_py3.

Authored-by: WeichenXu <weichen.xu@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-08-09 18:55:48 +09:00
Marcelo Vanzin 11e21cc17a [SPARK-28187][BUILD] Add support for hadoop-cloud to the PR builder.
Closes #24987 from vanzin/SPARK-28187.

Authored-by: Marcelo Vanzin <vanzin@cloudera.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
2019-06-27 15:59:05 -07:00
Yuming Wang 3f102a8229 [SPARK-27749][SQL] hadoop-3.2 support hive-thriftserver
## What changes were proposed in this pull request?

This PR mainly makes the following changes to make `hadoop-3.2` support `sql/hive-thriftserver`:
1. Upgrade [`TCLIService.thrift`](https://github.com/apache/hive/blob/rel/release-2.3.5/service-rpc/if/TCLIService.thrift) and related code to Hive 2.3.5 because of [HIVE-12442](https://issues.apache.org/jira/browse/HIVE-12442)(Note that we only migrate code without adding features, such as [HIVE-4924](https://issues.apache.org/jira/browse/HIVE-4924) and [HIVE-15473](https://issues.apache.org/jira/browse/HIVE-15473)).
2. Use slf4j as logging facade because of [HIVE-12237](https://issues.apache.org/jira/browse/HIVE-12237).
3. Port [HIVE-13169](https://issues.apache.org/jira/browse/HIVE-13169) to compatible with Hive 2.3.

## How was this patch tested?

Exiting test

Closes #24628 from wangyum/SPARK-27749.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
2019-06-05 08:40:05 -07:00
HyukjinKwon b7bf4fd123 [SPARK-27402][INFRA][FOLLOW-UP] Exclude 'hive-thriftserver' in modules to test for hadoop3.2 for now
## What changes were proposed in this pull request?

This PR excludes  'hive-thriftserver' in modules to test for hadoop3.2 for now as well

## How was this patch tested?

Manually tested via `run-tests.py`

Closes #24644 from HyukjinKwon/SPARK-27402.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-05-20 07:53:19 -07:00
Yuming Wang f3ddd6f9da [SPARK-27402][SQL][TEST-HADOOP3.2][TEST-MAVEN] Fix hadoop-3.2 test issue(except the hive-thriftserver module)
## What changes were proposed in this pull request?

This pr fix hadoop-3.2 test issues(except the `hive-thriftserver` module):
1. Add `hive.metastore.schema.verification` and `datanucleus.schema.autoCreateAll` to HiveConf.
2. hadoop-3.2 support access the Hive metastore from 0.12 to 2.2

After [SPARK-27176](https://issues.apache.org/jira/browse/SPARK-27176) and this PR, we upgraded the built-in Hive to 2.3 when enabling the Hadoop 3.2+ profile. This upgrade fixes the following issues:
- [HIVE-6727](https://issues.apache.org/jira/browse/HIVE-6727): Table level stats for external tables are set incorrectly.
- [HIVE-15653](https://issues.apache.org/jira/browse/HIVE-15653): Some ALTER TABLE commands drop table stats.
- [SPARK-12014](https://issues.apache.org/jira/browse/SPARK-12014): Spark SQL query containing semicolon is broken in Beeline.
- [SPARK-25193](https://issues.apache.org/jira/browse/SPARK-25193): insert overwrite doesn't throw exception when drop old data fails.
- [SPARK-25919](https://issues.apache.org/jira/browse/SPARK-25919): Date value corrupts when tables are "ParquetHiveSerDe" formatted and target table is Partitioned.
- [SPARK-26332](https://issues.apache.org/jira/browse/SPARK-26332): Spark sql write orc table on viewFS throws exception.
- [SPARK-26437](https://issues.apache.org/jira/browse/SPARK-26437): Decimal data becomes bigint to query, unable to query.

## How was this patch tested?
This pr test Spark’s Hadoop 3.2 profile on jenkins and #24591 test Spark’s Hadoop 2.7 profile on jenkins

This PR close #24591

Closes #24391 from wangyum/SPARK-27402.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
2019-05-13 10:35:26 -07:00
Gabor Somogyi 3729efb4d0 [SPARK-26856][PYSPARK] Python support for from_avro and to_avro APIs
## What changes were proposed in this pull request?

Avro is built-in but external data source module since Spark 2.4 but  `from_avro` and `to_avro` APIs not yet supported in pyspark.

In this PR I've made them available from pyspark.

## How was this patch tested?

Please see the python API examples what I've added.

cd docs/
SKIP_SCALADOC=1 SKIP_RDOC=1 SKIP_SQLDOC=1 jekyll build
Manual webpage check.

Closes #23797 from gaborgsomogyi/SPARK-26856.

Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2019-03-11 10:15:07 +09:00