Commit graph

1034 commits

Author SHA1 Message Date
Dongjoon Hyun de9818f043
[SPARK-33662][BUILD] Setting version to 3.2.0-SNAPSHOT
### What changes were proposed in this pull request?

This PR aims to update `master` branch version to 3.2.0-SNAPSHOT.

### Why are the changes needed?

Start to prepare Apache Spark 3.2.0.

### Does this PR introduce _any_ user-facing change?

N/A.

### How was this patch tested?

Pass the CIs.

Closes #30606 from dongjoon-hyun/SPARK-3.2.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-12-04 14:10:42 -08:00
CC Highman d338af3101 [SPARK-31962][SQL] Provide modifiedAfter and modifiedBefore options when filtering from a batch-based file data source
### What changes were proposed in this pull request?

Two new options, _modifiiedBefore_  and _modifiedAfter_, is provided expecting a value in 'YYYY-MM-DDTHH:mm:ss' format.  _PartioningAwareFileIndex_ considers these options during the process of checking for files, just before considering applied _PathFilters_ such as `pathGlobFilter.`  In order to filter file results, a new PathFilter class was derived for this purpose.  General house-keeping around classes extending PathFilter was performed for neatness.  It became apparent support was needed to handle multiple potential path filters.  Logic was introduced for this purpose and the associated tests written.

### Why are the changes needed?

When loading files from a data source, there can often times be thousands of file within a respective file path.  In many cases I've seen, we want to start loading from a folder path and ideally be able to begin loading files having modification dates past a certain point.  This would mean out of thousands of potential files, only the ones with modification dates greater than the specified timestamp would be considered.  This saves a ton of time automatically and reduces significant complexity managing this in code.

### Does this PR introduce _any_ user-facing change?

This PR introduces an option that can be used with batch-based Spark file data sources.  A documentation update was made to reflect an example and usage of the new data source option.

**Example Usages**
_Load all CSV files modified after date:_
`spark.read.format("csv").option("modifiedAfter","2020-06-15T05:00:00").load()`

_Load all CSV files modified before date:_
`spark.read.format("csv").option("modifiedBefore","2020-06-15T05:00:00").load()`

_Load all CSV files modified between two dates:_
`spark.read.format("csv").option("modifiedAfter","2019-01-15T05:00:00").option("modifiedBefore","2020-06-15T05:00:00").load()
`

### How was this patch tested?

A handful of unit tests were added to support the positive, negative, and edge case code paths.

It's also live in a handful of our Databricks dev environments.  (quoted from cchighman)

Closes #30411 from HeartSaVioR/SPARK-31962.

Lead-authored-by: CC Highman <christopher.highman@microsoft.com>
Co-authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
2020-11-23 08:30:41 +09:00
yangjie01 e3058ba17c [SPARK-33441][BUILD] Add unused-imports compilation check and remove all unused-imports
### What changes were proposed in this pull request?
This pr add a new Scala compile arg to `pom.xml` to defense against new unused imports:

- `-Ywarn-unused-import` for Scala 2.12
- `-Wconf:cat=unused-imports:e` for Scala 2.13

The other fIles change are remove all unused imports in Spark code

### Why are the changes needed?
Cleanup code and add guarantee to defense against new unused imports

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass the Jenkins or GitHub Action

Closes #30351 from LuciferYang/remove-imports-core-module.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-11-19 14:20:39 +09:00
Josh Soref 9d58a2f0f0 [MINOR][GRAPHX] Correct typos in the sub-modules: graphx, external, and examples
### What changes were proposed in this pull request?

This PR intends to fix typos in the sub-modules: graphx, external, and examples.
Split per holdenk https://github.com/apache/spark/pull/30323#issuecomment-725159710

NOTE: The misspellings have been reported at 706a726f87 (commitcomment-44064356)

### Why are the changes needed?

Misspelled words make it harder to read / understand content.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

No testing was performed

Closes #30326 from jsoref/spelling-graphx.

Authored-by: Josh Soref <jsoref@users.noreply.github.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-11-12 08:29:22 +09:00
Keiji Yoshida 7766a6fb5f [MINOR][DOCS][EXAMPLE] Fix the Python manual_load_options_csv example
### What changes were proposed in this pull request?
This pull request changes the `sep` parameter's value from `:` to `;` in the example of `examples/src/main/python/sql/datasource.py`. This code snippet is shown on the Spark SQL Guide documentation. The `sep` parameter's value should be `;` since the data in https://github.com/apache/spark/blob/master/examples/src/main/resources/people.csv is separated by `;`.

### Why are the changes needed?
To fix the example code so that it can be executed properly.

### Does this PR introduce _any_ user-facing change?
Yes.
This code snippet is shown on the Spark SQL Guide documentation: https://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html#manually-specifying-options

### How was this patch tested?
By building the documentation and checking the Spark SQL Guide documentation manually in the local environment.

Closes #30082 from kjmrknsn/fix-example-python-datasource.

Authored-by: Keiji Yoshida <kjmrknsn@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-10-18 16:47:04 +09:00
Yuming Wang 543d59dfbf [SPARK-33107][BUILD][FOLLOW-UP] Remove com.twitter:parquet-hadoop-bundle:1.6.0 and orc.classifier
### What changes were proposed in this pull request?

This pr removes `com.twitter:parquet-hadoop-bundle:1.6.0` and `orc.classifier`.

### Why are the changes needed?

To make code more clear and readable.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing test.

Closes #30005 from wangyum/SPARK-33107.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-10-11 21:54:56 -07:00
zero323 31a16fbb40 [SPARK-32714][PYTHON] Initial pyspark-stubs port
### What changes were proposed in this pull request?

This PR proposes migration of [`pyspark-stubs`](https://github.com/zero323/pyspark-stubs) into Spark codebase.

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

Yes. This PR adds type annotations directly to Spark source.

This can impact interaction with development tools for users, which haven't used `pyspark-stubs`.

### How was this patch tested?

- [x] MyPy tests of the PySpark source
    ```
    mypy --no-incremental --config python/mypy.ini python/pyspark
    ```
- [x] MyPy tests of Spark examples
    ```
   MYPYPATH=python/ mypy --no-incremental --config python/mypy.ini examples/src/main/python/ml examples/src/main/python/sql examples/src/main/python/sql/streaming
    ```
- [x] Existing Flake8 linter

- [x] Existing unit tests

Tested against:

- `mypy==0.790+dev.e959952d9001e9713d329a2f9b196705b028f894`
- `mypy==0.782`

Closes #29591 from zero323/SPARK-32681.

Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-09-24 14:15:36 +09:00
Fokko Driesprong a1e459ed9f [SPARK-32719][PYTHON] Add Flake8 check missing imports
https://issues.apache.org/jira/browse/SPARK-32719

### What changes were proposed in this pull request?

Add a check to detect missing imports. This makes sure that if we use a specific class, it should be explicitly imported (not using a wildcard).

### Why are the changes needed?

To make sure that the quality of the Python code is up to standard.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing unit-tests and Flake8 static analysis

Closes #29563 from Fokko/fd-add-check-missing-imports.

Authored-by: Fokko Driesprong <fokko@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-08-31 11:23:31 +09:00
HyukjinKwon c154629171 [SPARK-32183][DOCS][PYTHON] User Guide - PySpark Usage Guide for Pandas with Apache Arrow
### What changes were proposed in this pull request?

This PR proposes to move Arrow usage guide from Spark documentation site to PySpark documentation site (at "User Guide").

Here is the demo for reviewing quicker: https://hyukjin-spark.readthedocs.io/en/stable/user_guide/arrow_pandas.html

### Why are the changes needed?

To have a single place for PySpark users, and better documentation.

### Does this PR introduce _any_ user-facing change?

Yes, it will move https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html to our PySpark documentation.

### How was this patch tested?

```bash
cd docs
SKIP_SCALADOC=1 SKIP_RDOC=1 SKIP_SQLDOC=1 jekyll serve --watch
```

and

```bash
cd python/docs
make clean html
```

Closes #29548 from HyukjinKwon/SPARK-32183.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-08-28 15:09:06 +09:00
Fokko Driesprong 9fcf0ea718 [SPARK-32319][PYSPARK] Disallow the use of unused imports
Disallow the use of unused imports:

- Unnecessary increases the memory footprint of the application
- Removes the imports that are required for the examples in the docstring from the file-scope to the example itself. This keeps the files itself clean, and gives a more complete example as it also includes the imports :)

```
fokkodriesprongFan spark % flake8 python | grep -i "imported but unused"
python/pyspark/cloudpickle.py:46:1: F401 'functools.partial' imported but unused
python/pyspark/cloudpickle.py:55:1: F401 'traceback' imported but unused
python/pyspark/heapq3.py:868:5: F401 '_heapq.*' imported but unused
python/pyspark/__init__.py:61:1: F401 'pyspark.version.__version__' imported but unused
python/pyspark/__init__.py:62:1: F401 'pyspark._globals._NoValue' imported but unused
python/pyspark/__init__.py:115:1: F401 'pyspark.sql.SQLContext' imported but unused
python/pyspark/__init__.py:115:1: F401 'pyspark.sql.HiveContext' imported but unused
python/pyspark/__init__.py:115:1: F401 'pyspark.sql.Row' imported but unused
python/pyspark/rdd.py:21:1: F401 're' imported but unused
python/pyspark/rdd.py:29:1: F401 'tempfile.NamedTemporaryFile' imported but unused
python/pyspark/mllib/regression.py:26:1: F401 'pyspark.mllib.linalg.SparseVector' imported but unused
python/pyspark/mllib/clustering.py:28:1: F401 'pyspark.mllib.linalg.SparseVector' imported but unused
python/pyspark/mllib/clustering.py:28:1: F401 'pyspark.mllib.linalg.DenseVector' imported but unused
python/pyspark/mllib/classification.py:26:1: F401 'pyspark.mllib.linalg.SparseVector' imported but unused
python/pyspark/mllib/feature.py:28:1: F401 'pyspark.mllib.linalg.DenseVector' imported but unused
python/pyspark/mllib/feature.py:28:1: F401 'pyspark.mllib.linalg.SparseVector' imported but unused
python/pyspark/mllib/feature.py:30:1: F401 'pyspark.mllib.regression.LabeledPoint' imported but unused
python/pyspark/mllib/tests/test_linalg.py:18:1: F401 'sys' imported but unused
python/pyspark/mllib/tests/test_linalg.py:642:5: F401 'pyspark.mllib.tests.test_linalg.*' imported but unused
python/pyspark/mllib/tests/test_feature.py:21:1: F401 'numpy.random' imported but unused
python/pyspark/mllib/tests/test_feature.py:21:1: F401 'numpy.exp' imported but unused
python/pyspark/mllib/tests/test_feature.py:23:1: F401 'pyspark.mllib.linalg.Vector' imported but unused
python/pyspark/mllib/tests/test_feature.py:23:1: F401 'pyspark.mllib.linalg.VectorUDT' imported but unused
python/pyspark/mllib/tests/test_feature.py:185:5: F401 'pyspark.mllib.tests.test_feature.*' imported but unused
python/pyspark/mllib/tests/test_util.py:97:5: F401 'pyspark.mllib.tests.test_util.*' imported but unused
python/pyspark/mllib/tests/test_stat.py:23:1: F401 'pyspark.mllib.linalg.Vector' imported but unused
python/pyspark/mllib/tests/test_stat.py:23:1: F401 'pyspark.mllib.linalg.SparseVector' imported but unused
python/pyspark/mllib/tests/test_stat.py:23:1: F401 'pyspark.mllib.linalg.DenseVector' imported but unused
python/pyspark/mllib/tests/test_stat.py:23:1: F401 'pyspark.mllib.linalg.VectorUDT' imported but unused
python/pyspark/mllib/tests/test_stat.py:23:1: F401 'pyspark.mllib.linalg._convert_to_vector' imported but unused
python/pyspark/mllib/tests/test_stat.py:23:1: F401 'pyspark.mllib.linalg.DenseMatrix' imported but unused
python/pyspark/mllib/tests/test_stat.py:23:1: F401 'pyspark.mllib.linalg.SparseMatrix' imported but unused
python/pyspark/mllib/tests/test_stat.py:23:1: F401 'pyspark.mllib.linalg.MatrixUDT' imported but unused
python/pyspark/mllib/tests/test_stat.py:181:5: F401 'pyspark.mllib.tests.test_stat.*' imported but unused
python/pyspark/mllib/tests/test_streaming_algorithms.py:18:1: F401 'time.time' imported but unused
python/pyspark/mllib/tests/test_streaming_algorithms.py:18:1: F401 'time.sleep' imported but unused
python/pyspark/mllib/tests/test_streaming_algorithms.py:470:5: F401 'pyspark.mllib.tests.test_streaming_algorithms.*' imported but unused
python/pyspark/mllib/tests/test_algorithms.py:295:5: F401 'pyspark.mllib.tests.test_algorithms.*' imported but unused
python/pyspark/tests/test_serializers.py:90:13: F401 'xmlrunner' imported but unused
python/pyspark/tests/test_rdd.py:21:1: F401 'sys' imported but unused
python/pyspark/tests/test_rdd.py:29:1: F401 'pyspark.resource.ResourceProfile' imported but unused
python/pyspark/tests/test_rdd.py:885:5: F401 'pyspark.tests.test_rdd.*' imported but unused
python/pyspark/tests/test_readwrite.py:19:1: F401 'sys' imported but unused
python/pyspark/tests/test_readwrite.py:22:1: F401 'array.array' imported but unused
python/pyspark/tests/test_readwrite.py:309:5: F401 'pyspark.tests.test_readwrite.*' imported but unused
python/pyspark/tests/test_join.py:62:5: F401 'pyspark.tests.test_join.*' imported but unused
python/pyspark/tests/test_taskcontext.py:19:1: F401 'shutil' imported but unused
python/pyspark/tests/test_taskcontext.py:325:5: F401 'pyspark.tests.test_taskcontext.*' imported but unused
python/pyspark/tests/test_conf.py:36:5: F401 'pyspark.tests.test_conf.*' imported but unused
python/pyspark/tests/test_broadcast.py:148:5: F401 'pyspark.tests.test_broadcast.*' imported but unused
python/pyspark/tests/test_daemon.py:76:5: F401 'pyspark.tests.test_daemon.*' imported but unused
python/pyspark/tests/test_util.py:77:5: F401 'pyspark.tests.test_util.*' imported but unused
python/pyspark/tests/test_pin_thread.py:19:1: F401 'random' imported but unused
python/pyspark/tests/test_pin_thread.py:149:5: F401 'pyspark.tests.test_pin_thread.*' imported but unused
python/pyspark/tests/test_worker.py:19:1: F401 'sys' imported but unused
python/pyspark/tests/test_worker.py:26:5: F401 'resource' imported but unused
python/pyspark/tests/test_worker.py:203:5: F401 'pyspark.tests.test_worker.*' imported but unused
python/pyspark/tests/test_profiler.py:101:5: F401 'pyspark.tests.test_profiler.*' imported but unused
python/pyspark/tests/test_shuffle.py:18:1: F401 'sys' imported but unused
python/pyspark/tests/test_shuffle.py:171:5: F401 'pyspark.tests.test_shuffle.*' imported but unused
python/pyspark/tests/test_rddbarrier.py:43:5: F401 'pyspark.tests.test_rddbarrier.*' imported but unused
python/pyspark/tests/test_context.py:129:13: F401 'userlibrary.UserClass' imported but unused
python/pyspark/tests/test_context.py:140:13: F401 'userlib.UserClass' imported but unused
python/pyspark/tests/test_context.py:310:5: F401 'pyspark.tests.test_context.*' imported but unused
python/pyspark/tests/test_appsubmit.py:241:5: F401 'pyspark.tests.test_appsubmit.*' imported but unused
python/pyspark/streaming/dstream.py:18:1: F401 'sys' imported but unused
python/pyspark/streaming/tests/test_dstream.py:27:1: F401 'pyspark.RDD' imported but unused
python/pyspark/streaming/tests/test_dstream.py:647:5: F401 'pyspark.streaming.tests.test_dstream.*' imported but unused
python/pyspark/streaming/tests/test_kinesis.py:83:5: F401 'pyspark.streaming.tests.test_kinesis.*' imported but unused
python/pyspark/streaming/tests/test_listener.py:152:5: F401 'pyspark.streaming.tests.test_listener.*' imported but unused
python/pyspark/streaming/tests/test_context.py:178:5: F401 'pyspark.streaming.tests.test_context.*' imported but unused
python/pyspark/testing/utils.py:30:5: F401 'scipy.sparse' imported but unused
python/pyspark/testing/utils.py:36:5: F401 'numpy as np' imported but unused
python/pyspark/ml/regression.py:25:1: F401 'pyspark.ml.tree._TreeEnsembleParams' imported but unused
python/pyspark/ml/regression.py:25:1: F401 'pyspark.ml.tree._HasVarianceImpurity' imported but unused
python/pyspark/ml/regression.py:29:1: F401 'pyspark.ml.wrapper.JavaParams' imported but unused
python/pyspark/ml/util.py:19:1: F401 'sys' imported but unused
python/pyspark/ml/__init__.py:25:1: F401 'pyspark.ml.pipeline' imported but unused
python/pyspark/ml/pipeline.py:18:1: F401 'sys' imported but unused
python/pyspark/ml/stat.py:22:1: F401 'pyspark.ml.linalg.DenseMatrix' imported but unused
python/pyspark/ml/stat.py:22:1: F401 'pyspark.ml.linalg.Vectors' imported but unused
python/pyspark/ml/tests/test_training_summary.py:18:1: F401 'sys' imported but unused
python/pyspark/ml/tests/test_training_summary.py:364:5: F401 'pyspark.ml.tests.test_training_summary.*' imported but unused
python/pyspark/ml/tests/test_linalg.py:381:5: F401 'pyspark.ml.tests.test_linalg.*' imported but unused
python/pyspark/ml/tests/test_tuning.py:427:9: F401 'pyspark.sql.functions as F' imported but unused
python/pyspark/ml/tests/test_tuning.py:757:5: F401 'pyspark.ml.tests.test_tuning.*' imported but unused
python/pyspark/ml/tests/test_wrapper.py:120:5: F401 'pyspark.ml.tests.test_wrapper.*' imported but unused
python/pyspark/ml/tests/test_feature.py:19:1: F401 'sys' imported but unused
python/pyspark/ml/tests/test_feature.py:304:5: F401 'pyspark.ml.tests.test_feature.*' imported but unused
python/pyspark/ml/tests/test_image.py:19:1: F401 'py4j' imported but unused
python/pyspark/ml/tests/test_image.py:22:1: F401 'pyspark.testing.mlutils.PySparkTestCase' imported but unused
python/pyspark/ml/tests/test_image.py:71:5: F401 'pyspark.ml.tests.test_image.*' imported but unused
python/pyspark/ml/tests/test_persistence.py:456:5: F401 'pyspark.ml.tests.test_persistence.*' imported but unused
python/pyspark/ml/tests/test_evaluation.py:56:5: F401 'pyspark.ml.tests.test_evaluation.*' imported but unused
python/pyspark/ml/tests/test_stat.py:43:5: F401 'pyspark.ml.tests.test_stat.*' imported but unused
python/pyspark/ml/tests/test_base.py:70:5: F401 'pyspark.ml.tests.test_base.*' imported but unused
python/pyspark/ml/tests/test_param.py:20:1: F401 'sys' imported but unused
python/pyspark/ml/tests/test_param.py:375:5: F401 'pyspark.ml.tests.test_param.*' imported but unused
python/pyspark/ml/tests/test_pipeline.py:62:5: F401 'pyspark.ml.tests.test_pipeline.*' imported but unused
python/pyspark/ml/tests/test_algorithms.py:333:5: F401 'pyspark.ml.tests.test_algorithms.*' imported but unused
python/pyspark/ml/param/__init__.py:18:1: F401 'sys' imported but unused
python/pyspark/resource/tests/test_resources.py:17:1: F401 'random' imported but unused
python/pyspark/resource/tests/test_resources.py:20:1: F401 'pyspark.resource.ResourceProfile' imported but unused
python/pyspark/resource/tests/test_resources.py:75:5: F401 'pyspark.resource.tests.test_resources.*' imported but unused
python/pyspark/sql/functions.py:32:1: F401 'pyspark.sql.udf.UserDefinedFunction' imported but unused
python/pyspark/sql/functions.py:34:1: F401 'pyspark.sql.pandas.functions.pandas_udf' imported but unused
python/pyspark/sql/session.py:30:1: F401 'pyspark.sql.types.Row' imported but unused
python/pyspark/sql/session.py:30:1: F401 'pyspark.sql.types.StringType' imported but unused
python/pyspark/sql/readwriter.py:1084:5: F401 'pyspark.sql.Row' imported but unused
python/pyspark/sql/context.py:26:1: F401 'pyspark.sql.types.IntegerType' imported but unused
python/pyspark/sql/context.py:26:1: F401 'pyspark.sql.types.Row' imported but unused
python/pyspark/sql/context.py:26:1: F401 'pyspark.sql.types.StringType' imported but unused
python/pyspark/sql/context.py:27:1: F401 'pyspark.sql.udf.UDFRegistration' imported but unused
python/pyspark/sql/streaming.py:1212:5: F401 'pyspark.sql.Row' imported but unused
python/pyspark/sql/tests/test_utils.py:55:5: F401 'pyspark.sql.tests.test_utils.*' imported but unused
python/pyspark/sql/tests/test_pandas_map.py:18:1: F401 'sys' imported but unused
python/pyspark/sql/tests/test_pandas_map.py:22:1: F401 'pyspark.sql.functions.pandas_udf' imported but unused
python/pyspark/sql/tests/test_pandas_map.py:22:1: F401 'pyspark.sql.functions.PandasUDFType' imported but unused
python/pyspark/sql/tests/test_pandas_map.py:119:5: F401 'pyspark.sql.tests.test_pandas_map.*' imported but unused
python/pyspark/sql/tests/test_catalog.py:193:5: F401 'pyspark.sql.tests.test_catalog.*' imported but unused
python/pyspark/sql/tests/test_group.py:39:5: F401 'pyspark.sql.tests.test_group.*' imported but unused
python/pyspark/sql/tests/test_session.py:361:5: F401 'pyspark.sql.tests.test_session.*' imported but unused
python/pyspark/sql/tests/test_conf.py:49:5: F401 'pyspark.sql.tests.test_conf.*' imported but unused
python/pyspark/sql/tests/test_pandas_cogrouped_map.py:19:1: F401 'sys' imported but unused
python/pyspark/sql/tests/test_pandas_cogrouped_map.py:21:1: F401 'pyspark.sql.functions.sum' imported but unused
python/pyspark/sql/tests/test_pandas_cogrouped_map.py:21:1: F401 'pyspark.sql.functions.PandasUDFType' imported but unused
python/pyspark/sql/tests/test_pandas_cogrouped_map.py:29:5: F401 'pandas.util.testing.assert_series_equal' imported but unused
python/pyspark/sql/tests/test_pandas_cogrouped_map.py:32:5: F401 'pyarrow as pa' imported but unused
python/pyspark/sql/tests/test_pandas_cogrouped_map.py:248:5: F401 'pyspark.sql.tests.test_pandas_cogrouped_map.*' imported but unused
python/pyspark/sql/tests/test_udf.py:24:1: F401 'py4j' imported but unused
python/pyspark/sql/tests/test_pandas_udf_typehints.py:246:5: F401 'pyspark.sql.tests.test_pandas_udf_typehints.*' imported but unused
python/pyspark/sql/tests/test_functions.py:19:1: F401 'sys' imported but unused
python/pyspark/sql/tests/test_functions.py:362:9: F401 'pyspark.sql.functions.exists' imported but unused
python/pyspark/sql/tests/test_functions.py:387:5: F401 'pyspark.sql.tests.test_functions.*' imported but unused
python/pyspark/sql/tests/test_pandas_udf_scalar.py:21:1: F401 'sys' imported but unused
python/pyspark/sql/tests/test_pandas_udf_scalar.py:45:5: F401 'pyarrow as pa' imported but unused
python/pyspark/sql/tests/test_pandas_udf_window.py:355:5: F401 'pyspark.sql.tests.test_pandas_udf_window.*' imported but unused
python/pyspark/sql/tests/test_arrow.py:38:5: F401 'pyarrow as pa' imported but unused
python/pyspark/sql/tests/test_pandas_grouped_map.py:20:1: F401 'sys' imported but unused
python/pyspark/sql/tests/test_pandas_grouped_map.py:38:5: F401 'pyarrow as pa' imported but unused
python/pyspark/sql/tests/test_dataframe.py:382:9: F401 'pyspark.sql.DataFrame' imported but unused
python/pyspark/sql/avro/functions.py:125:5: F401 'pyspark.sql.Row' imported but unused
python/pyspark/sql/pandas/functions.py:19:1: F401 'sys' imported but unused
```

After:
```
fokkodriesprongFan spark % flake8 python | grep -i "imported but unused"
fokkodriesprongFan spark %
```

### What changes were proposed in this pull request?

Removing unused imports from the Python files to keep everything nice and tidy.

### Why are the changes needed?

Cleaning up of the imports that aren't used, and suppressing the imports that are used as references to other modules, preserving backward compatibility.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Adding the rule to the existing Flake8 checks.

Closes #29121 from Fokko/SPARK-32319.

Authored-by: Fokko Driesprong <fokko@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-08-08 08:51:57 -07:00
Itsuki Toyota 86ead044e3 [SPARK-32428][EXAMPLES] Make BinaryClassificationMetricsExample cons…
…istently print the metrics on driver's stdout

### What changes were proposed in this pull request?

Call collect on RDD before calling foreach so that it sends the result to the driver node and print it on this node's stdout.

### Why are the changes needed?

Some RDDs in this example (e.g., precision, recall) call println without calling collect.
If the job is under local mode, it sends the data to the driver node and prints the metrics on the driver's stdout.
However if the job is under cluster mode, the job prints the metrics on the executor's stdout.
It seems inconsistent compared to the other metrics nothing to do with RDD (e.g., auPRC, auROC) since these metrics always output the result on the driver's stdout.
All of the metrics should output its result on the driver's stdout.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

This is example code. It doesn't have any tests.

Closes #29222 from titsuki/SPARK-32428.

Authored-by: Itsuki Toyota <titsuki@cpan.org>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-07-26 09:12:43 -05:00
Sean Owen c28a6fa511 [SPARK-29292][SQL][ML] Update rest of default modules (Hive, ML, etc) for Scala 2.13 compilation
### What changes were proposed in this pull request?

Same as https://github.com/apache/spark/pull/29078 and https://github.com/apache/spark/pull/28971 . This makes the rest of the default modules (i.e. those you get without specifying `-Pyarn` etc) compile under Scala 2.13. It does not close the JIRA, as a result. this also of course does not demonstrate that tests pass yet in 2.13.

Note, this does not fix the `repl` module; that's separate.

### Why are the changes needed?

Eventually, we need to support a Scala 2.13 build, perhaps in Spark 3.1.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing tests. (2.13 was not tested; this is about getting it to compile without breaking 2.12)

Closes #29111 from srowen/SPARK-29292.3.

Authored-by: Sean Owen <srowen@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-07-15 13:26:28 -07:00
Erik Krogen cf22d947fb [SPARK-32036] Replace references to blacklist/whitelist language with more appropriate terminology, excluding the blacklisting feature
### What changes were proposed in this pull request?

This PR will remove references to these "blacklist" and "whitelist" terms besides the blacklisting feature as a whole, which can be handled in a separate JIRA/PR.

This touches quite a few files, but the changes are straightforward (variable/method/etc. name changes) and most quite self-contained.

### Why are the changes needed?

As per discussion on the Spark dev list, it will be beneficial to remove references to problematic language that can alienate potential community members. One such reference is "blacklist" and "whitelist". While it seems to me that there is some valid debate as to whether these terms have racist origins, the cultural connotations are inescapable in today's world.

### Does this PR introduce _any_ user-facing change?

In the test file `HiveQueryFileTest`, a developer has the ability to specify the system property `spark.hive.whitelist` to specify a list of Hive query files that should be tested. This system property has been renamed to `spark.hive.includelist`. The old property has been kept for compatibility, but will log a warning if used. I am open to feedback from others on whether keeping a deprecated property here is unnecessary given that this is just for developers running tests.

### How was this patch tested?

Existing tests should be suitable since no behavior changes are expected as a result of this PR.

Closes #28874 from xkrogen/xkrogen-SPARK-32036-rename-blacklists.

Authored-by: Erik Krogen <ekrogen@linkedin.com>
Signed-off-by: Thomas Graves <tgraves@apache.org>
2020-07-15 11:40:55 -05:00
HyukjinKwon 4ad9bfd53b [SPARK-32138] Drop Python 2.7, 3.4 and 3.5
### What changes were proposed in this pull request?

This PR aims to drop Python 2.7, 3.4 and 3.5.

Roughly speaking, it removes all the widely known Python 2 compatibility workarounds such as `sys.version` comparison, `__future__`. Also, it removes the Python 2 dedicated codes such as `ArrayConstructor` in Spark.

### Why are the changes needed?

 1. Unsupport EOL Python versions
 2. Reduce maintenance overhead and remove a bit of legacy codes and hacks for Python 2.
 3. PyPy2 has a critical bug that causes a flaky test, SPARK-28358 given my testing and investigation.
 4. Users can use Python type hints with Pandas UDFs without thinking about Python version
 5. Users can leverage one latest cloudpickle, https://github.com/apache/spark/pull/28950. With Python 3.8+ it can also leverage C pickle.

### Does this PR introduce _any_ user-facing change?

Yes, users cannot use Python 2.7, 3.4 and 3.5 in the upcoming Spark version.

### How was this patch tested?

Manually tested and also tested in Jenkins.

Closes #28957 from HyukjinKwon/SPARK-32138.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-07-14 11:22:44 +09:00
Huaxin Gao 194ac3be8b [SPARK-31708][ML][DOCS] Add docs and examples for ANOVASelector and FValueSelector
### What changes were proposed in this pull request?
Add docs and examples for ANOVASelector and FValueSelector

### Why are the changes needed?
Complete the implementation of ANOVASelector and FValueSelector

### Does this PR introduce _any_ user-facing change?
Yes

<img width="850" alt="Screen Shot 2020-05-13 at 5 17 44 PM" src="https://user-images.githubusercontent.com/13592258/81878703-b4f94480-953d-11ea-9166-da3c64852b90.png">

<img width="850" alt="Screen Shot 2020-05-13 at 5 05 15 PM" src="https://user-images.githubusercontent.com/13592258/81878600-6055c980-953d-11ea-8b24-09c31647139b.png">

<img width="850" alt="Screen Shot 2020-05-13 at 5 06 06 PM" src="https://user-images.githubusercontent.com/13592258/81878603-621f8d00-953d-11ea-9447-39913ccc067d.png">

<img width="850" alt="Screen Shot 2020-05-13 at 5 06 21 PM" src="https://user-images.githubusercontent.com/13592258/81878606-65b31400-953d-11ea-9d76-51859266d1a8.png">

<img width="850" alt="Screen Shot 2020-05-13 at 5 07 10 PM" src="https://user-images.githubusercontent.com/13592258/81878611-69df3180-953d-11ea-8618-23a2a6cfd730.png">

<img width="850" alt="Screen Shot 2020-05-13 at 5 07 33 PM" src="https://user-images.githubusercontent.com/13592258/81878620-6cda2200-953d-11ea-9c46-da763328364e.png">

<img width="850" alt="Screen Shot 2020-05-13 at 5 07 47 PM" src="https://user-images.githubusercontent.com/13592258/81878625-6f3c7c00-953d-11ea-9d11-2281b33a0bd8.png">

<img width="851" alt="Screen Shot 2020-05-13 at 5 19 35 PM" src="https://user-images.githubusercontent.com/13592258/81878882-13bebe00-953e-11ea-9776-288bac97d93f.png">

<img width="850" alt="Screen Shot 2020-05-13 at 5 08 42 PM" src="https://user-images.githubusercontent.com/13592258/81878637-76638a00-953d-11ea-94b0-dc9bc85ae2b7.png">

<img width="850" alt="Screen Shot 2020-05-13 at 5 09 01 PM" src="https://user-images.githubusercontent.com/13592258/81878640-79f71100-953d-11ea-9a66-b27f9482fbd3.png">

<img width="850" alt="Screen Shot 2020-05-13 at 5 09 50 PM" src="https://user-images.githubusercontent.com/13592258/81878644-7cf20180-953d-11ea-9142-9658c8e90986.png">

<img width="851" alt="Screen Shot 2020-05-13 at 5 10 06 PM" src="https://user-images.githubusercontent.com/13592258/81878653-81b6b580-953d-11ea-9dc2-8015095cf569.png">

<img width="850" alt="Screen Shot 2020-05-13 at 5 10 59 PM" src="https://user-images.githubusercontent.com/13592258/81878658-854a3c80-953d-11ea-8dc9-217aa749fd00.png">

<img width="850" alt="Screen Shot 2020-05-13 at 5 11 27 PM" src="https://user-images.githubusercontent.com/13592258/81878659-87ac9680-953d-11ea-8c6b-74ab76748e4a.png">

<img width="850" alt="Screen Shot 2020-05-13 at 5 14 54 PM" src="https://user-images.githubusercontent.com/13592258/81878664-8b401d80-953d-11ea-9ee1-05f6677e263c.png">

<img width="850" alt="Screen Shot 2020-05-13 at 5 15 17 PM" src="https://user-images.githubusercontent.com/13592258/81878669-8da27780-953d-11ea-8216-77eb8bb7e091.png">

### How was this patch tested?
Manually build and check

Closes #28524 from huaxingao/examples.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-05-15 09:59:14 -05:00
Huaxin Gao 08335b651a [SPARK-31659][ML][DOCS] Add VarianceThresholdSelector examples and doc
### What changes were proposed in this pull request?
Add VarianceThresholdSelector examples and doc

### Why are the changes needed?
VarianceThresholdSelector is a new feature selector in 3.1.0. We need to add examples and doc

### Does this PR introduce _any_ user-facing change?
Yes.
add Scala, Python and Java examples for VarianceThresholdSelector. Also add doc

<img width="860" alt="Screen Shot 2020-05-07 at 9 20 01 AM" src="https://user-images.githubusercontent.com/13592258/81321791-e3f84d80-9047-11ea-837b-e39c193bd437.png">

<img width="860" alt="Screen Shot 2020-05-07 at 9 20 44 AM" src="https://user-images.githubusercontent.com/13592258/81321806-e8246b00-9047-11ea-8f35-206e330a92ab.png">

<img width="860" alt="Screen Shot 2020-05-07 at 9 21 27 AM" src="https://user-images.githubusercontent.com/13592258/81321822-ea86c500-9047-11ea-8743-99adec7f502b.png">

<img width="860" alt="Screen Shot 2020-05-07 at 9 21 43 AM" src="https://user-images.githubusercontent.com/13592258/81321826-ec508880-9047-11ea-9e7a-22ee5e13f495.png">

### How was this patch tested?
Manually checked

Closes #28478 from huaxingao/variance_doc.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
2020-05-08 10:57:35 +08:00
Qianyang Yu 348fd53214 [SPARK-31307][ML][EXAMPLES] Add examples for ml.fvalue
### What changes were proposed in this pull request?

Add FValue example for ml.stat.FValueTest in python/java/scala

### Why are the changes needed?

Improve ML example

### Does this PR introduce any user-facing change?

No
### How was this patch tested?

manually run the example

Closes #28400 from kevinyu98/spark-26111-fvalue-examples.

Authored-by: Qianyang Yu <qyu@us.ibm.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-05-01 09:16:08 -05:00
Huaxin Gao 3bbd80dbc3 [SPARK-31319][SQL][DOCS] Document UDFs/UDAFs in SQL Reference
### What changes were proposed in this pull request?
Document UDF in SQL Reference

### Why are the changes needed?
To make SQL Reference complete.

### Does this PR introduce any user-facing change?
Yes. Here are the new pages:
<img width="1050" alt="Screen Shot 2020-04-09 at 5 06 42 PM" src="https://user-images.githubusercontent.com/13592258/78950977-585dc200-7a85-11ea-875c-ce14c3795e0f.png">

<img width="1049" alt="Screen Shot 2020-04-09 at 5 07 06 PM" src="https://user-images.githubusercontent.com/13592258/78950979-5b58b280-7a85-11ea-81f3-bd5d91bd07e3.png">

<img width="1049" alt="Screen Shot 2020-04-09 at 5 07 26 PM" src="https://user-images.githubusercontent.com/13592258/78950985-5e53a300-7a85-11ea-86be-f63152c1501b.png">

<img width="1051" alt="Screen Shot 2020-04-09 at 5 07 54 PM" src="https://user-images.githubusercontent.com/13592258/78950991-63185700-7a85-11ea-9379-8da46cfc434c.png">

<img width="1060" alt="Screen Shot 2020-04-09 at 5 08 17 PM" src="https://user-images.githubusercontent.com/13592258/78950994-657ab100-7a85-11ea-8b34-d2c87f94b03b.png">

<img width="1050" alt="Screen Shot 2020-04-09 at 5 09 27 PM" src="https://user-images.githubusercontent.com/13592258/78951001-6875a180-7a85-11ea-874e-8abd14a3d3d3.png">

<img width="1060" alt="Screen Shot 2020-04-09 at 5 10 00 PM" src="https://user-images.githubusercontent.com/13592258/78951005-6f041900-7a85-11ea-9e57-520eb8db59ec.png">

<img width="1049" alt="Screen Shot 2020-04-09 at 5 11 10 PM" src="https://user-images.githubusercontent.com/13592258/78951014-73303680-7a85-11ea-93ab-32d68d2e2d59.png">

<img width="1050" alt="Screen Shot 2020-04-09 at 5 11 41 PM" src="https://user-images.githubusercontent.com/13592258/78951019-75929080-7a85-11ea-9d3b-600e8e157c05.png">

<img width="1050" alt="Screen Shot 2020-04-09 at 5 16 22 PM" src="https://user-images.githubusercontent.com/13592258/78951137-dfab3580-7a85-11ea-8512-c6b660aa271e.png">

<img width="1050" alt="Screen Shot 2020-04-09 at 5 22 15 PM" src="https://user-images.githubusercontent.com/13592258/78951466-22214200-7a87-11ea-93dd-6e36492421f1.png">

<img width="1049" alt="Screen Shot 2020-04-09 at 5 22 46 PM" src="https://user-images.githubusercontent.com/13592258/78951469-24839c00-7a87-11ea-93a9-fe30d689adbd.png">

<img width="1050" alt="Screen Shot 2020-04-09 at 5 23 08 PM" src="https://user-images.githubusercontent.com/13592258/78951472-26e5f600-7a87-11ea-84db-087a3528aa53.png">

<img width="1050" alt="Screen Shot 2020-04-09 at 5 23 34 PM" src="https://user-images.githubusercontent.com/13592258/78951474-29e0e680-7a87-11ea-8be4-2a5be1bc3788.png">

<img width="1049" alt="Screen Shot 2020-04-09 at 5 23 57 PM" src="https://user-images.githubusercontent.com/13592258/78951481-2cdbd700-7a87-11ea-8894-0a39abf54a3b.png">

<img width="1050" alt="Screen Shot 2020-04-09 at 5 24 15 PM" src="https://user-images.githubusercontent.com/13592258/78951483-2f3e3100-7a87-11ea-8845-ffebf89d7898.png">

### How was this patch tested?
Manually build and check

Closes #28087 from huaxingao/udf.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-04-12 23:38:17 -05:00
zero323 697fe911ac [SPARK-30819][SPARKR][ML] Add FMRegressor wrapper to SparkR
### What changes were proposed in this pull request?

This pull request adds SparkR wrapper for `FMRegressor`:

- Supporting ` org.apache.spark.ml.r.FMRegressorWrapper`.
- `FMRegressionModel` S4 class.
- Corresponding `spark.fmRegressor`, `predict`, `summary` and `write.ml` generics.
- Corresponding docs and tests.

### Why are the changes needed?

Feature parity.

### Does this PR introduce any user-facing change?

No (new API).

### How was this patch tested?

New unit tests.

Closes #27571 from zero323/SPARK-30819.

Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-04-09 19:38:11 -05:00
zero323 0063462d55 [SPARK-30818][SPARKR][ML] Add SparkR LinearRegression wrapper
### What changes were proposed in this pull request?

This pull request adds SparkR wrapper for `LinearRegression`

- Supporting `org.apache.spark.ml.rLinearRegressionWrapper`.
- `LinearRegressionModel` S4 class.
- Corresponding `spark.lm` predict, summary and write.ml generics.
- Corresponding docs and tests.

### Why are the changes needed?

Feature parity.

### Does this PR introduce any user-facing change?

No (new API).

### How was this patch tested?

New unit tests.

Closes #27593 from zero323/SPARK-30818.

Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-04-08 22:29:44 -05:00
zero323 0d37f794ef [SPARK-30820][SPARKR][ML] Add FMClassifier to SparkR
### What changes were proposed in this pull request?

This pull request adds SparkR wrapper for `FMClassifier`:

- Supporting ` org.apache.spark.ml.r.FMClassifierWrapper`.
- `FMClassificationModel` S4 class.
- Corresponding `spark.fmClassifier`, `predict`, `summary` and `write.ml` generics.
- Corresponding docs and tests.

### Why are the changes needed?

Feature parity.

### Does this PR introduce any user-facing change?

No (new API).

### How was this patch tested?

New unit tests.

Closes #27570 from zero323/SPARK-30820.

Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-04-07 09:01:45 -05:00
Qianyang Yu e65c21e093 [SPARK-31304][ML][EXAMPLES] Add examples for ml.stat.ANOVATest
### What changes were proposed in this pull request?

Add ANOVATest example for ml.stat.ANOVATest in python/java/scala

### Why are the changes needed?

Improve ML example

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

manually run the example

Closes #28073 from kevinyu98/add-ANOVA-example.

Authored-by: Qianyang Yu <qyu@us.ibm.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-03-31 16:33:26 -05:00
gatorsmile 28b8713036 [SPARK-30950][BUILD] Setting version to 3.1.0-SNAPSHOT
### What changes were proposed in this pull request?
This patch is to bump the master branch version to 3.1.0-SNAPSHOT.

### Why are the changes needed?
N/A

### Does this PR introduce any user-facing change?
N/A

### How was this patch tested?
N/A

Closes #27698 from gatorsmile/updateVersion.

Authored-by: gatorsmile <gatorsmile@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-02-25 19:44:31 -08:00
HyukjinKwon aa6a60530e [SPARK-30722][PYTHON][DOCS] Update documentation for Pandas UDF with Python type hints
### What changes were proposed in this pull request?

This PR targets to document the Pandas UDF redesign with type hints introduced at SPARK-28264.
Mostly self-describing; however, there are few things to note for reviewers.

1. This PR replace the existing documentation of pandas UDFs to the newer redesign to promote the Python type hints. I added some words that Spark 3.0 still keeps the compatibility though.

2. This PR proposes to name non-pandas UDFs as "Pandas Function API"

3. SCALAR_ITER become two separate sections to reduce confusion:
  - `Iterator[pd.Series]` -> `Iterator[pd.Series]`
  - `Iterator[Tuple[pd.Series, ...]]` -> `Iterator[pd.Series]`

4. I removed some examples that look overkill to me.

5. I also removed some information in the doc, that seems duplicating or too much.

### Why are the changes needed?

To document new redesign in pandas UDF.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Existing tests should cover.

Closes #27466 from HyukjinKwon/SPARK-30722.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-02-12 10:49:46 +09:00
yi.wu 5983ad9cc4 [SPARK-30506][SQL][DOC] Document for generic file source options/configs
### What changes were proposed in this pull request?

Add a new document page named *Generic File Source Options* for *Data Sources* menu and added following sub items:

* spark.sql.files.ignoreCorruptFiles
* spark.sql.files.ignoreMissingFiles
* pathGlobFilter
* recursiveFileLookup

And here're snapshots of the generated document:
<img width="1080" alt="doc-1" src="https://user-images.githubusercontent.com/16397174/73816825-87a54800-4824-11ea-97da-e5c40c59a7d4.png">
<img width="1081" alt="doc-2" src="https://user-images.githubusercontent.com/16397174/73816827-8a07a200-4824-11ea-99ec-9c8b0286625e.png">
<img width="1080" alt="doc-3" src="https://user-images.githubusercontent.com/16397174/73816831-8c69fc00-4824-11ea-84f0-6c9e94c2f0e2.png">
<img width="1081" alt="doc-4" src="https://user-images.githubusercontent.com/16397174/73816834-8f64ec80-4824-11ea-9355-76ad45476634.png">

### Why are the changes needed?

Better guidance for end-user.

### Does this PR introduce any user-facing change?

No, added in Spark 3.0.

### How was this patch tested?

Pass Jenkins.

Closes #27302 from Ngone51/doc-generic-file-source-option.

Lead-authored-by: yi.wu <yi.wu@databricks.com>
Co-authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-05 17:16:38 +08:00
Erik Erlandson 176b69642e [SPARK-30423][SQL] Deprecate UserDefinedAggregateFunction
### What changes were proposed in this pull request?
* Annotate UserDefinedAggregateFunction as deprecated by SPARK-27296
* Update user doc examples to reflect new ability to register typed Aggregator[IN, BUF, OUT] as an untyped aggregating UDF
### Why are the changes needed?
UserDefinedAggregateFunction is being deprecated

### Does this PR introduce any user-facing change?
Changes are to user documentation, and deprecation annotations.

### How was this patch tested?
Testing was via package build to verify doc generation, deprecation warnings, and successful example compilation.

Closes #27193 from erikerlandson/spark-30423.

Authored-by: Erik Erlandson <eerlands@redhat.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-01-14 22:07:13 +08:00
HyukjinKwon ee8d661058 [SPARK-30434][PYTHON][SQL] Move pandas related functionalities into 'pandas' sub-package
### What changes were proposed in this pull request?

This PR proposes to move pandas related functionalities into pandas package. Namely:

```bash
pyspark/sql/pandas
├── __init__.py
├── conversion.py  # Conversion between pandas <> PySpark DataFrames
├── functions.py   # pandas_udf
├── group_ops.py   # Grouped UDF / Cogrouped UDF + groupby.apply, groupby.cogroup.apply
├── map_ops.py     # Map Iter UDF + mapInPandas
├── serializers.py # pandas <> PyArrow serializers
├── types.py       # Type utils between pandas <> PyArrow
└── utils.py       # Version requirement checks
```

In order to separately locate `groupby.apply`, `groupby.cogroup.apply`, `mapInPandas`, `toPandas`, and `createDataFrame(pdf)` under `pandas` sub-package, I had to use a mix-in approach which Scala side uses often by `trait`, and also pandas itself uses this approach (see `IndexOpsMixin` as an example) to group related functionalities. Currently, you can think it's like Scala's self typed trait. See the structure below:

```python
class PandasMapOpsMixin(object):
    def mapInPandas(self, ...):
        ...
        return ...

    # other Pandas <> PySpark APIs
```

```python
class DataFrame(PandasMapOpsMixin):

    # other DataFrame APIs equivalent to Scala side.

```

Yes, This is a big PR but they are mostly just moving around except one case `createDataFrame` which I had to split the methods.

### Why are the changes needed?

There are pandas functionalities here and there and I myself gets lost where it was. Also, when you have to make a change commonly for all of pandas related features, it's almost impossible now.

Also, after this change, `DataFrame` and `SparkSession` become more consistent with Scala side since pandas is specific to Python, and this change separates pandas-specific APIs away from `DataFrame` or `SparkSession`.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Existing tests should cover. Also, I manually built the PySpark API documentation and checked.

Closes #27109 from HyukjinKwon/pandas-refactoring.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-01-09 10:22:50 +09:00
zhengruifeng 32a5233d12 [SPARK-30358][ML] ML expose predictRaw and predictProbability
### What changes were proposed in this pull request?
1, expose predictRaw and predictProbability
2, add tests

### Why are the changes needed?
single instance prediction is useful out of spark, specially for online prediction.
Current `predict` is exposed, but it is not enough.

### Does this PR introduce any user-facing change?
Yes, new methods are exposed

### How was this patch tested?
added testsuites

Closes #27015 from zhengruifeng/expose_raw_prob.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
2019-12-31 12:49:16 +08:00
zhanjf 8d3eed33ee [SPARK-29224][ML] Implement Factorization Machines as a ml-pipeline component
### What changes were proposed in this pull request?

Implement Factorization Machines as a ml-pipeline component

1. loss function supports: logloss, mse
2. optimizer: GD, adamW

### Why are the changes needed?

Factorization Machines is widely used in advertising and recommendation system to estimate CTR(click-through rate).
Advertising and recommendation system usually has a lot of data, so we need Spark to estimate the CTR, and Factorization Machines are common ml model to estimate CTR.
References:

1. S. Rendle, “Factorization machines,” in Proceedings of IEEE International Conference on Data Mining (ICDM), pp. 995–1000, 2010.
https://www.csie.ntu.edu.tw/~b97053/paper/Rendle2010FM.pdf

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

run unit tests

Closes #27000 from mob-ai/ml/fm.

Authored-by: zhanjf <zhanjf@mob.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2019-12-26 11:39:53 -06:00
Wenchen Fan ba3f6330dd Revert "[SPARK-29224][ML] Implement Factorization Machines as a ml-pipeline component"
This reverts commit c6ab7165dd.
2019-12-24 14:01:27 +08:00
zhanjf c6ab7165dd [SPARK-29224][ML] Implement Factorization Machines as a ml-pipeline component
### What changes were proposed in this pull request?

Implement Factorization Machines as a ml-pipeline component

1. loss function supports: logloss, mse
2. optimizer: GD, adamW

### Why are the changes needed?

Factorization Machines is widely used in advertising and recommendation system to estimate CTR(click-through rate).
Advertising and recommendation system usually has a lot of data, so we need Spark to estimate the CTR, and Factorization Machines are common ml model to estimate CTR.
References:

1. S. Rendle, “Factorization machines,” in Proceedings of IEEE International Conference on Data Mining (ICDM), pp. 995–1000, 2010.
https://www.csie.ntu.edu.tw/~b97053/paper/Rendle2010FM.pdf

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

run unit tests

Closes #26124 from mob-ai/ml/fm.

Authored-by: zhanjf <zhanjf@mob.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2019-12-23 10:11:09 -06:00
Yuming Wang 696288f623 [INFRA] Reverts commit 56dcd79 and c216ef1
### What changes were proposed in this pull request?
1. Revert "Preparing development version 3.0.1-SNAPSHOT": 56dcd79

2. Revert "Preparing Spark release v3.0.0-preview2-rc2": c216ef1

### Why are the changes needed?
Shouldn't change master.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
manual test:
https://github.com/apache/spark/compare/5de5e46..wangyum:revert-master

Closes #26915 from wangyum/revert-master.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Yuming Wang <wgyumg@gmail.com>
2019-12-16 19:57:44 -07:00
Yuming Wang 56dcd79992 Preparing development version 3.0.1-SNAPSHOT 2019-12-17 01:57:27 +00:00
Yuming Wang c216ef1d03 Preparing Spark release v3.0.0-preview2-rc2 2019-12-17 01:57:21 +00:00
Sean Owen 36fa1980c2 [SPARK-30158][SQL][CORE] Seq -> Array for sc.parallelize for 2.13 compatibility; remove WrappedArray
### What changes were proposed in this pull request?

Use Seq instead of Array in sc.parallelize, with reference types.
Remove usage of WrappedArray.

### Why are the changes needed?

These both enable building on Scala 2.13.

### Does this PR introduce any user-facing change?

None

### How was this patch tested?

Existing tests

Closes #26787 from srowen/SPARK-30158.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2019-12-09 14:41:48 -06:00
Chris Martin c29494377b [SPARK-29126][PYSPARK][DOC] Pandas Cogroup udf usage guide
This PR adds some extra documentation for the new Cogrouped map Pandas udfs.  Specifically:

- Updated the usage guide for the new `COGROUPED_MAP` Pandas udfs added in https://github.com/apache/spark/pull/24981
- Updated the docstring for pandas_udf to include the COGROUPED_MAP type as suggested by HyukjinKwon in https://github.com/apache/spark/pull/25939

Closes #26110 from d80tb7/SPARK-29126-cogroup-udf-usage-guide.

Authored-by: Chris Martin <chris@cmartinit.co.uk>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-10-31 10:41:57 +09:00
Xingbo Jiang 8207c835b4 Revert "Prepare Spark release v3.0.0-preview-rc2"
This reverts commit 007c873ae3.
2019-10-30 17:45:44 -07:00
Xingbo Jiang 007c873ae3 Prepare Spark release v3.0.0-preview-rc2
### What changes were proposed in this pull request?

To push the built jars to maven release repository, we need to remove the 'SNAPSHOT' tag from the version name.

Made the following changes in this PR:
* Update all the `3.0.0-SNAPSHOT` version name to `3.0.0-preview`
* Update the sparkR version number check logic to allow jvm version like `3.0.0-preview`

**Please note those changes were generated by the release script in the past, but this time since we manually add tags on master branch, we need to manually apply those changes too.**

We shall revert the changes after 3.0.0-preview release passed.

### Why are the changes needed?

To make the maven release repository to accept the built jars.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

N/A
2019-10-30 17:42:59 -07:00
Xingbo Jiang b33a58c0c6 Revert "Prepare Spark release v3.0.0-preview-rc1"
This reverts commit 5eddbb5f1d.
2019-10-28 22:32:34 -07:00
Xingbo Jiang 5eddbb5f1d Prepare Spark release v3.0.0-preview-rc1
### What changes were proposed in this pull request?

To push the built jars to maven release repository, we need to remove the 'SNAPSHOT' tag from the version name.

Made the following changes in this PR:
* Update all the `3.0.0-SNAPSHOT` version name to `3.0.0-preview`
* Update the PySpark version from `3.0.0.dev0` to `3.0.0`

**Please note those changes were generated by the release script in the past, but this time since we manually add tags on master branch, we need to manually apply those changes too.**

We shall revert the changes after 3.0.0-preview release passed.

### Why are the changes needed?

To make the maven release repository to accept the built jars.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

N/A

Closes #26243 from jiangxb1987/3.0.0-preview-prepare.

Lead-authored-by: Xingbo Jiang <xingbo.jiang@databricks.com>
Co-authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Xingbo Jiang <xingbo.jiang@databricks.com>
2019-10-28 22:31:29 -07:00
Jiajia Li dc0bc7a6eb [MINOR][DOCS] Fix some typos
### What changes were proposed in this pull request?

This PR proposes a few typos:
1. Sparks => Spark's
2. parallize => parallelize
3. doesnt => doesn't

Closes #26140 from plusplusjiajia/fix-typos.

Authored-by: Jiajia Li <jiajia.li@intel.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-10-17 07:22:01 -07:00
Sean Owen e1ea806b30 [SPARK-29291][CORE][SQL][STREAMING][MLLIB] Change procedure-like declaration to function + Unit for 2.13
### What changes were proposed in this pull request?

Scala 2.13 emits a deprecation warning for procedure-like declarations:

```
def foo() {
 ...
```

This is equivalent to the following, so should be changed to avoid a warning:

```
def foo(): Unit = {
  ...
```

### Why are the changes needed?

It will avoid about a thousand compiler warnings when we start to support Scala 2.13. I wanted to make the change in 3.0 as there are less likely to be back-ports from 3.0 to 2.4 than 3.1 to 3.0, for example, minimizing that downside to touching so many files.

Unfortunately, that makes this quite a big change.

### Does this PR introduce any user-facing change?

No behavior change at all.

### How was this patch tested?

Existing tests.

Closes #25968 from srowen/SPARK-29291.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-09-30 10:03:23 -07:00
Sean Owen 28b8383a6c [SPARK-29289][BUILD] Update scalatest, scalacheck, scopt, clapper, scala-parser-combinators for 2.13
### What changes were proposed in this pull request?

Update scalatest, scalacheck, scopt, clapper, scala-parser-combinators to latest maintenance release that is also cross-published for Scala 2.13.

### Why are the changes needed?

To build in the future for Scala 2.13

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Existing tests

Closes #25967 from srowen/SPARK-29289.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-09-30 08:13:57 -05:00
Sean Owen 6378d4bc06 [SPARK-28980][CORE][SQL][STREAMING][MLLIB] Remove most items deprecated in Spark 2.2.0 or earlier, for Spark 3
### What changes were proposed in this pull request?

- Remove SQLContext.createExternalTable and Catalog.createExternalTable, deprecated in favor of createTable since 2.2.0, plus tests of deprecated methods
- Remove HiveContext, deprecated in 2.0.0, in favor of `SparkSession.builder.enableHiveSupport`
- Remove deprecated KinesisUtils.createStream methods, plus tests of deprecated methods, deprecate in 2.2.0
- Remove deprecated MLlib (not Spark ML) linear method support, mostly utility constructors and 'train' methods, and associated docs. This includes methods in LinearRegression, LogisticRegression, Lasso, RidgeRegression. These have been deprecated since 2.0.0
- Remove deprecated Pyspark MLlib linear method support, including LogisticRegressionWithSGD, LinearRegressionWithSGD, LassoWithSGD
- Remove 'runs' argument in KMeans.train() method, which has been a no-op since 2.0.0
- Remove deprecated ChiSqSelector isSorted protected method
- Remove deprecated 'yarn-cluster' and 'yarn-client' master argument in favor of 'yarn' and deploy mode 'cluster', etc

Notes:

- I was not able to remove deprecated DataFrameReader.json(RDD) in favor of DataFrameReader.json(Dataset); the former was deprecated in 2.2.0, but, it is still needed to support Pyspark's .json() method, which can't use a Dataset.
- Looks like SQLContext.createExternalTable was not actually deprecated in Pyspark, but, almost certainly was meant to be? Catalog.createExternalTable was.
- I afterwards noted that the toDegrees, toRadians functions were almost removed fully in SPARK-25908, but Felix suggested keeping just the R version as they hadn't been technically deprecated. I'd like to revisit that. Do we really want the inconsistency? I'm not against reverting it again, but then that implies leaving SQLContext.createExternalTable just in Pyspark too, which seems weird.
- I *kept* LogisticRegressionWithSGD, LinearRegressionWithSGD, LassoWithSGD, RidgeRegressionWithSGD in Pyspark, though deprecated, as it is hard to remove them (still used by StreamingLogisticRegressionWithSGD?) and they are not fully removed in Scala. Maybe should not have been deprecated.

### Why are the changes needed?

Deprecated items are easiest to remove in a major release, so we should do so as much as possible for Spark 3. This does not target items deprecated 'recently' as of Spark 2.3, which is still 18 months old.

### Does this PR introduce any user-facing change?

Yes, in that deprecated items are removed from some public APIs.

### How was this patch tested?

Existing tests.

Closes #25684 from srowen/SPARK-28980.

Lead-authored-by: Sean Owen <sean.owen@databricks.com>
Co-authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-09-09 10:19:40 -05:00
hongdd a838699ee0 [SPARK-28694][EXAMPLES] Add Java/Scala StructuredKerberizedKafkaWordCount examples
### What changes were proposed in this pull request?
Add Java/Scala StructuredKerberizedKafkaWordCount examples to test kerberized kafka.

### Why are the changes needed?
Now,`StructuredKafkaWordCount` example is not support to visit kafka using kerberos authentication.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
```
Yarn client:
   $ bin/run-example --files ${jaas_path}/kafka_jaas.conf,${keytab_path}/kafka.service.keytab \
     --driver-java-options "-Djava.security.auth.login.config=${path}/kafka_driver_jaas.conf" \
     --conf \
     "spark.executor.extraJavaOptions=-Djava.security.auth.login.config=./kafka_jaas.conf" \
     --master yarn
     sql.streaming.StructuredKerberizedKafkaWordCount broker1-host:port,broker2-host:port \
     subscribe topic1,topic2
Yarn cluster:
$ bin/run-example --files \
    ${jaas_path}/kafka_jaas.conf,${keytab_path}/kafka.service.keytab,${krb5_path}/krb5.conf \
    --driver-java-options \
    "-Djava.security.auth.login.config=./kafka_jaas.conf \
    -Djava.security.krb5.conf=./krb5.conf" \
    --conf \
    "spark.executor.extraJavaOptions=-Djava.security.auth.login.config=./kafka_jaas.conf" \
    --master yarn --deploy-mode cluster \
    sql.streaming.StructuredKerberizedKafkaWordCount broker1-host:port,broker2-host:port \
    subscribe topic1,topic2
```

Closes #25649 from hddong/Add-StructuredKerberizedKafkaWordCount-examples.

Lead-authored-by: hongdd <jn_hdd@163.com>
Co-authored-by: hongdongdong <hongdongdong@cmss.chinamobile.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-09-04 13:06:01 +09:00
zhengruifeng 7fe750674e [SPARK-11215][ML][FOLLOWUP] update the examples and suites using new api
## What changes were proposed in this pull request?
since method `labels` is already deprecated, we should update the examples and suites to turn off warings when compiling spark:
```
[warn] /Users/zrf/Dev/OpenSource/spark/examples/src/main/scala/org/apache/spark/examples/ml/DecisionTreeClassificationExample.scala:65: method labels in class StringIndexerModel is deprecated (since 3.0.0): `labels` is deprecated and will be removed in 3.1.0. Use `labelsArray` instead.
[warn]       .setLabels(labelIndexer.labels)
[warn]                               ^
[warn] /Users/zrf/Dev/OpenSource/spark/examples/src/main/scala/org/apache/spark/examples/ml/GradientBoostedTreeClassifierExample.scala:68: method labels in class StringIndexerModel is deprecated (since 3.0.0): `labels` is deprecated and will be removed in 3.1.0. Use `labelsArray` instead.
[warn]       .setLabels(labelIndexer.labels)
[warn]                               ^
```

## How was this patch tested?
existing suites

Closes #25428 from zhengruifeng/del_stringindexer_labels_usage.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-08-27 08:58:32 -05:00
hongdd c02c86e4e8 [SPARK-28691][EXAMPLES] Add Java/Scala DirectKerberizedKafkaWordCount examples
## What changes were proposed in this pull request?

Now, DirectKafkaWordCount example is not support to visit kafka using kerberos authentication. Add Java/Scala DirectKerberizedKafkaWordCount.

## How was this patch tested?
Use cmd to visit kafka using kerberos authentication.
```
$ bin/run-example --files ${path}/kafka_jaas.conf \
   --driver-java-options "-Djava.security.auth.login.config=${path}/kafka_jaas.conf" \
   --conf "spark.executor.extraJavaOptions=-Djava.security.auth.login.config=./kafka_jaas.conf" \
   streaming.DirectKerberizedKafkaWordCount broker1-host:port,broker2-host:port \
   consumer-group topic1,topic2
```

Closes #25412 from hddong/example-streaming-support-kafka-kerberos.

Lead-authored-by: hongdd <jn_hdd@163.com>
Co-authored-by: hongdongdong <hongdongdong@cmss.chinamobile.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-08-27 21:25:39 +09:00
younggyu chun 8535df7261 [MINOR] Fix typos in comments and replace an explicit type with <>
## What changes were proposed in this pull request?
This PR fixed typos in comments and replace the explicit type with '<>' for Java 8+.

## How was this patch tested?
Manually tested.

Closes #25338 from younggyuchun/younggyu.

Authored-by: younggyu chun <younggyuchun@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-08-10 16:47:11 -05:00
zhengruifeng 44c28d7515 [SPARK-28399][ML][PYTHON] implement RobustScaler
## What changes were proposed in this pull request?
Implement `RobustScaler`
Since the transformation is quite similar to `StandardScaler`, I refactor the transform function so that it can be reused in both scalers.

## How was this patch tested?
existing and added tests

Closes #25160 from zhengruifeng/robust_scaler.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-07-30 10:24:33 -05:00
HyukjinKwon cdbc30213b [SPARK-28226][PYTHON] Document Pandas UDF mapInPandas
## What changes were proposed in this pull request?

This PR proposes to document `MAP_ITER` with `mapInPandas`.

## How was this patch tested?

Manually checked the documentation.

![Screen Shot 2019-07-05 at 1 52 30 PM](https://user-images.githubusercontent.com/6477701/60698812-26cf2d80-9f2c-11e9-8295-9c00c28f5569.png)

![Screen Shot 2019-07-05 at 1 48 53 PM](https://user-images.githubusercontent.com/6477701/60698710-ac061280-9f2b-11e9-8521-a4f361207e06.png)

Closes #25025 from HyukjinKwon/SPARK-28226.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-07-07 09:07:52 +09:00