Commit graph

2302 commits

Author SHA1 Message Date
Bryan Cutler 65a189c7a1 [SPARK-29376][SQL][PYTHON] Upgrade Apache Arrow to version 0.15.1
### What changes were proposed in this pull request?

Upgrade Apache Arrow to version 0.15.1. This includes Java artifacts and increases the minimum required version of PyArrow also.

Version 0.12.0 to 0.15.1 includes the following selected fixes/improvements relevant to Spark users:

* ARROW-6898 - [Java] Fix potential memory leak in ArrowWriter and several test classes
* ARROW-6874 - [Python] Memory leak in Table.to_pandas() when conversion to object dtype
* ARROW-5579 - [Java] shade flatbuffer dependency
* ARROW-5843 - [Java] Improve the readability and performance of BitVectorHelper#getNullCount
* ARROW-5881 - [Java] Provide functionalities to efficiently determine if a validity buffer has completely 1 bits/0 bits
* ARROW-5893 - [C++] Remove arrow::Column class from C++ library
* ARROW-5970 - [Java] Provide pointer to Arrow buffer
* ARROW-6070 - [Java] Avoid creating new schema before IPC sending
* ARROW-6279 - [Python] Add Table.slice method or allow slices in \_\_getitem\_\_
* ARROW-6313 - [Format] Tracking for ensuring flatbuffer serialized values are aligned in stream/files.
* ARROW-6557 - [Python] Always return pandas.Series from Array/ChunkedArray.to_pandas, propagate field names to Series from RecordBatch, Table
* ARROW-2015 - [Java] Use Java Time and Date APIs instead of JodaTime
* ARROW-1261 - [Java] Add container type for Map logical type
* ARROW-1207 - [C++] Implement Map logical type

Changelog can be seen at https://arrow.apache.org/release/0.15.0.html

### Why are the changes needed?

Upgrade to get bug fixes, improvements, and maintain compatibility with future versions of PyArrow.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Existing tests, manually tested with Python 3.7, 3.8

Closes #26133 from BryanCutler/arrow-upgrade-015-SPARK-29376.

Authored-by: Bryan Cutler <cutlerb@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-11-15 13:27:30 +09:00
shane knapp 04e99c1e1b [SPARK-29672][PYSPARK] update spark testing framework to use python3
### What changes were proposed in this pull request?

remove python2.7 tests and test infra for 3.0+

### Why are the changes needed?

because python2.7 is finally going the way of the dodo.

### Does this PR introduce any user-facing change?

newp.

### How was this patch tested?

the build system will test this

Closes #26330 from shaneknapp/remove-py27-tests.

Lead-authored-by: shane knapp <incomplete@gmail.com>
Co-authored-by: shane <incomplete@gmail.com>
Signed-off-by: shane knapp <incomplete@gmail.com>
2019-11-14 10:18:55 -08:00
Huaxin Gao 1f4075d29e [SPARK-29808][ML][PYTHON] StopWordsRemover should support multi-cols
### What changes were proposed in this pull request?
Add multi-cols support in StopWordsRemover

### Why are the changes needed?
As a basic Transformer, StopWordsRemover should support multi-cols.
Param stopWords can be applied across all columns.

### Does this PR introduce any user-facing change?
```StopWordsRemover.setInputCols```
```StopWordsRemover.setOutputCols```

### How was this patch tested?
Unit tests

Closes #26480 from huaxingao/spark-29808.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-11-13 08:18:23 -06:00
zhengruifeng 76e5294bb6 [SPARK-29801][ML] ML models unify toString method
### What changes were proposed in this pull request?
1,ML models should extend toString method to expose basic information.
Current some algs (GBT/RF/LoR) had done this, while others not yet.
2,add `val numFeatures` in `BisectingKMeansModel`/`GaussianMixtureModel`/`KMeansModel`/`AFTSurvivalRegressionModel`/`IsotonicRegressionModel`

### Why are the changes needed?
ML models should extend toString method to expose basic information.

### Does this PR introduce any user-facing change?
yes

### How was this patch tested?
existing testsuites

Closes #26439 from zhengruifeng/models_toString.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-11-11 11:03:26 -08:00
Bago Amirbekian 8152a87235 [SPARK-28978][ ] Support > 256 args to python udf
### What changes were proposed in this pull request?

On the worker we express lambda functions as strings and then eval them to create a "mapper" function. This make the code hard to read & limits the # of arguments a udf can support to 256 for python <= 3.6.

This PR rewrites the mapper functions as nested functions instead of "lambda strings" and allows passing in more than 255 args.

### Why are the changes needed?
The jira ticket associated with this issue describes how MLflow uses udfs to consume columns as features. This pattern isn't unique and a limit of 255 features is quite low.

### Does this PR introduce any user-facing change?
Users can now pass more than 255 cols to a udf function.

### How was this patch tested?
Added a unit test for passing in > 255 args to udf.

Closes #26442 from MrBago/replace-lambdas-on-worker.

Authored-by: Bago Amirbekian <bago@databricks.com>
Signed-off-by: Xiangrui Meng <meng@databricks.com>
2019-11-08 19:19:14 -08:00
HyukjinKwon 7fc9db0853 [SPARK-29798][PYTHON][SQL] Infers bytes as binary type in createDataFrame in Python 3 at PySpark
### What changes were proposed in this pull request?

This PR proposes to infer bytes as binary types in Python 3. See https://github.com/apache/spark/pull/25749 for discussions. I have also checked that Arrow considers `bytes` as binary type, and PySpark UDF can also accepts `bytes` as a binary type.

Since `bytes` is not a `str` anymore in Python 3, it's clear to call it `BinaryType` in Python 3.

### Why are the changes needed?

To respect Python 3's `bytes` type and support Python's primitive types.

### Does this PR introduce any user-facing change?

Yes.

**Before:**

```python
>>> spark.createDataFrame([[b"abc"]])
Traceback (most recent call last):
  File "/.../spark/python/pyspark/sql/types.py", line 1036, in _infer_type
    return _infer_schema(obj)
  File "/.../spark/python/pyspark/sql/types.py", line 1062, in _infer_schema
    raise TypeError("Can not infer schema for type: %s" % type(row))
TypeError: Can not infer schema for type: <class 'bytes'>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/.../spark/python/pyspark/sql/session.py", line 787, in createDataFrame
    rdd, schema = self._createFromLocal(map(prepare, data), schema)
  File "/.../spark/python/pyspark/sql/session.py", line 445, in _createFromLocal
    struct = self._inferSchemaFromList(data, names=schema)
  File "/.../spark/python/pyspark/sql/session.py", line 377, in _inferSchemaFromList
    schema = reduce(_merge_type, (_infer_schema(row, names) for row in data))
  File "/.../spark/python/pyspark/sql/session.py", line 377, in <genexpr>
    schema = reduce(_merge_type, (_infer_schema(row, names) for row in data))
  File "/.../spark/python/pyspark/sql/types.py", line 1064, in _infer_schema
    fields = [StructField(k, _infer_type(v), True) for k, v in items]
  File "/.../spark/python/pyspark/sql/types.py", line 1064, in <listcomp>
    fields = [StructField(k, _infer_type(v), True) for k, v in items]
  File "/.../spark/python/pyspark/sql/types.py", line 1038, in _infer_type
    raise TypeError("not supported type: %s" % type(obj))
TypeError: not supported type: <class 'bytes'>
```

**After:**

```python
>>> spark.createDataFrame([[b"abc"]])
DataFrame[_1: binary]
```

### How was this patch tested?
Unittest was added and manually tested.

Closes #26432 from HyukjinKwon/SPARK-29798.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Bryan Cutler <cutlerb@gmail.com>
2019-11-08 12:10:39 -08:00
HyukjinKwon 4ec04e5ef3 [SPARK-22340][PYTHON] Add a mode to pin Python thread into JVM's
## What changes were proposed in this pull request?

This PR proposes to add **Single threading model design (pinned thread model)** mode which is an experimental mode to sync threads on PVM and JVM. See https://www.py4j.org/advanced_topics.html#using-single-threading-model-pinned-thread

### Multi threading model

Currently, PySpark uses this model. Threads on PVM and JVM are independent. For instance, in a different Python thread, callbacks are received and relevant Python codes are executed. JVM threads are reused when possible.

Py4J will create a new thread every time a command is received and there is no thread available. See the current model we're using - https://www.py4j.org/advanced_topics.html#the-multi-threading-model

One problem in this model is that we can't sync threads on PVM and JVM out of the box. This leads to some problems in particular at some codes related to threading in JVM side. See:
7056e004ee/core/src/main/scala/org/apache/spark/SparkContext.scala (L334)
Due to reusing JVM threads, seems the job groups in Python threads cannot be set in each thread as described in the JIRA.

### Single threading model design (pinned thread model)

This mode pins and syncs the threads on PVM and JVM to work around the problem above. For instance, in the same Python thread, callbacks are received and relevant Python codes are executed. See https://www.py4j.org/advanced_topics.html#the-single-threading-model

Even though this mode can sync threads on PVM and JVM for other thread related code paths,
 this might cause another problem: seems unable to inherit properties as below (assuming multi-thread mode still creates new threads when existing threads are busy, I suspect this issue already exists when multiple jobs are submitted in multi-thread mode; however, it can be always seen in single threading mode):

```bash
$ PYSPARK_PIN_THREAD=true ./bin/pyspark
```

```python
import threading

spark.sparkContext.setLocalProperty("a", "hi")
def print_prop():
    print(spark.sparkContext.getLocalProperty("a"))

threading.Thread(target=print_prop).start()
```

```
None
```

Unlike Scala side:

```scala
spark.sparkContext.setLocalProperty("a", "hi")
new Thread(new Runnable {
  def run() = println(spark.sparkContext.getLocalProperty("a"))
}).start()
```

```
hi
```

This behaviour potentially could cause weird issues but this PR currently does not target this fix this for now since this mode is experimental.

### How does this PR fix?

Basically there are two types of Py4J servers `GatewayServer` and `ClientServer`.  The former is for multi threading and the latter is for single threading. This PR adds a switch to use the latter.

In Scala side:
The logic to select a server is encapsulated in `Py4JServer` and use `Py4JServer` at `PythonRunner` for Spark summit and `PythonGatewayServer` for Spark shell. Each uses `ClientServer` when `PYSPARK_PIN_THREAD` is `true` and `GatewayServer` otherwise.

In Python side:
Simply do an if-else to switch the server to talk. It uses `ClientServer` when `PYSPARK_PIN_THREAD` is `true` and `GatewayServer` otherwise.

This is disabled by default for now.

## How was this patch tested?

Manually tested. This can be tested via:

```python
PYSPARK_PIN_THREAD=true ./bin/pyspark
```

and/or

```bash
cd python
./run-tests --python-executables=python --testnames "pyspark.tests.test_pin_thread"
```

Also, ran the Jenkins tests with `PYSPARK_PIN_THREAD` enabled.

Closes #24898 from HyukjinKwon/pinned-thread.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-11-08 06:44:58 +09:00
zhengruifeng ed12b61784 [SPARK-29656][ML][PYSPARK] ML algs expose aggregationDepth
### What changes were proposed in this pull request?
expose expert param `aggregationDepth` in algs: GMM/GLR

### Why are the changes needed?
SVC/LoR/LiR/AFT had exposed expert param aggregationDepth to end users. It should be nice to expose it in similar algs.

### Does this PR introduce any user-facing change?
yes, expose new param

### How was this patch tested?
added pytext tests

Closes #26322 from zhengruifeng/agg_opt.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
2019-11-06 10:34:53 +08:00
Dongjoon Hyun c55265cd2d [SPARK-29739][PYSPARK][TESTS] Use java instead of cc in test_pipe_functions
### What changes were proposed in this pull request?

This PR aims to replace `cc` with `java` in `test_pipe_functions` of `test_rdd.py`.

### Why are the changes needed?

Currently, `test_rdd.py` assumes `cc` installation during `rdd.pipe` tests.
This requires us to install `gcc` for python testing. If we use `java`, we can have the same test coverage and we don't need to install it because it's already installed in `PySpark` test environment.

This will be helpful when we build a dockerized parallel testing environment.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Pass the existing PySpark tests.

Closes #26383 from dongjoon-hyun/SPARK-29739.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-11-03 23:03:38 -08:00
Matt Stillwell 1e1b7302f4 [MINOR][PYSPARK][DOCS] Fix typo in example documentation
### What changes were proposed in this pull request?

I propose that we change the example code documentation to call the proper function .
For example, under the `foreachBatch` function, the example code was calling the `foreach()` function by mistake.

### Why are the changes needed?

I suppose it could confuse some people, and it is a typo

### Does this PR introduce any user-facing change?

No, there is no "meaningful" code being change, simply the documentation

### How was this patch tested?

I made the change on a fork and it still worked

Closes #26299 from mstill3/patch-1.

Authored-by: Matt Stillwell <18670089+mstill3@users.noreply.github.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-11-01 11:55:29 -07:00
Terry Kim 3175f4bf1b [SPARK-29664][PYTHON][SQL] Column.getItem behavior is not consistent with Scala
### What changes were proposed in this pull request?

This PR changes the behavior of `Column.getItem` to call `Column.getItem` on Scala side instead of `Column.apply`.

### Why are the changes needed?

The current behavior is not consistent with that of Scala.

In PySpark:
```Python
df = spark.range(2)
map_col = create_map(lit(0), lit(100), lit(1), lit(200))
df.withColumn("mapped", map_col.getItem(col('id'))).show()
# +---+------+
# | id|mapped|
# +---+------+
# |  0|   100|
# |  1|   200|
# +---+------+
```
In Scala:
```Scala
val df = spark.range(2)
val map_col = map(lit(0), lit(100), lit(1), lit(200))
// The following getItem results in the following exception, which is the right behavior:
// java.lang.RuntimeException: Unsupported literal type class org.apache.spark.sql.Column id
//  at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:78)
//  at org.apache.spark.sql.Column.getItem(Column.scala:856)
//  ... 49 elided
df.withColumn("mapped", map_col.getItem(col("id"))).show
```

### Does this PR introduce any user-facing change?

Yes. If the use wants to pass `Column` object to `getItem`, he/she now needs to use the indexing operator to achieve the previous behavior.

```Python
df = spark.range(2)
map_col = create_map(lit(0), lit(100), lit(1), lit(200))
df.withColumn("mapped", map_col[col('id'))].show()
# +---+------+
# | id|mapped|
# +---+------+
# |  0|   100|
# |  1|   200|
# +---+------+
```

### How was this patch tested?

Existing tests.

Closes #26351 from imback82/spark-29664.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-11-01 12:25:48 +09:00
zhengruifeng bb478706b5 [SPARK-29645][ML][PYSPARK] ML add param RelativeError
### What changes were proposed in this pull request?
1, add shared param `relativeError`
2, `Imputer`/`RobusterScaler`/`QuantileDiscretizer` extend `HasRelativeError`

### Why are the changes needed?
It makes sense to expose RelativeError to end users, since it controls both the precision and memory overhead.
`QuantileDiscretizer` had already added this param, while other algs not yet.

### Does this PR introduce any user-facing change?
yes, new param is added in  `Imputer`/`RobusterScaler`

### How was this patch tested?
existing testsutes

Closes #26305 from zhengruifeng/add_relative_err.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
2019-10-31 13:52:28 +08:00
Xianyang Liu 1e599e5005 [SPARK-29582][PYSPARK] Support TaskContext.get() in a barrier task from Python side
### What changes were proposed in this pull request?

Add support of `TaskContext.get()` in a barrier task from Python side, this makes it easier to migrate legacy user code to barrier execution mode.

### Why are the changes needed?

In Spark Core, there is a `TaskContext` object which is a singleton. We set a task context instance which can be TaskContext or BarrierTaskContext before the task function startup, and unset it to none after the function end. So we can both get TaskContext and BarrierTaskContext with the object. However we can only get the BarrierTaskContext with `BarrierTaskContext`, we will get `None` if we get it by `TaskContext.get` in a barrier stage.

This is useful when people switch from normal code to barrier code, and only need a little update.

### Does this PR introduce any user-facing change?

Yes.
Previously:
```python
def func(iterator):
    task_context = TaskContext.get() . # this could be None.
    barrier_task_context = BarrierTaskContext.get() # get the BarrierTaskContext instance
    ...

rdd.barrier().mapPartitions(func)
```

Proposed:
```python
def func(iterator):
    task_context = TaskContext.get() . # this could also get the BarrierTaskContext instance which is same as barrier_task_context
    barrier_task_context = BarrierTaskContext.get() # get the BarrierTaskContext instance
    ...

rdd.barrier().mapPartitions(func)
```

### How was this patch tested?

New UT tests.

Closes #26239 from ConeyLiu/barrier_task_context.

Authored-by: Xianyang Liu <xianyang.liu@intel.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-10-31 13:10:44 +09:00
HyukjinKwon aa3716896f [SPARK-29668][PYTHON] Add a deprecation warning for Python 3.4 and 3.5
### What changes were proposed in this pull request?

This PR proposes to show a warning for deprecated Python 3.4 and 3.5 in Pyspark.

### Why are the changes needed?

It's officially deprecated.

### Does this PR introduce any user-facing change?

Yes, it shows a warning message for Python 3.4 and 3.5:

```
...
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
/.../spark/python/pyspark/context.py:220: DeprecationWarning: Support for Python 2 and Python 3 prior to version 3.6 is deprecated as of Spark 3.0. See also the plan for dropping Python 2 support at https://spark.apache.org/news/plan-for-dropping-python-2-support.html.
  DeprecationWarning)
...
```

### How was this patch tested?

Manually tested.

Closes #26335 from HyukjinKwon/SPARK-29668.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-10-30 20:36:45 -07:00
Chris Martin c29494377b [SPARK-29126][PYSPARK][DOC] Pandas Cogroup udf usage guide
This PR adds some extra documentation for the new Cogrouped map Pandas udfs.  Specifically:

- Updated the usage guide for the new `COGROUPED_MAP` Pandas udfs added in https://github.com/apache/spark/pull/24981
- Updated the docstring for pandas_udf to include the COGROUPED_MAP type as suggested by HyukjinKwon in https://github.com/apache/spark/pull/25939

Closes #26110 from d80tb7/SPARK-29126-cogroup-udf-usage-guide.

Authored-by: Chris Martin <chris@cmartinit.co.uk>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-10-31 10:41:57 +09:00
HyukjinKwon 8682bb11ae [SPARK-29627][PYTHON][SQL] Allow array_contains to take column instances
### What changes were proposed in this pull request?

This PR proposes to allow `array_contains` to take column instances.

### Why are the changes needed?

For consistent support in Scala and Python APIs. Scala allows column instances at `array_contains`

Scala:

```scala
import org.apache.spark.sql.functions._
val df = Seq(Array("a", "b", "c"), Array.empty[String]).toDF("data")
df.select(array_contains($"data", lit("a"))).show()
```

Python:

```python
from pyspark.sql.functions import array_contains, lit
df = spark.createDataFrame([(["a", "b", "c"],), ([],)], ['data'])
df.select(array_contains(df.data, lit("a"))).show()
```

However, PySpark sides does not allow.

### Does this PR introduce any user-facing change?

Yes.

```python
from pyspark.sql.functions import array_contains, lit
df = spark.createDataFrame([(["a", "b", "c"],), ([],)], ['data'])
df.select(array_contains(df.data, lit("a"))).show()
```

**Before:**

```
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/.../spark/python/pyspark/sql/functions.py", line 1950, in array_contains
    return Column(sc._jvm.functions.array_contains(_to_java_column(col), value))
  File "/.../spark/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line 1277, in __call__
  File "/.../spark/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line 1241, in _build_args
  File "/.../spark/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line 1228, in _get_args
  File "/.../spark/python/lib/py4j-0.10.8.1-src.zip/py4j/java_collections.py", line 500, in convert
  File "/.../spark/python/pyspark/sql/column.py", line 344, in __iter__
    raise TypeError("Column is not iterable")
TypeError: Column is not iterable
```

**After:**

```
+-----------------------+
|array_contains(data, a)|
+-----------------------+
|                   true|
|                  false|
+-----------------------+
```

### How was this patch tested?

Manually tested and added a doctest.

Closes #26288 from HyukjinKwon/SPARK-29627.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-10-30 09:45:19 +09:00
Huaxin Gao 37690dea10 [SPARK-29565][ML][PYTHON] OneHotEncoder should support single-column input/output
### What changes were proposed in this pull request?
add single-column input/ouput support in OneHotEncoder

### Why are the changes needed?
Currently, OneHotEncoder only has multi columns support.  It makes sense to support single column as well.

### Does this PR introduce any user-facing change?
Yes
```OneHotEncoder.setInputCol```
```OneHotEncoder.setOutputCol```

### How was this patch tested?
Unit test

Closes #26265 from huaxingao/spark-29565.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Liang-Chi Hsieh <liangchi@uber.com>
2019-10-28 23:20:21 -07:00
Huaxin Gao c137acbf65 [SPARK-29566][ML] Imputer should support single-column input/output
### What changes were proposed in this pull request?
add single-column input/output support in Imputer

### Why are the changes needed?
Currently, Imputer only has multi-column support. This PR adds single-column input/output support.

### Does this PR introduce any user-facing change?
Yes. add single-column input/output support in Imputer
```Imputer.setInputCol```
```Imputer.setOutputCol```

### How was this patch tested?
add unit tests

Closes #26247 from huaxingao/spark-29566.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
2019-10-29 11:11:41 +08:00
Huaxin Gao b19fd487df [SPARK-29093][PYTHON][ML] Remove automatically generated param setters in _shared_params_code_gen.py
### What changes were proposed in this pull request?
Remove automatically generated param setters in _shared_params_code_gen.py

### Why are the changes needed?
To keep parity between scala and python

### Does this PR introduce any user-facing change?
Yes
Add some setters in Python ML XXXModels

### How was this patch tested?
unit tests

Closes #26232 from huaxingao/spark-29093.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
2019-10-28 11:36:10 +08:00
stczwd dcf5eaf1a6 [SPARK-29444][FOLLOWUP] add doc and python parameter for ignoreNullFields in json generating
# What changes were proposed in this pull request?
Add description for ignoreNullFields, which is commited in #26098 , in DataFrameWriter and readwriter.py.
Enable user to use ignoreNullFields in pyspark.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
run unit tests

Closes #26227 from stczwd/json-generator-doc.

Authored-by: stczwd <qcsd2011@163.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-10-24 10:25:04 -07:00
Xianyang Liu 0a7095156b [SPARK-29499][CORE][PYSPARK] Add mapPartitionsWithIndex for RDDBarrier
### What changes were proposed in this pull request?

Add mapPartitionsWithIndex for RDDBarrier.

### Why are the changes needed?

There is only one method in `RDDBarrier`. We often use the partition index as a label for the current partition. We need to get the index from `TaskContext`  index in the method of `mapPartitions` which is not convenient.
### Does this PR introduce any user-facing change?

No

### How was this patch tested?

New UT.

Closes #26148 from ConeyLiu/barrier-index.

Authored-by: Xianyang Liu <xianyang.liu@intel.com>
Signed-off-by: Xingbo Jiang <xingbo.jiang@databricks.com>
2019-10-23 13:46:09 +02:00
HyukjinKwon 811d563fbf [SPARK-29536][PYTHON] Upgrade cloudpickle to 1.1.1 to support Python 3.8
### What changes were proposed in this pull request?

Inline cloudpickle in PySpark to cloudpickle 1.1.1. See https://github.com/cloudpipe/cloudpickle/blob/v1.1.1/cloudpickle/cloudpickle.py

https://github.com/cloudpipe/cloudpickle/pull/269 was added for Python 3.8 support (fixed from 1.1.0). Using 1.2.2 seems breaking PyPy 2 due to cloudpipe/cloudpickle#278 so this PR currently uses 1.1.1.

Once we drop Python 2, we can switch to the highest version.

### Why are the changes needed?

positional-only arguments was newly introduced from Python 3.8 (see https://docs.python.org/3/whatsnew/3.8.html#positional-only-parameters)

Particularly the newly added argument to `types.CodeType` was the problem (https://docs.python.org/3/whatsnew/3.8.html#changes-in-the-python-api):

> `types.CodeType` has a new parameter in the second position of the constructor (posonlyargcount) to support positional-only arguments defined in **PEP 570**. The first argument (argcount) now represents the total number of positional arguments (including positional-only arguments). The new `replace()` method of `types.CodeType` can be used to make the code future-proof.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Manually tested. Note that the optional dependency PyArrow looks not yet supporting Python 3.8; therefore, it was not tested. See "Details" below.

<details>
<p>

```bash
cd python
./run-tests --python-executables=python3.8
```

```
Running PySpark tests. Output is in /Users/hyukjin.kwon/workspace/forked/spark/python/unit-tests.log
Will test against the following Python executables: ['python3.8']
Will test the following Python modules: ['pyspark-core', 'pyspark-ml', 'pyspark-mllib', 'pyspark-sql', 'pyspark-streaming']
Starting test(python3.8): pyspark.ml.tests.test_algorithms
Starting test(python3.8): pyspark.ml.tests.test_feature
Starting test(python3.8): pyspark.ml.tests.test_base
Starting test(python3.8): pyspark.ml.tests.test_evaluation
Finished test(python3.8): pyspark.ml.tests.test_base (12s)
Starting test(python3.8): pyspark.ml.tests.test_image
Finished test(python3.8): pyspark.ml.tests.test_evaluation (14s)
Starting test(python3.8): pyspark.ml.tests.test_linalg
Finished test(python3.8): pyspark.ml.tests.test_feature (23s)
Starting test(python3.8): pyspark.ml.tests.test_param
Finished test(python3.8): pyspark.ml.tests.test_image (22s)
Starting test(python3.8): pyspark.ml.tests.test_persistence
Finished test(python3.8): pyspark.ml.tests.test_param (25s)
Starting test(python3.8): pyspark.ml.tests.test_pipeline
Finished test(python3.8): pyspark.ml.tests.test_linalg (37s)
Starting test(python3.8): pyspark.ml.tests.test_stat
Finished test(python3.8): pyspark.ml.tests.test_pipeline (7s)
Starting test(python3.8): pyspark.ml.tests.test_training_summary
Finished test(python3.8): pyspark.ml.tests.test_stat (21s)
Starting test(python3.8): pyspark.ml.tests.test_tuning
Finished test(python3.8): pyspark.ml.tests.test_persistence (45s)
Starting test(python3.8): pyspark.ml.tests.test_wrapper
Finished test(python3.8): pyspark.ml.tests.test_algorithms (83s)
Starting test(python3.8): pyspark.mllib.tests.test_algorithms
Finished test(python3.8): pyspark.ml.tests.test_training_summary (32s)
Starting test(python3.8): pyspark.mllib.tests.test_feature
Finished test(python3.8): pyspark.ml.tests.test_wrapper (20s)
Starting test(python3.8): pyspark.mllib.tests.test_linalg
Finished test(python3.8): pyspark.mllib.tests.test_feature (32s)
Starting test(python3.8): pyspark.mllib.tests.test_stat
Finished test(python3.8): pyspark.mllib.tests.test_algorithms (70s)
Starting test(python3.8): pyspark.mllib.tests.test_streaming_algorithms
Finished test(python3.8): pyspark.mllib.tests.test_stat (37s)
Starting test(python3.8): pyspark.mllib.tests.test_util
Finished test(python3.8): pyspark.mllib.tests.test_linalg (70s)
Starting test(python3.8): pyspark.sql.tests.test_arrow
Finished test(python3.8): pyspark.sql.tests.test_arrow (1s) ... 53 tests were skipped
Starting test(python3.8): pyspark.sql.tests.test_catalog
Finished test(python3.8): pyspark.mllib.tests.test_util (15s)
Starting test(python3.8): pyspark.sql.tests.test_column
Finished test(python3.8): pyspark.sql.tests.test_catalog (24s)
Starting test(python3.8): pyspark.sql.tests.test_conf
Finished test(python3.8): pyspark.sql.tests.test_column (21s)
Starting test(python3.8): pyspark.sql.tests.test_context
Finished test(python3.8): pyspark.ml.tests.test_tuning (125s)
Starting test(python3.8): pyspark.sql.tests.test_dataframe
Finished test(python3.8): pyspark.sql.tests.test_conf (9s)
Starting test(python3.8): pyspark.sql.tests.test_datasources
Finished test(python3.8): pyspark.sql.tests.test_context (29s)
Starting test(python3.8): pyspark.sql.tests.test_functions
Finished test(python3.8): pyspark.sql.tests.test_datasources (32s)
Starting test(python3.8): pyspark.sql.tests.test_group
Finished test(python3.8): pyspark.sql.tests.test_dataframe (39s) ... 3 tests were skipped
Starting test(python3.8): pyspark.sql.tests.test_pandas_udf
Finished test(python3.8): pyspark.sql.tests.test_pandas_udf (1s) ... 6 tests were skipped
Starting test(python3.8): pyspark.sql.tests.test_pandas_udf_cogrouped_map
Finished test(python3.8): pyspark.sql.tests.test_pandas_udf_cogrouped_map (0s) ... 14 tests were skipped
Starting test(python3.8): pyspark.sql.tests.test_pandas_udf_grouped_agg
Finished test(python3.8): pyspark.sql.tests.test_pandas_udf_grouped_agg (1s) ... 15 tests were skipped
Starting test(python3.8): pyspark.sql.tests.test_pandas_udf_grouped_map
Finished test(python3.8): pyspark.sql.tests.test_pandas_udf_grouped_map (1s) ... 20 tests were skipped
Starting test(python3.8): pyspark.sql.tests.test_pandas_udf_scalar
Finished test(python3.8): pyspark.sql.tests.test_pandas_udf_scalar (1s) ... 49 tests were skipped
Starting test(python3.8): pyspark.sql.tests.test_pandas_udf_window
Finished test(python3.8): pyspark.sql.tests.test_pandas_udf_window (1s) ... 14 tests were skipped
Starting test(python3.8): pyspark.sql.tests.test_readwriter
Finished test(python3.8): pyspark.sql.tests.test_functions (29s)
Starting test(python3.8): pyspark.sql.tests.test_serde
Finished test(python3.8): pyspark.sql.tests.test_group (20s)
Starting test(python3.8): pyspark.sql.tests.test_session
Finished test(python3.8): pyspark.mllib.tests.test_streaming_algorithms (126s)
Starting test(python3.8): pyspark.sql.tests.test_streaming
Finished test(python3.8): pyspark.sql.tests.test_serde (25s)
Starting test(python3.8): pyspark.sql.tests.test_types
Finished test(python3.8): pyspark.sql.tests.test_readwriter (38s)
Starting test(python3.8): pyspark.sql.tests.test_udf
Finished test(python3.8): pyspark.sql.tests.test_session (32s)
Starting test(python3.8): pyspark.sql.tests.test_utils
Finished test(python3.8): pyspark.sql.tests.test_utils (17s)
Starting test(python3.8): pyspark.streaming.tests.test_context
Finished test(python3.8): pyspark.sql.tests.test_types (45s)
Starting test(python3.8): pyspark.streaming.tests.test_dstream
Finished test(python3.8): pyspark.sql.tests.test_udf (44s)
Starting test(python3.8): pyspark.streaming.tests.test_kinesis
Finished test(python3.8): pyspark.streaming.tests.test_kinesis (0s) ... 2 tests were skipped
Starting test(python3.8): pyspark.streaming.tests.test_listener
Finished test(python3.8): pyspark.streaming.tests.test_context (28s)
Starting test(python3.8): pyspark.tests.test_appsubmit
Finished test(python3.8): pyspark.sql.tests.test_streaming (60s)
Starting test(python3.8): pyspark.tests.test_broadcast
Finished test(python3.8): pyspark.streaming.tests.test_listener (11s)
Starting test(python3.8): pyspark.tests.test_conf
Finished test(python3.8): pyspark.tests.test_conf (17s)
Starting test(python3.8): pyspark.tests.test_context
Finished test(python3.8): pyspark.tests.test_broadcast (39s)
Starting test(python3.8): pyspark.tests.test_daemon
Finished test(python3.8): pyspark.tests.test_daemon (5s)
Starting test(python3.8): pyspark.tests.test_join
Finished test(python3.8): pyspark.tests.test_context (31s)
Starting test(python3.8): pyspark.tests.test_profiler
Finished test(python3.8): pyspark.tests.test_join (9s)
Starting test(python3.8): pyspark.tests.test_rdd
Finished test(python3.8): pyspark.tests.test_profiler (12s)
Starting test(python3.8): pyspark.tests.test_readwrite
Finished test(python3.8): pyspark.tests.test_readwrite (23s) ... 3 tests were skipped
Starting test(python3.8): pyspark.tests.test_serializers
Finished test(python3.8): pyspark.tests.test_appsubmit (94s)
Starting test(python3.8): pyspark.tests.test_shuffle
Finished test(python3.8): pyspark.streaming.tests.test_dstream (110s)
Starting test(python3.8): pyspark.tests.test_taskcontext
Finished test(python3.8): pyspark.tests.test_rdd (42s)
Starting test(python3.8): pyspark.tests.test_util
Finished test(python3.8): pyspark.tests.test_serializers (11s)
Starting test(python3.8): pyspark.tests.test_worker
Finished test(python3.8): pyspark.tests.test_shuffle (12s)
Starting test(python3.8): pyspark.accumulators
Finished test(python3.8): pyspark.tests.test_util (7s)
Starting test(python3.8): pyspark.broadcast
Finished test(python3.8): pyspark.accumulators (8s)
Starting test(python3.8): pyspark.conf
Finished test(python3.8): pyspark.broadcast (8s)
Starting test(python3.8): pyspark.context
Finished test(python3.8): pyspark.tests.test_worker (19s)
Starting test(python3.8): pyspark.ml.classification
Finished test(python3.8): pyspark.conf (4s)
Starting test(python3.8): pyspark.ml.clustering
Finished test(python3.8): pyspark.context (22s)
Starting test(python3.8): pyspark.ml.evaluation
Finished test(python3.8): pyspark.tests.test_taskcontext (49s)
Starting test(python3.8): pyspark.ml.feature
Finished test(python3.8): pyspark.ml.clustering (43s)
Starting test(python3.8): pyspark.ml.fpm
Finished test(python3.8): pyspark.ml.evaluation (27s)
Starting test(python3.8): pyspark.ml.image
Finished test(python3.8): pyspark.ml.image (8s)
Starting test(python3.8): pyspark.ml.linalg.__init__
Finished test(python3.8): pyspark.ml.linalg.__init__ (0s)
Starting test(python3.8): pyspark.ml.recommendation
Finished test(python3.8): pyspark.ml.classification (63s)
Starting test(python3.8): pyspark.ml.regression
Finished test(python3.8): pyspark.ml.fpm (23s)
Starting test(python3.8): pyspark.ml.stat
Finished test(python3.8): pyspark.ml.stat (30s)
Starting test(python3.8): pyspark.ml.tuning
Finished test(python3.8): pyspark.ml.regression (51s)
Starting test(python3.8): pyspark.mllib.classification
Finished test(python3.8): pyspark.ml.feature (93s)
Starting test(python3.8): pyspark.mllib.clustering
Finished test(python3.8): pyspark.ml.tuning (39s)
Starting test(python3.8): pyspark.mllib.evaluation
Finished test(python3.8): pyspark.mllib.classification (38s)
Starting test(python3.8): pyspark.mllib.feature
Finished test(python3.8): pyspark.mllib.evaluation (25s)
Starting test(python3.8): pyspark.mllib.fpm
Finished test(python3.8): pyspark.mllib.clustering (64s)
Starting test(python3.8): pyspark.mllib.linalg.__init__
Finished test(python3.8): pyspark.ml.recommendation (131s)
Starting test(python3.8): pyspark.mllib.linalg.distributed
Finished test(python3.8): pyspark.mllib.linalg.__init__ (0s)
Starting test(python3.8): pyspark.mllib.random
Finished test(python3.8): pyspark.mllib.feature (36s)
Starting test(python3.8): pyspark.mllib.recommendation
Finished test(python3.8): pyspark.mllib.fpm (31s)
Starting test(python3.8): pyspark.mllib.regression
Finished test(python3.8): pyspark.mllib.random (16s)
Starting test(python3.8): pyspark.mllib.stat.KernelDensity
Finished test(python3.8): pyspark.mllib.stat.KernelDensity (1s)
Starting test(python3.8): pyspark.mllib.stat._statistics
Finished test(python3.8): pyspark.mllib.stat._statistics (25s)
Starting test(python3.8): pyspark.mllib.tree
Finished test(python3.8): pyspark.mllib.regression (44s)
Starting test(python3.8): pyspark.mllib.util
Finished test(python3.8): pyspark.mllib.recommendation (49s)
Starting test(python3.8): pyspark.profiler
Finished test(python3.8): pyspark.mllib.linalg.distributed (53s)
Starting test(python3.8): pyspark.rdd
Finished test(python3.8): pyspark.profiler (14s)
Starting test(python3.8): pyspark.serializers
Finished test(python3.8): pyspark.mllib.tree (30s)
Starting test(python3.8): pyspark.shuffle
Finished test(python3.8): pyspark.shuffle (2s)
Starting test(python3.8): pyspark.sql.avro.functions
Finished test(python3.8): pyspark.mllib.util (30s)
Starting test(python3.8): pyspark.sql.catalog
Finished test(python3.8): pyspark.serializers (17s)
Starting test(python3.8): pyspark.sql.column
Finished test(python3.8): pyspark.rdd (31s)
Starting test(python3.8): pyspark.sql.conf
Finished test(python3.8): pyspark.sql.conf (7s)
Starting test(python3.8): pyspark.sql.context
Finished test(python3.8): pyspark.sql.avro.functions (19s)
Starting test(python3.8): pyspark.sql.dataframe
Finished test(python3.8): pyspark.sql.catalog (16s)
Starting test(python3.8): pyspark.sql.functions
Finished test(python3.8): pyspark.sql.column (27s)
Starting test(python3.8): pyspark.sql.group
Finished test(python3.8): pyspark.sql.context (26s)
Starting test(python3.8): pyspark.sql.readwriter
Finished test(python3.8): pyspark.sql.group (52s)
Starting test(python3.8): pyspark.sql.session
Finished test(python3.8): pyspark.sql.dataframe (73s)
Starting test(python3.8): pyspark.sql.streaming
Finished test(python3.8): pyspark.sql.functions (75s)
Starting test(python3.8): pyspark.sql.types
Finished test(python3.8): pyspark.sql.readwriter (57s)
Starting test(python3.8): pyspark.sql.udf
Finished test(python3.8): pyspark.sql.types (13s)
Starting test(python3.8): pyspark.sql.window
Finished test(python3.8): pyspark.sql.session (32s)
Starting test(python3.8): pyspark.streaming.util
Finished test(python3.8): pyspark.streaming.util (1s)
Starting test(python3.8): pyspark.util
Finished test(python3.8): pyspark.util (0s)
Finished test(python3.8): pyspark.sql.streaming (30s)
Finished test(python3.8): pyspark.sql.udf (27s)
Finished test(python3.8): pyspark.sql.window (22s)
Tests passed in 855 seconds
```
</p>
</details>

Closes #26194 from HyukjinKwon/SPARK-29536.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-10-22 16:18:34 +09:00
shahid 4a6005c795 [SPARK-29235][ML][PYSPARK] Support avgMetrics in read/write of CrossValidatorModel
### What changes were proposed in this pull request?
 Currently pyspark doesn't write/read `avgMetrics` in `CrossValidatorModel`, whereas scala supports it.

### Why are the changes needed?
 Test step to reproduce it:

```
dataset = spark.createDataFrame([(Vectors.dense([0.0]), 0.0),
     (Vectors.dense([0.4]), 1.0),
     (Vectors.dense([0.5]), 0.0),
      (Vectors.dense([0.6]), 1.0),
      (Vectors.dense([1.0]), 1.0)] * 10,
     ["features", "label"])
lr = LogisticRegression()
grid = ParamGridBuilder().addGrid(lr.maxIter, [0, 1]).build()
evaluator = BinaryClassificationEvaluator()
cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, evaluator=evaluator,parallelism=2)
cvModel = cv.fit(dataset)
cvModel.write().save("/tmp/model")
cvModel2 = CrossValidatorModel.read().load("/tmp/model")
print(cvModel.avgMetrics) # prints non empty result as expected
print(cvModel2.avgMetrics) # Bug: prints an empty result.
```

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Manually tested

Before patch:
```
>>> cvModel.write().save("/tmp/model_0")
>>> cvModel2 = CrossValidatorModel.read().load("/tmp/model_0")
>>> print(cvModel2.avgMetrics)
[]
```

After patch:
```
>>> cvModel2 = CrossValidatorModel.read().load("/tmp/model_2")
>>> print(cvModel2.avgMetrics[0])
0.5
```

Closes #26038 from shahidki31/avgMetrics.

Authored-by: shahid <shahidki31@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-10-19 15:23:57 -05:00
zhengruifeng dba673f0e3 [SPARK-29489][ML][PYSPARK] ml.evaluation support log-loss
### What changes were proposed in this pull request?
`ml.MulticlassClassificationEvaluator` & `mllib.MulticlassMetrics` support log-loss

### Why are the changes needed?
log-loss is an important classification metric and is widely used in practice

### Does this PR introduce any user-facing change?
Yes, add new option ("logloss") and a related param `eps`

### How was this patch tested?
added testsuites & local tests refering to sklearn

Closes #26135 from zhengruifeng/logloss.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
2019-10-18 17:57:13 +08:00
Huaxin Gao 6f8c001c8d [SPARK-29381][FOLLOWUP][PYTHON][ML] Add 'private' _XXXParams classes for classification & regression
### What changes were proposed in this pull request?
Add private _XXXParams classes for classification & regression

### Why are the changes needed?
To keep parity between scala and python

### Does this PR introduce any user-facing change?
Yes. Add gettters/setters for the following Model classes

```
LinearSVCModel:
get/setRegParam
get/setMaxIte
get/setFitIntercept
get/setTol
get/setStandardization
get/setWeightCol
get/setAggregationDepth
get/setThreshold

LogisticRegressionModel:
get/setRegParam
get/setElasticNetParam
get/setMaxIter
get/setFitIntercept
get/setTol
get/setStandardization
get/setWeightCol
get/setAggregationDepth
get/setThreshold

NaiveBayesModel:
get/setWeightCol

LinearRegressionModel:
get/setRegParam
get/setElasticNetParam
get/setMaxIter
get/setTol
get/setFitIntercept
get/setStandardization
get/setWeight
get/setSolver
get/setAggregationDepth
get/setLoss

GeneralizedLinearRegressionModel:
get/setFitIntercept
get/setMaxIter
get/setTol
get/setRegParam
get/setWeightCol
get/setSolver
```

### How was this patch tested?
Add a few doctest

Closes #26142 from huaxingao/spark-29381.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
2019-10-18 17:26:54 +08:00
Huaxin Gao 901ff92969 [SPARK-29464][PYTHON][ML] PySpark ML should expose Params.clear() to unset a user supplied Param
### What changes were proposed in this pull request?
change PySpark ml ```Params._clear``` to ```Params.clear```

### Why are the changes needed?
PySpark ML currently has a private _clear() method that will unset a param. This should be made public to match the Scala API and give users a way to unset a user supplied param.

### Does this PR introduce any user-facing change?
Yes. PySpark ml ```Params._clear``` ---> ```Params.clear```

### How was this patch tested?
Add test.

Closes #26130 from huaxingao/spark-29464.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Bryan Cutler <cutlerb@gmail.com>
2019-10-17 17:02:31 -07:00
zhengruifeng 9dacdd38b3 [SPARK-23578][ML][PYSPARK] Binarizer support multi-column
### What changes were proposed in this pull request?
Binarizer support multi-column by extending `HasInputCols`/`HasOutputCols`/`HasThreshold`/`HasThresholds`

### Why are the changes needed?
similar algs in `ml.feature` already support multi-column, like `Bucketizer`/`StringIndexer`/`QuantileDiscretizer`

### Does this PR introduce any user-facing change?
yes, add setter/getter of `thresholds`/`inputCols`/`outputCols`

### How was this patch tested?
added suites

Closes #26064 from zhengruifeng/binarizer_multicols.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
2019-10-16 18:32:07 +08:00
Jeff Evans 95de93b24e [SPARK-24540][SQL] Support for multiple character delimiter in Spark CSV read
Updating univocity-parsers version to 2.8.3, which adds support for multiple character delimiters

Moving univocity-parsers version to spark-parent pom dependencyManagement section

Adding new utility method to build multi-char delimiter string, which delegates to existing one

Adding tests for multiple character delimited CSV

### What changes were proposed in this pull request?

Adds support for parsing CSV data using multiple-character delimiters.  Existing logic for converting the input delimiter string to characters was kept and invoked in a loop.  Project dependencies were updated to remove redundant declaration of `univocity-parsers` version, and also to change that version to the latest.

### Why are the changes needed?

It is quite common for people to have delimited data, where the delimiter is not a single character, but rather a sequence of characters.  Currently, it is difficult to handle such data in Spark (typically needs pre-processing).

### Does this PR introduce any user-facing change?

Yes. Specifying the "delimiter" option for the DataFrame read, and providing more than one character, will no longer result in an exception.  Instead, it will be converted as before and passed to the underlying library (Univocity), which has accepted multiple character delimiters since 2.8.0.

### How was this patch tested?

The `CSVSuite` tests were confirmed passing (including new methods), and `sbt` tests for `sql` were executed.

Closes #26027 from jeff303/SPARK-24540.

Authored-by: Jeff Evans <jeffrey.wayne.evans@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-10-15 15:44:51 -05:00
Huaxin Gao cfcaf528cd [SPARK-29381][PYTHON][ML] Add _ before the XXXParams classes
### What changes were proposed in this pull request?
Add _ before XXXParams classes to indicate internal usage

### Why are the changes needed?
Follow the PEP 8 convention to use _single_leading_underscore to indicate internal use

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
use existing tests

Closes #26103 from huaxingao/spark-29381.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-10-14 10:52:23 -05:00
Huaxin Gao 67e1360bad [SPARK-29377][PYTHON][ML] Parity between Scala ML tuning and Python ML tuning
### What changes were proposed in this pull request?
Follow Scala ml tuning implementation

- put leading underscore before python ```ValidatorParams``` to indicate private
- add ```_CrossValidatorParams``` and ```_TrainValidationSplitParams```
- separate the getters and setters. Put getters in _XXXParams and setters in the Classes.

### Why are the changes needed?
Keep parity between scala and python

### Does this PR introduce any user-facing change?
add ```CrossValidatorModel.getNumFolds``` and ```TrainValidationSplitModel.getTrainRatio()```

### How was this patch tested?
Add doctest

Closes #26057 from huaxingao/spark-tuning.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
2019-10-14 14:28:31 +08:00
Huaxin Gao 81362956a7 [SPARK-29116][PYTHON][ML] Refactor py classes related to DecisionTree
### What changes were proposed in this pull request?

- Move tree related classes to a separate file ```tree.py```
- add method ```predictLeaf``` in ```DecisionTreeModel```& ```TreeEnsembleModel```

### Why are the changes needed?

- keep parity between scala and python
- easy code maintenance

### Does this PR introduce any user-facing change?
Yes
add method ```predictLeaf``` in ```DecisionTreeModel```& ```TreeEnsembleModel```
add ```setMinWeightFractionPerNode```  in ```DecisionTreeClassifier``` and ```DecisionTreeRegressor```

### How was this patch tested?
add some doc tests

Closes #25929 from huaxingao/spark_29116.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
2019-10-12 22:13:50 +08:00
Bryan Cutler beb8d2f8ad [SPARK-29402][PYTHON][TESTS] Added tests for grouped map pandas_udf with window
### What changes were proposed in this pull request?

Added tests for grouped map pandas_udf using a window.

### Why are the changes needed?

Current tests for grouped map do not use a window and this had previously caused an error due the window range being a struct column, which was not yet supported.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

New tests added.

Closes #26063 from BryanCutler/pyspark-pandas_udf-group-with-window-tests-SPARK-29402.

Authored-by: Bryan Cutler <cutlerb@gmail.com>
Signed-off-by: Bryan Cutler <cutlerb@gmail.com>
2019-10-11 16:19:13 -07:00
Huaxin Gao ffddfc8584 [SPARK-29269][PYTHON][ML] Pyspark ALSModel support getters/setters
### What changes were proposed in this pull request?
Add getters/setters in Pyspark ALSModel.

### Why are the changes needed?
To keep parity between python and scala.

### Does this PR introduce any user-facing change?
Yes.
add the following getters/setters to ALSModel
```
get/setUserCol
get/setItemCol
get/setColdStartStrategy
get/setPredictionCol
```

### How was this patch tested?
add doctest

Closes #25947 from huaxingao/spark-29269.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
2019-10-08 14:05:09 +08:00
Huaxin Gao 2399134456 [SPARK-29143][PYTHON][ML] Pyspark feature models support column setters/getters
### What changes were proposed in this pull request?
add column setters/getters support in Pyspark feature models

### Why are the changes needed?
keep parity between Pyspark and Scala

### Does this PR introduce any user-facing change?
Yes.
After the change, Pyspark feature models have column setters/getters support.

### How was this patch tested?
Add some doctests

Closes #25908 from huaxingao/spark-29143.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-10-07 10:55:48 -05:00
Huaxin Gao bd213a0850 [SPARK-29360][PYTHON][ML] PySpark FPGrowthModel supports getter/setter
### What changes were proposed in this pull request?

### Why are the changes needed?
Keep parity between Scala and Python

### Does this PR introduce any user-facing change?
add the following getters/setter to FPGrowthModel
```
getMinSupport
getNumPartitions
getMinConfidence
getItemsCol
getPredictionCol
setItemsCol
setMinConfidence
setPredictionCol
```
add following getters/setters to PrefixSpan
```
set/getMinSupport
set/getMaxPatternLength
set/getMaxLocalProjDBSize
set/getSequenceCol
```

### How was this patch tested?
add doctest

Closes #26035 from huaxingao/spark-29360.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-10-07 10:53:59 -05:00
zero323 8556710409 [SPARK-28985][PYTHON][ML][FOLLOW-UP] Add _IsotonicRegressionBase
### What changes were proposed in this pull request?

Adds

```python
class _IsotonicRegressionBase(HasFeaturesCol, HasLabelCol, HasPredictionCol, HasWeightCol): ...
```

with related `Params` and uses it to replace `JavaPredictor` and `HasWeightCol` in `IsotonicRegression` base classes and `JavaPredictionModel,` in `IsotonicRegressionModel` base classes.

### Why are the changes needed?

Previous work (#25776) on [SPARK-28985](https://issues.apache.org/jira/browse/SPARK-28985) replaced `JavaEstimator`, `HasFeaturesCol`, `HasLabelCol`, `HasPredictionCol` in `IsotonicRegression` and `JavaModel` in `IsotonicRegressionModel` with newly added `JavaPredictor`:

e97b55d322/python/pyspark/ml/wrapper.py (L377)

and `JavaPredictionModel`

e97b55d322/python/pyspark/ml/wrapper.py (L405)

respectively.

This however is inconsistent with Scala counterpart where both  classes extend private `IsotonicRegressionBase`

3cb1b57809/mllib/src/main/scala/org/apache/spark/ml/regression/IsotonicRegression.scala (L42-L43)

This preserves some of the existing inconsistencies (`model` as defined in [the official example](https://github.com/apache/spark/blob/master/examples/src/main/python/ml/isotonic_regression_example.py)), i.e.

```python
from pyspark.ml.regression impor IsotonicRegressionMode
from pyspark.ml.param.shared import HasWeightCol

issubclass(IsotonicRegressionModel, HasWeightCol)
# False

hasattr(model, "weightCol")
# True
```

as well as introduces a bug, by adding unsupported `predict` method:

```python
import inspect

hasattr(model, "predict")
# True

inspect.getfullargspec(IsotonicRegressionModel.predict)
# FullArgSpec(args=['self', 'value'], varargs=None, varkw=None, defaults=None, kwonlyargs=[], kwonlydefaults=None, annotations={})

IsotonicRegressionModel.predict.__doc__
# Predict label for the given features.\n\n        .. versionadded:: 3.0.0'

model.predict(dataset.first().features)

# Py4JError: An error occurred while calling o49.predict. Trace:
# py4j.Py4JException: Method predict([class org.apache.spark.ml.linalg.SparseVector]) does not exist
# ...

```

Furthermore existing implementation can cause further problems in the future, if `Predictor` / `PredictionModel` API changes.

### Does this PR introduce any user-facing change?

Yes. It:

- Removes invalid `IsotonicRegressionModel.predict` method.
- Adds `HasWeightColumn` to `IsotonicRegressionModel`.

however the faulty implementation hasn't been released yet, and proposed additions have negligible potential for breaking existing code (and none, compared to changes already made in #25776).

### How was this patch tested?

- Existing unit tests.
- Manual testing.

CC huaxingao, zhengruifeng

Closes #26023 from zero323/SPARK-28985-FOLLOW-UP-isotonic-regression.

Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-10-04 18:06:10 -05:00
zero323 df22535bbd [SPARK-28985][PYTHON][ML][FOLLOW-UP] Add _AFTSurvivalRegressionParams
### What changes were proposed in this pull request?

Adds

```python
_AFTSurvivalRegressionParams(HasFeaturesCol, HasLabelCol, HasPredictionCol,
                                   HasMaxIter, HasTol, HasFitIntercept,
                                   HasAggregationDepth): ...
```

with related Params and uses it to replace `HasFitIntercept`, `HasMaxIter`, `HasTol` and  `HasAggregationDepth` in `AFTSurvivalRegression` base classes and `JavaPredictionModel,` in `AFTSurvivalRegressionModel` base classes.

### Why are the changes needed?

Previous work (#25776) on [SPARK-28985](https://issues.apache.org/jira/browse/SPARK-28985) replaced `JavaEstimator`, `HasFeaturesCol`, `HasLabelCol`, `HasPredictionCol` in `AFTSurvivalRegression` and  `JavaModel` in `AFTSurvivalRegressionModel` with newly added `JavaPredictor`:

e97b55d322/python/pyspark/ml/wrapper.py (L377)

and `JavaPredictionModel`

e97b55d322/python/pyspark/ml/wrapper.py (L405)

respectively.

This however is inconsistent with Scala counterpart where both classes extend private `AFTSurvivalRegressionBase`

eb037a8180/mllib/src/main/scala/org/apache/spark/ml/regression/AFTSurvivalRegression.scala (L48-L50)

This preserves some of the existing inconsistencies (variables as defined in [the official example](https://github.com/apache/spark/blob/master/examples/src/main/python/ml/aft_survival_regression.p))

```
from pyspark.ml.regression import AFTSurvivalRegression, AFTSurvivalRegressionModel
from pyspark.ml.param.shared import HasMaxIter, HasTol, HasFitIntercept, HasAggregationDepth
from pyspark.ml.param import Param

issubclass(AFTSurvivalRegressionModel, HasMaxIter)
# False
hasattr(model, "maxIter")  and isinstance(model.maxIter, Param)
# True

issubclass(AFTSurvivalRegressionModel, HasTol)
# False
hasattr(model, "tol")  and isinstance(model.tol, Param)
# True
```

and can cause problems in the future, if Predictor / PredictionModel API changes (unlike [`IsotonicRegression`](https://github.com/apache/spark/pull/26023), current implementation is technically speaking correct, though incomplete).

### Does this PR introduce any user-facing change?

Yes, it adds a number of base classes to `AFTSurvivalRegressionModel`. These change purely additive and have negligible potential for breaking existing code (and none, compared to changes already made in #25776). Additionally affected API hasn't been released in the current form yet.

### How was this patch tested?

- Existing unit tests.
- Manual testing.

CC huaxingao, zhengruifeng

Closes #26024 from zero323/SPARK-28985-FOLLOW-UP-aftsurival-regression.

Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-10-04 18:04:21 -05:00
HyukjinKwon 20ee2f5dcb [SPARK-29286][PYTHON][TESTS] Uses UTF-8 with 'replace' on errors at Python testing script
### What changes were proposed in this pull request?

This PR proposes to let Python 2 uses UTF-8, instead of ASCII, with permissively replacing non-UDF-8 unicodes into unicode points in Python testing script.

### Why are the changes needed?

When Python 2 is used to run the Python testing script, with `decode(encoding='ascii')`, it fails whenever non-ascii codes are printed out.

### Does this PR introduce any user-facing change?

To dev, it will enable to support to print out non-ASCII characters.

### How was this patch tested?

Jenkins will test it for our existing test codes. Also, manually tested with UTF-8 output.

Closes #26021 from HyukjinKwon/SPARK-29286.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-10-04 10:04:28 -07:00
Liang-Chi Hsieh 2bc3fff13b [SPARK-29341][PYTHON] Upgrade cloudpickle to 1.0.0
### What changes were proposed in this pull request?

This patch upgrades cloudpickle to 1.0.0 version.

Main changes:

1. cleanup unused functions: 936f16fac8
2. Fix relative imports inside function body: 31ecdd6f57
3. Write kw only arguments to pickle: 6cb4718528

### Why are the changes needed?

We should include new bug fix like 6cb4718528, because users might use such python function in PySpark.

```python
>>> def f(a, *, b=1):
...   return a + b
...
>>> rdd = sc.parallelize([1, 2, 3])
>>> rdd.map(f).collect()
[Stage 0:>                                                        (0 + 12) / 12]19/10/03 00:42:24 ERROR Executor: Exception in task 3.0 in stage 0.0 (TID 3)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/spark/python/lib/pyspark.zip/pyspark/worker.py", line 598, in main
    process()
  File "/spark/python/lib/pyspark.zip/pyspark/worker.py", line 590, in process
    serializer.dump_stream(out_iter, outfile)
  File "/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 513, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "/spark/python/lib/pyspark.zip/pyspark/util.py", line 99, in wrapper
    return f(*args, **kwargs)
TypeError: f() missing 1 required keyword-only argument: 'b'
```

After:

```python
>>> def f(a, *, b=1):
...   return a + b
...
>>> rdd = sc.parallelize([1, 2, 3])
>>> rdd.map(f).collect()
[2, 3, 4]
```

### Does this PR introduce any user-facing change?

Yes. This fixes two bugs when pickling Python functions.

### How was this patch tested?

Existing tests.

Closes #26009 from viirya/upgrade-cloudpickle.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-10-03 19:20:51 +09:00
zero323 858bf76e35 [SPARK-29142][PYTHON][ML][FOLLOWUP][DOC] Replace incorrect :py:attr: applications
### What changes were proposed in this pull request?

This PR replaces some references with correct ones (`:py:class:`).

### Why are the changes needed?

Newly added mixins from the original PR incorrectly reference classes with `:py:attr:`.  While these classes are marked as internal, and not rendered in the standard documentation, it still makes sense to use correct roles.

### Does this PR introduce any user-facing change?

No. The changed part is not a part of generated PySpark documents.

### How was this patch tested?

Since this PR is a kind of typo fix, manually checking the patch.
We can build document for compilation test although there is no UI change.

Closes #26004 from zero323/SPARK-29142-FOLLOWUP.

Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-10-03 02:45:44 -07:00
Chris Martin 76791b89f5 [SPARK-27463][PYTHON][FOLLOW-UP] Miscellaneous documentation and code cleanup of cogroup pandas UDF
Follow up from https://github.com/apache/spark/pull/24981 incorporating some comments from HyukjinKwon.

Specifically:

- Adding `CoGroupedData` to `pyspark/sql/__init__.py __all__` so that documentation is generated.
- Added pydoc, including example, for the use case whereby the user supplies a cogrouping function including a key.
- Added the boilerplate for doctests to cogroup.py.  Note that cogroup.py only contains the apply() function which has doctests disabled as per the  other Pandas Udfs.
- Restricted the newly exposed RelationalGroupedDataset constructor parameters to access only by the sql package.
- Some minor  formatting tweaks.

This was tested by running the appropriate unit tests.  I'm unsure as to how to check that my change will cause the documentation to be generated correctly, but it someone can describe how I can do this I'd be happy to check.

Closes #25939 from d80tb7/SPARK-27463-fixes.

Authored-by: Chris Martin <chris@cmartinit.co.uk>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-09-30 22:25:35 +09:00
HyukjinKwon fda0e6e48d [SPARK-29240][PYTHON] Pass Py4J column instance to support PySpark column in element_at function
### What changes were proposed in this pull request?

This PR makes `element_at` in PySpark able to take PySpark `Column` instances.

### Why are the changes needed?

To match with Scala side. Seems it was intended but not working correctly as a bug.

### Does this PR introduce any user-facing change?

Yes. See below:

```python
from pyspark.sql import functions as F
x = spark.createDataFrame([([1,2,3],1),([4,5,6],2),([7,8,9],3)],['list','num'])
x.withColumn('aa',F.element_at('list',x.num.cast('int'))).show()
```

Before:

```
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/.../spark/python/pyspark/sql/functions.py", line 2059, in element_at
    return Column(sc._jvm.functions.element_at(_to_java_column(col), extraction))
  File "/.../spark/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line 1277, in __call__
  File "/.../spark/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line 1241, in _build_args
  File "/.../spark/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line 1228, in _get_args
  File "/.../forked/spark/python/lib/py4j-0.10.8.1-src.zip/py4j/java_collections.py", line 500, in convert
  File "/.../spark/python/pyspark/sql/column.py", line 344, in __iter__
    raise TypeError("Column is not iterable")
TypeError: Column is not iterable
```

After:

```
+---------+---+---+
|     list|num| aa|
+---------+---+---+
|[1, 2, 3]|  1|  1|
|[4, 5, 6]|  2|  5|
|[7, 8, 9]|  3|  9|
+---------+---+---+
```

### How was this patch tested?

Manually tested against literal, Python native types, and PySpark column.

Closes #25950 from HyukjinKwon/SPARK-29240.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-09-27 11:04:55 -07:00
zhengruifeng aed7ff36f7 [SPARK-29258][ML][PYSPARK] parity between ml.evaluator and mllib.metrics
### What changes were proposed in this pull request?
1, expose `BinaryClassificationMetrics.numBins` in `BinaryClassificationEvaluator`
2, expose `RegressionMetrics.throughOrigin` in `RegressionEvaluator`
3, add metric `explainedVariance` in `RegressionEvaluator`

### Why are the changes needed?
existing function in mllib.metrics should also be exposed in ml

### Does this PR introduce any user-facing change?
yes, this PR add two expert params and one metric option

### How was this patch tested?
existing and added tests

Closes #25940 from zhengruifeng/evaluator_add_param.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
2019-09-27 13:30:03 +08:00
Huaxin Gao bdc4943b9e [SPARK-29142][PYTHON][ML] Pyspark clustering models support column setters/getters/predict
### What changes were proposed in this pull request?
Add the following Params classes in Pyspark clustering
```GaussianMixtureParams```
```KMeansParams```
```BisectingKMeansParams```
```LDAParams```
```PowerIterationClusteringParams```

### Why are the changes needed?
To be consistent with scala side

### Does this PR introduce any user-facing change?
Yes. Add the following changes:
```
GaussianMixtureModel
- get/setMaxIter
- get/setFeaturesCol
- get/setSeed
- get/setPredictionCol
- get/setProbabilityCol
- get/setTol
- predict
```

```
KMeansModel
- get/setMaxIter
- get/setFeaturesCol
- get/setSeed
- get/setPredictionCol
- get/setDistanceMeasure
- get/setTol
- predict
```

```
BisectingKMeansModel
- get/setMaxIter
- get/setFeaturesCol
- get/setSeed
- get/setPredictionCol
- get/setDistanceMeasure
- predict
```

```
LDAModel(HasMaxIter, HasFeaturesCol, HasSeed, HasCheckpointInterval):
- get/setMaxIter
- get/setFeaturesCol
- get/setSeed
- get/setCheckpointInterval
```

### How was this patch tested?
Add doctests

Closes #25859 from huaxingao/spark-29142.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
2019-09-27 11:19:02 +08:00
sheepstop 81de9d3c29 [SPARK-28678][DOC] Specify that array indices start at 1 for function slice in R Scala Python
### What changes were proposed in this pull request?
Added "array indices start at 1" in annotation to make it clear for the usage of function slice, in R Scala Python component

### Why are the changes needed?
It will throw exception if the value stare is 0, but array indices start at 0 most of times in other scenarios.

### Does this PR introduce any user-facing change?
Yes, more info provided to user.

### How was this patch tested?
No tests added, only doc change.

Closes #25704 from sheepstop/master.

Authored-by: sheepstop <yangting617@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-09-24 18:57:54 +09:00
Xianjin YE 8c8016a152 [SPARK-21045][PYTHON] Allow non-ascii string as an exception message from python execution in Python 2
### What changes were proposed in this pull request?

This PR allows non-ascii string as an exception message in Python 2 by explicitly en/decoding in case of `str` in Python 2.

### Why are the changes needed?

Previously PySpark will hang when the `UnicodeDecodeError` occurs and the real exception cannot be passed to the JVM side.

See the reproducer as below:

```python
def f():
    raise Exception("中")
spark = SparkSession.builder.master('local').getOrCreate()
spark.sparkContext.parallelize([1]).map(lambda x: f()).count()
```

### Does this PR introduce any user-facing change?

User may not observe hanging for the similar cases.

### How was this patch tested?

Added a new test and manually checking.

This pr is based on #18324, credits should also go to dataknocker.
To make lint-python happy for python3, it also includes a followup fix for #25814

Closes #25847 from advancedxy/python_exception_19926_and_21045.

Authored-by: Xianjin YE <advancedxy@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-09-21 08:09:19 +09:00
Holden Karau 42050c3f4f [SPARK-27659][PYTHON] Allow PySpark to prefetch during toLocalIterator
### What changes were proposed in this pull request?

This PR allows Python toLocalIterator to prefetch the next partition while the first partition is being collected. The PR also adds a demo micro bench mark in the examples directory, we may wish to keep this or not.

### Why are the changes needed?

In https://issues.apache.org/jira/browse/SPARK-23961 / 5e79ae3b40 we changed PySpark to only pull one partition at a time. This is memory efficient, but if partitions take time to compute this can mean we're spending more time blocking.

### Does this PR introduce any user-facing change?

A new param is added to toLocalIterator

### How was this patch tested?

New unit test inside of `test_rdd.py` checks the time that the elements are evaluated at. Another test that the results remain the same are added to `test_dataframe.py`.

I also ran a micro benchmark in the examples directory `prefetch.py` which shows an improvement of ~40% in this specific use case.

>
> 19/08/16 17:11:36 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
> Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
> Running timers:
>
> [Stage 32:>                                                         (0 + 1) / 1]
> Results:
>
> Prefetch time:
>
> 100.228110831
>
>
> Regular time:
>
> 188.341721614
>
>
>

Closes #25515 from holdenk/SPARK-27659-allow-pyspark-tolocalitr-to-prefetch.

Authored-by: Holden Karau <hkarau@apple.com>
Signed-off-by: Holden Karau <hkarau@apple.com>
2019-09-20 09:59:31 -07:00
Huaxin Gao e97b55d322 [SPARK-28985][PYTHON][ML] Add common classes (JavaPredictor/JavaClassificationModel/JavaProbabilisticClassifier) in PYTHON
### What changes were proposed in this pull request?

Add some common classes in Python to make it have the same structure as Scala

1. Scala has ClassifierParams/Classifier/ClassificationModel:

```
trait ClassifierParams
    extends PredictorParams with HasRawPredictionCol

abstract class Classifier
    extends Predictor with ClassifierParams {
    def setRawPredictionCol
}

abstract class ClassificationModel
  extends PredictionModel with ClassifierParams {
    def setRawPredictionCol
}

```
This PR makes Python has the following:

```
class JavaClassifierParams(HasRawPredictionCol, JavaPredictorParams):
    pass

class JavaClassifier(JavaPredictor, JavaClassifierParams):
    def setRawPredictionCol

class JavaClassificationModel(JavaPredictionModel, JavaClassifierParams):
    def setRawPredictionCol
```
2. Scala has ProbabilisticClassifierParams/ProbabilisticClassifier/ProbabilisticClassificationModel:
```
trait ProbabilisticClassifierParams
    extends ClassifierParams with HasProbabilityCol with HasThresholds

abstract class ProbabilisticClassifier
    extends Classifier with ProbabilisticClassifierParams {
    def setProbabilityCol
    def setThresholds
}

abstract class ProbabilisticClassificationModel
    extends ClassificationModel with ProbabilisticClassifierParams {
    def setProbabilityCol
    def setThresholds
}
```
This PR makes Python have the following:
```
class JavaProbabilisticClassifierParams(HasProbabilityCol, HasThresholds, JavaClassifierParams):
    pass

class JavaProbabilisticClassifier(JavaClassifier, JavaProbabilisticClassifierParams):
    def setProbabilityCol
    def setThresholds

class JavaProbabilisticClassificationModel(JavaClassificationModel, JavaProbabilisticClassifierParams):
    def setProbabilityCol
    def setThresholds
```
3. Scala has PredictorParams/Predictor/PredictionModel:
```
trait PredictorParams extends Params
    with HasLabelCol with HasFeaturesCol with HasPredictionCol

abstract class Predictor
    extends Estimator with PredictorParams {
    def setLabelCol
    def setFeaturesCol
    def setPredictionCol
  }

abstract class PredictionModel
    extends Model with PredictorParams {
    def setFeaturesCol
    def setPredictionCol
    def numFeatures
    def predict
}
```
This PR makes Python have the following:
```
class JavaPredictorParams(HasLabelCol, HasFeaturesCol, HasPredictionCol):
    pass

class JavaPredictor(JavaEstimator, JavaPredictorParams):
    def setLabelCol
    def setFeaturesCol
    def setPredictionCol

class JavaPredictionModel(JavaModel, JavaPredictorParams):
    def setFeaturesCol
    def setPredictionCol
    def numFeatures
    def predict
```

### Why are the changes needed?
Have parity between Python and Scala ML

### Does this PR introduce any user-facing change?
Yes. Add the following changes:

```
LinearSVCModel

- get/setFeatureCol
- get/setPredictionCol
- get/setLabelCol
- get/setRawPredictionCol
- predict
```

```
LogisticRegressionModel
DecisionTreeClassificationModel
RandomForestClassificationModel
GBTClassificationModel
NaiveBayesModel
MultilayerPerceptronClassificationModel

- get/setFeatureCol
- get/setPredictionCol
- get/setLabelCol
- get/setRawPredictionCol
- get/setProbabilityCol
- predict
```
```
LinearRegressionModel
IsotonicRegressionModel
DecisionTreeRegressionModel
RandomForestRegressionModel
GBTRegressionModel
AFTSurvivalRegressionModel
GeneralizedLinearRegressionModel

- get/setFeatureCol
- get/setPredictionCol
- get/setLabelCol
- predict
```

### How was this patch tested?
Add a few doc tests.

Closes #25776 from huaxingao/spark-28985.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-09-19 08:17:25 -05:00
Huaxin Gao db9e0fda6b [SPARK-22796][PYTHON][ML] Add multiple columns support to PySpark QuantileDiscretizer
### What changes were proposed in this pull request?
Add multiple columns support to PySpark QuantileDiscretizer

### Why are the changes needed?
Multiple columns support for  QuantileDiscretizer was in scala side a while ago. We need to add multiple columns support to python too.

### Does this PR introduce any user-facing change?
Yes. New Python is added

### How was this patch tested?
Add doctest

Closes #25812 from huaxingao/spark-22796.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Liang-Chi Hsieh <liangchi@uber.com>
2019-09-18 12:16:06 -07:00
Xianjin YE 203bf9e569 [SPARK-19926][PYSPARK] make captured exception from JVM side user friendly
### What changes were proposed in this pull request?
The str of `CapaturedException` is now returned by str(self.desc) rather than repr(self.desc), which is more user-friendly. It also handles unicode under python2 specially.

### Why are the changes needed?
This is an improvement, and makes exception more human readable in python side.

### Does this PR introduce any user-facing change?
Before this pr,  select `中文字段` throws exception something likes below:

```
Traceback (most recent call last):
  File "/Users/advancedxy/code_workspace/github/spark/python/pyspark/sql/tests/test_utils.py", line 34, in test_capture_user_friendly_exception
    raise e
AnalysisException: u"cannot resolve '`\u4e2d\u6587\u5b57\u6bb5`' given input columns: []; line 1 pos 7;\n'Project ['\u4e2d\u6587\u5b57\u6bb5]\n+- OneRowRelation\n"
```

after this pr:
```
Traceback (most recent call last):
  File "/Users/advancedxy/code_workspace/github/spark/python/pyspark/sql/tests/test_utils.py", line 34, in test_capture_user_friendly_exception
    raise e
AnalysisException: cannot resolve '`中文字段`' given input columns: []; line 1 pos 7;
'Project ['中文字段]
+- OneRowRelation

```
### How was this patch
Add a new test to verify unicode are correctly converted and manual checks for thrown exceptions.

This pr's credits should go to uncleGen and is based on https://github.com/apache/spark/pull/17267

Closes #25814 from advancedxy/python_exception_19926_and_21045.

Authored-by: Xianjin YE <advancedxy@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-09-18 23:32:10 +09:00