## What changes were proposed in this pull request?
The `catalog` and `conf` APIs were exposed in `SparkSession` in #12713 and #12669. This patch adds those to the python API.
## How was this patch tested?
Python tests.
Author: Andrew Or <andrew@databricks.com>
Closes#12765 from andrewor14/python-spark-session-more.
## What changes were proposed in this pull request?
According to the [SPARK-14829](https://issues.apache.org/jira/browse/SPARK-14829), deprecate API of LogisticRegression and LinearRegression using SGD
## How was this patch tested?
manual tests
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#12596 from zhengruifeng/deprecate_sgd.
## What changes were proposed in this pull request?
This PR adds Python APIs for:
- `ContinuousQueryManager`
- `ContinuousQueryException`
The `ContinuousQueryException` is a very basic wrapper, it doesn't provide the functionality that the Scala side provides, but it follows the same pattern for `AnalysisException`.
For `ContinuousQueryManager`, all APIs are provided except for registering listeners.
This PR also attempts to fix test flakiness by stopping all active streams just before tests.
## How was this patch tested?
Python Doc tests and unit tests
Author: Burak Yavuz <brkyvz@gmail.com>
Closes#12673 from brkyvz/pyspark-cqm.
## What changes were proposed in this pull request?
support avgMetrics in CrossValidatorModel with Python
## How was this patch tested?
Doctest and `test_save_load` in `pyspark/ml/test.py`
[JIRA](https://issues.apache.org/jira/browse/SPARK-12810)
Author: Kai Jiang <jiangkai@gmail.com>
Closes#12464 from vectorijk/spark-12810.
## What changes were proposed in this pull request?
```
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.0.0-SNAPSHOT
/_/
Using Python version 2.7.5 (default, Mar 9 2014 22:15:05)
SparkSession available as 'spark'.
>>> spark
<pyspark.sql.session.SparkSession object at 0x101f3bfd0>
>>> spark.sql("SHOW TABLES").show()
...
+---------+-----------+
|tableName|isTemporary|
+---------+-----------+
| src| false|
+---------+-----------+
>>> spark.range(1, 10, 2).show()
+---+
| id|
+---+
| 1|
| 3|
| 5|
| 7|
| 9|
+---+
```
**Note**: This API is NOT complete in its current state. In particular, for now I left out the `conf` and `catalog` APIs, which were added later in Scala. These will be added later before 2.0.
## How was this patch tested?
Python tests.
Author: Andrew Or <andrew@databricks.com>
Closes#12746 from andrewor14/python-spark-session.
## What changes were proposed in this pull request?
Since [SPARK-10574](https://issues.apache.org/jira/browse/SPARK-10574) breaks behavior of ```HashingTF```, we should try to enforce good practice by removing the "native" hashAlgorithm option in spark.ml and pyspark.ml. We can leave spark.mllib and pyspark.mllib alone.
## How was this patch tested?
Unit tests.
cc jkbradley
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#12702 from yanboliang/spark-14899.
This PR adds the remaining group of methods to PySpark's distributed linear algebra classes as follows:
* `RowMatrix` <sup>**[1]**</sup>
1. `computeGramianMatrix`
2. `computeCovariance`
3. `computeColumnSummaryStatistics`
4. `columnSimilarities`
5. `tallSkinnyQR` <sup>**[2]**</sup>
* `IndexedRowMatrix` <sup>**[3]**</sup>
1. `computeGramianMatrix`
* `CoordinateMatrix`
1. `transpose`
* `BlockMatrix`
1. `validate`
2. `cache`
3. `persist`
4. `transpose`
**[1]**: Note: `multiply`, `computeSVD`, and `computePrincipalComponents` are already part of PR #7963 for SPARK-6227.
**[2]**: Implementing `tallSkinnyQR` uncovered a bug with our PySpark `RowMatrix` constructor. As discussed on the dev list [here](http://apache-spark-developers-list.1001551.n3.nabble.com/K-Means-And-Class-Tags-td10038.html), there appears to be an issue with type erasure with RDDs coming from Java, and by extension from PySpark. Although we are attempting to construct a `RowMatrix` from an `RDD[Vector]` in [PythonMLlibAPI](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala#L1115), the `Vector` type is erased, resulting in an `RDD[Object]`. Thus, when calling Scala's `tallSkinnyQR` from PySpark, we get a Java `ClassCastException` in which an `Object` cannot be cast to a Spark `Vector`. As noted in the aforementioned dev list thread, this issue was also encountered with `DecisionTrees`, and the fix involved an explicit `retag` of the RDD with a `Vector` type. Thus, this PR currently contains that fix applied to the `createRowMatrix` helper function in `PythonMLlibAPI`. `IndexedRowMatrix` and `CoordinateMatrix` do not appear to have this issue likely due to their related helper functions in `PythonMLlibAPI` creating the RDDs explicitly from DataFrames with pattern matching, thus preserving the types. However, this fix may be out of scope for this single PR, and it may be better suited in a separate JIRA/PR. Therefore, I have marked this PR as WIP and am open to discussion.
**[3]**: Note: `multiply` and `computeSVD` are already part of PR #7963 for SPARK-6227.
Author: Mike Dusenberry <mwdusenb@us.ibm.com>
Closes#9441 from dusenberrymw/SPARK-9656_Add_Missing_Methods_to_PySpark_Distributed_Linear_Algebra.
## What changes were proposed in this pull request?
Before, spark.ml GaussianMixtureModel used the spark.mllib MultivariateGaussian in its public API. This was added after 1.6, so we can modify this API without breaking APIs.
This PR copies MultivariateGaussian to mllib-local in spark.ml, with a few changes:
* Renamed fields to match numpy, scipy: mu => mean, sigma => cov
This PR then uses the spark.ml MultivariateGaussian in the spark.ml GaussianMixtureModel, which involves:
* Modifying the constructor
* Adding a computeProbabilities method
Also:
* Added EPSILON to mllib-local for use in MultivariateGaussian
## How was this patch tested?
Existing unit tests
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#12593 from jkbradley/sparkml-gmm-fix.
## What changes were proposed in this pull request?
SPARK-14071 changed MLWritable.write to be a property. This reverts that change since there was not a good way to make MLReadable.read appear to be a property.
## How was this patch tested?
existing unit tests
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#12671 from jkbradley/revert-MLWritable-write-py.
## What changes were proposed in this pull request?
We deprecated ```runs``` of mllib.KMeans in Spark 1.6 (SPARK-11358). In 2.0, we will make it no effect (with warning messages). We did not remove ```setRuns/getRuns``` for better binary compatibility.
This PR change `runs` which are appeared at the public API. Usage inside of ```KMeans.runAlgorithm()``` will be resolved at #10806.
## How was this patch tested?
Existing unit tests.
cc jkbradley
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#12608 from yanboliang/spark-11559.
## What changes were proposed in this pull request?
This removes the class `HiveContext` itself along with all code usages associated with it. The bulk of the work was already done in #12485. This is mainly just code cleanup and actually removing the class.
Note: A couple of things will break after this patch. These will be fixed separately.
- the python HiveContext
- all the documentation / comments referencing HiveContext
- there will be no more HiveContext in the REPL (fixed by #12589)
## How was this patch tested?
No change in functionality.
Author: Andrew Or <andrew@databricks.com>
Closes#12585 from andrewor14/delete-hive-context.
## What changes were proposed in this pull request?
As the discussion at [SPARK-10574](https://issues.apache.org/jira/browse/SPARK-10574), ```HashingTF``` should support MurmurHash3 and make it as the default hash algorithm. We should also expose set/get API for ```hashAlgorithm```, then users can choose the hash method.
Note: The problem that ```mllib.feature.HashingTF``` behaves differently between Scala/Java and Python will be resolved in the followup work.
## How was this patch tested?
unit tests.
cc jkbradley MLnick
Author: Yanbo Liang <ybliang8@gmail.com>
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#12498 from yanboliang/spark-10574.
## What changes were proposed in this pull request?
Removed instances of JavaMLWriter, JavaMLReader appearing in public Python API docs
## How was this patch tested?
n/a
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#12542 from jkbradley/javamlwriter-doc.
## What changes were proposed in this pull request?
Add Python API in ML for GaussianMixture
## How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
Add doctest and test cases are the same as mllib Python tests
./dev/lint-python
PEP8 checks passed.
rm -rf _build/*
pydoc checks passed.
./python/run-tests --python-executables=python2.7 --modules=pyspark-ml
Running PySpark tests. Output is in /Users/mwang/spark_ws_0904/python/unit-tests.log
Will test against the following Python executables: ['python2.7']
Will test the following Python modules: ['pyspark-ml']
Finished test(python2.7): pyspark.ml.evaluation (18s)
Finished test(python2.7): pyspark.ml.clustering (40s)
Finished test(python2.7): pyspark.ml.classification (49s)
Finished test(python2.7): pyspark.ml.recommendation (44s)
Finished test(python2.7): pyspark.ml.feature (64s)
Finished test(python2.7): pyspark.ml.regression (45s)
Finished test(python2.7): pyspark.ml.tuning (30s)
Finished test(python2.7): pyspark.ml.tests (56s)
Tests passed in 106 seconds
Author: wm624@hotmail.com <wm624@hotmail.com>
Closes#12402 from wangmiao1981/gmm.
## What changes were proposed in this pull request?
Removed expectedType arg from PySpark Param __init__, as suggested by the JIRA.
## How was this patch tested?
Manually looked through all places that use Param. Compiled and ran all ML PySpark test cases before and after the fix.
Author: Jason Lee <cjlee@us.ibm.com>
Closes#12581 from jasoncl/SPARK-14768.
## What changes were proposed in this pull request?
In Python, sqlContext.getConf didn't allow getting the system default (getConf with one parameter).
Now the following are supported:
```
sqlContext.getConf(confName) # System default if not locally set, this is new
sqlContext.getConf(confName, myDefault) # myDefault if not locally set, old behavior
```
I also added doctests to this function. The original behavior does not change.
## How was this patch tested?
Manually, but doctests were added.
Author: mathieu longtin <mathieu.longtin@nuance.com>
Closes#12488 from mathieulongtin/pyfixgetconf3.
## What changes were proposed in this pull request?
In Python, the `option` and `options` method of `DataFrameReader` and `DataFrameWriter` were sending the string "None" instead of `null` when passed `None`, therefore making it impossible to send an actual `null`. This fixes that problem.
This is based on #11305 from mathieulongtin.
## How was this patch tested?
Added test to readwriter.py.
Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
Author: mathieu longtin <mathieu.longtin@nuance.com>
Closes#12494 from viirya/py-df-none-option.
## What changes were proposed in this pull request?
The PySpark deserialization has a bug that shows while deserializing all zero sparse vectors. This fix filters out empty string tokens before casting, hence properly stringified SparseVectors successfully get parsed.
## How was this patch tested?
Standard unit-tests similar to other methods.
Author: Arash Parsa <arash@ip-192-168-50-106.ec2.internal>
Author: Arash Parsa <arashpa@gmail.com>
Author: Vishnu Prasad <vishnu667@gmail.com>
Author: Vishnu Prasad S <vishnu667@gmail.com>
Closes#12516 from arashpa/SPARK-14739.
## What changes were proposed in this pull request?
Expand the possible ways to interact with the contents of a `pyspark.sql.types.StructType` instance.
- Iterating a `StructType` will iterate its fields
- `[field.name for field in my_structtype]`
- Indexing with a string will return a field by name
- `my_structtype['my_field_name']`
- Indexing with an integer will return a field by position
- `my_structtype[0]`
- Indexing with a slice will return a new `StructType` with just the chosen fields:
- `my_structtype[1:3]`
- The length is the number of fields (should also provide "truthiness" for free)
- `len(my_structtype) == 2`
## How was this patch tested?
Extended the unit test coverage in the accompanying `tests.py`.
Author: Sheamus K. Parkes <shea.parkes@milliman.com>
Closes#12251 from skparkes/pyspark-structtype-enhance.
## What changes were proposed in this pull request?
#11663 adds type conversion functionality for parameters in Pyspark. This PR find out the omissive ```Param``` that did not pass corresponding ```TypeConverter``` argument and fix them. After this PR, all params in pyspark/ml/ used ```TypeConverter```.
## How was this patch tested?
Existing tests.
cc jkbradley sethah
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#12529 from yanboliang/typeConverter.
## What changes were proposed in this pull request?
#11939 make Python param setters use the `_set` method. This PR fix omissive ones.
## How was this patch tested?
Existing tests.
cc jkbradley sethah
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#12531 from yanboliang/setters-omissive.
## What changes were proposed in this pull request?
This patch provides a first cut of python APIs for structured streaming. This PR provides the new classes:
- ContinuousQuery
- Trigger
- ProcessingTime
in pyspark under `pyspark.sql.streaming`.
In addition, it contains the new methods added under:
- `DataFrameWriter`
a) `startStream`
b) `trigger`
c) `queryName`
- `DataFrameReader`
a) `stream`
- `DataFrame`
a) `isStreaming`
This PR doesn't contain all methods exposed for `ContinuousQuery`, for example:
- `exception`
- `sourceStatuses`
- `sinkStatus`
They may be added in a follow up.
This PR also contains some very minor doc fixes in the Scala side.
## How was this patch tested?
Python doc tests
TODO:
- [ ] verify Python docs look good
Author: Burak Yavuz <brkyvz@gmail.com>
Author: Burak Yavuz <burak@databricks.com>
Closes#12320 from brkyvz/stream-python.
## What changes were proposed in this pull request?
This issue aims to expose Scala `bround` function in Python/R API.
`bround` function is implemented in SPARK-14614 by extending current `round` function.
We used the following semantics from Hive.
```java
public static double bround(double input, int scale) {
if (Double.isNaN(input) || Double.isInfinite(input)) {
return input;
}
return BigDecimal.valueOf(input).setScale(scale, RoundingMode.HALF_EVEN).doubleValue();
}
```
After this PR, `pyspark` and `sparkR` also support `bround` function.
**PySpark**
```python
>>> from pyspark.sql.functions import bround
>>> sqlContext.createDataFrame([(2.5,)], ['a']).select(bround('a', 0).alias('r')).collect()
[Row(r=2.0)]
```
**SparkR**
```r
> df = createDataFrame(sqlContext, data.frame(x = c(2.5, 3.5)))
> head(collect(select(df, bround(df$x, 0))))
bround(x, 0)
1 2
2 4
```
## How was this patch tested?
Pass the Jenkins tests (including new testcases).
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#12509 from dongjoon-hyun/SPARK-14639.
## What changes were proposed in this pull request?
Change unpersist blocking parameter default value to match Scala
## How was this patch tested?
unit tests, manual tests
jkbradley davies
Author: felixcheung <felixcheung_m@hotmail.com>
Closes#12507 from felixcheung/pyunpersist.
## What changes were proposed in this pull request?
PySpark Param constructors need to pass the TypeConverter argument by name, partly to make sure it is not mistaken for the expectedType arg and partly because we will remove the expectedType arg in 2.1. In several places, this is not being done correctly.
This PR changes all usages in pyspark/ml/ to keyword args.
## How was this patch tested?
Existing unit tests. I will not test type conversion for every Param unless we really think it necessary.
Also, if you start the PySpark shell and import classes (e.g., pyspark.ml.feature.StandardScaler), then you no longer get this warning:
```
/Users/josephkb/spark/python/pyspark/ml/param/__init__.py:58: UserWarning: expectedType is deprecated and will be removed in 2.1. Use typeConverter instead, as a keyword argument.
"Use typeConverter instead, as a keyword argument.")
```
That warning came from the typeConverter argument being passes as the expectedType arg by mistake.
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#12480 from jkbradley/typeconverter-fix.
## What changes were proposed in this pull request?
https://issues.apache.org/jira/browse/SPARK-14440
Remove
* PipelineMLWriter
* PipelineMLReader
* PipelineModelMLWriter
* PipelineModelMLReader
and modify comments.
## How was this patch tested?
test with unit test.
Author: Xusen Yin <yinxusen@gmail.com>
Closes#12216 from yinxusen/SPARK-14440.
## What changes were proposed in this pull request?
Added windowSize getter/setter to ML/MLlib
## How was this patch tested?
Added test cases in tests.py under both ML and MLlib
Author: Jason Lee <cjlee@us.ibm.com>
Closes#12428 from jasoncl/SPARK-14564.
## What changes were proposed in this pull request?
https://issues.apache.org/jira/browse/SPARK-14306
Add PySpark OneVsRest save/load supports.
## How was this patch tested?
Test with Python unit test.
Author: Xusen Yin <yinxusen@gmail.com>
Closes#12439 from yinxusen/SPARK-14306-0415.
## What changes were proposed in this pull request?
Python spark.ml Identifiable classes use UIDs of type str, but they should use unicode (in Python 2.x) to match Java. This could be a problem if someone created a class in Java with odd unicode characters, saved it, and loaded it in Python.
This PR: Use unicode everywhere in Python.
## How was this patch tested?
Updated persistence unit test to check uid type
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#12368 from jkbradley/python-uid-unicode.
## What changes were proposed in this pull request?
https://issues.apache.org/jira/browse/SPARK-7861
Add PySpark OneVsRest. I implement it with Python since it's a meta-pipeline.
## How was this patch tested?
Test with doctest.
Author: Xusen Yin <yinxusen@gmail.com>
Closes#12124 from yinxusen/SPARK-14306-7861.
## What changes were proposed in this pull request?
Param setters in python previously accessed the _paramMap directly to update values. The `_set` method now implements type checking, so it should be used to update all parameters. This PR eliminates all direct accesses to `_paramMap` besides the one in the `_set` method to ensure type checking happens.
Additional changes:
* [SPARK-13068](https://github.com/apache/spark/pull/11663) missed adding type converters in evaluation.py so those are done here
* An incorrect `toBoolean` type converter was used for StringIndexer `handleInvalid` param in previous PR. This is fixed here.
## How was this patch tested?
Existing unit tests verify that parameters are still set properly. No new functionality is actually added in this PR.
Author: sethah <seth.hendrickson16@gmail.com>
Closes#11939 from sethah/SPARK-14104.
## What changes were proposed in this pull request?
The default stopwords were a Java object. They are no longer.
## How was this patch tested?
Unit test which failed before the fix
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#12422 from jkbradley/pyspark-stopwords.
## What changes were proposed in this pull request?
PySpark ml GBTClassifier, Regressor support export/import.
## How was this patch tested?
Doc test.
cc jkbradley
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#12383 from yanboliang/spark-14374.
## What changes were proposed in this pull request?
This fix tries to add binary toggle Param to PySpark HashingTF in ML & MLlib. If this toggle is set, then all non-zero counts will be set to 1.
Note: This fix (SPARK-14238) is extended from SPARK-13963 where Scala implementation was done.
## How was this patch tested?
This fix adds two tests to cover the code changes. One for HashingTF in PySpark's ML and one for HashingTF in PySpark's MLLib.
Author: Yong Tang <yong.tang.github@outlook.com>
Closes#12079 from yongtang/SPARK-14238.
Added binary toggle param to CountVectorizer feature transformer in PySpark.
Created a unit test for using CountVectorizer with the binary toggle on.
Author: Bryan Cutler <cutlerb@gmail.com>
Closes#12308 from BryanCutler/binary-param-python-CountVectorizer-SPARK-13967.
## What changes were proposed in this pull request?
The PyDoc Makefile used "=" rather than "?=" for setting env variables so it overwrote the user values. This ignored the environment variables we set for linting allowing warnings through. This PR also fixes the warnings that had been introduced.
## How was this patch tested?
manual local export & make
Author: Holden Karau <holden@us.ibm.com>
Closes#12336 from holdenk/SPARK-14573-fix-pydoc-makefile.
Currently, JavaWrapper is only a wrapper class for pipeline classes that have Params and JavaCallable is a separate mixin that provides methods to make Java calls. This change simplifies the class structure and to define the Java wrapper in a plain base class along with methods to make Java calls. Also, renames Java wrapper classes to better reflect their purpose.
Ran existing Python ml tests and generated documentation to test this change.
Author: Bryan Cutler <cutlerb@gmail.com>
Closes#12304 from BryanCutler/pyspark-cleanup-JavaWrapper-SPARK-14472.
## What changes were proposed in this pull request?
- updated `OFF_HEAP` semantics for `StorageLevels.java`
- updated `OFF_HEAP` semantics for `storagelevel.py`
## How was this patch tested?
no need to test
Author: Liwei Lin <lwlin7@gmail.com>
Closes#12126 from lw-lin/storagelevel.py.
## What changes were proposed in this pull request?
Python API for GeneralizedLinearRegression
JIRA: https://issues.apache.org/jira/browse/SPARK-13597
## How was this patch tested?
The patch is tested with Python doctest.
Author: Kai Jiang <jiangkai@gmail.com>
Closes#11468 from vectorijk/spark-13597.
## What changes were proposed in this pull request?
Eagerly cleanup PySpark's temporary parallelize cleanup files rather than waiting for shut down.
## How was this patch tested?
Unit tests
Author: Holden Karau <holden@us.ibm.com>
Closes#12233 from holdenk/SPARK-13687-cleanup-pyspark-temporary-files.
## What changes were proposed in this pull request?
Cleanups to documentation. No changes to code.
* GBT docs: Move Scala doc for private object GradientBoostedTrees to public docs for GBTClassifier,Regressor
* GLM regParam: needs doc saying it is for L2 only
* TrainValidationSplitModel: add .. versionadded:: 2.0.0
* Rename “_transformer_params_from_java” to “_transfer_params_from_java”
* LogReg Summary classes: “probability” col should not say “calibrated”
* LR summaries: coefficientStandardErrors —> document that intercept stderr comes last. Same for t,p-values
* approxCountDistinct: Document meaning of “rsd" argument.
* LDA: note which params are for online LDA only
## How was this patch tested?
Doc build
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#12266 from jkbradley/ml-doc-cleanups.
## What changes were proposed in this pull request?
A new column VarianceCol has been added to DecisionTreeRegressor in ML scala code.
This patch adds the corresponding Python API, HasVarianceCol, to class DecisionTreeRegressor.
## How was this patch tested?
./dev/lint-python
PEP8 checks passed.
rm -rf _build/*
pydoc checks passed.
./python/run-tests --python-executables=python2.7 --modules=pyspark-ml
Running PySpark tests. Output is in /Users/mwang/spark_ws_0904/python/unit-tests.log
Will test against the following Python executables: ['python2.7']
Will test the following Python modules: ['pyspark-ml']
Finished test(python2.7): pyspark.ml.evaluation (12s)
Finished test(python2.7): pyspark.ml.clustering (18s)
Finished test(python2.7): pyspark.ml.classification (30s)
Finished test(python2.7): pyspark.ml.recommendation (28s)
Finished test(python2.7): pyspark.ml.feature (43s)
Finished test(python2.7): pyspark.ml.regression (31s)
Finished test(python2.7): pyspark.ml.tuning (19s)
Finished test(python2.7): pyspark.ml.tests (34s)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
Author: wm624@hotmail.com <wm624@hotmail.com>
Closes#12116 from wangmiao1981/fix_api.
## What changes were proposed in this pull request?
supporting `RandomForest{Classifier, Regressor}` save/load for Python API.
[JIRA](https://issues.apache.org/jira/browse/SPARK-14373)
## How was this patch tested?
doctest
Author: Kai Jiang <jiangkai@gmail.com>
Closes#12238 from vectorijk/spark-14373.
## What changes were proposed in this pull request?
Adding Python API for training summaries of LogisticRegression and LinearRegression in PySpark ML.
## How was this patch tested?
Added unit tests to exercise the api calls for the summary classes. Also, manually verified values are expected and match those from Scala directly.
Author: Bryan Cutler <cutlerb@gmail.com>
Closes#11621 from BryanCutler/pyspark-ml-summary-SPARK-13430.
## What changes were proposed in this pull request?
https://issues.apache.org/jira/browse/SPARK-13786
Add save/load for Python CrossValidator/Model and TrainValidationSplit/Model.
## How was this patch tested?
Test with Python doctest.
Author: Xusen Yin <yinxusen@gmail.com>
Closes#12020 from yinxusen/SPARK-13786.
## What changes were proposed in this pull request?
Currently, Broaccast.unpersist() will remove the file of broadcast, which should be the behavior of destroy().
This PR added destroy() for Broadcast in Python, to match the sematics in Scala.
## How was this patch tested?
Added regression tests.
Author: Davies Liu <davies@databricks.com>
Closes#12189 from davies/py_unpersist.
## What changes were proposed in this pull request?
The `window` function was added to Dataset with [this PR](https://github.com/apache/spark/pull/12008).
This PR adds the Python, and SQL, API for this function.
With this PR, SQL, Java, and Scala will share the same APIs as in users can use:
- `window(timeColumn, windowDuration)`
- `window(timeColumn, windowDuration, slideDuration)`
- `window(timeColumn, windowDuration, slideDuration, startTime)`
In Python, users can access all APIs above, but in addition they can do
- In Python:
`window(timeColumn, windowDuration, startTime=...)`
that is, they can provide the startTime without providing the `slideDuration`. In this case, we will generate tumbling windows.
## How was this patch tested?
Unit tests + manual tests
Author: Burak Yavuz <brkyvz@gmail.com>
Closes#12136 from brkyvz/python-windows.
## What changes were proposed in this pull request?
This fix tries to address the issue in PySpark where `spark.python.worker.memory`
could only be configured with a lower case unit (`k`, `m`, `g`, `t`). This fix
allows the upper case unit (`K`, `M`, `G`, `T`) to be used as well. This is to
conform to the JVM memory string as is specified in the documentation .
## How was this patch tested?
This fix adds additional test to cover the changes.
Author: Yong Tang <yong.tang.github@outlook.com>
Closes#12163 from yongtang/SPARK-14368.
This change modifies the "assembly/" module to just copy needed
dependencies to its build directory, and modifies the packaging
script to pick those up (and remove duplicate jars packages in the
examples module).
I also made some minor adjustments to dependencies to remove some
test jars from the final packaging, and remove jars that conflict with each
other when packaged separately (e.g. servlet api).
Also note that this change restores guava in applications' classpaths, even
though it's still shaded inside Spark. This is now needed for the Hadoop
libraries that are packaged with Spark, which now are not processed by
the shade plugin.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#11796 from vanzin/SPARK-13579.
## What changes were proposed in this pull request?
RDD.toLocalIterator() could be used to fetch one partition at a time to reduce the memory usage. Right now, for Dataset/Dataframe we have to use df.rdd.toLocalIterator, which is super slow also requires lots of memory (because of the Java serializer or even Kyro serializer).
This PR introduce an optimized toLocalIterator for Dataset/DataFrame, which is much faster and requires much less memory. For a partition with 5 millions rows, `df.rdd.toIterator` took about 100 seconds, but df.toIterator took less than 7 seconds. For 10 millions row, rdd.toIterator will crash (not enough memory) with 4G heap, but df.toLocalIterator could finished in 12 seconds.
The JDBC server has been updated to use DataFrame.toIterator.
## How was this patch tested?
Existing tests.
Author: Davies Liu <davies@databricks.com>
Closes#12114 from davies/local_iterator.