Commit graph

2362 commits

Author SHA1 Message Date
zhengruifeng f1489e6b12 [SPARK-31436][ML] MinHash keyDistance optimization
### What changes were proposed in this pull request?
re-impl `keyDistance`:
if both vectors are dense, new impl is 9.09x faster;
if both vectors are sparse, new impl is 5.66x faster;
if one is dense and the other is sparse, new impl is 7.8x faster;

### Why are the changes needed?
current implementation based on set operations is inefficient

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
existing testsuites

Closes #28206 from zhengruifeng/minhash_opt.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-04-17 08:27:21 -05:00
herman fab4ca5156
[SPARK-31450][SQL] Make ExpressionEncoder thread-safe
### What changes were proposed in this pull request?
This PR moves the `ExpressionEncoder.toRow` and `ExpressionEncoder.fromRow` functions into their own function objects(`ExpressionEncoder.Serializer` & `ExpressionEncoder.Deserializer`). This effectively makes the `ExpressionEncoder` stateless, thread-safe and (more) reusable. The function objects are not thread safe, however they are documented as such and should be used in a more limited scope (making it easier to reason about thread safety).

### Why are the changes needed?
ExpressionEncoders are not thread-safe. We had various (nasty) bugs because of this.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Existing tests.

Closes #28223 from hvanhovell/SPARK-31450.

Authored-by: herman <herman@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-04-16 18:47:46 -07:00
zhengruifeng 165590bc29 [SPARK-31301][ML] Flatten the result dataframe of tests in testChiSquare
### What changes were proposed in this pull request?
1, remove newly added method: `def testChiSquare(dataset: Dataset[_], featuresCol: String, labelCol: String): Array[SelectionTestResult]`, because: 1) it is only used in `ChiSqSelector`; 2, since the returned dataframe of `def test(dataset: DataFrame, featuresCol: String, labelCol: String): DataFrame` only contains one row, after collect it back to driver, the results are similar;
2, add method `def test(dataset: DataFrame, featuresCol: String, labelCol: String, flatten: Boolean): DataFrame` to return the flatten results;

### Why are the changes needed?
1, when get returned result dataframe, we may want to filter it like `pValue<0.1`, but current returned dataframe is hard to use;
2, current impl need to collect all test results of all columns back to the driver, this is a bottleneck, if we return the flatten datafame, we no longer to collect them to driver;

### Does this PR introduce any user-facing change?
Yes:
1, `def testChiSquare(dataset: Dataset[_], featuresCol: String, labelCol: String): Array[SelectionTestResult]` removed;
2, the returned dataframe need an action to trigger computation;

### How was this patch tested?
updated testsuites

Closes #28176 from zhengruifeng/flatten_chisq.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
2020-04-14 14:44:54 +08:00
zero323 697fe911ac [SPARK-30819][SPARKR][ML] Add FMRegressor wrapper to SparkR
### What changes were proposed in this pull request?

This pull request adds SparkR wrapper for `FMRegressor`:

- Supporting ` org.apache.spark.ml.r.FMRegressorWrapper`.
- `FMRegressionModel` S4 class.
- Corresponding `spark.fmRegressor`, `predict`, `summary` and `write.ml` generics.
- Corresponding docs and tests.

### Why are the changes needed?

Feature parity.

### Does this PR introduce any user-facing change?

No (new API).

### How was this patch tested?

New unit tests.

Closes #27571 from zero323/SPARK-30819.

Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-04-09 19:38:11 -05:00
zero323 0063462d55 [SPARK-30818][SPARKR][ML] Add SparkR LinearRegression wrapper
### What changes were proposed in this pull request?

This pull request adds SparkR wrapper for `LinearRegression`

- Supporting `org.apache.spark.ml.rLinearRegressionWrapper`.
- `LinearRegressionModel` S4 class.
- Corresponding `spark.lm` predict, summary and write.ml generics.
- Corresponding docs and tests.

### Why are the changes needed?

Feature parity.

### Does this PR introduce any user-facing change?

No (new API).

### How was this patch tested?

New unit tests.

Closes #27593 from zero323/SPARK-30818.

Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-04-08 22:29:44 -05:00
Holden Karau 8f010bd0a8
[SPARK-31208][CORE] Add an expiremental cleanShuffleDependencies
### What changes were proposed in this pull request?

Add a cleanShuffleDependencies as an experimental developer feature to allow folks to clean up shuffle files more aggressively than we currently do.

### Why are the changes needed?

Dynamic scaling on Kubernetes (introduced in Spark 3) depends on only shutting down executors without shuffle files. However Spark does not aggressively clean up shuffle files (see SPARK-5836) and instead depends on JVM GC on the driver to trigger deletes. We already have a mechanism to explicitly clean up shuffle files from the ALS algorithm where we create a lot of quickly orphaned shuffle files. We should expose this as an advanced developer feature to enable people to better clean-up shuffle files improving dynamic scaling of their jobs on Kubernetes.

### Does this PR introduce any user-facing change?

This adds a new experimental API.

### How was this patch tested?

ALS already used a mechanism like this, re-targets the ALS code to the new interface, tested with existing ALS tests.

Closes #28038 from holdenk/SPARK-31208-allow-users-to-cleanup-shuffle-files.

Authored-by: Holden Karau <hkarau@apple.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-04-07 13:54:36 -07:00
zero323 0d37f794ef [SPARK-30820][SPARKR][ML] Add FMClassifier to SparkR
### What changes were proposed in this pull request?

This pull request adds SparkR wrapper for `FMClassifier`:

- Supporting ` org.apache.spark.ml.r.FMClassifierWrapper`.
- `FMClassificationModel` S4 class.
- Corresponding `spark.fmClassifier`, `predict`, `summary` and `write.ml` generics.
- Corresponding docs and tests.

### Why are the changes needed?

Feature parity.

### Does this PR introduce any user-facing change?

No (new API).

### How was this patch tested?

New unit tests.

Closes #27570 from zero323/SPARK-30820.

Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-04-07 09:01:45 -05:00
zhengruifeng 1dce6c1fd4 [SPARK-31222][ML] Make ANOVATest Sparsity-Aware
### What changes were proposed in this pull request?
when input dataset is sparse, make `ANOVATest` only process non-zero value

### Why are the changes needed?
for performance

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
existing testsuites

Closes #27982 from zhengruifeng/anova_sparse.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
2020-03-31 10:40:17 +08:00
zhengruifeng 34d6b90449 [SPARK-31283][ML] Simplify ChiSq by adding a common method
### What changes were proposed in this pull request?
add a common method `computeChiSq` and reuse it in both `chiSquaredDenseFeatures` and `chiSquaredSparseFeatures`

### Why are the changes needed?
to simplify ChiSq

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
existing testsuites

Closes #28045 from zhengruifeng/simplify_chisq.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
2020-03-30 13:37:56 +08:00
Huaxin Gao d81df56f2d [SPARK-31223][ML] Set seed in np.random to regenerate test data
### What changes were proposed in this pull request?
https://issues.apache.org/jira/browse/SPARK-31223
set seed in np.random when generating test data......

### Why are the changes needed?
so the same set of test data can be regenerated later.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
exiting tests

Closes #27994 from huaxingao/spark-31223.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
2020-03-26 13:53:31 +08:00
zhengruifeng ded51b04d2 [SPARK-31138][ML][FOLLOWUP] ANOVA optimization
### What changes were proposed in this pull request?
1, remove unused var `numFeatures`;
2, remove the computation of `numSamples` and `numClasses`, since they can be directly infered by `counts: OpenHashMap[Double, Long]`

### Why are the changes needed?
remove a unnecessary job to compute `numSamples` and `numClasses`

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
existing testsuites

Closes #27979 from zhengruifeng/anova_followup.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
2020-03-23 11:16:57 +08:00
Huaxin Gao 307cfe1f8e [SPARK-31185][ML] Implement VarianceThresholdSelector
### What changes were proposed in this pull request?
Implement a Feature selector that removes all low-variance features. Features with a
variance lower than the threshold will be removed. The default is to keep all features with non-zero variance, i.e. remove the features that have the same value in all samples.

### Why are the changes needed?
VarianceThreshold is a simple baseline approach to feature selection. It removes all features whose variance doesn’t meet some threshold. The idea is when a feature doesn’t vary much within itself, it generally has very little predictive power.
scikit has implemented this selector.
https://scikit-learn.org/stable/modules/feature_selection.html#variance-threshold

### Does this PR introduce any user-facing change?
Yes.

### How was this patch tested?
Add new test suite.

Closes #27954 from huaxingao/variance-threshold.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
2020-03-22 12:44:18 +08:00
yan ma fae981e5f3 [SPARK-30773][ML] Support NativeBlas for level-1 routines
### What changes were proposed in this pull request?
Change BLAS for part of level-1 routines(axpy, dot, scal(double, denseVector)) from java implementation to NativeBLAS when vector size>256

### Why are the changes needed?
In current ML BLAS.scala, all level-1 routines are fixed to use java
implementation. But NativeBLAS(intel MKL, OpenBLAS) can bring up to 11X
performance improvement based on performance test which apply direct
calls against these methods. We should provide a way to allow user take
advantage of NativeBLAS for level-1 routines. Here we do it through
switching to NativeBLAS for these methods from f2jBLAS.

### Does this PR introduce any user-facing change?
 Yes, methods axpy, dot, scal in level-1 routines will switch to NativeBLAS when it has more than nativeL1Threshold(fixed value 256) elements and will fallback to f2jBLAS if native BLAS is not properly configured in system.

### How was this patch tested?
Perf test direct calls level-1 routines

Closes #27546 from yma11/SPARK-30773.

Lead-authored-by: yan ma <yan.ma@intel.com>
Co-authored-by: Ma Yan <yan.ma@intel.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-03-20 10:32:58 -05:00
Huaxin Gao 9a990133f6 [SPARK-31138][ML] Add ANOVA Selector for continuous features and categorical labels
### What changes were proposed in this pull request?
Add ANOVA Selector

### Why are the changes needed?
Currently Spark only supports selection of categorical features, while there are many requirements for the selection of continuous distribution features.

https://github.com/apache/spark/pull/27679 added FValueSelector for continuous features and continuous labels.
This PR adds ANOVASelector for continuous features and categorical labels.

### Does this PR introduce any user-facing change?
Yes, add a new Selector.

### How was this patch tested?
add new test suites

Closes #27895 from huaxingao/anova.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-03-20 10:28:00 -05:00
Qianyang Yu 6f0b0f1655
[SPARK-30954][ML][R] Make file name the same as class name
This pr solved the same issue as [pr27919](https://github.com/apache/spark/pull/27919), but this one changes the file names based on comment from previous pr.

### What changes were proposed in this pull request?

Make some of  file names the same as class name in R package.

### Why are the changes needed?

Make the file consistence

### Does this PR introduce any user-facing change?

No
### How was this patch tested?

run `./R/run-tests.sh`

Closes #27940 from kevinyu98/spark-30954-r-v2.

Authored-by: Qianyang Yu <qyu@us.ibm.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-03-17 16:15:02 -07:00
zhengruifeng 93088f79cc [SPARK-30776][ML][FOLLOWUP] FValue clean up
### What changes were proposed in this pull request?
remove unused variables;

### Why are the changes needed?
remove unused variables;

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
existing testsuites

Closes #27922 from zhengruifeng/test_cleanup.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
2020-03-17 11:29:08 +08:00
Huaxin Gao 3ce1dff7ba [SPARK-30930][ML] Remove ML/MLLIB DeveloperApi annotations
### What changes were proposed in this pull request?
jira link: https://issues.apache.org/jira/browse/SPARK-30930

Remove ML/MLLIB DeveloperApi annotations.

### Why are the changes needed?

The Developer APIs in ML/MLLIB have been there for a long time. They are stable now and are very unlikely to be changed or removed, so I unmark these Developer APIs in this PR.

### Does this PR introduce any user-facing change?
Yes. DeveloperApi annotations are removed from docs.

### How was this patch tested?
existing tests

Closes #27859 from huaxingao/spark-30930.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-03-16 12:41:22 -05:00
zhengruifeng 7f3c8fa42e [SPARK-31032][ML] GMM compute summary and update distributions in one job
### What changes were proposed in this pull request?
1, compute summary and update distributions in one pass;
2, remove logic related to check `shouldDistributeGaussians`

### Why are the changes needed?
In current impl, GMM need to trigger two jobs at one iteration:
1, one to compute summary;
2, if `shouldDistributeGaussians = ((k - 1.0) / k) * numFeatures > 25.0`, trigger another to update distributions;

`shouldDistributeGaussians` is almost true in practice, since numFeatures is likely to be greater than 25.

We can use only one job to impl above computation, by following the logic in `KMeans`: using `reduceByKey` to compute statistics for each center

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
existing testsuites

Closes #27784 from zhengruifeng/gmm_avoid_distri_gaussian.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
2020-03-12 11:21:30 +08:00
Huaxin Gao a1a665bece [SPARK-31077][ML] Remove ChiSqSelector dependency on mllib.ChiSqSelectorModel
### What changes were proposed in this pull request?

```ChiSqSelector ``` depends on ```mllib.ChiSqSelectorModel``` to do the selection logic. Will remove the dependency in this PR.

### Why are the changes needed?
This PR is an intermediate PR.  Removing ```ChiSqSelector``` dependency on ```mllib.ChiSqSelectorModel```. Next subtask will extract the common code between ```ChiSqSelector``` and ```FValueSelector``` and put in an abstract ```Selector```.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
New and existing tests

Closes #27841 from huaxingao/chisq.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-03-11 13:51:49 -05:00
Huaxin Gao b6b0343e3e [SPARK-30929][ML] ML, GraphX 3.0 QA: API: New Scala APIs, docs
### What changes were proposed in this pull request?
Auditing new ML Scala APIs introduced in 3.0. Fix found issues.

### Why are the changes needed?

### Does this PR introduce any user-facing change?
Yes. Some doc changes

### How was this patch tested?
Existing tests

Closes #27818 from huaxingao/spark-30929.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-03-09 09:11:21 -05:00
Huaxin Gao 4a64901ab7 [SPARK-31012][ML][PYSPARK][DOCS] Updating ML API docs for 3.0 changes
### What changes were proposed in this pull request?
Updating ML docs for 3.0 changes

### Why are the changes needed?
I am auditing 3.0 ML changes, found some docs are missing or not updated. Need to update these.

### Does this PR introduce any user-facing change?
Yes, doc changes

### How was this patch tested?
Manually build and check

Closes #27762 from huaxingao/spark-doc.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-03-07 11:42:05 -06:00
Huaxin Gao 6468d6f103 [SPARK-30776][ML] Support FValueSelector for continuous features and continuous labels
### What changes were proposed in this pull request?
Add FValueRegressionSelector for continuous features and continuous labels.

### Why are the changes needed?
Currently Spark only supports selection of categorical features, while there are many requirements for the selection of continuous distribution features.

This PR adds FValueSelector for continuous features and continuous labels.
ANOVASelector for continuous features and categorical labels will be added later using a separate PR.

### Does this PR introduce any user-facing change?
Yes.
Add a new Selector

### How was this patch tested?
Add new tests

Closes #27679 from huaxingao/spark_30776.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
2020-03-06 13:24:30 +08:00
zhengruifeng 111e9038d8 [SPARK-30770][ML] avoid vector conversion in GMM.transform
### What changes were proposed in this pull request?
Current impl needs to convert ml.Vector to breeze.Vector, which can be skipped.

### Why are the changes needed?
avoid unnecessary vector conversions

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
existing testsuites

Closes #27519 from zhengruifeng/gmm_transform_opt.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
2020-03-04 11:02:27 +08:00
Wu, Xiaochang ac122762f5 [SPARK-30813][ML] Fix Matrices.sprand comments
### What changes were proposed in this pull request?
Fix mistakes in comments

### Why are the changes needed?
There are mistakes in comments

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
N/A

Closes #27564 from xwu99/fix-mllib-sprand-comment.

Authored-by: Wu, Xiaochang <xiaochang.wu@intel.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-03-02 08:56:17 -06:00
Huaxin Gao 0a5e9a12a7 [SPARK-30995][ML][DOCS] Latex doesn't work correctly in FMClassifier/FMRegressor Scala doc
### What changes were proposed in this pull request?
Latex doesn't work correctly

### Why are the changes needed?
Fix the doc to make Latex work

### Does this PR introduce any user-facing change?
Before fix:

![image](https://user-images.githubusercontent.com/13592258/75611743-0fa00a00-5ad2-11ea-9cc0-4b246b25d63d.png)

![image](https://user-images.githubusercontent.com/13592258/75611755-25adca80-5ad2-11ea-884d-b2792b714bd5.png)

After fix:

![image](https://user-images.githubusercontent.com/13592258/75611776-46762000-5ad2-11ea-838d-f7f6f93c8aec.png)

![image](https://user-images.githubusercontent.com/13592258/75611778-51c94b80-5ad2-11ea-85e4-8c7424268f52.png)

### How was this patch tested?
Manually build doc and test

Closes #27748 from huaxingao/fm_doc.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-03-02 10:33:26 +09:00
zhengruifeng 14bb639c55 [SPARK-30938][ML][MLLIB] BinaryClassificationMetrics optimization
### What changes were proposed in this pull request?
1, avoid `Iterator.grouped(size: Int)`, which need to maintain an arraybuffer of `size`
2, keep the number of partitions in curve computation

### Why are the changes needed?
1, `BinaryClassificationMetrics` tend to fail (OOM) when `grouping=count/numBins` is too large, due to `Iterator.grouped(size: Int)` need to maintain an arraybuffer with `size` entries, however, in `BinaryClassificationMetrics` we do not need to maintain such a big array;
2, make sizes of partitions more even;

This PR computes metrics more stable and a littler faster;

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
existing testsuites

Closes #27682 from zhengruifeng/grouped_opt.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
2020-02-28 16:55:24 +08:00
Tigran Saluev 6f4a2e4c99 [MINOR][ML] Fix confusing error message in VectorAssembler
### What changes were proposed in this pull request?

When VectorAssembler encounters a NULL with handleInvalid="error", it throws an exception. This exception, though, has a typo making it confusing. Yet apparently, this same exception for NaN values is fine. Fixed it to look like the right one.

### Why are the changes needed?

Encountering this error with such message was very confusing! I hope to save time of fellow engineers by improving it.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

It's just an error message...

Closes #27709 from Saluev/patch-1.

Authored-by: Tigran Saluev <tigran@saluev.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-02-27 11:05:53 -06:00
gatorsmile 28b8713036 [SPARK-30950][BUILD] Setting version to 3.1.0-SNAPSHOT
### What changes were proposed in this pull request?
This patch is to bump the master branch version to 3.1.0-SNAPSHOT.

### Why are the changes needed?
N/A

### Does this PR introduce any user-facing change?
N/A

### How was this patch tested?
N/A

Closes #27698 from gatorsmile/updateVersion.

Authored-by: gatorsmile <gatorsmile@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-02-25 19:44:31 -08:00
zhengruifeng e086a78706 [MINOR][ML] ML cleanup
### What changes were proposed in this pull request?
1, remove used imports and variables;
2, use `.iterator` instead of `.view` to avoid IDEA warnings;
3, remove resolved _TODO_

### Why are the changes needed?
cleanup

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
existing testsuites

Closes #27600 from zhengruifeng/nits.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-02-25 12:32:12 -06:00
Sean Owen cc8d356e4f [SPARK-30939][ML] Correctly set output col when StringIndexer.setOutputCols is used
### What changes were proposed in this pull request?

Set the supplied output col name as intended when StringIndexer transforms an input after setOutputCols is used.

### Why are the changes needed?

The output col names are wrong otherwise and downstream pipeline components fail.

### Does this PR introduce any user-facing change?

Yes in the sense that it fixes incorrect behavior, otherwise no.

### How was this patch tested?

Existing tests plus new direct tests of the schema.

Closes #27684 from srowen/SPARK-30939.

Authored-by: Sean Owen <srowen@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-02-24 20:18:10 -06:00
Huaxin Gao 7aa94ca9cb [SPARK-30867][ML] Add FValueTest
### What changes were proposed in this pull request?
This is the very first PR for supporting continuous distribution features selectors.
It adds the algorithm to compute fvalue for continuous features and continuous labels. This algorithm will be used for FValueRegressionSelector.

### Why are the changes needed?
Current Spark only supports the selection of categorical features, while there are many requirements for the selection of continuous distribution features.

I will add two new selectors:

1. FValueRegressionSelector for continuous features and continuous labels.
2. ANOVAFValueClassificationSelector for continuous features and categorical labels.

I will use subtasks to add these two selectors:

add FValueRegressionSelector on scala side

- add FValueRegressionTest, this contains the algorithm to compute FValue
- add FValueRegressionSelector using the above algorithm
- add a common Selector, make FValueRegressionSelector  and ChisqSelector to extend common selector

add FValueRegressionSelector on python side
add samples and doc
do the same for ANOVAFValueClassificationSelector

### Does this PR introduce any user-facing change?
Yes.
```
/**
 * param dataset  DataFrame of continuous labels and continuous features.
 * param featuresCol  Name of features column in dataset, of type `Vector` (`VectorUDT`)
 * param labelCol  Name of label column in dataset, of any numerical type
 * return Array containing the SelectionTestResult for every feature against the label.
 */
SelectionTest.fValueRegressionTest(dataset: Dataset[_], featuresCol: String, labelCol: String)
```

### How was this patch tested?
Add Unit test.

Closes #27623 from huaxingao/spark-30867.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
2020-02-24 11:14:54 +08:00
yi.wu 82ce4753aa [SPARK-26580][SQL][ML][FOLLOW-UP] Throw exception when use untyped UDF by default
### What changes were proposed in this pull request?

This PR proposes to throw exception by default when user use untyped UDF(a.k.a `org.apache.spark.sql.functions.udf(AnyRef, DataType)`).

And user could still use it by setting `spark.sql.legacy.useUnTypedUdf.enabled` to `true`.

### Why are the changes needed?

According to #23498, since Spark 3.0, the untyped UDF will return the default value of the Java type if the input value is null. For example, `val f = udf((x: Int) => x, IntegerType)`, `f($"x")` will  return 0 in Spark 3.0 but null in Spark 2.4. And the behavior change is introduced due to Spark3.0 is built with Scala 2.12 by default.

As a result, this might change data silently and may cause correctness issue if user still expect `null` in some cases. Thus, we'd better to encourage user to use typed UDF to avoid this problem.

### Does this PR introduce any user-facing change?

Yeah. User will hit exception now when use untyped UDF.

### How was this patch tested?

Added test and updated some tests.

Closes #27488 from Ngone51/spark_26580_followup.

Lead-authored-by: yi.wu <yi.wu@databricks.com>
Co-authored-by: wuyi <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-21 14:46:54 +08:00
Huaxin Gao 69367997c3 [SPARK-30802][ML] Use Summarizer instead of MultivariateOnlineSummarizer in Aggregator test suite
### What changes were proposed in this pull request?
There are three changes in this PR:
1. use Summarizer instead of MultivariateOnlineSummarizer in Aggregator test suites (similar to https://github.com/apache/spark/pull/26396)
2. Put common code in ```Summarizer.getRegressionSummarizers``` and ```Summarizer.getClassificationSummarizers```.
3. Move ```MultiClassSummarizer``` from ```LogisticRegression``` to ```ml.stat``` (this seems to be a better place since ```MultiClassSummarizer``` is not only used by ```LogisticRegression``` but also several other classes).

### Why are the changes needed?
Minimize code duplication and improve performance

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
existing test suites.

Closes #27555 from huaxingao/spark-30802.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-02-19 17:29:52 -06:00
zhengruifeng 0a4080ec3b [SPARK-30736][ML] One-Pass ChiSquareTest
### What changes were proposed in this pull request?
1, distributedly gather matrix `contingency` of each feature
2, distributedly compute the results and then collect them back to the driver

### Why are the changes needed?
existing impl is not efficient:
1, it directly collect matrix `contingency` of partial featues to driver and compute the corresponding result on one pass;
2, a matrix  `contingency` of a featues is of size numDistinctValues X numDistinctLabels, so only 1000 matrices can be collected at a time;

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
existing testsuites

Closes #27461 from zhengruifeng/chisq_opt.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-02-17 09:41:38 -06:00
zhengruifeng 8ebbf85a85 [SPARK-30772][ML][SQL] avoid tuple assignment because it will circumvent the transient tag
### What changes were proposed in this pull request?
it is said in [LeastSquaresAggregator](12e1bbaddb/mllib/src/main/scala/org/apache/spark/ml/optim/aggregator/LeastSquaresAggregator.scala (L188)) that :

> // do not use tuple assignment above because it will circumvent the transient tag

I then check this issue with Scala 2.13.1 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_241)

### Why are the changes needed?
avoid tuple assignment because it will circumvent the transient tag

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
existing testsuites

Closes #27523 from zhengruifeng/avoid_tuple_assign_to_transient.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-02-16 10:01:49 -06:00
Liang Zhang 82d0aa37ae [SPARK-30762] Add dtype=float32 support to vector_to_array UDF
### What changes were proposed in this pull request?
In this PR, we add a parameter in the python function vector_to_array(col) that allows converting to a column of arrays of Float (32bits) in scala, which would be mapped to a numpy array of dtype=float32.

### Why are the changes needed?
In the downstream ML training, using float32 instead of float64 (default) would allow a larger batch size, i.e., allow more data to fit in the memory.

### Does this PR introduce any user-facing change?
Yes.
Old: `vector_to_array()` only take one param
```
df.select(vector_to_array("colA"), ...)
```
New: `vector_to_array()` can take an additional optional param: `dtype` = "float32" (or "float64")
```
df.select(vector_to_array("colA", "float32"), ...)
```

### How was this patch tested?
Unit test in scala.
doctest in python.

Closes #27522 from liangz1/udf-float32.

Authored-by: Liang Zhang <liang.zhang@databricks.com>
Signed-off-by: WeichenXu <weichen.xu@databricks.com>
2020-02-13 23:55:13 +08:00
Huaxin Gao a7ae77a8d8 [SPARK-30662][ML][PYSPARK] Put back the API changes for HasBlockSize in ALS/MLP
### What changes were proposed in this pull request?
Add ```HasBlockSize``` in shared Params in both Scala and Python.
Make ALS/MLP extend ```HasBlockSize```

### Why are the changes needed?
Add ```HasBlockSize ``` in ALS, so user can specify the blockSize.
Make ```HasBlockSize``` a shared param so both ALS and MLP can use it.

### Does this PR introduce any user-facing change?
Yes
```ALS.setBlockSize/getBlockSize```
```ALSModel.setBlockSize/getBlockSize```

### How was this patch tested?
Manually tested. Also added doctest.

Closes #27501 from huaxingao/spark_30662.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
2020-02-09 13:14:30 +08:00
zhengruifeng 12e1bbaddb Revert "[SPARK-30642][SPARK-30659][SPARK-30660][SPARK-30662]"
### What changes were proposed in this pull request?
Revert
#27360
#27396
#27374
#27389

### Why are the changes needed?
BLAS need more performace tests, specially on sparse datasets.
Perfermance test of LogisticRegression (https://github.com/apache/spark/pull/27374) on sparse dataset shows that blockify vectors to matrices and use BLAS will cause performance regression.
LinearSVC and LinearRegression were also updated in the same way as LogisticRegression, so we need to revert them to make sure no regression.

### Does this PR introduce any user-facing change?
remove newly added param blockSize

### How was this patch tested?
reverted testsuites

Closes #27487 from zhengruifeng/revert_blockify_ii.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
2020-02-08 08:46:16 +08:00
zhengruifeng da32d1e6b5 [SPARK-30700][ML] NaiveBayesModel predict optimization
### What changes were proposed in this pull request?
var `negThetaSum` is always used together with `pi`, so we can add them at first

### Why are the changes needed?
only need to add one var `piMinusThetaSum`, instead of `pi` and `negThetaSum`

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
existing testsuites

Closes #27427 from zhengruifeng/nb_predict.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
2020-02-01 15:19:16 +08:00
zhengruifeng d0c3e9f1f7 [SPARK-30660][ML][PYSPARK] LinearRegression blockify input vectors
### What changes were proposed in this pull request?
1, use blocks instead of vectors for performance improvement
2, use Level-2 BLAS
3, move standardization of input vectors outside of gradient computation

### Why are the changes needed?
1, less RAM to persist training data; (save ~40%)
2, faster than existing impl; (30% ~ 102%)

### Does this PR introduce any user-facing change?
add a new expert param `blockSize`

### How was this patch tested?
updated testsuites

Closes #27396 from zhengruifeng/blockify_lireg.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-01-31 21:04:26 -06:00
Huaxin Gao f59685acaa [SPARK-30662][ML][PYSPARK] ALS/MLP extend HasBlockSize
### What changes were proposed in this pull request?
Make ALS/MLP extend ```HasBlockSize```

### Why are the changes needed?

Currently, MLP has its own ```blockSize``` param, we should make MLP extend ```HasBlockSize``` since ```HasBlockSize``` was added in ```sharedParams.scala``` recently.

ALS doesn't have ```blockSize``` param now, we can make it extend ```HasBlockSize```, so user can specify the ```blockSize```.

### Does this PR introduce any user-facing change?
Yes
```ALS.setBlockSize``` and ```ALS.getBlockSize```
```ALSModel.setBlockSize``` and ```ALSModel.getBlockSize```

### How was this patch tested?
Manually tested. Also added doctest.

Closes #27389 from huaxingao/spark-30662.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-01-30 13:13:10 -06:00
Maxim Gekk a291433ed3 [SPARK-30678][MLLIB][TESTS] Eliminate warnings from deprecated BisectingKMeansModel.computeCost
### What changes were proposed in this pull request?
In the PR, I propose to replace deprecated method `computeCost` of `BisectingKMeansModel` by `summary.trainingCost`.

### Why are the changes needed?
The changes eliminate deprecation warnings:
```
BisectingKMeansSuite.scala:108: method computeCost in class BisectingKMeansModel is deprecated (since 3.0.0): This method is deprecated and will be removed in future versions. Use ClusteringEvaluator instead. You can also get the cost on the training dataset in the summary.
[warn]     assert(model.computeCost(dataset) < 0.1)
BisectingKMeansSuite.scala:135: method computeCost in class BisectingKMeansModel is deprecated (since 3.0.0): This method is deprecated and will be removed in future versions. Use ClusteringEvaluator instead. You can also get the cost on the training dataset in the summary.
[warn]     assert(model.computeCost(dataset) == summary.trainingCost)
BisectingKMeansSuite.scala:323: method computeCost in class BisectingKMeansModel is deprecated (since 3.0.0): This method is deprecated and will be removed in future versions. Use ClusteringEvaluator instead. You can also get the cost on the training dataset in the summary.
[warn]       model.computeCost(dataset)
```

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
By running `BisectingKMeansSuite` via:
```
./build/sbt "test:testOnly *BisectingKMeansSuite"
```

Closes #27401 from MaxGekk/kmeans-computeCost-warning.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-01-30 09:05:14 -08:00
zhengruifeng 073ce12543 [SPARK-30659][ML][PYSPARK] LogisticRegression blockify input vectors
### What changes were proposed in this pull request?
1, use blocks instead of vectors
2, use Level-2 BLAS for binary, use Level-3 BLAS for multinomial

### Why are the changes needed?
1, less RAM to persist training data; (save ~40%)
2, faster than existing impl; (40% ~ 92%)

### Does this PR introduce any user-facing change?
add a new expert param `blockSize`

### How was this patch tested?
updated testsuites

Closes #27374 from zhengruifeng/blockify_lor.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-01-30 10:52:07 -06:00
zhengruifeng 96d27274f5 [SPARK-30642][ML][PYSPARK] LinearSVC blockify input vectors
### What changes were proposed in this pull request?
1, stack input vectors to blocks (like ALS/MLP);
2, add new param `blockSize`;
3, add a new class `InstanceBlock`
4, standardize the input outside of optimization procedure;

### Why are the changes needed?
1, reduce RAM to persist traing dataset; (save ~40% in test)
2, use Level-2 BLAS routines; (12% ~ 28% faster, without native BLAS)

### Does this PR introduce any user-facing change?
a new param `blockSize`

### How was this patch tested?
existing and updated testsuites

Closes #27360 from zhengruifeng/blockify_svc.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
2020-01-28 20:55:21 +08:00
Huaxin Gao 2f8e4d0d6e [SPARK-30630][ML] Remove numTrees in GBT in 3.0.0
### What changes were proposed in this pull request?
Remove ```numTrees``` in GBT in 3.0.0.

### Why are the changes needed?
Currently, GBT has
```
  /**
   * Number of trees in ensemble
   */
  Since("2.0.0")
  val getNumTrees: Int = trees.length
```
and
```
  /** Number of trees in ensemble */
  val numTrees: Int = trees.length
```
I think we should remove one of them. We deprecated it in 2.4.5 via https://github.com/apache/spark/pull/27352.

### Does this PR introduce any user-facing change?
Yes, remove ```numTrees``` in GBT in 3.0.0

### How was this patch tested?
existing tests

Closes #27330 from huaxingao/spark-numTrees.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-01-24 12:12:46 -08:00
zhengruifeng f35f352096 [SPARK-30543][ML][PYSPARK][R] RandomForest add Param bootstrap to control sampling method
### What changes were proposed in this pull request?
add a param `bootstrap` to control whether bootstrap samples are used.

### Why are the changes needed?
Current RF with numTrees=1 will directly build a tree using the orignial dataset,

while with numTrees>1 it will use bootstrap samples to build trees.

This design is for training a DecisionTreeModel by the impl of RandomForest, however, it is somewhat strange.

In Scikit-Learn, there is a param [bootstrap](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier) to control whether bootstrap samples are used.

### Does this PR introduce any user-facing change?
Yes, new param is added

### How was this patch tested?
existing testsuites

Closes #27254 from zhengruifeng/add_bootstrap.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
2020-01-23 16:44:13 +08:00
zhengruifeng 1c46bd9e60 [SPARK-30503][ML] OnlineLDAOptimizer does not handle persistance correctly
### What changes were proposed in this pull request?
unpersist graph outside checkpointer, like what Pregel does

### Why are the changes needed?
Shown in [SPARK-30503](https://issues.apache.org/jira/browse/SPARK-30503), intermediate edges are not unpersisted

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
existing testsuites and manual test

Closes #27261 from zhengruifeng/lda_checkpointer.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-01-22 08:24:11 -06:00
Huaxin Gao 92dd7c9d2a [MINOR][ML] Change DecisionTreeClassifier to FMClassifier in OneVsRest setWeightCol test
### What changes were proposed in this pull request?
Change ```DecisionTreeClassifier``` to ```FMClassifier``` in ```OneVsRest``` setWeightCol test

### Why are the changes needed?
In ```OneVsRest```, if the classifier doesn't support instance weight, ```OneVsRest``` weightCol will be ignored, so unit test has tested one classifier(```LogisticRegression```) that support instance weight, and one classifier (```DecisionTreeClassifier```) that doesn't support instance weight. Since ```DecisionTreeClassifier``` now supports instance weight, we need to change it to the classifier that doesn't have weight support.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Existing test

Closes #27204 from huaxingao/spark-ovr-minor.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
2020-01-17 10:04:41 +08:00
Huaxin Gao 1ef1d6caf2 [SPARK-29565][FOLLOWUP] add setInputCol/setOutputCol in OHEModel
### What changes were proposed in this pull request?
add setInputCol/setOutputCol in OHEModel

### Why are the changes needed?
setInputCol/setOutputCol should be in OHEModel too.

### Does this PR introduce any user-facing change?
Yes.
```OHEModel.setInputCol```
```OHEModel.setOutputCol```

### How was this patch tested?
Manually tested.

Closes #27228 from huaxingao/spark-29565.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
2020-01-16 19:23:10 +08:00
zhengruifeng aec55cd1ca [SPARK-30502][ML][CORE] PeriodicRDDCheckpointer support storageLevel
### What changes were proposed in this pull request?
1, add field `storageLevel` in `PeriodicRDDCheckpointer`
2, for ml.GBT/ml.RF set storageLevel=`StorageLevel.MEMORY_AND_DISK`

### Why are the changes needed?
Intermediate RDDs in ML are cached with storageLevel=StorageLevel.MEMORY_AND_DISK.
PeriodicRDDCheckpointer & PeriodicGraphCheckpointer now store RDD with storageLevel=StorageLevel.MEMORY_ONLY, it maybe nice to set the storageLevel of checkpointer.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
existing testsuites

Closes #27189 from zhengruifeng/checkpointer_storage.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
2020-01-16 11:01:30 +08:00