# What changes were proposed in this pull request?
current problems:
```
mlp = MultilayerPerceptronClassifier(layers=[2, 2, 2], seed=123)
model = mlp.fit(df)
path = tempfile.mkdtemp()
model_path = path + "/mlp"
model.save(model_path)
model2 = MultilayerPerceptronClassificationModel.load(model_path)
self.assertEqual(model2.getSolver(), "l-bfgs") # this fails because model2.getSolver() returns 'auto'
model2.transform(df)
# this fails with Exception pyspark.sql.utils.IllegalArgumentException: MultilayerPerceptronClassifier_dec859ed24ec parameter solver given invalid value auto.
```
FMClassifier/Regression and GeneralizedLinearRegression have the same problems.
Here are the root cause of the problems:
1. In HasSolver, both Scala and Python default solver to 'auto'
2. On Scala side, mlp overrides the default of solver to 'l-bfgs', FMClassifier/Regression overrides the default of solver to 'adamW', and glr overrides the default of solver to 'irls'
3. On Scala side, mlp overrides the default of solver in MultilayerPerceptronClassificationParams, so both MultilayerPerceptronClassification and MultilayerPerceptronClassificationModel have 'l-bfgs' as default
4. On Python side, mlp overrides the default of solver in MultilayerPerceptronClassification, so it has default as 'l-bfgs', but MultilayerPerceptronClassificationModel doesn't override the default so it gets the default from HasSolver which is 'auto'. In theory, we don't care about the solver value or any other params values for MultilayerPerceptronClassificationModel, because we have the fitted model already. That's why on Python side, we never set default values for any of the XXXModel.
5. when calling getSolver on the loaded mlp model, it calls this line of code underneath:
```
def _transfer_params_from_java(self):
"""
Transforms the embedded params from the companion Java object.
"""
......
# SPARK-14931: Only check set params back to avoid default params mismatch.
if self._java_obj.isSet(java_param):
value = _java2py(sc, self._java_obj.getOrDefault(java_param))
self._set(**{param.name: value})
......
```
that's why model2.getSolver() returns 'auto'. The code doesn't get the default Scala value (in this case 'l-bfgs') to set to Python param, so it takes the default value (in this case 'auto') on Python side.
6. when calling model2.transform(df), it calls this underneath:
```
def _transfer_params_to_java(self):
"""
Transforms the embedded params to the companion Java object.
"""
......
if self.hasDefault(param):
pair = self._make_java_param_pair(param, self._defaultParamMap[param])
pair_defaults.append(pair)
......
```
Again, it gets the Python default solver which is 'auto', and this caused the Exception
7. Currently, on Scala side, for some of the algorithms, we set default values in the XXXParam, so both estimator and transformer get the default value. However, for some of the algorithms, we only set default in estimators, and the XXXModel doesn't get the default value. On Python side, we never set defaults for the XXXModel. This causes the default value inconsistency.
8. My proposed solution: set default params in XXXParam for both Scala and Python, so both the estimator and transformer have the same default value for both Scala and Python. I currently only changed solver in this PR. If everyone is OK with the fix, I will change all the other params as well.
I hope my explanation makes sense to your folks :)
### Why are the changes needed?
Fix bug
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
existing and new tests
Closes#29060 from huaxingao/solver_parity.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
1, GeneralizedLinearRegressionSummary compute several statistics on single pass
2, LinearRegressionSummary use metrics.count
### Why are the changes needed?
avoid extra passes on the dataset
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
existing testsuites
Closes#28990 from zhengruifeng/glr_summary_opt.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: Huaxin Gao <huaxing@us.ibm.com>
### What changes were proposed in this pull request?
Add summary to RandomForestClassificationModel...
### Why are the changes needed?
so user can get a summary of this classification model, and retrieve common metrics such as accuracy, weightedTruePositiveRate, roc (for binary), pr curves (for binary), etc.
### Does this PR introduce _any_ user-facing change?
Yes
```
RandomForestClassificationModel.summary
RandomForestClassificationModel.evaluate
```
### How was this patch tested?
Add new tests
Closes#28913 from huaxingao/rf_summary.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
Add training summary for LinearSVCModel......
### Why are the changes needed?
so that user can get the training process status, such as loss value of each iteration and total iteration number.
### Does this PR introduce _any_ user-facing change?
Yes
```LinearSVCModel.summary```
```LinearSVCModel.evaluate```
### How was this patch tested?
new tests
Closes#28884 from huaxingao/svc_summary.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
Adding support to Association Rules in Spark ml.fpm.
### Why are the changes needed?
Support is an indication of how frequently the itemset of an association rule appears in the database and suggests if the rules are generally applicable to the dateset. Refer to [wiki](https://en.wikipedia.org/wiki/Association_rule_learning#Support) for more details.
### Does this PR introduce _any_ user-facing change?
Yes. Associate Rules now have support measure
### How was this patch tested?
existing and new unit test
Closes#28903 from huaxingao/fpm.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
Add a generic ClassificationSummary trait
### Why are the changes needed?
Add a generic ClassificationSummary trait so all the classification models can use it to implement summary.
Currently in classification, we only have summary implemented in ```LogisticRegression```. There are requests to implement summary for ```LinearSVCModel``` in https://issues.apache.org/jira/browse/SPARK-20249 and to implement summary for ```RandomForestClassificationModel``` in https://issues.apache.org/jira/browse/SPARK-23631. If we add a generic ClassificationSummary trait and put all the common code there, we can easily add summary to ```LinearSVCModel``` and ```RandomForestClassificationModel```, and also add summary to all the other classification models.
We can use the same approach to add a generic RegressionSummary trait to regression package and implement summary for all the regression models.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
existing tests
Closes#28710 from huaxingao/summary_trait.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
This patch adds user-specified fold column support to `CrossValidator`. User can assign fold numbers to dataset instead of letting Spark do random splits.
### Why are the changes needed?
This gives `CrossValidator` users more flexibility in splitting folds.
### Does this PR introduce _any_ user-facing change?
Yes, a new `foldCol` param is added to `CrossValidator`. User can use it to specify custom fold splitting.
### How was this patch tested?
Added unit tests.
Closes#28704 from viirya/SPARK-31777.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Liang-Chi Hsieh <liangchi@uber.com>
### What changes were proposed in this pull request?
In LogisticRegression and LinearRegression, if set maxIter=n, the model.summary.totalIterations returns n+1 if the training procedure does not drop out. This is because we use ```objectiveHistory.length``` as totalIterations, but ```objectiveHistory``` contains init sate, thus ```objectiveHistory.length``` is 1 larger than number of training iterations.
### Why are the changes needed?
correctness
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
add new tests and also modify existing tests
Closes#28786 from huaxingao/summary_iter.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
Add instance weight support in LinearRegressionSummary
### Why are the changes needed?
LinearRegression and RegressionMetrics support instance weight. We should support instance weight in LinearRegressionSummary too.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
add new test
Closes#28772 from huaxingao/lir_weight_summary.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
Add instance weight support in LogisticRegressionSummary
### Why are the changes needed?
LogisticRegression, MulticlassClassificationEvaluator and BinaryClassificationEvaluator support instance weight. We should support instance weight in LogisticRegressionSummary too.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Add new tests
Closes#28657 from huaxingao/weighted_summary.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
In the algorithms that support instance weight, add checks to make sure instance weight is not negative.
### Why are the changes needed?
instance weight has to be >= 0.0
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Manually tested
Closes#28621 from huaxingao/weight_check.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
Add weight support in ClusteringEvaluator
### Why are the changes needed?
Currently, BinaryClassificationEvaluator, RegressionEvaluator, and MulticlassClassificationEvaluator support instance weight, but ClusteringEvaluator doesn't, so we will add instance weight support in ClusteringEvaluator.
### Does this PR introduce _any_ user-facing change?
Yes.
ClusteringEvaluator.setWeightCol
### How was this patch tested?
add new unit test
Closes#28553 from huaxingao/weight_evaluator.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
add getMetrics in Evaluators to get the corresponding Metrics instance, so users can use it to get any of the metrics scores. For example:
```
val trainer = new LinearRegression
val model = trainer.fit(dataset)
val predictions = model.transform(dataset)
val evaluator = new RegressionEvaluator()
val metrics = evaluator.getMetrics(predictions)
val rmse = metrics.rootMeanSquaredError
val r2 = metrics.r2
val mae = metrics.meanAbsoluteError
val variance = metrics.explainedVariance
```
### Why are the changes needed?
Currently, Evaluator.evaluate only access to one metrics, but most users may need to get multiple metrics. This PR adds getMetrics in all the Evaluators, so users can use it to get an instance of the corresponding Metrics to get any of the metrics they want.
### Does this PR introduce _any_ user-facing change?
Yes. Add getMetrics in Evaluators.
For example:
```
/**
* Get a RegressionMetrics, which can be used to get any of the regression
* metrics such as rootMeanSquaredError, meanSquaredError, etc.
*
* param dataset a dataset that contains labels/observations and predictions.
* return RegressionMetrics
*/
Since("3.1.0")
def getMetrics(dataset: Dataset[_]): RegressionMetrics
```
### How was this patch tested?
Add new unit tests
Closes#28590 from huaxingao/getMetrics.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
In QuantileDiscretizer.getDistinctSplits, before invoking distinct, normalize all -0.0 and 0.0 to be 0.0
```
for (i <- 0 until splits.length) {
if (splits(i) == -0.0) {
splits(i) = 0.0
}
}
```
### Why are the changes needed?
Fix bug.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Unit test.
#### Manually test:
~~~scala
import scala.util.Random
val rng = new Random(3)
val a1 = Array.tabulate(200)(_=>rng.nextDouble * 2.0 - 1.0) ++ Array.fill(20)(0.0) ++ Array.fill(20)(-0.0)
import spark.implicits._
val df1 = sc.parallelize(a1, 2).toDF("id")
import org.apache.spark.ml.feature.QuantileDiscretizer
val qd = new QuantileDiscretizer().setInputCol("id").setOutputCol("out").setNumBuckets(200).setRelativeError(0.0)
val model = qd.fit(df1) // will raise error in spark master.
~~~
### Explain
scala `0.0 == -0.0` is True but `0.0.hashCode == -0.0.hashCode()` is False. This break the contract between equals() and hashCode() If two objects are equal, then they must have the same hash code.
And array.distinct will rely on elem.hashCode so it leads to this error.
Test code on distinct
```
import scala.util.Random
val rng = new Random(3)
val a1 = Array.tabulate(200)(_=>rng.nextDouble * 2.0 - 1.0) ++ Array.fill(20)(0.0) ++ Array.fill(20)(-0.0)
a1.distinct.sorted.foreach(x => print(x.toString + "\n"))
```
Then you will see output like:
```
...
-0.009292684662246975
-0.0033280686465135823
-0.0
0.0
0.0022219556032221366
0.02217419561977274
...
```
Closes#28498 from WeichenXu123/SPARK-31676.
Authored-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
Expose hashFunc property in HashingTF
Some third-party library such as mleap need to access it.
See background description here:
https://github.com/combust/mleap/pull/665#issuecomment-621258623
### Why are the changes needed?
See https://github.com/combust/mleap/pull/665#issuecomment-621258623
### Does this PR introduce any user-facing change?
No. Only add a package private constructor.
### How was this patch tested?
N/A
Closes#28413 from WeichenXu123/hashing_tf_expose_hashfunc.
Authored-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Xiangrui Meng <meng@databricks.com>
### What changes were proposed in this pull request?
1, add new param blockSize;
2, if blockSize==1, keep original behavior, code path trainOnRows;
3, if blockSize>1, standardize and stack input vectors to blocks (like ALS/MLP), code path trainOnBlocks
### Why are the changes needed?
performance gain on dense dataset HIGGS:
1, save about 45% RAM;
2, 3X faster with openBLAS
### Does this PR introduce any user-facing change?
add a new expert param `blockSize`
### How was this patch tested?
added testsuites
Closes#27473 from zhengruifeng/blockify_gmm.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
When input column lengths can not be inferred and handleInvalid = "keep", VectorAssembler will throw a runtime exception. However the error message with this exception is not consistent. I change the content of this error message to make it work properly.
### Why are the changes needed?
This is a bug. Here is a simple example to reproduce it.
```
// create a df without vector size
val df = Seq(
(Vectors.dense(1.0), Vectors.dense(2.0))
).toDF("n1", "n2")
// only set vector size hint for n1 column
val hintedDf = new VectorSizeHint()
.setInputCol("n1")
.setSize(1)
.transform(df)
// assemble n1, n2
val output = new VectorAssembler()
.setInputCols(Array("n1", "n2"))
.setOutputCol("features")
.setHandleInvalid("keep")
.transform(hintedDf)
// because only n1 has vector size, the error message should tell us to set vector size for n2 too
output.show()
```
Expected error message:
```
Can not infer column lengths with handleInvalid = "keep". Consider using VectorSizeHint to add metadata for columns: [n2].
```
Actual error message:
```
Can not infer column lengths with handleInvalid = "keep". Consider using VectorSizeHint to add metadata for columns: [n1, n2].
```
This introduce difficulties when I try to resolve this exception, for I do not know which column required vectorSizeHint. This is especially troublesome when you have a large number of columns to deal with.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Add test in VectorAssemblerSuite.
Closes#28487 from fan31415/SPARK-31671.
Lead-authored-by: fan31415 <fan12356789@gmail.com>
Co-authored-by: yijiefan <fanyije@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
1, add new param blockSize;
2, add a new class InstanceBlock;
3, if blockSize==1, keep original behavior; if blockSize>1, stack input vectors to blocks (like ALS/MLP);
4, if blockSize>1, standardize the input outside of optimization procedure;
### Why are the changes needed?
it will obtain performance gain on dense datasets, such as epsilon
1, reduce RAM to persist traing dataset; (save about 40% RAM)
2, use Level-2 BLAS routines; (~10X speedup)
### Does this PR introduce _any_ user-facing change?
Yes, a new param is added
### How was this patch tested?
existing and added testsuites
Closes#28473 from zhengruifeng/blockify_aft.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
Add ANOVASelector and FValueSelector to PySpark
### Why are the changes needed?
ANOVASelector and FValueSelector have been implemented in Scala. We need to implement these in Python as well.
### Does this PR introduce _any_ user-facing change?
Yes. Add Python version of ANOVASelector and FValueSelector
### How was this patch tested?
new doctest
Closes#28464 from huaxingao/selector_py.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
1, add new param blockSize;
2, add a new class InstanceBlock;
3, if blockSize==1, keep original behavior; if blockSize>1, stack input vectors to blocks (like ALS/MLP);
4, if blockSize>1, standardize the input outside of optimization procedure;
### Why are the changes needed?
it will obtain performance gain on dense datasets, such as `epsilon`
1, reduce RAM to persist traing dataset; (save about 40% RAM)
2, use Level-2 BLAS routines; (up to 6X(squaredError)~12X(huber) speedup)
### Does this PR introduce _any_ user-facing change?
Yes, a new param is added
### How was this patch tested?
existing and added testsuites
Closes#28471 from zhengruifeng/blockify_lir_II.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
1, reorg the `fit` method in LR to several blocks (`createModel`, `createBounds`, `createOptimizer`, `createInitCoefWithInterceptMatrix`);
2, add new param blockSize;
3, if blockSize==1, keep original behavior, code path `trainOnRows`;
4, if blockSize>1, standardize and stack input vectors to blocks (like ALS/MLP), code path `trainOnBlocks`
### Why are the changes needed?
On dense dataset `epsilon_normalized.t`:
1, reduce RAM to persist traing dataset; (save about 40% RAM)
2, use Level-2 BLAS routines; (4x ~ 5x faster)
### Does this PR introduce _any_ user-facing change?
Yes, a new param is added
### How was this patch tested?
existing and added testsuites
Closes#28458 from zhengruifeng/blockify_lor_II.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
Implement abstract Selector. Put the common code among ```ANOVASelector```, ```ChiSqSelector```, ```FValueSelector``` and ```VarianceThresholdSelector``` to Selector.
### Why are the changes needed?
code reuse
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Existing tests
Closes#27978 from huaxingao/spark-31127.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
1, add new param `blockSize`;
2, add a new class InstanceBlock;
3, **if `blockSize==1`, keep original behavior; if `blockSize>1`, stack input vectors to blocks (like ALS/MLP);**
4, if `blockSize>1`, standardize the input outside of optimization procedure;
### Why are the changes needed?
1, reduce RAM to persist traing dataset; (save about 40% RAM)
2, use Level-2 BLAS routines; (4x ~ 5x faster on dataset `epsilon`)
### Does this PR introduce any user-facing change?
Yes, a new param is added
### How was this patch tested?
existing and added testsuites
Closes#28349 from zhengruifeng/blockify_svc_II.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
1, make AFT reuse common functions in `ml.optim`, rather than making its own impl.
### Why are the changes needed?
The logic in optimizing AFT is quite similar to other algorithms like other algs based on `RDDLossFunction`,
We should reuse the common functions.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
existing testsuites
Closes#28404 from zhengruifeng/mv_aft_optim.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
What changes were proposed in this pull request?
1.Add class info output in org.apache.spark.ml.util.SchemaUtils#checkColumnType to distinct Vectors in ml and mllib
2.Add unit test
Why are the changes needed?
the catalogString doesn't distinguish Vectors in ml and mllib when mllib vector misused in ml
https://issues.apache.org/jira/browse/SPARK-31400
Does this PR introduce any user-facing change?
No
How was this patch tested?
Unit test is added
Closes#28347 from TJX2014/master-catalogString-distinguish-Vectors-in-ml-and-mllib.
Authored-by: TJX2014 <xiaoxingstack@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
apply Lemma 1 in [Using the Triangle Inequality to Accelerate K-Means](https://www.aaai.org/Papers/ICML/2003/ICML03-022.pdf):
> Let x be a point, and let b and c be centers. If d(b,c)>=2d(x,b) then d(x,c) >= d(x,b);
It can be directly applied in EuclideanDistance, but not in CosineDistance.
However, for CosineDistance we can luckily get a variant in the space of radian/angle.
### Why are the changes needed?
It help improving the performance of prediction and training (mostly)
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
existing testsuites
Closes#27758 from zhengruifeng/km_triangle.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
add a new method `def test(dataset: DataFrame, featuresCol: String, labelCol: String, flatten: Boolean): DataFrame`
### Why are the changes needed?
Similar to new `test` method in `ChiSquareTest`, it will:
1, support df operation on the returned df;
2, make driver no longer a bottleneck with large numFeatures
### Does this PR introduce any user-facing change?
Yes, new method added
### How was this patch tested?
existing testsuites
Closes#28270 from zhengruifeng/flatten_anova.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
add a new method `def test(dataset: DataFrame, featuresCol: String, labelCol: String, flatten: Boolean): DataFrame`
### Why are the changes needed?
Similar to new test method in ChiSquareTest, it will:
1, support df operation on the returned df;
2, make driver no longer a bottleneck with large `numFeatures`
### Does this PR introduce any user-facing change?
Yes, add a new method
### How was this patch tested?
existing testsuites
Closes#28268 from zhengruifeng/flatten_fvalue.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
re-impl `keyDistance`:
if both vectors are dense, new impl is 9.09x faster;
if both vectors are sparse, new impl is 5.66x faster;
if one is dense and the other is sparse, new impl is 7.8x faster;
### Why are the changes needed?
current implementation based on set operations is inefficient
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
existing testsuites
Closes#28206 from zhengruifeng/minhash_opt.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
This PR moves the `ExpressionEncoder.toRow` and `ExpressionEncoder.fromRow` functions into their own function objects(`ExpressionEncoder.Serializer` & `ExpressionEncoder.Deserializer`). This effectively makes the `ExpressionEncoder` stateless, thread-safe and (more) reusable. The function objects are not thread safe, however they are documented as such and should be used in a more limited scope (making it easier to reason about thread safety).
### Why are the changes needed?
ExpressionEncoders are not thread-safe. We had various (nasty) bugs because of this.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Existing tests.
Closes#28223 from hvanhovell/SPARK-31450.
Authored-by: herman <herman@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
1, remove newly added method: `def testChiSquare(dataset: Dataset[_], featuresCol: String, labelCol: String): Array[SelectionTestResult]`, because: 1) it is only used in `ChiSqSelector`; 2, since the returned dataframe of `def test(dataset: DataFrame, featuresCol: String, labelCol: String): DataFrame` only contains one row, after collect it back to driver, the results are similar;
2, add method `def test(dataset: DataFrame, featuresCol: String, labelCol: String, flatten: Boolean): DataFrame` to return the flatten results;
### Why are the changes needed?
1, when get returned result dataframe, we may want to filter it like `pValue<0.1`, but current returned dataframe is hard to use;
2, current impl need to collect all test results of all columns back to the driver, this is a bottleneck, if we return the flatten datafame, we no longer to collect them to driver;
### Does this PR introduce any user-facing change?
Yes:
1, `def testChiSquare(dataset: Dataset[_], featuresCol: String, labelCol: String): Array[SelectionTestResult]` removed;
2, the returned dataframe need an action to trigger computation;
### How was this patch tested?
updated testsuites
Closes#28176 from zhengruifeng/flatten_chisq.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
This pull request adds SparkR wrapper for `FMRegressor`:
- Supporting ` org.apache.spark.ml.r.FMRegressorWrapper`.
- `FMRegressionModel` S4 class.
- Corresponding `spark.fmRegressor`, `predict`, `summary` and `write.ml` generics.
- Corresponding docs and tests.
### Why are the changes needed?
Feature parity.
### Does this PR introduce any user-facing change?
No (new API).
### How was this patch tested?
New unit tests.
Closes#27571 from zero323/SPARK-30819.
Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
This pull request adds SparkR wrapper for `LinearRegression`
- Supporting `org.apache.spark.ml.rLinearRegressionWrapper`.
- `LinearRegressionModel` S4 class.
- Corresponding `spark.lm` predict, summary and write.ml generics.
- Corresponding docs and tests.
### Why are the changes needed?
Feature parity.
### Does this PR introduce any user-facing change?
No (new API).
### How was this patch tested?
New unit tests.
Closes#27593 from zero323/SPARK-30818.
Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
Add a cleanShuffleDependencies as an experimental developer feature to allow folks to clean up shuffle files more aggressively than we currently do.
### Why are the changes needed?
Dynamic scaling on Kubernetes (introduced in Spark 3) depends on only shutting down executors without shuffle files. However Spark does not aggressively clean up shuffle files (see SPARK-5836) and instead depends on JVM GC on the driver to trigger deletes. We already have a mechanism to explicitly clean up shuffle files from the ALS algorithm where we create a lot of quickly orphaned shuffle files. We should expose this as an advanced developer feature to enable people to better clean-up shuffle files improving dynamic scaling of their jobs on Kubernetes.
### Does this PR introduce any user-facing change?
This adds a new experimental API.
### How was this patch tested?
ALS already used a mechanism like this, re-targets the ALS code to the new interface, tested with existing ALS tests.
Closes#28038 from holdenk/SPARK-31208-allow-users-to-cleanup-shuffle-files.
Authored-by: Holden Karau <hkarau@apple.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
This pull request adds SparkR wrapper for `FMClassifier`:
- Supporting ` org.apache.spark.ml.r.FMClassifierWrapper`.
- `FMClassificationModel` S4 class.
- Corresponding `spark.fmClassifier`, `predict`, `summary` and `write.ml` generics.
- Corresponding docs and tests.
### Why are the changes needed?
Feature parity.
### Does this PR introduce any user-facing change?
No (new API).
### How was this patch tested?
New unit tests.
Closes#27570 from zero323/SPARK-30820.
Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
when input dataset is sparse, make `ANOVATest` only process non-zero value
### Why are the changes needed?
for performance
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
existing testsuites
Closes#27982 from zhengruifeng/anova_sparse.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
add a common method `computeChiSq` and reuse it in both `chiSquaredDenseFeatures` and `chiSquaredSparseFeatures`
### Why are the changes needed?
to simplify ChiSq
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
existing testsuites
Closes#28045 from zhengruifeng/simplify_chisq.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
https://issues.apache.org/jira/browse/SPARK-31223
set seed in np.random when generating test data......
### Why are the changes needed?
so the same set of test data can be regenerated later.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
exiting tests
Closes#27994 from huaxingao/spark-31223.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
1, remove unused var `numFeatures`;
2, remove the computation of `numSamples` and `numClasses`, since they can be directly infered by `counts: OpenHashMap[Double, Long]`
### Why are the changes needed?
remove a unnecessary job to compute `numSamples` and `numClasses`
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
existing testsuites
Closes#27979 from zhengruifeng/anova_followup.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
Implement a Feature selector that removes all low-variance features. Features with a
variance lower than the threshold will be removed. The default is to keep all features with non-zero variance, i.e. remove the features that have the same value in all samples.
### Why are the changes needed?
VarianceThreshold is a simple baseline approach to feature selection. It removes all features whose variance doesn’t meet some threshold. The idea is when a feature doesn’t vary much within itself, it generally has very little predictive power.
scikit has implemented this selector.
https://scikit-learn.org/stable/modules/feature_selection.html#variance-threshold
### Does this PR introduce any user-facing change?
Yes.
### How was this patch tested?
Add new test suite.
Closes#27954 from huaxingao/variance-threshold.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
Change BLAS for part of level-1 routines(axpy, dot, scal(double, denseVector)) from java implementation to NativeBLAS when vector size>256
### Why are the changes needed?
In current ML BLAS.scala, all level-1 routines are fixed to use java
implementation. But NativeBLAS(intel MKL, OpenBLAS) can bring up to 11X
performance improvement based on performance test which apply direct
calls against these methods. We should provide a way to allow user take
advantage of NativeBLAS for level-1 routines. Here we do it through
switching to NativeBLAS for these methods from f2jBLAS.
### Does this PR introduce any user-facing change?
Yes, methods axpy, dot, scal in level-1 routines will switch to NativeBLAS when it has more than nativeL1Threshold(fixed value 256) elements and will fallback to f2jBLAS if native BLAS is not properly configured in system.
### How was this patch tested?
Perf test direct calls level-1 routines
Closes#27546 from yma11/SPARK-30773.
Lead-authored-by: yan ma <yan.ma@intel.com>
Co-authored-by: Ma Yan <yan.ma@intel.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
Add ANOVA Selector
### Why are the changes needed?
Currently Spark only supports selection of categorical features, while there are many requirements for the selection of continuous distribution features.
https://github.com/apache/spark/pull/27679 added FValueSelector for continuous features and continuous labels.
This PR adds ANOVASelector for continuous features and categorical labels.
### Does this PR introduce any user-facing change?
Yes, add a new Selector.
### How was this patch tested?
add new test suites
Closes#27895 from huaxingao/anova.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
This pr solved the same issue as [pr27919](https://github.com/apache/spark/pull/27919), but this one changes the file names based on comment from previous pr.
### What changes were proposed in this pull request?
Make some of file names the same as class name in R package.
### Why are the changes needed?
Make the file consistence
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
run `./R/run-tests.sh`
Closes#27940 from kevinyu98/spark-30954-r-v2.
Authored-by: Qianyang Yu <qyu@us.ibm.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
remove unused variables;
### Why are the changes needed?
remove unused variables;
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
existing testsuites
Closes#27922 from zhengruifeng/test_cleanup.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
jira link: https://issues.apache.org/jira/browse/SPARK-30930
Remove ML/MLLIB DeveloperApi annotations.
### Why are the changes needed?
The Developer APIs in ML/MLLIB have been there for a long time. They are stable now and are very unlikely to be changed or removed, so I unmark these Developer APIs in this PR.
### Does this PR introduce any user-facing change?
Yes. DeveloperApi annotations are removed from docs.
### How was this patch tested?
existing tests
Closes#27859 from huaxingao/spark-30930.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
1, compute summary and update distributions in one pass;
2, remove logic related to check `shouldDistributeGaussians`
### Why are the changes needed?
In current impl, GMM need to trigger two jobs at one iteration:
1, one to compute summary;
2, if `shouldDistributeGaussians = ((k - 1.0) / k) * numFeatures > 25.0`, trigger another to update distributions;
`shouldDistributeGaussians` is almost true in practice, since numFeatures is likely to be greater than 25.
We can use only one job to impl above computation, by following the logic in `KMeans`: using `reduceByKey` to compute statistics for each center
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
existing testsuites
Closes#27784 from zhengruifeng/gmm_avoid_distri_gaussian.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
```ChiSqSelector ``` depends on ```mllib.ChiSqSelectorModel``` to do the selection logic. Will remove the dependency in this PR.
### Why are the changes needed?
This PR is an intermediate PR. Removing ```ChiSqSelector``` dependency on ```mllib.ChiSqSelectorModel```. Next subtask will extract the common code between ```ChiSqSelector``` and ```FValueSelector``` and put in an abstract ```Selector```.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
New and existing tests
Closes#27841 from huaxingao/chisq.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
Auditing new ML Scala APIs introduced in 3.0. Fix found issues.
### Why are the changes needed?
### Does this PR introduce any user-facing change?
Yes. Some doc changes
### How was this patch tested?
Existing tests
Closes#27818 from huaxingao/spark-30929.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
Updating ML docs for 3.0 changes
### Why are the changes needed?
I am auditing 3.0 ML changes, found some docs are missing or not updated. Need to update these.
### Does this PR introduce any user-facing change?
Yes, doc changes
### How was this patch tested?
Manually build and check
Closes#27762 from huaxingao/spark-doc.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
Add FValueRegressionSelector for continuous features and continuous labels.
### Why are the changes needed?
Currently Spark only supports selection of categorical features, while there are many requirements for the selection of continuous distribution features.
This PR adds FValueSelector for continuous features and continuous labels.
ANOVASelector for continuous features and categorical labels will be added later using a separate PR.
### Does this PR introduce any user-facing change?
Yes.
Add a new Selector
### How was this patch tested?
Add new tests
Closes#27679 from huaxingao/spark_30776.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
Current impl needs to convert ml.Vector to breeze.Vector, which can be skipped.
### Why are the changes needed?
avoid unnecessary vector conversions
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
existing testsuites
Closes#27519 from zhengruifeng/gmm_transform_opt.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
Fix mistakes in comments
### Why are the changes needed?
There are mistakes in comments
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
N/A
Closes#27564 from xwu99/fix-mllib-sprand-comment.
Authored-by: Wu, Xiaochang <xiaochang.wu@intel.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
1, avoid `Iterator.grouped(size: Int)`, which need to maintain an arraybuffer of `size`
2, keep the number of partitions in curve computation
### Why are the changes needed?
1, `BinaryClassificationMetrics` tend to fail (OOM) when `grouping=count/numBins` is too large, due to `Iterator.grouped(size: Int)` need to maintain an arraybuffer with `size` entries, however, in `BinaryClassificationMetrics` we do not need to maintain such a big array;
2, make sizes of partitions more even;
This PR computes metrics more stable and a littler faster;
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
existing testsuites
Closes#27682 from zhengruifeng/grouped_opt.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
When VectorAssembler encounters a NULL with handleInvalid="error", it throws an exception. This exception, though, has a typo making it confusing. Yet apparently, this same exception for NaN values is fine. Fixed it to look like the right one.
### Why are the changes needed?
Encountering this error with such message was very confusing! I hope to save time of fellow engineers by improving it.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
It's just an error message...
Closes#27709 from Saluev/patch-1.
Authored-by: Tigran Saluev <tigran@saluev.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
This patch is to bump the master branch version to 3.1.0-SNAPSHOT.
### Why are the changes needed?
N/A
### Does this PR introduce any user-facing change?
N/A
### How was this patch tested?
N/A
Closes#27698 from gatorsmile/updateVersion.
Authored-by: gatorsmile <gatorsmile@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
1, remove used imports and variables;
2, use `.iterator` instead of `.view` to avoid IDEA warnings;
3, remove resolved _TODO_
### Why are the changes needed?
cleanup
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
existing testsuites
Closes#27600 from zhengruifeng/nits.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
Set the supplied output col name as intended when StringIndexer transforms an input after setOutputCols is used.
### Why are the changes needed?
The output col names are wrong otherwise and downstream pipeline components fail.
### Does this PR introduce any user-facing change?
Yes in the sense that it fixes incorrect behavior, otherwise no.
### How was this patch tested?
Existing tests plus new direct tests of the schema.
Closes#27684 from srowen/SPARK-30939.
Authored-by: Sean Owen <srowen@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
This is the very first PR for supporting continuous distribution features selectors.
It adds the algorithm to compute fvalue for continuous features and continuous labels. This algorithm will be used for FValueRegressionSelector.
### Why are the changes needed?
Current Spark only supports the selection of categorical features, while there are many requirements for the selection of continuous distribution features.
I will add two new selectors:
1. FValueRegressionSelector for continuous features and continuous labels.
2. ANOVAFValueClassificationSelector for continuous features and categorical labels.
I will use subtasks to add these two selectors:
add FValueRegressionSelector on scala side
- add FValueRegressionTest, this contains the algorithm to compute FValue
- add FValueRegressionSelector using the above algorithm
- add a common Selector, make FValueRegressionSelector and ChisqSelector to extend common selector
add FValueRegressionSelector on python side
add samples and doc
do the same for ANOVAFValueClassificationSelector
### Does this PR introduce any user-facing change?
Yes.
```
/**
* param dataset DataFrame of continuous labels and continuous features.
* param featuresCol Name of features column in dataset, of type `Vector` (`VectorUDT`)
* param labelCol Name of label column in dataset, of any numerical type
* return Array containing the SelectionTestResult for every feature against the label.
*/
SelectionTest.fValueRegressionTest(dataset: Dataset[_], featuresCol: String, labelCol: String)
```
### How was this patch tested?
Add Unit test.
Closes#27623 from huaxingao/spark-30867.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
This PR proposes to throw exception by default when user use untyped UDF(a.k.a `org.apache.spark.sql.functions.udf(AnyRef, DataType)`).
And user could still use it by setting `spark.sql.legacy.useUnTypedUdf.enabled` to `true`.
### Why are the changes needed?
According to #23498, since Spark 3.0, the untyped UDF will return the default value of the Java type if the input value is null. For example, `val f = udf((x: Int) => x, IntegerType)`, `f($"x")` will return 0 in Spark 3.0 but null in Spark 2.4. And the behavior change is introduced due to Spark3.0 is built with Scala 2.12 by default.
As a result, this might change data silently and may cause correctness issue if user still expect `null` in some cases. Thus, we'd better to encourage user to use typed UDF to avoid this problem.
### Does this PR introduce any user-facing change?
Yeah. User will hit exception now when use untyped UDF.
### How was this patch tested?
Added test and updated some tests.
Closes#27488 from Ngone51/spark_26580_followup.
Lead-authored-by: yi.wu <yi.wu@databricks.com>
Co-authored-by: wuyi <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
There are three changes in this PR:
1. use Summarizer instead of MultivariateOnlineSummarizer in Aggregator test suites (similar to https://github.com/apache/spark/pull/26396)
2. Put common code in ```Summarizer.getRegressionSummarizers``` and ```Summarizer.getClassificationSummarizers```.
3. Move ```MultiClassSummarizer``` from ```LogisticRegression``` to ```ml.stat``` (this seems to be a better place since ```MultiClassSummarizer``` is not only used by ```LogisticRegression``` but also several other classes).
### Why are the changes needed?
Minimize code duplication and improve performance
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
existing test suites.
Closes#27555 from huaxingao/spark-30802.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
1, distributedly gather matrix `contingency` of each feature
2, distributedly compute the results and then collect them back to the driver
### Why are the changes needed?
existing impl is not efficient:
1, it directly collect matrix `contingency` of partial featues to driver and compute the corresponding result on one pass;
2, a matrix `contingency` of a featues is of size numDistinctValues X numDistinctLabels, so only 1000 matrices can be collected at a time;
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
existing testsuites
Closes#27461 from zhengruifeng/chisq_opt.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
it is said in [LeastSquaresAggregator](12e1bbaddb/mllib/src/main/scala/org/apache/spark/ml/optim/aggregator/LeastSquaresAggregator.scala (L188)) that :
> // do not use tuple assignment above because it will circumvent the transient tag
I then check this issue with Scala 2.13.1 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_241)
### Why are the changes needed?
avoid tuple assignment because it will circumvent the transient tag
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
existing testsuites
Closes#27523 from zhengruifeng/avoid_tuple_assign_to_transient.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
In this PR, we add a parameter in the python function vector_to_array(col) that allows converting to a column of arrays of Float (32bits) in scala, which would be mapped to a numpy array of dtype=float32.
### Why are the changes needed?
In the downstream ML training, using float32 instead of float64 (default) would allow a larger batch size, i.e., allow more data to fit in the memory.
### Does this PR introduce any user-facing change?
Yes.
Old: `vector_to_array()` only take one param
```
df.select(vector_to_array("colA"), ...)
```
New: `vector_to_array()` can take an additional optional param: `dtype` = "float32" (or "float64")
```
df.select(vector_to_array("colA", "float32"), ...)
```
### How was this patch tested?
Unit test in scala.
doctest in python.
Closes#27522 from liangz1/udf-float32.
Authored-by: Liang Zhang <liang.zhang@databricks.com>
Signed-off-by: WeichenXu <weichen.xu@databricks.com>
### What changes were proposed in this pull request?
Add ```HasBlockSize``` in shared Params in both Scala and Python.
Make ALS/MLP extend ```HasBlockSize```
### Why are the changes needed?
Add ```HasBlockSize ``` in ALS, so user can specify the blockSize.
Make ```HasBlockSize``` a shared param so both ALS and MLP can use it.
### Does this PR introduce any user-facing change?
Yes
```ALS.setBlockSize/getBlockSize```
```ALSModel.setBlockSize/getBlockSize```
### How was this patch tested?
Manually tested. Also added doctest.
Closes#27501 from huaxingao/spark_30662.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
Revert
#27360#27396#27374#27389
### Why are the changes needed?
BLAS need more performace tests, specially on sparse datasets.
Perfermance test of LogisticRegression (https://github.com/apache/spark/pull/27374) on sparse dataset shows that blockify vectors to matrices and use BLAS will cause performance regression.
LinearSVC and LinearRegression were also updated in the same way as LogisticRegression, so we need to revert them to make sure no regression.
### Does this PR introduce any user-facing change?
remove newly added param blockSize
### How was this patch tested?
reverted testsuites
Closes#27487 from zhengruifeng/revert_blockify_ii.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
var `negThetaSum` is always used together with `pi`, so we can add them at first
### Why are the changes needed?
only need to add one var `piMinusThetaSum`, instead of `pi` and `negThetaSum`
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
existing testsuites
Closes#27427 from zhengruifeng/nb_predict.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
1, use blocks instead of vectors for performance improvement
2, use Level-2 BLAS
3, move standardization of input vectors outside of gradient computation
### Why are the changes needed?
1, less RAM to persist training data; (save ~40%)
2, faster than existing impl; (30% ~ 102%)
### Does this PR introduce any user-facing change?
add a new expert param `blockSize`
### How was this patch tested?
updated testsuites
Closes#27396 from zhengruifeng/blockify_lireg.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
Make ALS/MLP extend ```HasBlockSize```
### Why are the changes needed?
Currently, MLP has its own ```blockSize``` param, we should make MLP extend ```HasBlockSize``` since ```HasBlockSize``` was added in ```sharedParams.scala``` recently.
ALS doesn't have ```blockSize``` param now, we can make it extend ```HasBlockSize```, so user can specify the ```blockSize```.
### Does this PR introduce any user-facing change?
Yes
```ALS.setBlockSize``` and ```ALS.getBlockSize```
```ALSModel.setBlockSize``` and ```ALSModel.getBlockSize```
### How was this patch tested?
Manually tested. Also added doctest.
Closes#27389 from huaxingao/spark-30662.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
In the PR, I propose to replace deprecated method `computeCost` of `BisectingKMeansModel` by `summary.trainingCost`.
### Why are the changes needed?
The changes eliminate deprecation warnings:
```
BisectingKMeansSuite.scala:108: method computeCost in class BisectingKMeansModel is deprecated (since 3.0.0): This method is deprecated and will be removed in future versions. Use ClusteringEvaluator instead. You can also get the cost on the training dataset in the summary.
[warn] assert(model.computeCost(dataset) < 0.1)
BisectingKMeansSuite.scala:135: method computeCost in class BisectingKMeansModel is deprecated (since 3.0.0): This method is deprecated and will be removed in future versions. Use ClusteringEvaluator instead. You can also get the cost on the training dataset in the summary.
[warn] assert(model.computeCost(dataset) == summary.trainingCost)
BisectingKMeansSuite.scala:323: method computeCost in class BisectingKMeansModel is deprecated (since 3.0.0): This method is deprecated and will be removed in future versions. Use ClusteringEvaluator instead. You can also get the cost on the training dataset in the summary.
[warn] model.computeCost(dataset)
```
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
By running `BisectingKMeansSuite` via:
```
./build/sbt "test:testOnly *BisectingKMeansSuite"
```
Closes#27401 from MaxGekk/kmeans-computeCost-warning.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
1, use blocks instead of vectors
2, use Level-2 BLAS for binary, use Level-3 BLAS for multinomial
### Why are the changes needed?
1, less RAM to persist training data; (save ~40%)
2, faster than existing impl; (40% ~ 92%)
### Does this PR introduce any user-facing change?
add a new expert param `blockSize`
### How was this patch tested?
updated testsuites
Closes#27374 from zhengruifeng/blockify_lor.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
1, stack input vectors to blocks (like ALS/MLP);
2, add new param `blockSize`;
3, add a new class `InstanceBlock`
4, standardize the input outside of optimization procedure;
### Why are the changes needed?
1, reduce RAM to persist traing dataset; (save ~40% in test)
2, use Level-2 BLAS routines; (12% ~ 28% faster, without native BLAS)
### Does this PR introduce any user-facing change?
a new param `blockSize`
### How was this patch tested?
existing and updated testsuites
Closes#27360 from zhengruifeng/blockify_svc.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
Remove ```numTrees``` in GBT in 3.0.0.
### Why are the changes needed?
Currently, GBT has
```
/**
* Number of trees in ensemble
*/
Since("2.0.0")
val getNumTrees: Int = trees.length
```
and
```
/** Number of trees in ensemble */
val numTrees: Int = trees.length
```
I think we should remove one of them. We deprecated it in 2.4.5 via https://github.com/apache/spark/pull/27352.
### Does this PR introduce any user-facing change?
Yes, remove ```numTrees``` in GBT in 3.0.0
### How was this patch tested?
existing tests
Closes#27330 from huaxingao/spark-numTrees.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
add a param `bootstrap` to control whether bootstrap samples are used.
### Why are the changes needed?
Current RF with numTrees=1 will directly build a tree using the orignial dataset,
while with numTrees>1 it will use bootstrap samples to build trees.
This design is for training a DecisionTreeModel by the impl of RandomForest, however, it is somewhat strange.
In Scikit-Learn, there is a param [bootstrap](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier) to control whether bootstrap samples are used.
### Does this PR introduce any user-facing change?
Yes, new param is added
### How was this patch tested?
existing testsuites
Closes#27254 from zhengruifeng/add_bootstrap.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
unpersist graph outside checkpointer, like what Pregel does
### Why are the changes needed?
Shown in [SPARK-30503](https://issues.apache.org/jira/browse/SPARK-30503), intermediate edges are not unpersisted
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
existing testsuites and manual test
Closes#27261 from zhengruifeng/lda_checkpointer.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
Change ```DecisionTreeClassifier``` to ```FMClassifier``` in ```OneVsRest``` setWeightCol test
### Why are the changes needed?
In ```OneVsRest```, if the classifier doesn't support instance weight, ```OneVsRest``` weightCol will be ignored, so unit test has tested one classifier(```LogisticRegression```) that support instance weight, and one classifier (```DecisionTreeClassifier```) that doesn't support instance weight. Since ```DecisionTreeClassifier``` now supports instance weight, we need to change it to the classifier that doesn't have weight support.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Existing test
Closes#27204 from huaxingao/spark-ovr-minor.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
add setInputCol/setOutputCol in OHEModel
### Why are the changes needed?
setInputCol/setOutputCol should be in OHEModel too.
### Does this PR introduce any user-facing change?
Yes.
```OHEModel.setInputCol```
```OHEModel.setOutputCol```
### How was this patch tested?
Manually tested.
Closes#27228 from huaxingao/spark-29565.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
1, add field `storageLevel` in `PeriodicRDDCheckpointer`
2, for ml.GBT/ml.RF set storageLevel=`StorageLevel.MEMORY_AND_DISK`
### Why are the changes needed?
Intermediate RDDs in ML are cached with storageLevel=StorageLevel.MEMORY_AND_DISK.
PeriodicRDDCheckpointer & PeriodicGraphCheckpointer now store RDD with storageLevel=StorageLevel.MEMORY_ONLY, it maybe nice to set the storageLevel of checkpointer.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
existing testsuites
Closes#27189 from zhengruifeng/checkpointer_storage.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
1, change `convertToBaggedRDDSamplingWithReplacement` to attach instance weights
2, make RF supports weights
### Why are the changes needed?
`weightCol` is already exposed, while RF has not support weights.
### Does this PR introduce any user-facing change?
Yes, new setters
### How was this patch tested?
added testsuites
Closes#27097 from zhengruifeng/rf_support_weight.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
Fix all the failed tests when enable AQE.
### Why are the changes needed?
Run more tests with AQE to catch bugs, and make it easier to enable AQE by default in the future.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Existing unit tests
Closes#26813 from JkSelf/enableAQEDefault.
Authored-by: jiake <ke.a.jia@intel.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
add weight support in BisectingKMeans
### Why are the changes needed?
BisectingKMeans should support instance weighting
### Does this PR introduce any user-facing change?
Yes. BisectingKMeans.setWeight
### How was this patch tested?
Unit test
Closes#27035 from huaxingao/spark_30351.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
Make Regressors extend abstract class Regressor:
```AFTSurvivalRegression extends Estimator => extends Regressor```
```DecisionTreeRegressor extends Predictor => extends Regressor```
```FMRegressor extends Predictor => extends Regressor```
```GBTRegressor extends Predictor => extends Regressor```
```RandomForestRegressor extends Predictor => extends Regressor```
We will not make ```IsotonicRegression``` extend ```Regressor``` because it is tricky to handle both DoubleType and VectorType.
### Why are the changes needed?
Make class hierarchy consistent for all Regressors
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
existing tests
Closes#27168 from huaxingao/spark-30377.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
1, del `NodeIdCache`, and use `PeriodicRDDCheckpointer` instead;
2, reuse broadcasted `Splits` in the whole training;
### Why are the changes needed?
1, The functionality of `NodeIdCache` and `PeriodicRDDCheckpointer` are highly similar, and the update process of nodeIds is simple; One goal of "Generalize PeriodicGraphCheckpointer for RDDs" in SPARK-5561 is to use checkpointer in RandomForest;
2, only need to broadcast `Splits` once;
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Existing testsuites
Closes#27145 from zhengruifeng/del_NodeIdCache.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
1, for primitive types `Array.fill(n)(0)` -> `Array.ofDim(n)`;
2, for `AnyRef` types `Array.fill(n)(null)` -> `Array.ofDim(n)`;
3, for primitive types `Array.empty[XXX]` -> `Array.emptyXXXArray`
### Why are the changes needed?
`Array.ofDim` avoid assignments;
`Array.emptyXXXArray` avoid create new object;
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
existing testsuites
Closes#27133 from zhengruifeng/minor_fill_ofDim.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Make GBT reuse splits/treePoints for all trees:
1, reuse splits/treePoints for all trees:
existing impl will find feature splits and transform input vectors to treePoints for each tree; while other famous impls like XGBoost/lightGBM will build a global splits/binned features and reuse them for all trees;
Note: the sampling rate in existing impl to build `splits` is not the param `subsamplingRate` but the output of `RandomForest.samplesFractionForFindSplits` which depends on `maxBins` and `numExamples`.
Note II: Existing impl do not guarantee that splits among iteration are the same, so this may cause a little difference in convergence.
2, do not cache input vectors:
existing impl will cached the input twice: 1,`input: RDD[Instance]` is used to compute/update prediction and errors; 2, at each iteration, input is transformed to bagged points, the bagged points will be cached during this iteration;
In this PR,`input: RDD[Instance]` is no longer cached, since it is only used three times: 1, compute metadata; 2, find splits; 3, converted to treePoints;
Instead, the treePoints `RDD[TreePoint]` is cached, at each iter, it is convert to bagged points by attaching extra `labelWithCounts: RDD[(Double, Int)]` containing residuals/sampleCount information, this rdd is relative small (like cached `norms` in KMeans);
To compute/update prediction and errors, new prediction method based on binned features are added in `Node`
### Why are the changes needed?
for perfermance improvement:
1,40%~50% faster than existing impl
2,save 30%~50% RAM
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
existing testsuites & several manual tests in REPL
Closes#27103 from zhengruifeng/gbt_reuse_bagged.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
PySpark UDF to convert MLlib vectors to dense arrays.
Example:
```
from pyspark.ml.functions import vector_to_array
df.select(vector_to_array(col("features"))
```
### Why are the changes needed?
If a PySpark user wants to convert MLlib sparse/dense vectors in a DataFrame into dense arrays, an efficient approach is to do that in JVM. However, it requires PySpark user to write Scala code and register it as a UDF. Often this is infeasible for a pure python project.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
UT.
Closes#26910 from WeichenXu123/vector_to_array.
Authored-by: WeichenXu <weichen.xu@databricks.com>
Signed-off-by: Xiangrui Meng <meng@databricks.com>
### What changes were proposed in this pull request?
1, fix `BaggedPoint.convertToBaggedRDD` when `subsamplingRate < 1.0`
2, reorg `RandomForest.runWithMetadata` btw
### Why are the changes needed?
In GBT, Instance weights will be discarded if subsamplingRate<1
1, `baggedPoint: BaggedPoint[TreePoint]` is used in the tree growth to find best split;
2, `BaggedPoint[TreePoint]` contains two weights:
```scala
class BaggedPoint[Datum](val datum: Datum, val subsampleCounts: Array[Int], val sampleWeight: Double = 1.0)
class TreePoint(val label: Double, val binnedFeatures: Array[Int], val weight: Double)
```
3, only the var `sampleWeight` in `BaggedPoint` is used, the var `weight` in `TreePoint` is never used in finding splits;
4, The method `BaggedPoint.convertToBaggedRDD` was changed in https://github.com/apache/spark/pull/21632, it was only for decisiontree, so only the following code path was changed;
```
if (numSubsamples == 1 && subsamplingRate == 1.0) {
convertToBaggedRDDWithoutSampling(input, extractSampleWeight)
}
```
5, In https://github.com/apache/spark/pull/25926, I made GBT support weights, but only test it with default `subsamplingRate==1`.
GBT with `subsamplingRate<1` will convert treePoints to baggedPoints via
```scala
convertToBaggedRDDSamplingWithoutReplacement(input, subsamplingRate, numSubsamples, seed)
```
in which the orignial weights from `weightCol` will be discarded and all `sampleWeight` are assigned default 1.0;
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
updated testsuites
Closes#27070 from zhengruifeng/gbt_sampling.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
make FMClassifier/Regressor call super class method extractLabeledPoints
### Why are the changes needed?
code reuse
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
existing tests
Closes#27093 from huaxingao/spark-FM.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
use `.ml.Summarizer` instead of `.mllib.MultivariateOnlineSummarizer` to avoid computation of unused metrics
### Why are the changes needed?
to avoid computation of unused metrics
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
existing testsuites
Closes#27059 from zhengruifeng/pac_summarizer.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
Check before caching zippedData (as suggested in https://github.com/apache/spark/pull/26483#issuecomment-569702482).
### Why are the changes needed?
If the `data` is already cached before calling `run` method of `KMeans` then `zippedData.persist()` will hurt the performance. Hence, persisting it conditionally.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Manually.
Closes#27052 from amanomer/29823followup.
Authored-by: Aman Omer <amanomer1996@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
Make ```MultilayerPerceptronClassificationModel``` extend ```MultilayerPerceptronParams```
### Why are the changes needed?
Make ```MultilayerPerceptronClassificationModel``` extend ```MultilayerPerceptronParams``` to expose the training params, so user can see these params when calling ```extractParamMap```
### Does this PR introduce any user-facing change?
Yes. The ```MultilayerPerceptronParams``` such as ```seed```, ```maxIter``` ... are available in ```MultilayerPerceptronClassificationModel``` now
### How was this patch tested?
Manually tested ```MultilayerPerceptronClassificationModel.extractParamMap()``` to verify all the new params are there.
Closes#26838 from huaxingao/spark-30144.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
1, add new foreach-like methods: foreach/foreachNonZero
2, add iterator: iterator/activeIterator/nonZeroIterator
### Why are the changes needed?
see the [ticke](https://issues.apache.org/jira/browse/SPARK-30329) for details
foreach/foreachNonZero: for both convenience and performace (SparseVector.foreach should be faster than current traversal method)
iterator/activeIterator/nonZeroIterator: add the three iterators, so that we can futuremore add/change some impls based on those iterators for both ml and mllib sides, to avoid vector conversions.
### Does this PR introduce any user-facing change?
Yes, new methods are added
### How was this patch tested?
added testsuites
Closes#26982 from zhengruifeng/vector_iter.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
add instr.logSumOfWeights in the Algo that has weightCol support
### Why are the changes needed?
Many algorithms support weightCol now. I think weightsum is useful info to add to the log.
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
manually tested
Closes#26972 from huaxingao/spark-30321.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
Refactor `RandomForest.findSplits` by applying `aggregateByKey` instead of `groupByKey`
### Why are the changes needed?
Current impl of `RandomForest.findSplits` uses `groupByKey` to collect non-zero values for each feature, so it is quite dangerous.
After looking into the following logic to find splits, I found that collecting all non-zero values is not necessary, and we only need weightSums of distinct values.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
existing testsuites
Closes#27040 from zhengruifeng/rf_opt.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
1, expose predictRaw and predictProbability
2, add tests
### Why are the changes needed?
single instance prediction is useful out of spark, specially for online prediction.
Current `predict` is exposed, but it is not enough.
### Does this PR introduce any user-facing change?
Yes, new methods are exposed
### How was this patch tested?
added testsuites
Closes#27015 from zhengruifeng/expose_raw_prob.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
using `MetadataUtils.getNumFeatures` to extract the numFeatures
### Why are the changes needed?
may avoid `first`/`head` job if metadata has attrGroup
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
existing testsuites
Closes#27037 from zhengruifeng/unify_numFeatures.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
compute count and quantile on one pass
### Why are the changes needed?
to avoid extra pass
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
existing testsuites
Closes#26990 from zhengruifeng/quantile_count_lsh.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
1, remove unused imports and variables
2, remove `countAccum: LongAccumulator`, since `costAccum: DoubleAccumulator` also records the count
3, mark `clusterCentersWithNorm` in KMeansModel trasient and lazy, since it is only used in transformation and can be directly generated from the centers.
### Why are the changes needed?
1,remove unused codes
2,avoid repeated computation
3,reduce broadcasted size
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
existing testsuites
Closes#27014 from zhengruifeng/kmeans_clean_up.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
precompute the `DecisionTreeMetadata` and reuse it for all trees
### Why are the changes needed?
In existing impl, each `DecisionTreeRegressor` needs a pass on the whole dataset to calculate the same `DecisionTreeMetadata` repeatedly.
In this PR, with default depth=5, it is about 8% faster then existing impl
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
existing testsuites
Closes#27011 from zhengruifeng/gbt_reuse_instr_meta.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
supports instance weighting in GMM
### Why are the changes needed?
ML should support instance weighting
### Does this PR introduce any user-facing change?
yes, a new param `weightCol` is exposed
### How was this patch tested?
added testsuits
Closes#26735 from zhengruifeng/gmm_support_weight.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
Implement Factorization Machines as a ml-pipeline component
1. loss function supports: logloss, mse
2. optimizer: GD, adamW
### Why are the changes needed?
Factorization Machines is widely used in advertising and recommendation system to estimate CTR(click-through rate).
Advertising and recommendation system usually has a lot of data, so we need Spark to estimate the CTR, and Factorization Machines are common ml model to estimate CTR.
References:
1. S. Rendle, “Factorization machines,” in Proceedings of IEEE International Conference on Data Mining (ICDM), pp. 995–1000, 2010.
https://www.csie.ntu.edu.tw/~b97053/paper/Rendle2010FM.pdf
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
run unit tests
Closes#27000 from mob-ai/ml/fm.
Authored-by: zhanjf <zhanjf@mob.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
LibSVMDataSource attach AttributeGroup
### Why are the changes needed?
LibSVMDataSource will attach a special metadata to indicate numFeatures:
```scala
scala> val data = spark.read.format("libsvm").load("/data0/Dev/Opensource/spark/data/mllib/sample_multiclass_classification_data.txt")
scala> data.schema("features").metadata
res0: org.apache.spark.sql.types.Metadata = {"numFeatures":4}
```
However, all ML impls will try to obtain vector size via AttributeGroup, which can not use this metadata:
```scala
scala> import org.apache.spark.ml.attribute._
import org.apache.spark.ml.attribute._
scala> AttributeGroup.fromStructField(data.schema("features")).size
res1: Int = -1
```
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
added tests
Closes#27003 from zhengruifeng/libsvm_attr_group.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
compute the medians/ranges more distributedly
### Why are the changes needed?
It is a bottleneck to collect the whole Array[QuantileSummaries] from executors,
since a QuantileSummaries is a large object, which maintains arrays of large sizes 10k(`defaultCompressThreshold`)/50k(`defaultHeadSize`).
In Spark-Shell with default params, I processed a dataset with numFeatures=69,200, and existing impl fail due to OOM.
After this PR, it will sucessfuly fit the model.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
existing testsuites
Closes#26803 from zhengruifeng/robust_high_dim.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
1,modify the toString in SQLTransformer & VectorSizeHint
2,add toString in RegexTokenizer
### Why are the changes needed?
in SQLTransformer & VectorSizeHint , `toString` methods directly call getter of param without default values.
This will cause `java.util.NoSuchElementException` in REPL:
```scala
scala> val vs = new VectorSizeHint()
java.util.NoSuchElementException: Failed to find a default value for size
at org.apache.spark.ml.param.Params.$anonfun$getOrDefault$2(params.scala:780)
at scala.Option.getOrElse(Option.scala:189)
```
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
existing testsuites
Closes#26999 from zhengruifeng/fix_toString.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
Implement Factorization Machines as a ml-pipeline component
1. loss function supports: logloss, mse
2. optimizer: GD, adamW
### Why are the changes needed?
Factorization Machines is widely used in advertising and recommendation system to estimate CTR(click-through rate).
Advertising and recommendation system usually has a lot of data, so we need Spark to estimate the CTR, and Factorization Machines are common ml model to estimate CTR.
References:
1. S. Rendle, “Factorization machines,” in Proceedings of IEEE International Conference on Data Mining (ICDM), pp. 995–1000, 2010.
https://www.csie.ntu.edu.tw/~b97053/paper/Rendle2010FM.pdf
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
run unit tests
Closes#26124 from mob-ai/ml/fm.
Authored-by: zhanjf <zhanjf@mob.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
Fixed typo in `docs` directory and in other directories
1. Find typo in `docs` and apply fixes to files in all directories
2. Fix `the the` -> `the`
### Why are the changes needed?
Better readability of documents
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
No test needed
Closes#26976 from kiszk/typo_20191221.
Authored-by: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
1. Revert "Preparing development version 3.0.1-SNAPSHOT": 56dcd79
2. Revert "Preparing Spark release v3.0.0-preview2-rc2": c216ef1
### Why are the changes needed?
Shouldn't change master.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
manual test:
https://github.com/apache/spark/compare/5de5e46..wangyum:revert-masterCloses#26915 from wangyum/revert-master.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Yuming Wang <wgyumg@gmail.com>
### What changes were proposed in this pull request?
- Replace `Seq[String]` by `Seq[_]` in `StopWordsRemoverSuite` because `String` type is unchecked due erasure.
- Throw an exception for default case in `MLTest.checkNominalOnDF` because we don't expect other attribute types currently.
- Explicitly cast float to double in `BigDecimal(y)`. This is what the `apply()` method does for `float`s.
- Replace deprecated `verifyZeroInteractions` by `verifyNoInteractions`.
- Equivalent replacement of `\0` by `\u0000` in `CSVExprUtilsSuite`
- Import `scala.language.implicitConversions` in `CollectionExpressionsSuite`, `HashExpressionsSuite` and in `ExpressionParserSuite`.
### Why are the changes needed?
The changes fix compiler warnings showed in the JIRA ticket https://issues.apache.org/jira/browse/SPARK-30170 . Eliminating the warning highlights other warnings which could take more attention to real problems.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
By existing test suites `StopWordsRemoverSuite`, `AnalysisExternalCatalogSuite`, `CSVExprUtilsSuite`, `CollectionExpressionsSuite`, `HashExpressionsSuite`, `ExpressionParserSuite` and sub-tests of `MLTest`.
Closes#26799 from MaxGekk/eliminate-warning-2.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
See https://issues.apache.org/jira/browse/SPARK-30195 for the background; I won't repeat it here. This is sort of a grab-bag of related issues.
### Why are the changes needed?
To cross-compile with Scala 2.13 later.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Existing tests for 2.12. I've been manually checking that this actually resolves the compile problems in 2.13 separately.
Closes#26826 from srowen/SPARK-30195.
Authored-by: Sean Owen <srowen@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
add weight support in KMeans
### Why are the changes needed?
KMeans should support weighting
### Does this PR introduce any user-facing change?
Yes. ```KMeans.setWeightCol```
### How was this patch tested?
Unit Tests
Closes#26739 from huaxingao/spark-29967.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
Use Seq instead of Array in sc.parallelize, with reference types.
Remove usage of WrappedArray.
### Why are the changes needed?
These both enable building on Scala 2.13.
### Does this PR introduce any user-facing change?
None
### How was this patch tested?
Existing tests
Closes#26787 from srowen/SPARK-30158.
Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
This patch adds normalization to word vectors when fitting dataset in Word2Vec.
### Why are the changes needed?
Running Word2Vec on some datasets, when numIterations is large, can produce infinity word vectors.
### Does this PR introduce any user-facing change?
Yes. After this patch, Word2Vec won't produce infinity word vectors.
### How was this patch tested?
Manually. This issue is not always reproducible on any dataset. The dataset known to reproduce it is too large (925M) to upload.
```scala
case class Sentences(name: String, words: Array[String])
val dataset = spark.read
.option("header", "true").option("sep", "\t")
.option("quote", "").option("nullValue", "\\N")
.csv("/tmp/title.akas.tsv")
.filter("region = 'US' or language = 'en'")
.select("title")
.as[String]
.map(s => Sentences(s, s.split(' ')))
.persist()
println("Training model...")
val word2Vec = new Word2Vec()
.setInputCol("words")
.setOutputCol("vector")
.setVectorSize(64)
.setWindowSize(4)
.setNumPartitions(50)
.setMinCount(5)
.setMaxIter(30)
val model = word2Vec.fit(dataset)
model.getVectors.show()
```
Before:
```
Training model...
+-------------+--------------------+
| word| vector|
+-------------+--------------------+
| Unspoken|[-Infinity,-Infin...|
| Talent|[-Infinity,Infini...|
| Hourglass|[2.02805806500023...|
|Nickelodeon's|[-4.2918617120906...|
| Priests|[-1.3570403355926...|
| Religion:|[-6.7049072282803...|
| Bu|[5.05591774315586...|
| Totoro:|[-1.0539840178632...|
| Trouble,|[-3.5363592836003...|
| Hatter|[4.90413981352826...|
| '79|[7.50436471285412...|
| Vile|[-2.9147142985312...|
| 9/11|[-Infinity,Infini...|
| Santino|[1.30005911270850...|
| Motives|[-1.2538958306253...|
| '13|[-4.5040152427657...|
| Fierce|[Infinity,Infinit...|
| Stover|[-2.6326895394029...|
| 'It|[1.66574533864436...|
| Butts|[Infinity,Infinit...|
+-------------+--------------------+
only showing top 20 rows
```
After:
```
Training model...
+-------------+--------------------+
| word| vector|
+-------------+--------------------+
| Unspoken|[-0.0454501919448...|
| Talent|[-0.2657704949378...|
| Hourglass|[-0.1399687677621...|
|Nickelodeon's|[-0.1767119318246...|
| Priests|[-0.0047509293071...|
| Religion:|[-0.0411605164408...|
| Bu|[0.11837736517190...|
| Totoro:|[0.05258282646536...|
| Trouble,|[0.09482011198997...|
| Hatter|[0.06040831282734...|
| '79|[0.04783720895648...|
| Vile|[-0.0017210749210...|
| 9/11|[-0.0713915303349...|
| Santino|[-0.0412711687386...|
| Motives|[-0.0492418706417...|
| '13|[-0.0073119504377...|
| Fierce|[-0.0565455369651...|
| Stover|[0.06938160210847...|
| 'It|[0.01117012929171...|
| Butts|[0.05374567210674...|
+-------------+--------------------+
only showing top 20 rows
```
Closes#26722 from viirya/SPARK-24666-2.
Lead-authored-by: Liang-Chi Hsieh <liangchi@uber.com>
Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Liang-Chi Hsieh <liangchi@uber.com>
### What changes were proposed in this pull request?
Removed unnecessary persist.
### Why are the changes needed?
Persist in `PythonMLLibAPI.scala` is unnecessary because later in `run()` of `gmmAlg` is caching the data.
710ddab39e/mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala (L167-L171)
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Manually
Closes#26758 from amanomer/improperPersist.
Authored-by: Aman Omer <amanomer1996@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
### What changes were proposed in this pull request?
Where it generates a deprecation warning in Scala 2.13, replace Symbol shorthand syntax `'foo` with an equivalent.
### Why are the changes needed?
Symbol syntax `'foo` is deprecated in Scala 2.13. The lines changed below otherwise generate about 440 warnings when building for 2.13.
The previous PR directly replaced many usages with `Symbol("foo")`. But it's also used to specify Columns via implicit conversion (`.select('foo)`) or even where simple Strings are used (`.as('foo)`), as it's kind of an abstraction for interned Strings.
While I find this syntax confusing and would like to deprecate it, here I just replaced it where it generates a build warning (not sure why all occurrences don't): `$"foo"` or just `"foo"`.
### Does this PR introduce any user-facing change?
Should not change behavior.
### How was this patch tested?
Existing tests.
Closes#26748 from srowen/SPARK-29392.2.
Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
1, `predictionCol` in `ml.classification` & `ml.clustering` add `NominalAttribute`
2, `rawPredictionCol` in `ml.classification` add `AttributeGroup` containing vectorsize=`numClasses`
3, `probabilityCol` in `ml.classification` & `ml.clustering` add `AttributeGroup` containing vectorsize=`numClasses`/`k`
4, `leafCol` in GBT/RF add `AttributeGroup` containing vectorsize=`numTrees`
5, `leafCol` in DecisionTree add `NominalAttribute`
6, `outputCol` in models in `ml.feature` add `AttributeGroup` containing vectorsize
7, `outputCol` in `UnaryTransformer`s in `ml.feature` add `AttributeGroup` containing vectorsize
### Why are the changes needed?
Appened metadata can be used in downstream ops, like `Classifier.getNumClasses`
There are many impls (like `Binarizer`/`Bucketizer`/`VectorAssembler`/`OneHotEncoder`/`FeatureHasher`/`HashingTF`/`VectorSlicer`/...) in `.ml` that append appropriate metadata in `transform`/`transformSchema` method.
However there are also many impls return no metadata in transformation, even some metadata like `vector.size`/`numAttrs`/`attrs` can be ealily inferred.
### Does this PR introduce any user-facing change?
Yes, add some metadatas in transformed dataset.
### How was this patch tested?
existing testsuites and added testsuites
Closes#26547 from zhengruifeng/add_output_vecSize.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
When PCA was first impled in [SPARK-5521](https://issues.apache.org/jira/browse/SPARK-5521), at that time Matrix.multiply(BLAS.gemv internally) did not support sparse vector. So worked around it by applying a sparse matrix multiplication.
Since [SPARK-7681](https://issues.apache.org/jira/browse/SPARK-7681), BLAS.gemv supported sparse vector. So we can directly use Matrix.multiply now.
### Why are the changes needed?
for simplity
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
existing testsuites
Closes#26745 from zhengruifeng/pca_mul.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
MNB/CNB/BNB use empty sigma matrix instead of null
### Why are the changes needed?
1,Using empty sigma matrix will simplify the impl
2,I am reviewing FM impl these days, FMModels have optional bias and linear part. It seems more reasonable to set optional part an empty vector/matrix or zero value than `null`
### Does this PR introduce any user-facing change?
yes, sigma from `null` to empty matrix
### How was this patch tested?
updated testsuites
Closes#26679 from zhengruifeng/nb_use_empty_sigma.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
Summarizer support more metrics: sum, std
### Why are the changes needed?
Those metrics are widely used, it will be convenient to directly obtain them other than a conversion.
in `NaiveBayes`: we want the sum of vectors, mean & weightSum need to computed then multiplied
in `StandardScaler`,`AFTSurvivalRegression`,`LinearRegression`,`LinearSVC`,`LogisticRegression`: we need to obtain `variance` and then sqrt it to get std
### Does this PR introduce any user-facing change?
yes, new metrics are exposed to end users
### How was this patch tested?
added testsuites
Closes#26596 from zhengruifeng/summarizer_add_metrics.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
MulticlassClassificationEvaluator support hammingLoss
### Why are the changes needed?
1, it is an easy to compute hammingLoss based on confusion matrix
2, scikit-learn supports it
### Does this PR introduce any user-facing change?
yes
### How was this patch tested?
added testsuites
Closes#26597 from zhengruifeng/multi_class_hamming_loss.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
Impl Complement Naive Bayes Classifier as a `modelType` option in `NaiveBayes`
### Why are the changes needed?
1, it is a better choice for text classification: it is said in [scikit-learn](https://scikit-learn.org/stable/modules/naive_bayes.html#complement-naive-bayes) that 'CNB regularly outperforms MNB (often by a considerable margin) on text classification tasks.'
2, CNB is highly similar to existing MNB, only a small part of existing MNB need to be changed, so it is a easy win to support CNB.
### Does this PR introduce any user-facing change?
yes, a new `modelType` is supported
### How was this patch tested?
added testsuites
Closes#26575 from zhengruifeng/cnb.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
A follow-up to rm useless test in VectorUDTSuite
### Why are the changes needed?
rm useless test, which is already covered.
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
no
Closes#26620 from yaooqinn/SPARK-29961-f.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Add typeof function for Spark to get the underlying type of value.
```sql
-- !query 0
select typeof(1)
-- !query 0 schema
struct<typeof(1):string>
-- !query 0 output
int
-- !query 1
select typeof(1.2)
-- !query 1 schema
struct<typeof(1.2):string>
-- !query 1 output
decimal(2,1)
-- !query 2
select typeof(array(1, 2))
-- !query 2 schema
struct<typeof(array(1, 2)):string>
-- !query 2 output
array<int>
-- !query 3
select typeof(a) from (values (1), (2), (3.1)) t(a)
-- !query 3 schema
struct<typeof(a):string>
-- !query 3 output
decimal(11,1)
decimal(11,1)
decimal(11,1)
```
##### presto
```sql
presto> select typeof(array[1]);
_col0
----------------
array(integer)
(1 row)
```
##### PostgreSQL
```sql
postgres=# select pg_typeof(a) from (values (1), (2), (3.0)) t(a);
pg_typeof
-----------
numeric
numeric
numeric
(3 rows)
```
##### impala
https://issues.apache.org/jira/browse/IMPALA-1597
### Why are the changes needed?
a function which is better we have to help us debug, test, develop ...
### Does this PR introduce any user-facing change?
add a new function
### How was this patch tested?
add ut and example
Closes#26599 from yaooqinn/SPARK-29961.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Use JUnit assertions in tests uniformly, not JVM assert() statements.
### Why are the changes needed?
assert() statements do not produce as useful errors when they fail, and, if they were somehow disabled, would fail to test anything.
### Does this PR introduce any user-facing change?
No. The assertion logic should be identical.
### How was this patch tested?
Existing tests.
Closes#26581 from srowen/assertToJUnit.
Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
### What changes were proposed in this pull request?
```LSHModel.approxNearestNeighbors``` sorts the full dataset on the hashDistance in order to find a threshold. This PR uses approxQuantile instead.
### Why are the changes needed?
To improve performance.
### Does this PR introduce any user-facing change?
Yes.
Changed ```LSH``` to make it extend ```HasRelativeError```
```LSH``` and ```LSHModel``` have new APIs ```setRelativeError/getRelativeError```
### How was this patch tested?
Existing tests. Also added a couple doc test in python to test newly added ```getRelativeError```
Closes#26415 from huaxingao/spark-18409.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
### What changes were proposed in this pull request?
support `modelType` `gaussian`
### Why are the changes needed?
current modelTypes do not support continuous data
### Does this PR introduce any user-facing change?
yes, add a `modelType` option
### How was this patch tested?
existing testsuites and added ones
Closes#26413 from zhengruifeng/gnb.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
Add multi-cols support in StopWordsRemover
### Why are the changes needed?
As a basic Transformer, StopWordsRemover should support multi-cols.
Param stopWords can be applied across all columns.
### Does this PR introduce any user-facing change?
```StopWordsRemover.setInputCols```
```StopWordsRemover.setOutputCols```
### How was this patch tested?
Unit tests
Closes#26480 from huaxingao/spark-29808.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
### What changes were proposed in this pull request?
Adjust RDD to persist.
### Why are the changes needed?
To handle the improper persist strategy.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Manually
Closes#26483 from amanomer/SPARK-29823.
Authored-by: Aman Omer <amanomer1996@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
### What changes were proposed in this pull request?
Adjust improper unpersist timing on RDD.
### Why are the changes needed?
Improper unpersist timing will result in memory waste
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Manually
Closes#26469 from Icysandwich/SPARK-29844.
Authored-by: DongWang <cqwd123@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
### What changes were proposed in this pull request?
1,ML models should extend toString method to expose basic information.
Current some algs (GBT/RF/LoR) had done this, while others not yet.
2,add `val numFeatures` in `BisectingKMeansModel`/`GaussianMixtureModel`/`KMeansModel`/`AFTSurvivalRegressionModel`/`IsotonicRegressionModel`
### Why are the changes needed?
ML models should extend toString method to expose basic information.
### Does this PR introduce any user-facing change?
yes
### How was this patch tested?
existing testsuites
Closes#26439 from zhengruifeng/models_toString.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
1,unpersist intermediate rdd `wordCounts`
2,if the `dataset` is already persisted, we do not need to persist rdd `input`
3,if both `minDF`&`maxDF` are gteq or lt than 1, we can compare & check them af first.
### Why are the changes needed?
we should unpersit unused rdd ASAP
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
existing testsuites
Closes#26398 from zhengruifeng/CountVectorizer_unpersist_wordCounts.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
use `ml.Summarizer` instead of `mllib.MultivariateOnlineSummarizer`
### Why are the changes needed?
1, I found that using `ml.Summarizer` is faster than current impl;
2, `mllib.MultivariateOnlineSummarizer` maintain all arrays, while `ml.Summarizer` only maintain necessary arrays
3, using `ml.Summarizer` will avoid vector conversions to `mlllib.Vector`
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
existing testsuites
Closes#26393 from zhengruifeng/maxabs_opt.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
### What changes were proposed in this pull request?
This PR implements ```validateInput``` in ```ElementwiseProduct```, ```Normalizer``` and ```PolynomialExpansion```.
### Why are the changes needed?
```UnaryTransformer``` has abstract method ```validateInputType``` and call it in ```transformSchema```, but this method is not implemented in ```ElementwiseProduct```, ```Normalizer``` and ```PolynomialExpansion```.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Existing tests
Closes#26388 from huaxingao/spark-29746.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
### What changes were proposed in this pull request?
1, change the scope of `ml.SummarizerBuffer` and add a method `createSummarizerBuffer` for it, so it can be used as an aggregator like `MultivariateOnlineSummarizer`;
2, In LoR/AFT/LiR/SVC, use Summarizer instead of MultivariateOnlineSummarizer
### Why are the changes needed?
The computation of summary before learning iterations is a bottleneck in high-dimension cases, since `MultivariateOnlineSummarizer` compute much more than needed.
In the [ticket](https://issues.apache.org/jira/browse/SPARK-29754) is an example, with `--driver-memory=4G` LoR will always fail on KDDA dataset. If we swith to `ml.Summarizer`, then `--driver-memory=3G` is enough to train a model.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
existing testsuites & manual test in REPL
Closes#26396 from zhengruifeng/using_SummarizerBuffer.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
expose expert param `aggregationDepth` in algs: GMM/GLR
### Why are the changes needed?
SVC/LoR/LiR/AFT had exposed expert param aggregationDepth to end users. It should be nice to expose it in similar algs.
### Does this PR introduce any user-facing change?
yes, expose new param
### How was this patch tested?
added pytext tests
Closes#26322 from zhengruifeng/agg_opt.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
The `assertEquals` method of JUnit Assert requires the first parameter to be the expected value. In this PR, I propose to change the order of parameters when the expected value is passed as the second parameter.
### Why are the changes needed?
Wrong order of assert parameters confuses when the assert fails and the parameters have special string representation. For example:
```java
assertEquals(input1.add(input2), new CalendarInterval(5, 5, 367200000000L));
```
```
java.lang.AssertionError:
Expected :interval 5 months 5 days 101 hours
Actual :interval 5 months 5 days 102 hours
```
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
By existing tests.
Closes#26377 from MaxGekk/fix-order-in-assert-equals.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
persist the input if needed
### Why are the changes needed?
training with non-cached dataset will hurt performance
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
existing tests
Closes#26344 from zhengruifeng/linear_svc_cache.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
1, add shared param `relativeError`
2, `Imputer`/`RobusterScaler`/`QuantileDiscretizer` extend `HasRelativeError`
### Why are the changes needed?
It makes sense to expose RelativeError to end users, since it controls both the precision and memory overhead.
`QuantileDiscretizer` had already added this param, while other algs not yet.
### Does this PR introduce any user-facing change?
yes, new param is added in `Imputer`/`RobusterScaler`
### How was this patch tested?
existing testsutes
Closes#26305 from zhengruifeng/add_relative_err.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
To push the built jars to maven release repository, we need to remove the 'SNAPSHOT' tag from the version name.
Made the following changes in this PR:
* Update all the `3.0.0-SNAPSHOT` version name to `3.0.0-preview`
* Update the sparkR version number check logic to allow jvm version like `3.0.0-preview`
**Please note those changes were generated by the release script in the past, but this time since we manually add tags on master branch, we need to manually apply those changes too.**
We shall revert the changes after 3.0.0-preview release passed.
### Why are the changes needed?
To make the maven release repository to accept the built jars.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
N/A
### What changes were proposed in this pull request?
add single-column input/ouput support in OneHotEncoder
### Why are the changes needed?
Currently, OneHotEncoder only has multi columns support. It makes sense to support single column as well.
### Does this PR introduce any user-facing change?
Yes
```OneHotEncoder.setInputCol```
```OneHotEncoder.setOutputCol```
### How was this patch tested?
Unit test
Closes#26265 from huaxingao/spark-29565.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Liang-Chi Hsieh <liangchi@uber.com>
### What changes were proposed in this pull request?
To push the built jars to maven release repository, we need to remove the 'SNAPSHOT' tag from the version name.
Made the following changes in this PR:
* Update all the `3.0.0-SNAPSHOT` version name to `3.0.0-preview`
* Update the PySpark version from `3.0.0.dev0` to `3.0.0`
**Please note those changes were generated by the release script in the past, but this time since we manually add tags on master branch, we need to manually apply those changes too.**
We shall revert the changes after 3.0.0-preview release passed.
### Why are the changes needed?
To make the maven release repository to accept the built jars.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
N/A
Closes#26243 from jiangxb1987/3.0.0-preview-prepare.
Lead-authored-by: Xingbo Jiang <xingbo.jiang@databricks.com>
Co-authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Xingbo Jiang <xingbo.jiang@databricks.com>
### What changes were proposed in this pull request?
add single-column input/output support in Imputer
### Why are the changes needed?
Currently, Imputer only has multi-column support. This PR adds single-column input/output support.
### Does this PR introduce any user-facing change?
Yes. add single-column input/output support in Imputer
```Imputer.setInputCol```
```Imputer.setOutputCol```
### How was this patch tested?
add unit tests
Closes#26247 from huaxingao/spark-29566.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
Remove automatically generated param setters in _shared_params_code_gen.py
### Why are the changes needed?
To keep parity between scala and python
### Does this PR introduce any user-facing change?
Yes
Add some setters in Python ML XXXModels
### How was this patch tested?
unit tests
Closes#26232 from huaxingao/spark-29093.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
add weight support for GBTs by sampling data before passing it to trees and then passing weights to trees
in summary:
1, add setters of `minWeightFractionPerNode` & `weightCol`
2, update input types in private methods from `RDD[LabeledPoint]` to `RDD[Instance]`:
`DecisionTreeRegressor.train`, `GradientBoostedTrees.run`, `GradientBoostedTrees.runWithValidation`, `GradientBoostedTrees.computeInitialPredictionAndError`, `GradientBoostedTrees.computeError`,
`GradientBoostedTrees.evaluateEachIteration`, `GradientBoostedTrees.boost`, `GradientBoostedTrees.updatePredictionError`
3, add new private method `GradientBoostedTrees.computeError(data, predError)` to compute average error, since original `predError.values.mean()` do not take weights into account.
4, add new tests
### Why are the changes needed?
GBTs should support sample weights like other algs
### Does this PR introduce any user-facing change?
yes, new setters are added
### How was this patch tested?
existing & added testsuites
Closes#25926 from zhengruifeng/gbt_add_weight.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
The trees (Array[```DecisionTreeRegressionModel```]) in ```RandomForestRegressionModel``` only contains the default parameter value. Need to update the parameter maps for these trees.
Same issues in ```RandomForestClassifier```, ```GBTClassifier``` and ```GBTRegressor```
### Why are the changes needed?
User wants to access each individual tree and build the trees back up for the random forest estimator. This doesn't work because trees don't have the correct parameter values
### Does this PR introduce any user-facing change?
Yes. Now the trees in ```RandomForestRegressionModel```, ```RandomForestClassifier```, ```GBTClassifier``` and ```GBTRegressor``` have the correct parameter values.
### How was this patch tested?
Add tests
Closes#26154 from huaxingao/spark-29232.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
`ml.MulticlassClassificationEvaluator` & `mllib.MulticlassMetrics` support log-loss
### Why are the changes needed?
log-loss is an important classification metric and is widely used in practice
### Does this PR introduce any user-facing change?
Yes, add new option ("logloss") and a related param `eps`
### How was this patch tested?
added testsuites & local tests refering to sklearn
Closes#26135 from zhengruifeng/logloss.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>