### What changes were proposed in this pull request?
- Replace `Seq[String]` by `Seq[_]` in `StopWordsRemoverSuite` because `String` type is unchecked due erasure.
- Throw an exception for default case in `MLTest.checkNominalOnDF` because we don't expect other attribute types currently.
- Explicitly cast float to double in `BigDecimal(y)`. This is what the `apply()` method does for `float`s.
- Replace deprecated `verifyZeroInteractions` by `verifyNoInteractions`.
- Equivalent replacement of `\0` by `\u0000` in `CSVExprUtilsSuite`
- Import `scala.language.implicitConversions` in `CollectionExpressionsSuite`, `HashExpressionsSuite` and in `ExpressionParserSuite`.
### Why are the changes needed?
The changes fix compiler warnings showed in the JIRA ticket https://issues.apache.org/jira/browse/SPARK-30170 . Eliminating the warning highlights other warnings which could take more attention to real problems.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
By existing test suites `StopWordsRemoverSuite`, `AnalysisExternalCatalogSuite`, `CSVExprUtilsSuite`, `CollectionExpressionsSuite`, `HashExpressionsSuite`, `ExpressionParserSuite` and sub-tests of `MLTest`.
Closes#26799 from MaxGekk/eliminate-warning-2.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
See https://issues.apache.org/jira/browse/SPARK-30195 for the background; I won't repeat it here. This is sort of a grab-bag of related issues.
### Why are the changes needed?
To cross-compile with Scala 2.13 later.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Existing tests for 2.12. I've been manually checking that this actually resolves the compile problems in 2.13 separately.
Closes#26826 from srowen/SPARK-30195.
Authored-by: Sean Owen <srowen@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
add weight support in KMeans
### Why are the changes needed?
KMeans should support weighting
### Does this PR introduce any user-facing change?
Yes. ```KMeans.setWeightCol```
### How was this patch tested?
Unit Tests
Closes#26739 from huaxingao/spark-29967.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
Use Seq instead of Array in sc.parallelize, with reference types.
Remove usage of WrappedArray.
### Why are the changes needed?
These both enable building on Scala 2.13.
### Does this PR introduce any user-facing change?
None
### How was this patch tested?
Existing tests
Closes#26787 from srowen/SPARK-30158.
Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
This patch adds normalization to word vectors when fitting dataset in Word2Vec.
### Why are the changes needed?
Running Word2Vec on some datasets, when numIterations is large, can produce infinity word vectors.
### Does this PR introduce any user-facing change?
Yes. After this patch, Word2Vec won't produce infinity word vectors.
### How was this patch tested?
Manually. This issue is not always reproducible on any dataset. The dataset known to reproduce it is too large (925M) to upload.
```scala
case class Sentences(name: String, words: Array[String])
val dataset = spark.read
.option("header", "true").option("sep", "\t")
.option("quote", "").option("nullValue", "\\N")
.csv("/tmp/title.akas.tsv")
.filter("region = 'US' or language = 'en'")
.select("title")
.as[String]
.map(s => Sentences(s, s.split(' ')))
.persist()
println("Training model...")
val word2Vec = new Word2Vec()
.setInputCol("words")
.setOutputCol("vector")
.setVectorSize(64)
.setWindowSize(4)
.setNumPartitions(50)
.setMinCount(5)
.setMaxIter(30)
val model = word2Vec.fit(dataset)
model.getVectors.show()
```
Before:
```
Training model...
+-------------+--------------------+
| word| vector|
+-------------+--------------------+
| Unspoken|[-Infinity,-Infin...|
| Talent|[-Infinity,Infini...|
| Hourglass|[2.02805806500023...|
|Nickelodeon's|[-4.2918617120906...|
| Priests|[-1.3570403355926...|
| Religion:|[-6.7049072282803...|
| Bu|[5.05591774315586...|
| Totoro:|[-1.0539840178632...|
| Trouble,|[-3.5363592836003...|
| Hatter|[4.90413981352826...|
| '79|[7.50436471285412...|
| Vile|[-2.9147142985312...|
| 9/11|[-Infinity,Infini...|
| Santino|[1.30005911270850...|
| Motives|[-1.2538958306253...|
| '13|[-4.5040152427657...|
| Fierce|[Infinity,Infinit...|
| Stover|[-2.6326895394029...|
| 'It|[1.66574533864436...|
| Butts|[Infinity,Infinit...|
+-------------+--------------------+
only showing top 20 rows
```
After:
```
Training model...
+-------------+--------------------+
| word| vector|
+-------------+--------------------+
| Unspoken|[-0.0454501919448...|
| Talent|[-0.2657704949378...|
| Hourglass|[-0.1399687677621...|
|Nickelodeon's|[-0.1767119318246...|
| Priests|[-0.0047509293071...|
| Religion:|[-0.0411605164408...|
| Bu|[0.11837736517190...|
| Totoro:|[0.05258282646536...|
| Trouble,|[0.09482011198997...|
| Hatter|[0.06040831282734...|
| '79|[0.04783720895648...|
| Vile|[-0.0017210749210...|
| 9/11|[-0.0713915303349...|
| Santino|[-0.0412711687386...|
| Motives|[-0.0492418706417...|
| '13|[-0.0073119504377...|
| Fierce|[-0.0565455369651...|
| Stover|[0.06938160210847...|
| 'It|[0.01117012929171...|
| Butts|[0.05374567210674...|
+-------------+--------------------+
only showing top 20 rows
```
Closes#26722 from viirya/SPARK-24666-2.
Lead-authored-by: Liang-Chi Hsieh <liangchi@uber.com>
Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Liang-Chi Hsieh <liangchi@uber.com>
### What changes were proposed in this pull request?
Removed unnecessary persist.
### Why are the changes needed?
Persist in `PythonMLLibAPI.scala` is unnecessary because later in `run()` of `gmmAlg` is caching the data.
710ddab39e/mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala (L167-L171)
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Manually
Closes#26758 from amanomer/improperPersist.
Authored-by: Aman Omer <amanomer1996@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
### What changes were proposed in this pull request?
Where it generates a deprecation warning in Scala 2.13, replace Symbol shorthand syntax `'foo` with an equivalent.
### Why are the changes needed?
Symbol syntax `'foo` is deprecated in Scala 2.13. The lines changed below otherwise generate about 440 warnings when building for 2.13.
The previous PR directly replaced many usages with `Symbol("foo")`. But it's also used to specify Columns via implicit conversion (`.select('foo)`) or even where simple Strings are used (`.as('foo)`), as it's kind of an abstraction for interned Strings.
While I find this syntax confusing and would like to deprecate it, here I just replaced it where it generates a build warning (not sure why all occurrences don't): `$"foo"` or just `"foo"`.
### Does this PR introduce any user-facing change?
Should not change behavior.
### How was this patch tested?
Existing tests.
Closes#26748 from srowen/SPARK-29392.2.
Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
1, `predictionCol` in `ml.classification` & `ml.clustering` add `NominalAttribute`
2, `rawPredictionCol` in `ml.classification` add `AttributeGroup` containing vectorsize=`numClasses`
3, `probabilityCol` in `ml.classification` & `ml.clustering` add `AttributeGroup` containing vectorsize=`numClasses`/`k`
4, `leafCol` in GBT/RF add `AttributeGroup` containing vectorsize=`numTrees`
5, `leafCol` in DecisionTree add `NominalAttribute`
6, `outputCol` in models in `ml.feature` add `AttributeGroup` containing vectorsize
7, `outputCol` in `UnaryTransformer`s in `ml.feature` add `AttributeGroup` containing vectorsize
### Why are the changes needed?
Appened metadata can be used in downstream ops, like `Classifier.getNumClasses`
There are many impls (like `Binarizer`/`Bucketizer`/`VectorAssembler`/`OneHotEncoder`/`FeatureHasher`/`HashingTF`/`VectorSlicer`/...) in `.ml` that append appropriate metadata in `transform`/`transformSchema` method.
However there are also many impls return no metadata in transformation, even some metadata like `vector.size`/`numAttrs`/`attrs` can be ealily inferred.
### Does this PR introduce any user-facing change?
Yes, add some metadatas in transformed dataset.
### How was this patch tested?
existing testsuites and added testsuites
Closes#26547 from zhengruifeng/add_output_vecSize.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
When PCA was first impled in [SPARK-5521](https://issues.apache.org/jira/browse/SPARK-5521), at that time Matrix.multiply(BLAS.gemv internally) did not support sparse vector. So worked around it by applying a sparse matrix multiplication.
Since [SPARK-7681](https://issues.apache.org/jira/browse/SPARK-7681), BLAS.gemv supported sparse vector. So we can directly use Matrix.multiply now.
### Why are the changes needed?
for simplity
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
existing testsuites
Closes#26745 from zhengruifeng/pca_mul.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
MNB/CNB/BNB use empty sigma matrix instead of null
### Why are the changes needed?
1,Using empty sigma matrix will simplify the impl
2,I am reviewing FM impl these days, FMModels have optional bias and linear part. It seems more reasonable to set optional part an empty vector/matrix or zero value than `null`
### Does this PR introduce any user-facing change?
yes, sigma from `null` to empty matrix
### How was this patch tested?
updated testsuites
Closes#26679 from zhengruifeng/nb_use_empty_sigma.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
Summarizer support more metrics: sum, std
### Why are the changes needed?
Those metrics are widely used, it will be convenient to directly obtain them other than a conversion.
in `NaiveBayes`: we want the sum of vectors, mean & weightSum need to computed then multiplied
in `StandardScaler`,`AFTSurvivalRegression`,`LinearRegression`,`LinearSVC`,`LogisticRegression`: we need to obtain `variance` and then sqrt it to get std
### Does this PR introduce any user-facing change?
yes, new metrics are exposed to end users
### How was this patch tested?
added testsuites
Closes#26596 from zhengruifeng/summarizer_add_metrics.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
MulticlassClassificationEvaluator support hammingLoss
### Why are the changes needed?
1, it is an easy to compute hammingLoss based on confusion matrix
2, scikit-learn supports it
### Does this PR introduce any user-facing change?
yes
### How was this patch tested?
added testsuites
Closes#26597 from zhengruifeng/multi_class_hamming_loss.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
Impl Complement Naive Bayes Classifier as a `modelType` option in `NaiveBayes`
### Why are the changes needed?
1, it is a better choice for text classification: it is said in [scikit-learn](https://scikit-learn.org/stable/modules/naive_bayes.html#complement-naive-bayes) that 'CNB regularly outperforms MNB (often by a considerable margin) on text classification tasks.'
2, CNB is highly similar to existing MNB, only a small part of existing MNB need to be changed, so it is a easy win to support CNB.
### Does this PR introduce any user-facing change?
yes, a new `modelType` is supported
### How was this patch tested?
added testsuites
Closes#26575 from zhengruifeng/cnb.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
A follow-up to rm useless test in VectorUDTSuite
### Why are the changes needed?
rm useless test, which is already covered.
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
no
Closes#26620 from yaooqinn/SPARK-29961-f.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Add typeof function for Spark to get the underlying type of value.
```sql
-- !query 0
select typeof(1)
-- !query 0 schema
struct<typeof(1):string>
-- !query 0 output
int
-- !query 1
select typeof(1.2)
-- !query 1 schema
struct<typeof(1.2):string>
-- !query 1 output
decimal(2,1)
-- !query 2
select typeof(array(1, 2))
-- !query 2 schema
struct<typeof(array(1, 2)):string>
-- !query 2 output
array<int>
-- !query 3
select typeof(a) from (values (1), (2), (3.1)) t(a)
-- !query 3 schema
struct<typeof(a):string>
-- !query 3 output
decimal(11,1)
decimal(11,1)
decimal(11,1)
```
##### presto
```sql
presto> select typeof(array[1]);
_col0
----------------
array(integer)
(1 row)
```
##### PostgreSQL
```sql
postgres=# select pg_typeof(a) from (values (1), (2), (3.0)) t(a);
pg_typeof
-----------
numeric
numeric
numeric
(3 rows)
```
##### impala
https://issues.apache.org/jira/browse/IMPALA-1597
### Why are the changes needed?
a function which is better we have to help us debug, test, develop ...
### Does this PR introduce any user-facing change?
add a new function
### How was this patch tested?
add ut and example
Closes#26599 from yaooqinn/SPARK-29961.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Use JUnit assertions in tests uniformly, not JVM assert() statements.
### Why are the changes needed?
assert() statements do not produce as useful errors when they fail, and, if they were somehow disabled, would fail to test anything.
### Does this PR introduce any user-facing change?
No. The assertion logic should be identical.
### How was this patch tested?
Existing tests.
Closes#26581 from srowen/assertToJUnit.
Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
### What changes were proposed in this pull request?
```LSHModel.approxNearestNeighbors``` sorts the full dataset on the hashDistance in order to find a threshold. This PR uses approxQuantile instead.
### Why are the changes needed?
To improve performance.
### Does this PR introduce any user-facing change?
Yes.
Changed ```LSH``` to make it extend ```HasRelativeError```
```LSH``` and ```LSHModel``` have new APIs ```setRelativeError/getRelativeError```
### How was this patch tested?
Existing tests. Also added a couple doc test in python to test newly added ```getRelativeError```
Closes#26415 from huaxingao/spark-18409.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
### What changes were proposed in this pull request?
support `modelType` `gaussian`
### Why are the changes needed?
current modelTypes do not support continuous data
### Does this PR introduce any user-facing change?
yes, add a `modelType` option
### How was this patch tested?
existing testsuites and added ones
Closes#26413 from zhengruifeng/gnb.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
Add multi-cols support in StopWordsRemover
### Why are the changes needed?
As a basic Transformer, StopWordsRemover should support multi-cols.
Param stopWords can be applied across all columns.
### Does this PR introduce any user-facing change?
```StopWordsRemover.setInputCols```
```StopWordsRemover.setOutputCols```
### How was this patch tested?
Unit tests
Closes#26480 from huaxingao/spark-29808.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
### What changes were proposed in this pull request?
Adjust RDD to persist.
### Why are the changes needed?
To handle the improper persist strategy.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Manually
Closes#26483 from amanomer/SPARK-29823.
Authored-by: Aman Omer <amanomer1996@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
### What changes were proposed in this pull request?
Adjust improper unpersist timing on RDD.
### Why are the changes needed?
Improper unpersist timing will result in memory waste
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Manually
Closes#26469 from Icysandwich/SPARK-29844.
Authored-by: DongWang <cqwd123@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
### What changes were proposed in this pull request?
1,ML models should extend toString method to expose basic information.
Current some algs (GBT/RF/LoR) had done this, while others not yet.
2,add `val numFeatures` in `BisectingKMeansModel`/`GaussianMixtureModel`/`KMeansModel`/`AFTSurvivalRegressionModel`/`IsotonicRegressionModel`
### Why are the changes needed?
ML models should extend toString method to expose basic information.
### Does this PR introduce any user-facing change?
yes
### How was this patch tested?
existing testsuites
Closes#26439 from zhengruifeng/models_toString.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
1,unpersist intermediate rdd `wordCounts`
2,if the `dataset` is already persisted, we do not need to persist rdd `input`
3,if both `minDF`&`maxDF` are gteq or lt than 1, we can compare & check them af first.
### Why are the changes needed?
we should unpersit unused rdd ASAP
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
existing testsuites
Closes#26398 from zhengruifeng/CountVectorizer_unpersist_wordCounts.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
use `ml.Summarizer` instead of `mllib.MultivariateOnlineSummarizer`
### Why are the changes needed?
1, I found that using `ml.Summarizer` is faster than current impl;
2, `mllib.MultivariateOnlineSummarizer` maintain all arrays, while `ml.Summarizer` only maintain necessary arrays
3, using `ml.Summarizer` will avoid vector conversions to `mlllib.Vector`
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
existing testsuites
Closes#26393 from zhengruifeng/maxabs_opt.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
### What changes were proposed in this pull request?
This PR implements ```validateInput``` in ```ElementwiseProduct```, ```Normalizer``` and ```PolynomialExpansion```.
### Why are the changes needed?
```UnaryTransformer``` has abstract method ```validateInputType``` and call it in ```transformSchema```, but this method is not implemented in ```ElementwiseProduct```, ```Normalizer``` and ```PolynomialExpansion```.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Existing tests
Closes#26388 from huaxingao/spark-29746.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
### What changes were proposed in this pull request?
1, change the scope of `ml.SummarizerBuffer` and add a method `createSummarizerBuffer` for it, so it can be used as an aggregator like `MultivariateOnlineSummarizer`;
2, In LoR/AFT/LiR/SVC, use Summarizer instead of MultivariateOnlineSummarizer
### Why are the changes needed?
The computation of summary before learning iterations is a bottleneck in high-dimension cases, since `MultivariateOnlineSummarizer` compute much more than needed.
In the [ticket](https://issues.apache.org/jira/browse/SPARK-29754) is an example, with `--driver-memory=4G` LoR will always fail on KDDA dataset. If we swith to `ml.Summarizer`, then `--driver-memory=3G` is enough to train a model.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
existing testsuites & manual test in REPL
Closes#26396 from zhengruifeng/using_SummarizerBuffer.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
expose expert param `aggregationDepth` in algs: GMM/GLR
### Why are the changes needed?
SVC/LoR/LiR/AFT had exposed expert param aggregationDepth to end users. It should be nice to expose it in similar algs.
### Does this PR introduce any user-facing change?
yes, expose new param
### How was this patch tested?
added pytext tests
Closes#26322 from zhengruifeng/agg_opt.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
The `assertEquals` method of JUnit Assert requires the first parameter to be the expected value. In this PR, I propose to change the order of parameters when the expected value is passed as the second parameter.
### Why are the changes needed?
Wrong order of assert parameters confuses when the assert fails and the parameters have special string representation. For example:
```java
assertEquals(input1.add(input2), new CalendarInterval(5, 5, 367200000000L));
```
```
java.lang.AssertionError:
Expected :interval 5 months 5 days 101 hours
Actual :interval 5 months 5 days 102 hours
```
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
By existing tests.
Closes#26377 from MaxGekk/fix-order-in-assert-equals.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
persist the input if needed
### Why are the changes needed?
training with non-cached dataset will hurt performance
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
existing tests
Closes#26344 from zhengruifeng/linear_svc_cache.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
1, add shared param `relativeError`
2, `Imputer`/`RobusterScaler`/`QuantileDiscretizer` extend `HasRelativeError`
### Why are the changes needed?
It makes sense to expose RelativeError to end users, since it controls both the precision and memory overhead.
`QuantileDiscretizer` had already added this param, while other algs not yet.
### Does this PR introduce any user-facing change?
yes, new param is added in `Imputer`/`RobusterScaler`
### How was this patch tested?
existing testsutes
Closes#26305 from zhengruifeng/add_relative_err.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
To push the built jars to maven release repository, we need to remove the 'SNAPSHOT' tag from the version name.
Made the following changes in this PR:
* Update all the `3.0.0-SNAPSHOT` version name to `3.0.0-preview`
* Update the sparkR version number check logic to allow jvm version like `3.0.0-preview`
**Please note those changes were generated by the release script in the past, but this time since we manually add tags on master branch, we need to manually apply those changes too.**
We shall revert the changes after 3.0.0-preview release passed.
### Why are the changes needed?
To make the maven release repository to accept the built jars.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
N/A
### What changes were proposed in this pull request?
add single-column input/ouput support in OneHotEncoder
### Why are the changes needed?
Currently, OneHotEncoder only has multi columns support. It makes sense to support single column as well.
### Does this PR introduce any user-facing change?
Yes
```OneHotEncoder.setInputCol```
```OneHotEncoder.setOutputCol```
### How was this patch tested?
Unit test
Closes#26265 from huaxingao/spark-29565.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Liang-Chi Hsieh <liangchi@uber.com>
### What changes were proposed in this pull request?
To push the built jars to maven release repository, we need to remove the 'SNAPSHOT' tag from the version name.
Made the following changes in this PR:
* Update all the `3.0.0-SNAPSHOT` version name to `3.0.0-preview`
* Update the PySpark version from `3.0.0.dev0` to `3.0.0`
**Please note those changes were generated by the release script in the past, but this time since we manually add tags on master branch, we need to manually apply those changes too.**
We shall revert the changes after 3.0.0-preview release passed.
### Why are the changes needed?
To make the maven release repository to accept the built jars.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
N/A
Closes#26243 from jiangxb1987/3.0.0-preview-prepare.
Lead-authored-by: Xingbo Jiang <xingbo.jiang@databricks.com>
Co-authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Xingbo Jiang <xingbo.jiang@databricks.com>
### What changes were proposed in this pull request?
add single-column input/output support in Imputer
### Why are the changes needed?
Currently, Imputer only has multi-column support. This PR adds single-column input/output support.
### Does this PR introduce any user-facing change?
Yes. add single-column input/output support in Imputer
```Imputer.setInputCol```
```Imputer.setOutputCol```
### How was this patch tested?
add unit tests
Closes#26247 from huaxingao/spark-29566.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
Remove automatically generated param setters in _shared_params_code_gen.py
### Why are the changes needed?
To keep parity between scala and python
### Does this PR introduce any user-facing change?
Yes
Add some setters in Python ML XXXModels
### How was this patch tested?
unit tests
Closes#26232 from huaxingao/spark-29093.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
add weight support for GBTs by sampling data before passing it to trees and then passing weights to trees
in summary:
1, add setters of `minWeightFractionPerNode` & `weightCol`
2, update input types in private methods from `RDD[LabeledPoint]` to `RDD[Instance]`:
`DecisionTreeRegressor.train`, `GradientBoostedTrees.run`, `GradientBoostedTrees.runWithValidation`, `GradientBoostedTrees.computeInitialPredictionAndError`, `GradientBoostedTrees.computeError`,
`GradientBoostedTrees.evaluateEachIteration`, `GradientBoostedTrees.boost`, `GradientBoostedTrees.updatePredictionError`
3, add new private method `GradientBoostedTrees.computeError(data, predError)` to compute average error, since original `predError.values.mean()` do not take weights into account.
4, add new tests
### Why are the changes needed?
GBTs should support sample weights like other algs
### Does this PR introduce any user-facing change?
yes, new setters are added
### How was this patch tested?
existing & added testsuites
Closes#25926 from zhengruifeng/gbt_add_weight.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
The trees (Array[```DecisionTreeRegressionModel```]) in ```RandomForestRegressionModel``` only contains the default parameter value. Need to update the parameter maps for these trees.
Same issues in ```RandomForestClassifier```, ```GBTClassifier``` and ```GBTRegressor```
### Why are the changes needed?
User wants to access each individual tree and build the trees back up for the random forest estimator. This doesn't work because trees don't have the correct parameter values
### Does this PR introduce any user-facing change?
Yes. Now the trees in ```RandomForestRegressionModel```, ```RandomForestClassifier```, ```GBTClassifier``` and ```GBTRegressor``` have the correct parameter values.
### How was this patch tested?
Add tests
Closes#26154 from huaxingao/spark-29232.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
`ml.MulticlassClassificationEvaluator` & `mllib.MulticlassMetrics` support log-loss
### Why are the changes needed?
log-loss is an important classification metric and is widely used in practice
### Does this PR introduce any user-facing change?
Yes, add new option ("logloss") and a related param `eps`
### How was this patch tested?
added testsuites & local tests refering to sklearn
Closes#26135 from zhengruifeng/logloss.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
Binarizer support multi-column by extending `HasInputCols`/`HasOutputCols`/`HasThreshold`/`HasThresholds`
### Why are the changes needed?
similar algs in `ml.feature` already support multi-column, like `Bucketizer`/`StringIndexer`/`QuantileDiscretizer`
### Does this PR introduce any user-facing change?
yes, add setter/getter of `thresholds`/`inputCols`/`outputCols`
### How was this patch tested?
added suites
Closes#26064 from zhengruifeng/binarizer_multicols.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
get the first row lazily, and reuse it for each vector column.
### Why are the changes needed?
avoid unnecssary `first` jobs
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
existing testsuites & local tests in repl
Closes#26052 from zhengruifeng/rformula_lazy_row.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
Use `.sameElements` to compare (non-nested) arrays, as `Arrays.deep` is removed in 2.13 and wasn't the best way to do this in the first place.
### Why are the changes needed?
To compile with 2.13.
### Does this PR introduce any user-facing change?
None.
### How was this patch tested?
Existing tests.
Closes#26073 from srowen/SPARK-29416.
Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Replace `Unit` with equivalent `()` where code refers to the `Unit` companion object.
### Why are the changes needed?
It doesn't compile otherwise in Scala 2.13.
- https://github.com/scala/scala/blob/v2.13.0/src/library/scala/Unit.scala#L30
### Does this PR introduce any user-facing change?
Should be no behavior change at all.
### How was this patch tested?
Existing tests.
Closes#26070 from srowen/SPARK-29411.
Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Invocations like `sc.parallelize(Array((1,2)))` cause a compile error in 2.13, like:
```
[ERROR] [Error] /Users/seanowen/Documents/spark_2.13/core/src/test/scala/org/apache/spark/ShuffleSuite.scala:47: overloaded method value apply with alternatives:
(x: Unit,xs: Unit*)Array[Unit] <and>
(x: Double,xs: Double*)Array[Double] <and>
(x: Float,xs: Float*)Array[Float] <and>
(x: Long,xs: Long*)Array[Long] <and>
(x: Int,xs: Int*)Array[Int] <and>
(x: Char,xs: Char*)Array[Char] <and>
(x: Short,xs: Short*)Array[Short] <and>
(x: Byte,xs: Byte*)Array[Byte] <and>
(x: Boolean,xs: Boolean*)Array[Boolean]
cannot be applied to ((Int, Int), (Int, Int), (Int, Int), (Int, Int))
```
Using a `Seq` instead appears to resolve it, and is effectively equivalent.
### Why are the changes needed?
To better cross-build for 2.13.
### Does this PR introduce any user-facing change?
None.
### How was this patch tested?
Existing tests.
Closes#26062 from srowen/SPARK-29401.
Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Add getters/setters in Pyspark ALSModel.
### Why are the changes needed?
To keep parity between python and scala.
### Does this PR introduce any user-facing change?
Yes.
add the following getters/setters to ALSModel
```
get/setUserCol
get/setItemCol
get/setColdStartStrategy
get/setPredictionCol
```
### How was this patch tested?
add doctest
Closes#25947 from huaxingao/spark-29269.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
- Removal of `private[ml]` modifier from `Regressor`.
- Marking `Regressor` as `DeveloperApi`.
### Why are the changes needed?
Consistency with the rest of ML API as described in [the corresponding JIRA ticket](https://issues.apache.org/jira/browse/SPARK-29363).
### Does this PR introduce any user-facing change?
Yes, as described above.
### How was this patch tested?
Existing tests.
Closes#26033 from zero323/SPARK-29363.
Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR aims to remove `scalatest` deprecation warnings with the following changes.
- `org.scalatest.mockito.MockitoSugar` -> `org.scalatestplus.mockito.MockitoSugar`
- `org.scalatest.selenium.WebBrowser` -> `org.scalatestplus.selenium.WebBrowser`
- `org.scalatest.prop.Checkers` -> `org.scalatestplus.scalacheck.Checkers`
- `org.scalatest.prop.GeneratorDrivenPropertyChecks` -> `org.scalatestplus.scalacheck.ScalaCheckDrivenPropertyChecks`
### Why are the changes needed?
According to the Jenkins logs, there are 118 warnings about this.
```
grep "is deprecated" ~/consoleText | grep scalatest | wc -l
118
```
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
After Jenkins passes, we need to check the Jenkins log.
Closes#25982 from dongjoon-hyun/SPARK-29307.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Scala 2.13 emits a deprecation warning for procedure-like declarations:
```
def foo() {
...
```
This is equivalent to the following, so should be changed to avoid a warning:
```
def foo(): Unit = {
...
```
### Why are the changes needed?
It will avoid about a thousand compiler warnings when we start to support Scala 2.13. I wanted to make the change in 3.0 as there are less likely to be back-ports from 3.0 to 2.4 than 3.1 to 3.0, for example, minimizing that downside to touching so many files.
Unfortunately, that makes this quite a big change.
### Does this PR introduce any user-facing change?
No behavior change at all.
### How was this patch tested?
Existing tests.
Closes#25968 from srowen/SPARK-29291.
Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR regenerate the benchmark results in `core` and `mllib` module in order to compare JDK8/JDK11 result.
### Why are the changes needed?
According to the result, For `PropertiesCloneBenchmark` and `UDTSerializationBenchmark`, JDK11 is slightly faster. In general, there is no regression in JDK11.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
This is a test-only PR. Manually run the benchmark.
Closes#25969 from dongjoon-hyun/SPARK-29297.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>