## What changes were proposed in this pull request?
* changes the implementation of gemv with transposed SparseMatrix and SparseVector both in mllib-local and mllib (identical)
* adds a test that was failing before this change, but succeeds with these changes.
The problem in the previous implementation was that it only increments `i`, that is enumerating the columns of a row in the SparseMatrix, when the row-index of the vector matches the column-index of the SparseMatrix. In cases where a particular row of the SparseMatrix has non-zero values at column-indices lower than corresponding non-zero row-indices of the SparseVector, the non-zero values of the SparseVector are enumerated without ever matching the column-index at index `i` and the remaining column-indices i+1,...,indEnd-1 are never attempted. The test cases in this PR illustrate this issue.
## How was this patch tested?
I have run the specific `gemv` tests in both mllib-local and mllib. I am currently still running `./dev/run-tests`.
## ___
As per instructions, I hereby state that this is my original work and that I license the work to the project (Apache Spark) under the project's open source license.
Mentioning dbtsai, viirya and brkyvz whom I can see have worked/authored on these parts before.
Author: Bjarne Fruergaard <bwahlgreen@gmail.com>
Closes#15296 from bwahlgreen/bugfix-spark-17721.
## What changes were proposed in this pull request?
Several performance improvement for ```ChiSqSelector```:
1, Keep ```selectedFeatures``` ordered ascendent.
```ChiSqSelectorModel.transform``` need ```selectedFeatures``` ordered to make prediction. We should sort it when training model rather than making prediction, since users usually train model once and use the model to do prediction multiple times.
2, When training ```fpr``` type ```ChiSqSelectorModel```, it's not necessary to sort the ChiSq test result by statistic.
## How was this patch tested?
Existing unit tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#15277 from yanboliang/spark-17704.
## What changes were proposed in this pull request?
#14035 added ```testImplicits``` to ML unit tests and promoted ```toDF()```, but left one minor issue at ```VectorIndexerSuite```. If we create the DataFrame by ```Seq(...).toDF()```, it will throw different error/exception compared with ```sc.parallelize(Seq(...)).toDF()``` for one of the test cases.
After in-depth study, I found it was caused by different behavior of local and distributed Dataset if the UDF failed at ```assert```. If the data is local Dataset, it throws ```AssertionError``` directly; If the data is distributed Dataset, it throws ```SparkException``` which is the wrapper of ```AssertionError```. I think we should enforce this test to cover both case.
## How was this patch tested?
Unit test.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#15261 from yanboliang/spark-16356.
## What changes were proposed in this pull request?
This patch addresses a potential cause of resource leaks in data source file scans. As reported in [SPARK-17666](https://issues.apache.org/jira/browse/SPARK-17666), tasks which do not fully-consume their input may cause file handles / network connections (e.g. S3 connections) to be leaked. Spark's `NewHadoopRDD` uses a TaskContext callback to [close its record readers](https://github.com/apache/spark/blame/master/core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala#L208), but the new data source file scans will only close record readers once their iterators are fully-consumed.
This patch modifies `RecordReaderIterator` and `HadoopFileLinesReader` to add `close()` methods and modifies all six implementations of `FileFormat.buildReader()` to register TaskContext task completion callbacks to guarantee that cleanup is eventually performed.
## How was this patch tested?
Tested manually for now.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#15245 from JoshRosen/SPARK-17666-close-recordreader.
## What changes were proposed in this pull request?
This PR introduces more compact representation for ```UnsafeArrayData```.
```UnsafeArrayData``` needs to accept ```null``` value in each entry of an array. In the current version, it has three parts
```
[numElements] [offsets] [values]
```
`Offsets` has the number of `numElements`, and represents `null` if its value is negative. It may increase memory footprint, and introduces an indirection for accessing each of `values`.
This PR uses bitvectors to represent nullability for each element like `UnsafeRow`, and eliminates an indirection for accessing each element. The new ```UnsafeArrayData``` has four parts.
```
[numElements][null bits][values or offset&length][variable length portion]
```
In the `null bits` region, we store 1 bit per element, represents whether an element is null. Its total size is ceil(numElements / 8) bytes, and it is aligned to 8-byte boundaries.
In the `values or offset&length` region, we store the content of elements. For fields that hold fixed-length primitive types, such as long, double, or int, we store the value directly in the field. For fields with non-primitive or variable-length values, we store a relative offset (w.r.t. the base address of the array) that points to the beginning of the variable-length field and length (they are combined into a long). Each is word-aligned. For `variable length portion`, each is aligned to 8-byte boundaries.
The new format can reduce memory footprint and improve performance of accessing each element. An example of memory foot comparison:
1024x1024 elements integer array
Size of ```baseObject``` for ```UnsafeArrayData```: 8 + 1024x1024 + 1024x1024 = 2M bytes
Size of ```baseObject``` for ```UnsafeArrayData```: 8 + 1024x1024/8 + 1024x1024 = 1.25M bytes
In summary, we got 1.0-2.6x performance improvements over the code before applying this PR.
Here are performance results of [benchmark programs](04d2e4b6db/sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/UnsafeArrayDataBenchmark.scala):
**Read UnsafeArrayData**: 1.7x and 1.6x performance improvements over the code before applying this PR
````
OpenJDK 64-Bit Server VM 1.8.0_91-b14 on Linux 4.4.11-200.fc22.x86_64
Intel Xeon E3-12xx v2 (Ivy Bridge)
Without SPARK-15962
Read UnsafeArrayData: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
Int 430 / 436 390.0 2.6 1.0X
Double 456 / 485 367.8 2.7 0.9X
With SPARK-15962
Read UnsafeArrayData: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
Int 252 / 260 666.1 1.5 1.0X
Double 281 / 292 597.7 1.7 0.9X
````
**Write UnsafeArrayData**: 1.0x and 1.1x performance improvements over the code before applying this PR
````
OpenJDK 64-Bit Server VM 1.8.0_91-b14 on Linux 4.0.4-301.fc22.x86_64
Intel Xeon E3-12xx v2 (Ivy Bridge)
Without SPARK-15962
Write UnsafeArrayData: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
Int 203 / 273 103.4 9.7 1.0X
Double 239 / 356 87.9 11.4 0.8X
With SPARK-15962
Write UnsafeArrayData: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
Int 196 / 249 107.0 9.3 1.0X
Double 227 / 367 92.3 10.8 0.9X
````
**Get primitive array from UnsafeArrayData**: 2.6x and 1.6x performance improvements over the code before applying this PR
````
OpenJDK 64-Bit Server VM 1.8.0_91-b14 on Linux 4.0.4-301.fc22.x86_64
Intel Xeon E3-12xx v2 (Ivy Bridge)
Without SPARK-15962
Get primitive array from UnsafeArrayData: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
Int 207 / 217 304.2 3.3 1.0X
Double 257 / 363 245.2 4.1 0.8X
With SPARK-15962
Get primitive array from UnsafeArrayData: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
Int 151 / 198 415.8 2.4 1.0X
Double 214 / 394 293.6 3.4 0.7X
````
**Create UnsafeArrayData from primitive array**: 1.7x and 2.1x performance improvements over the code before applying this PR
````
OpenJDK 64-Bit Server VM 1.8.0_91-b14 on Linux 4.0.4-301.fc22.x86_64
Intel Xeon E3-12xx v2 (Ivy Bridge)
Without SPARK-15962
Create UnsafeArrayData from primitive array: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
Int 340 / 385 185.1 5.4 1.0X
Double 479 / 705 131.3 7.6 0.7X
With SPARK-15962
Create UnsafeArrayData from primitive array: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
Int 206 / 211 306.0 3.3 1.0X
Double 232 / 406 271.6 3.7 0.9X
````
1.7x and 1.4x performance improvements in [```UDTSerializationBenchmark```](https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/mllib/linalg/UDTSerializationBenchmark.scala) over the code before applying this PR
````
OpenJDK 64-Bit Server VM 1.8.0_91-b14 on Linux 4.4.11-200.fc22.x86_64
Intel Xeon E3-12xx v2 (Ivy Bridge)
Without SPARK-15962
VectorUDT de/serialization: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
serialize 442 / 533 0.0 441927.1 1.0X
deserialize 217 / 274 0.0 217087.6 2.0X
With SPARK-15962
VectorUDT de/serialization: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
serialize 265 / 318 0.0 265138.5 1.0X
deserialize 155 / 197 0.0 154611.4 1.7X
````
## How was this patch tested?
Added unit tests into ```UnsafeArraySuite```
Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Closes#13680 from kiszk/SPARK-15962.
## What changes were proposed in this pull request?
This was suggested in 101663f1ae (commitcomment-17114968).
This PR adds `testImplicits` to `MLlibTestSparkContext` so that some implicits such as `toDF()` can be sued across ml tests.
This PR also changes all the usages of `spark.createDataFrame( ... )` to `toDF()` where applicable in ml tests in Scala.
## How was this patch tested?
Existing tests should work.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#14035 from HyukjinKwon/minor-ml-test.
## What changes were proposed in this pull request?
#14597 modified ```ChiSqSelector``` to support ```fpr``` type selector, however, it left some issue need to be addressed:
* We should allow users to set selector type explicitly rather than switching them by using different setting function, since the setting order will involves some unexpected issue. For example, if users both set ```numTopFeatures``` and ```percentile```, it will train ```kbest``` or ```percentile``` model based on the order of setting (the latter setting one will be trained). This make users confused, and we should allow users to set selector type explicitly. We handle similar issues at other place of ML code base such as ```GeneralizedLinearRegression``` and ```LogisticRegression```.
* Meanwhile, if there are more than one parameter except ```alpha``` can be set for ```fpr``` model, we can not handle it elegantly in the existing framework. And similar issues for ```kbest``` and ```percentile``` model. Setting selector type explicitly can solve this issue also.
* If setting selector type explicitly by users is allowed, we should handle param interaction such as if users set ```selectorType = percentile``` and ```alpha = 0.1```, we should notify users the parameter ```alpha``` will take no effect. We should handle complex parameter interaction checks at ```transformSchema```. (FYI #11620)
* We should use lower case of the selector type names to follow MLlib convention.
* Add ML Python API.
## How was this patch tested?
Unit test.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#15214 from yanboliang/spark-17017.
## What changes were proposed in this pull request?
Match ProbabilisticClassifer.thresholds requirements to R randomForest cutoff, requiring all > 0
## How was this patch tested?
Jenkins tests plus new test cases
Author: Sean Owen <sowen@cloudera.com>
Closes#15149 from srowen/SPARK-17057.
## What changes were proposed in this pull request?
To match Tokenizer and for compatibility with Word2Vec, output a nullable string array type in NGram
## How was this patch tested?
Jenkins tests.
Author: Sean Owen <sowen@cloudera.com>
Closes#15179 from srowen/SPARK-10835.
## What changes were proposed in this pull request?
update `MultilayerPerceptronClassifierWrapper.fit` paramter type:
`layers: Array[Int]`
`seed: String`
update several default params in sparkR `spark.mlp`:
`tol` --> 1e-6
`stepSize` --> 0.03
`seed` --> NULL ( when seed == NULL, the scala-side wrapper regard it as a `null` value and the seed will use the default one )
r-side `seed` only support 32bit integer.
remove `layers` default value, and move it in front of those parameters with default value.
add `layers` parameter validation check.
## How was this patch tested?
tests added.
Author: WeichenXu <WeichenXu123@outlook.com>
Closes#15051 from WeichenXu123/update_py_mlp_default.
## What changes were proposed in this pull request?
RandomForest currently sends the entire forest to each worker on each iteration. This is because (a) the node queue is FIFO and (b) the closure references the entire array of trees (topNodes). (a) causes RFs to handle splits in many trees, especially early on in learning. (b) sends all trees explicitly.
This PR:
(a) Change the RF node queue to be FILO (a stack), so that RFs tend to focus on 1 or a few trees before focusing on others.
(b) Change topNodes to pass only the trees required on that iteration.
## How was this patch tested?
Unit tests:
* Existing tests for correctness of tree learning
* Manually modifying code and running tests to verify that a small number of trees are communicated on each iteration
* This last item is hard to test via unit tests given the current APIs.
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#14359 from jkbradley/rfs-fewer-trees.
## What changes were proposed in this pull request?
Allow Spark 2.x to load instances of LDA, LocalLDAModel, and DistributedLDAModel saved from Spark 1.6.
## How was this patch tested?
I tested this manually, saving the 3 types from 1.6 and loading them into master (2.x). In the future, we can add generic tests for testing backwards compatibility across all ML models in SPARK-15573.
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#15034 from jkbradley/lda-backwards.
## What changes were proposed in this pull request?
Add treeAggregateDepth parameter for AFTSurvivalRegression to keep consistent with LiR/LoR.
## How was this patch tested?
Existing tests.
Author: WeichenXu <WeichenXu123@outlook.com>
Closes#14851 from WeichenXu123/add_treeAggregate_param_for_survival_regression.
## What changes were proposed in this pull request?
Update error handling for Cholesky decomposition to provide a little more info when input is singular.
## How was this patch tested?
New test case; jenkins tests.
Author: Sean Owen <sowen@cloudera.com>
Closes#15177 from srowen/SPARK-11918.
## What changes were proposed in this pull request?
This PR fixes an issue when Bucketizer is called to handle a dataset containing NaN value.
Sometimes, null value might also be useful to users, so in these cases, Bucketizer should
reserve one extra bucket for NaN values, instead of throwing an illegal exception.
Before:
```
Bucketizer.transform on NaN value threw an illegal exception.
```
After:
```
NaN values will be grouped in an extra bucket.
```
## How was this patch tested?
New test cases added in `BucketizerSuite`.
Signed-off-by: VinceShieh <vincent.xieintel.com>
Author: VinceShieh <vincent.xie@intel.com>
Closes#14858 from VinceShieh/spark-17219.
## What changes were proposed in this pull request?
Univariate feature selection works by selecting the best features based on univariate statistical tests. False Positive Rate (FPR) is a popular univariate statistical test for feature selection. We add a chiSquare Selector based on False Positive Rate (FPR) test in this PR, like it is implemented in scikit-learn.
http://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection
## How was this patch tested?
Add Scala ut
Author: Peng, Meng <peng.meng@intel.com>
Closes#14597 from mpjlu/fprChiSquare.
## What changes were proposed in this pull request?
The code in `Word2VecModel.findSynonyms` to choose the vocabulary elements with the highest similarity to the query vector currently sorts the collection of similarities for every vocabulary element. This involves making multiple copies of the collection of similarities while doing a (relatively) expensive sort. It would be more efficient to find the best matches by maintaining a bounded priority queue and populating it with a single pass over the vocabulary, and that is exactly what this patch does.
## How was this patch tested?
This patch adds no user-visible functionality and its correctness should be exercised by existing tests. To ensure that this approach is actually faster, I made a microbenchmark for `findSynonyms`:
```
object W2VTiming {
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.mllib.feature.Word2VecModel
def run(modelPath: String, scOpt: Option[SparkContext] = None) {
val sc = scOpt.getOrElse(new SparkContext(new SparkConf(true).setMaster("local[*]").setAppName("test")))
val model = Word2VecModel.load(sc, modelPath)
val keys = model.getVectors.keys
val start = System.currentTimeMillis
for(key <- keys) {
model.findSynonyms(key, 5)
model.findSynonyms(key, 10)
model.findSynonyms(key, 25)
model.findSynonyms(key, 50)
}
val finish = System.currentTimeMillis
println("run completed in " + (finish - start) + "ms")
}
}
```
I ran this test on a model generated from the complete works of Jane Austen and found that the new approach was over 3x faster than the old approach. (If the `num` argument to `findSynonyms` is very close to the vocabulary size, the new approach will have less of an advantage over the old one.)
Author: William Benton <willb@redhat.com>
Closes#15150 from willb/SPARK-17595.
## What changes were proposed in this pull request?
Merge `MultinomialLogisticRegression` into `LogisticRegression` and remove `MultinomialLogisticRegression`.
Marked as WIP because we should discuss the coefficients API in the model. See discussion below.
JIRA: [SPARK-17163](https://issues.apache.org/jira/browse/SPARK-17163)
## How was this patch tested?
Merged test suites and added some new unit tests.
## Design
### Switching between binomial and multinomial
We default to automatically detecting whether we should run binomial or multinomial lor. We expose a new parameter called `family` which defaults to auto. When "auto" is used, we run normal binomial lor with pivoting if there are 1 or 2 label classes. Otherwise, we run multinomial. If the user explicitly sets the family, then we abide by that setting. In the case where "binomial" is set but multiclass lor is detected, we throw an error.
### coefficients/intercept model API (TODO)
This is the biggest design point remaining, IMO. We need to decide how to store the coefficients and intercepts in the model, and in turn how to expose them via the API. Two important points:
* We must maintain compatibility with the old API, i.e. we must expose `def coefficients: Vector` and `def intercept: Double`
* There are two separate cases: binomial lr where we have a single set of coefficients and a single intercept and multinomial lr where we have `numClasses` sets of coefficients and `numClasses` intercepts.
Some options:
1. **Store the binomial coefficients as a `2 x numFeatures` matrix.** This means that we would center the model coefficients before storing them in the model. The BLOR algorithm gives `1 * numFeatures` coefficients, but we would convert them to `2 x numFeatures` coefficients before storing them, effectively doubling the storage in the model. This has the advantage that we can make the code cleaner (i.e. less `if (isMultinomial) ... else ...`) and we don't have to reason about the different cases as much. It has the disadvantage that we double the storage space and we could see small regressions at prediction time since there are 2x the number of operations in the prediction algorithms. Additionally, we still have to produce the uncentered coefficients/intercept via the API, so we will have to either ALSO store the uncentered version, or compute it in `def coefficients: Vector` every time.
2. **Store the binomial coefficients as a `1 x numFeatures` matrix.** We still store the coefficients as a matrix and the intercepts as a vector. When users call `coefficients` we return them a `Vector` that is backed by the same underlying array as the `coefficientMatrix`, so we don't duplicate any data. At prediction time, we use the old prediction methods that are specialized for binary LOR. The benefits here are that we don't store extra data, and we won't see any regressions in performance. The cost of this is that we have separate implementations for predict methods in the binary vs multiclass case. The duplicated code is really not very high, but it's still a bit messy.
If we do decide to store the 2x coefficients, we would likely want to see some performance tests to understand the potential regressions.
**Update:** We have chosen option 2
### Threshold/thresholds (TODO)
Currently, when `threshold` is set we clear whatever value is in `thresholds` and when `thresholds` is set we clear whatever value is in `threshold`. [SPARK-11543](https://issues.apache.org/jira/browse/SPARK-11543) was created to prefer thresholds over threshold. We should decide if we should implement this behavior now or if we want to do it in a separate JIRA.
**Update:** Let's leave it for a follow up PR
## Follow up
* Summary model for multiclass logistic regression [SPARK-17139](https://issues.apache.org/jira/browse/SPARK-17139)
* Thresholds vs threshold [SPARK-11543](https://issues.apache.org/jira/browse/SPARK-11543)
Author: sethah <seth.hendrickson16@gmail.com>
Closes#14834 from sethah/SPARK-17163.
## What changes were proposed in this pull request?
This pull request changes the behavior of `Word2VecModel.findSynonyms` so that it will not spuriously reject the best match when invoked with a vector that does not correspond to a word in the model's vocabulary. Instead of blindly discarding the best match, the changed implementation discards a match that corresponds to the query word (in cases where `findSynonyms` is invoked with a word) or that has an identical angle to the query vector.
## How was this patch tested?
I added a test to `Word2VecSuite` to ensure that the word with the most similar vector from a supplied vector would not be spuriously rejected.
Author: William Benton <willb@redhat.com>
Closes#15105 from willb/fix/findSynonyms.
## What changes were proposed in this pull request?
as the TODO described,
check weight vector size and if wrong throw exception.
## How was this patch tested?
existing tests.
Author: WeichenXu <WeichenXu123@outlook.com>
Closes#15060 from WeichenXu123/check_input_weight_size_of_ann.
## What changes were proposed in this pull request?
#14956 reduced default k-means|| init steps to 2 from 5 only for spark.mllib package, we should also do same change for spark.ml and PySpark.
## How was this patch tested?
Existing tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#15050 from yanboliang/spark-17389.
## What changes were proposed in this pull request?
Reduce default k-means|| init steps to 2 from 5. See JIRA for discussion.
See also https://github.com/apache/spark/pull/14948
## How was this patch tested?
Existing tests.
Author: Sean Owen <sowen@cloudera.com>
Closes#14956 from srowen/SPARK-17389.2.
## What changes were proposed in this pull request?
#13584 resolved the issue of features and label columns conflict with ```RFormula``` default ones when loading libsvm data, but it still left some issues should be resolved:
1, It’s not necessary to check and rename label column.
Since we have considerations on the design of ```RFormula```, it can handle the case of label column already exists(with restriction of the existing label column should be numeric/boolean type). So it’s not necessary to change the column name to avoid conflict. If the label column is not numeric/boolean type, ```RFormula``` will throw exception.
2, We should rename features column name to new one if there is conflict, but appending a random value is enough since it was used internally only. We done similar work when implementing ```SQLTransformer```.
3, We should set correct new features column for the estimators. Take ```GLM``` as example:
```GLM``` estimator should set features column with the changed one(rFormula.getFeaturesCol) rather than the default “features”. Although it’s same when training model, but it involves problems when predicting. The following is the prediction result of GLM before this PR:
![image](https://cloud.githubusercontent.com/assets/1962026/18308227/84c3c452-74a8-11e6-9caa-9d6d846cc957.png)
We should drop the internal used feature column name, otherwise, it will appear on the prediction DataFrame which will confused users. And this behavior is same as other scenarios which does not exist column name conflict.
After this PR:
![image](https://cloud.githubusercontent.com/assets/1962026/18308240/92082a04-74a8-11e6-9226-801f52b856d9.png)
## How was this patch tested?
Existing unit tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#14993 from yanboliang/spark-15509.
## What changes were proposed in this pull request?
We should generally use `ArrayBuffer.+=(A)` rather than `ArrayBuffer.append(A)`, because `append(A)` would involve extra boxing / unboxing.
## How was this patch tested?
N/A
Author: Liwei Lin <lwlin7@gmail.com>
Closes#14914 from lw-lin/append_to_plus_eq_v2.
## What changes were proposed in this pull request?
1, remove unnecessary `require()`, because it will make following check useless.
2, update the error msg.
## How was this patch tested?
no test
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#14972 from zhengruifeng/del_unnecessary_check.
## What changes were proposed in this pull request?
```weights``` of ```MultilayerPerceptronClassificationModel``` should be the output weights of layers rather than initial weights, this PR correct it.
## How was this patch tested?
Doc change.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#14967 from yanboliang/mlp-weights.
## What changes were proposed in this pull request?
If `ScalaUDF` throws exceptions during executing user code, sometimes it's hard for users to figure out what's wrong, especially when they use Spark shell. An example
```
org.apache.spark.SparkException: Job aborted due to stage failure: Task 12 in stage 325.0 failed 4 times, most recent failure: Lost task 12.3 in stage 325.0 (TID 35622, 10.0.207.202): java.lang.NullPointerException
at line8414e872fb8b42aba390efc153d1611a12.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(<console>:40)
at line8414e872fb8b42aba390efc153d1611a12.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(<console>:40)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
...
```
We should catch these exceptions and rethrow them with better error message, to say that the exception is happened in scala udf.
This PR also does some clean up for `ScalaUDF` and add a unit test suite for it.
## How was this patch tested?
the new test suite
Author: Wenchen Fan <wenchen@databricks.com>
Closes#14850 from cloud-fan/npe.
## What changes were proposed in this pull request?
Since we have updated breeze version to 0.12, we should remove work around for bug of breeze sparse matrix in v0.11.
I checked all mllib code and found this is the only work around for breeze 0.11.
## How was this patch tested?
Existing tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#14953 from yanboliang/matrices.
## What changes were proposed in this pull request?
Related to https://github.com/apache/spark/pull/14524 -- just the 'fix' rather than a behavior change.
- PythonMLlibAPI methods that take a seed now always take a `java.lang.Long` consistently, allowing the Python API to specify "no seed"
- .mllib's Word2VecModel seemed to be an odd man out in .mllib in that it picked its own random seed. Instead it defaults to None, meaning, letting the Scala implementation pick a seed
- BisectingKMeansModel arguably should not hard-code a seed for consistency with .mllib, I think. However I left it.
## How was this patch tested?
Existing tests
Author: Sean Owen <sowen@cloudera.com>
Closes#14826 from srowen/SPARK-16832.2.
## What changes were proposed in this pull request?
Improved the code quality of spark by replacing all pattern match on boolean value by if/else block.
## How was this patch tested?
By running the tests
Author: Shivansh <shiv4nsh@gmail.com>
Closes#14873 from shiv4nsh/SPARK-17308.
## What changes were proposed in this pull request?
This PR tries to add Kolmogorov-Smirnov Test wrapper to SparkR. This wrapper implementation only supports one sample test against normal distribution.
## How was this patch tested?
R unit test.
Author: Junyang Qian <junyangq@databricks.com>
Closes#14881 from junyangq/SPARK-17315.
## What changes were proposed in this pull request?
fix `MultivariantOnlineSummerizer.numNonZeros` method,
return `nnz` array, instead of `weightSum` array
## How was this patch tested?
Existing test.
Author: WeichenXu <WeichenXu123@outlook.com>
Closes#14923 from WeichenXu123/fix_MultivariantOnlineSummerizer_numNonZeros.
https://issues.apache.org/jira/browse/SPARK-15509
## What changes were proposed in this pull request?
Currently in SparkR, when you load a LibSVM dataset using the sqlContext and then pass it to an MLlib algorithm, the ML wrappers will fail since they will try to create a "features" column, which conflicts with the existing "features" column from the LibSVM loader. E.g., using the "mnist" dataset from LibSVM:
`training <- loadDF(sqlContext, ".../mnist", "libsvm")`
`model <- naiveBayes(label ~ features, training)`
This fails with:
```
16/05/24 11:52:41 ERROR RBackendHandler: fit on org.apache.spark.ml.r.NaiveBayesWrapper failed
Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
java.lang.IllegalArgumentException: Output column features already exists.
at org.apache.spark.ml.feature.VectorAssembler.transformSchema(VectorAssembler.scala:120)
at org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:179)
at org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:179)
at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:186)
at org.apache.spark.ml.Pipeline.transformSchema(Pipeline.scala:179)
at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:67)
at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:131)
at org.apache.spark.ml.feature.RFormula.fit(RFormula.scala:169)
at org.apache.spark.ml.r.NaiveBayesWrapper$.fit(NaiveBayesWrapper.scala:62)
at org.apache.spark.ml.r.NaiveBayesWrapper.fit(NaiveBayesWrapper.sca
The same issue appears for the "label" column once you rename the "features" column.
```
The cause is, when using `loadDF()` to generate dataframes, sometimes it’s with default column name `“label”` and `“features”`, and these two name will conflict with default column names `setDefault(labelCol, "label")` and ` setDefault(featuresCol, "features")` of `SharedParams.scala`
## How was this patch tested?
Test on my local machine.
Author: Xin Ren <iamshrek@126.com>
Closes#13584 from keypointt/SPARK-15509.
## What changes were proposed in this pull request?
Avoid allocating some 0-length arrays, esp. in UTF8String, and by using Array.empty in Scala over Array[T]()
## How was this patch tested?
Jenkins
Author: Sean Owen <sowen@cloudera.com>
Closes#14895 from srowen/SPARK-17331.
https://issues.apache.org/jira/browse/SPARK-17241
## What changes were proposed in this pull request?
Spark has configurable L2 regularization parameter for generalized linear regression. It is very important to have them in SparkR so that users can run ridge regression.
## How was this patch tested?
Test manually on local laptop.
Author: Xin Ren <iamshrek@126.com>
Closes#14856 from keypointt/SPARK-17241.
## What changes were proposed in this pull request?
Clean up unused variables and unused import statements, unnecessary `return` and `toArray`, and some more style improvement, when I walk through the code examples.
## How was this patch tested?
Testet manually on local laptop.
Author: Xin Ren <iamshrek@126.com>
Closes#14836 from keypointt/codeWalkThroughML.
## What changes were proposed in this pull request?
Allow centering / mean scaling of sparse vectors in StandardScaler, if requested. This is for compatibility with `VectorAssembler` in common usages.
## How was this patch tested?
Jenkins tests, including new caes to reflect the new behavior.
Author: Sean Owen <sowen@cloudera.com>
Closes#14663 from srowen/SPARK-17001.
## What changes were proposed in this pull request?
The require condition and message doesn't match, and the condition also should be optimized.
Small change. Please kindly let me know if JIRA required.
## How was this patch tested?
No additional test required.
Author: Peng, Meng <peng.meng@intel.com>
Closes#14824 from mpjlu/smallChangeForMatrixRequire.
## What changes were proposed in this pull request?
fix comparing Vector bug in TestingUtils.
There is the same bug for Matrix comparing. How to check the length of Matrix should be discussed first.
## How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
Author: Peng, Meng <peng.meng@intel.com>
Closes#14785 from mpjlu/testUtils.
https://issues.apache.org/jira/browse/SPARK-16445
## What changes were proposed in this pull request?
Create Multilayer Perceptron Classifier wrapper in SparkR
## How was this patch tested?
Tested manually on local machine
Author: Xin Ren <iamshrek@126.com>
Closes#14447 from keypointt/SPARK-16445.
## What changes were proposed in this pull request?
In cases when QuantileDiscretizerSuite is called upon a numeric array with duplicated elements, we will take the unique elements generated from approxQuantiles as input for Bucketizer.
## How was this patch tested?
An unit test is added in QuantileDiscretizerSuite
QuantileDiscretizer.fit will throw an illegal exception when calling setSplits on a list of splits
with duplicated elements. Bucketizer.setSplits should only accept either a numeric vector of two
or more unique cut points, although that may produce less number of buckets than requested.
Signed-off-by: VinceShieh <vincent.xieintel.com>
Author: VinceShieh <vincent.xie@intel.com>
Closes#14747 from VinceShieh/SPARK-17086.
## What changes were proposed in this pull request?
Fix a typo
## How was this patch tested?
no tests
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#14772 from zhengruifeng/minor_numClasses.
## What changes were proposed in this pull request?
In Latex, it is common to find "}}}" when closing several expressions at once. [SPARK-16822](https://issues.apache.org/jira/browse/SPARK-16822) added Mathjax to render Latex equations in scaladoc. However, when scala doc sees "}}}" or "{{{" it treats it as a special character for code block. This results in some very strange output.
Author: Jagadeesan <as2@us.ibm.com>
Closes#14688 from jagadeesanas2/SPARK-17095.
## What changes were proposed in this pull request?
Add expert param support to SharedParamsCodeGen where aggregationDepth a expert param is added.
Author: hqzizania <hqzizania@gmail.com>
Closes#14738 from hqzizania/SPARK-17090-minor.
## What changes were proposed in this pull request?
Add missing `numFeatures` and `numClasses` to the wrapped Java models in PySpark ML pipelines. Also tag `DecisionTreeClassificationModel` as Expiremental to match Scala doc.
## How was this patch tested?
Extended doctests
Author: Holden Karau <holden@us.ibm.com>
Closes#12889 from holdenk/SPARK-15113-add-missing-numFeatures-numClasses.
## What changes were proposed in this pull request?
Spark SQL doesn't have its own meta store yet, and use hive's currently. However, hive's meta store has some limitations(e.g. columns can't be too many, not case-preserving, bad decimal type support, etc.), so we have some hacks to successfully store data source table metadata into hive meta store, i.e. put all the information in table properties.
This PR moves these hacks to `HiveExternalCatalog`, tries to isolate hive specific logic in one place.
changes overview:
1. **before this PR**: we need to put metadata(schema, partition columns, etc.) of data source tables to table properties before saving it to external catalog, even the external catalog doesn't use hive metastore(e.g. `InMemoryCatalog`)
**after this PR**: the table properties tricks are only in `HiveExternalCatalog`, the caller side doesn't need to take care of it anymore.
2. **before this PR**: because the table properties tricks are done outside of external catalog, so we also need to revert these tricks when we read the table metadata from external catalog and use it. e.g. in `DescribeTableCommand` we will read schema and partition columns from table properties.
**after this PR**: The table metadata read from external catalog is exactly the same with what we saved to it.
bonus: now we can create data source table using `SessionCatalog`, if schema is specified.
breaks: `schemaStringLengthThreshold` is not configurable anymore. `hive.default.rcfile.serde` is not configurable anymore.
## How was this patch tested?
existing tests.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#14155 from cloud-fan/catalog-table.
## What changes were proposed in this pull request?
Linear/logistic regression use treeAggregate with default depth (always = 2) for collecting coefficient gradient updates to the driver. For high dimensional problems, this can cause OOM error on the driver. This patch makes it configurable to avoid this problem if users' input data has many features. It adds a HasTreeDepth API in `sharedParams.scala`, and extends it to both Linear regression and logistic regression in .ml
Author: hqzizania <hqzizania@gmail.com>
Closes#14717 from hqzizania/SPARK-17090.
## What changes were proposed in this pull request?
In the existing code, ```MinMaxScaler``` handle ```NaN``` value indeterminately.
* If a column has identity value, that is ```max == min```, ```MinMaxScalerModel``` transformation will output ```0.5``` for all rows even the original value is ```NaN```.
* Otherwise, it will remain ```NaN``` after transformation.
I think we should unify the behavior by remaining ```NaN``` value at any condition, since we don't know how to transform a ```NaN``` value. In Python sklearn, it will throw exception when there is ```NaN``` in the dataset.
## How was this patch tested?
Unit tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#14716 from yanboliang/spark-17141.
## What changes were proposed in this pull request?
This patch adds a new estimator/transformer `MultinomialLogisticRegression` to spark ML.
JIRA: [SPARK-7159](https://issues.apache.org/jira/browse/SPARK-7159)
## How was this patch tested?
Added new test suite `MultinomialLogisticRegressionSuite`.
## Approach
### Do not use a "pivot" class in the algorithm formulation
Many implementations of multinomial logistic regression treat the problem as K - 1 independent binary logistic regression models where K is the number of possible outcomes in the output variable. In this case, one outcome is chosen as a "pivot" and the other K - 1 outcomes are regressed against the pivot. This is somewhat undesirable since the coefficients returned will be different for different choices of pivot variables. An alternative approach to the problem models class conditional probabilites using the softmax function and will return uniquely identifiable coefficients (assuming regularization is applied). This second approach is used in R's glmnet and was also recommended by dbtsai.
### Separate multinomial logistic regression and binary logistic regression
The initial design makes multinomial logistic regression a separate estimator/transformer than the existing LogisticRegression estimator/transformer. An alternative design would be to merge them into one.
**Arguments for:**
* The multinomial case without pivot is distinctly different than the current binary case since the binary case uses a pivot class.
* The current logistic regression model in ML uses a vector of coefficients and a scalar intercept. In the multinomial case, we require a matrix of coefficients and a vector of intercepts. There are potential workarounds for this issue if we were to merge the two estimators, but none are particularly elegant.
**Arguments against:**
* It may be inconvenient for users to have to switch the estimator class when transitioning between binary and multiclass (although the new multinomial estimator can be used for two class outcomes).
* Some portions of the code are repeated.
This is a major design point and warrants more discussion.
### Mean centering
When no regularization is applied, the coefficients will not be uniquely identifiable. This is not hard to show and is discussed in further detail [here](https://core.ac.uk/download/files/153/6287975.pdf). R's glmnet deals with this by choosing the minimum l2 regularized solution (i.e. mean centering). Additionally, the intercepts are never regularized so they are always mean centered. This is the approach taken in this PR as well.
### Feature scaling
In current ML logistic regression, the features are always standardized when running the optimization algorithm. They are always returned to the user in the original feature space, however. This same approach is maintained in this patch as well, but the implementation details are different. In ML logistic regression, the unregularized feature values are divided by the column standard deviation in every gradient update iteration. In contrast, MLlib transforms the entire input dataset to the scaled space _before_ optimizaton. In ML, this means that `numFeatures * numClasses` extra scalar divisions are required in every iteration. Performance testing shows that this has significant (4x in some cases) slow downs in each iteration. This can be avoided by transforming the input to the scaled space ala MLlib once, before iteration begins. This does add some overhead initially, but can make significant time savings in some cases.
One issue with this approach is that if the input data is already cached, there may not be enough memory to cache the transformed data, which would make the algorithm _much_ slower. The tradeoffs here merit more discussion.
### Specifying and inferring the number of outcome classes
The estimator checks the dataframe label column for metadata which specifies the number of values. If they are not specified, the length of the `histogram` variable is used, which is essentially the maximum value found in the column. The assumption then, is that the labels are zero-indexed when they are provided to the algorithm.
## Performance
Below are some performance tests I have run so far. I am happy to add more cases or trials if we deem them necessary.
Test cluster: 4 bare metal nodes, 128 GB RAM each, 48 cores each
Notes:
* Time in units of seconds
* Metric is classification accuracy
| algo | elasticNetParam | fitIntercept | metric | maxIter | numPoints | numClasses | numFeatures | time | standardization | regParam |
|--------|-------------------|----------------|----------|-----------|-------------|--------------|---------------|---------|-------------------|------------|
| ml | 0 | true | 0.746415 | 30 | 100000 | 3 | 100000 | 327.923 | true | 0 |
| mllib | 0 | true | 0.743785 | 30 | 100000 | 3 | 100000 | 390.217 | true | 0 |
| algo | elasticNetParam | fitIntercept | metric | maxIter | numPoints | numClasses | numFeatures | time | standardization | regParam |
|--------|-------------------|----------------|----------|-----------|-------------|--------------|---------------|---------|-------------------|------------|
| ml | 0 | true | 0.973238 | 30 | 2000000 | 3 | 10000 | 385.476 | true | 0 |
| mllib | 0 | true | 0.949828 | 30 | 2000000 | 3 | 10000 | 550.403 | true | 0 |
| algo | elasticNetParam | fitIntercept | metric | maxIter | numPoints | numClasses | numFeatures | time | standardization | regParam |
|--------|-------------------|----------------|----------|-----------|-------------|--------------|---------------|---------|-------------------|------------|
| mllib | 0 | true | 0.864358 | 30 | 2000000 | 3 | 10000 | 543.359 | true | 0.1 |
| ml | 0 | true | 0.867418 | 30 | 2000000 | 3 | 10000 | 401.955 | true | 0.1 |
| algo | elasticNetParam | fitIntercept | metric | maxIter | numPoints | numClasses | numFeatures | time | standardization | regParam |
|--------|-------------------|----------------|----------|-----------|-------------|--------------|---------------|---------|-------------------|------------|
| ml | 1 | true | 0.807449 | 30 | 2000000 | 3 | 10000 | 334.892 | true | 0.05 |
| algo | elasticNetParam | fitIntercept | metric | maxIter | numPoints | numClasses | numFeatures | time | standardization | regParam |
|--------|-------------------|----------------|----------|-----------|-------------|--------------|---------------|---------|-------------------|------------|
| ml | 0 | true | 0.602006 | 30 | 2000000 | 500 | 100 | 112.319 | true | 0 |
| mllib | 0 | true | 0.567226 | 30 | 2000000 | 500 | 100 | 263.768 | true | 0 |e | 0.567226 | 30 | 2000000 | 500 | 100 | 263.768 | true | 0 |
## References
Friedman, et al. ["Regularization Paths for Generalized Linear Models via Coordinate Descent"](https://core.ac.uk/download/files/153/6287975.pdf)
[http://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html](http://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html)
## Follow up items
* Consider using level 2 BLAS routines in the gradient computations - [SPARK-17134](https://issues.apache.org/jira/browse/SPARK-17134)
* Add model summary for MLOR - [SPARK-17139](https://issues.apache.org/jira/browse/SPARK-17139)
* Add initial model to MLOR and add test for intercept priors - [SPARK-17140](https://issues.apache.org/jira/browse/SPARK-17140)
* Python API - [SPARK-17138](https://issues.apache.org/jira/browse/SPARK-17138)
* Consider changing the tree aggregation level for MLOR/BLOR or making it user configurable to avoid memory problems with high dimensional data - [SPARK-17090](https://issues.apache.org/jira/browse/SPARK-17090)
* Refactor helper classes out of `LogisticRegression.scala` - [SPARK-17135](https://issues.apache.org/jira/browse/SPARK-17135)
* Design optimizer interface for added flexibility in ML algos - [SPARK-17136](https://issues.apache.org/jira/browse/SPARK-17136)
* Support compressing the coefficients and intercepts for MLOR models - [SPARK-17137](https://issues.apache.org/jira/browse/SPARK-17137)
Author: sethah <seth.hendrickson16@gmail.com>
Closes#13796 from sethah/SPARK-7159_M.
## What changes were proposed in this pull request?
Add LDA Wrapper in SparkR with the following interfaces:
- spark.lda(data, ...)
- spark.posterior(object, newData, ...)
- spark.perplexity(object, ...)
- summary(object)
- write.ml(object)
- read.ml(path)
## How was this patch tested?
Test with SparkR unit test.
Author: Xusen Yin <yinxusen@gmail.com>
Closes#14229 from yinxusen/SPARK-16447.
## What changes were proposed in this pull request?
Gaussian Mixture Model wrapper in SparkR, similarly to R's ```mvnormalmixEM```.
## How was this patch tested?
Unit test.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#14392 from yanboliang/spark-16446.
## What changes were proposed in this pull request?
(Please fill in changes proposed in this fix)
Add Isotonic Regression wrapper in SparkR
Wrappers in R and Scala are added.
Unit tests
Documentation
## How was this patch tested?
Manually tested with sudo ./R/run-tests.sh
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
Author: wm624@hotmail.com <wm624@hotmail.com>
Closes#14182 from wangmiao1981/isoR.
## What changes were proposed in this pull request?
Fix ```LogisticRegression``` typo in error message.
## How was this patch tested?
Docs change, no new tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#14633 from yanboliang/lr-typo.
## What changes were proposed in this pull request?
Replaces custom choose function with o.a.commons.math3.CombinatoricsUtils.binomialCoefficient
## How was this patch tested?
Spark unit tests
Author: zero323 <zero323@users.noreply.github.com>
Closes#14614 from zero323/SPARK-17027.
## What changes were proposed in this pull request?
```GaussianMixture``` should use ```treeAggregate``` rather than ```aggregate``` to improve performance and scalability. In my test of dataset with 200 features and 1M instance, I found there is 20% increased performance.
BTW, we should destroy broadcast variable ```compute``` at the end of each iteration.
## How was this patch tested?
Existing tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#14621 from yanboliang/spark-17033.
## What changes were proposed in this pull request?
Training GLMs on weighted dataset is very important use cases, but it is not supported by SparkR currently. Users can pass argument ```weights``` to specify the weights vector in native R. For ```spark.glm```, we can pass in the ```weightCol``` which is consistent with MLlib.
## How was this patch tested?
Unit test.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#14346 from yanboliang/spark-16710.
## What changes were proposed in this pull request?
Avoid using postfix operation for command execution in SQLQuerySuite where it wasn't whitelisted and audit existing whitelistings removing postfix operators from most places. Some notable places where postfix operation remains is in the XML parsing & time units (seconds, millis, etc.) where it arguably can improve readability.
## How was this patch tested?
Existing tests.
Author: Holden Karau <holden@us.ibm.com>
Closes#14407 from holdenk/SPARK-16779.
## What changes were proposed in this pull request?
This is follow-up for #14378. When we add ```transformSchema``` for all estimators and transformers, I found there are tests failed for ```StringIndexer``` and ```VectorAssembler```. So I moved these parts of work separately in this PR, to make it more clear to review.
The corresponding tests should throw ```IllegalArgumentException``` at schema validation period after we add ```transformSchema```. It's efficient that to throw exception at the start of ```fit``` or ```transform``` rather than during the process.
## How was this patch tested?
Modified unit tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#14455 from yanboliang/transformSchema.
## What changes were proposed in this pull request?
Add threshoulds' length checking for Classifiers which extends ProbabilisticClassifier
## How was this patch tested?
unit tests and manual tests
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#14470 from zhengruifeng/classifier_check_setThreshoulds_length.
## What changes were proposed in this pull request?
To Make sure ANN layer input training data to be persisted,
so that it can avoid overhead cost if the RDD need to be computed from lineage.
## How was this patch tested?
Existing Tests.
Author: WeichenXu <WeichenXu123@outlook.com>
Closes#14483 from WeichenXu123/add_ann_persist_training_data.
## What changes were proposed in this pull request?
Add a length checking for threshoulds' length in method `setThreshoulds()` of classification models.
## How was this patch tested?
unit tests
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#14457 from zhengruifeng/check_setThresholds.
## What changes were proposed in this pull request?
Removed useless latex in a log messge.
## How was this patch tested?
Check generated scaladoc.
Author: Shuai Lin <linshuai2012@gmail.com>
Closes#14380 from lins05/fix-docs-formatting.
## What changes were proposed in this pull request?
update unused broadcast in KMeans/Word2Vec,
use destroy(false) to release memory in time.
and several place destroy() update to destroy(false) so that it will be async-called,
it will better than blocking called.
and update bcNewCenters in KMeans to make it destroy in correct time.
I use a list to store all historical `bcNewCenters` generated in each loop iteration and delay them to release at the end of loop.
fix TODO in `BisectingKMeans.run` "unpersist old indices",
Implements the pattern "persist current step RDD, and unpersist previous one" in the loop iteration.
## How was this patch tested?
Existing tests.
Author: WeichenXu <WeichenXu123@outlook.com>
Closes#14333 from WeichenXu123/broadvar_unpersist_to_destroy.
## What changes were proposed in this pull request?
Use foreach/for instead of map where operation requires execution of body, not actually defining a transformation
## How was this patch tested?
Jenkins
Author: Sean Owen <sowen@cloudera.com>
Closes#14332 from srowen/SPARK-16694.
## What changes were proposed in this pull request?
ML ```GaussianMixture``` training failed due to feature column type mistake. The feature column type should be ```ml.linalg.VectorUDT``` but got ```mllib.linalg.VectorUDT``` by mistake.
See [SPARK-16750](https://issues.apache.org/jira/browse/SPARK-16750) for how to reproduce this bug.
Why the unit tests did not complain this errors? Because some estimators/transformers missed calling ```transformSchema(dataset.schema)``` firstly during ```fit``` or ```transform```. I will also add this function to all estimators/transformers who missed in this PR.
## How was this patch tested?
No new tests, should pass existing ones.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#14378 from yanboliang/spark-16750.
## What changes were proposed in this pull request?
Updated ML pipeline Cross Validation Scaladoc & PyDoc.
## How was this patch tested?
Documentation update
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
Author: krishnakalyan3 <krishnakalyan3@gmail.com>
Closes#13894 from krishnakalyan3/kfold-cv.
## What changes were proposed in this pull request?
Fix some mistake in ```LinearRegression``` formula.
## How was this patch tested?
Documents change, no tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#14369 from yanboliang/LiR-formula.
## What changes were proposed in this pull request?
In `LDAOptimizer.submitMiniBatch`, do persist on `stats: RDD[(BDM[Double], List[BDV[Double]])]`
and also move the place of unpersisting `expElogbetaBc` broadcast variable,
to avoid the `expElogbetaBc` broadcast variable to be unpersisted too early,
and update previous `expElogbetaBc.unpersist()` into `expElogbetaBc.destroy(false)`
## How was this patch tested?
Existing test.
Author: WeichenXu <WeichenXu123@outlook.com>
Closes#14335 from WeichenXu123/improve_LDA.
## What changes were proposed in this pull request?
replace ANN convergence tolerance param default
from 1e-4 to 1e-6
so that it will be the same with other algorithms in MLLib which use LBFGS as optimizer.
## How was this patch tested?
Existing Test.
Author: WeichenXu <WeichenXu123@outlook.com>
Closes#14286 from WeichenXu123/update_ann_tol.
## What changes were proposed in this pull request?
renaming var names to make code more clear:
nnz => weightSum
weightSum => totalWeightSum
and add a new member vector `nnz` (not `nnz` in previous code, which renamed to `weightSum`) to count each dimensions non-zero value number.
using `nnz` which I added above instead of `weightSum` when calculating min/max so that it fix several numerical error in some extreme case.
## How was this patch tested?
A new testcase added.
Author: WeichenXu <WeichenXu123@outlook.com>
Closes#14216 from WeichenXu123/multivarOnlineSummary.
## What changes were proposed in this pull request?
Forgotten broadcasted variables were persisted into a previous #PR 14153). This PR turns those `unpersist()` into `destroy()` so that memory is freed even on the driver.
## How was this patch tested?
Unit Tests in Word2VecSuite were run locally.
This contribution is done on behalf of Criteo, according to the
terms of the Apache license 2.0.
Author: Anthony Truchet <a.truchet@criteo.com>
Closes#14268 from AnthonyTruchet/SPARK-16440.
## What changes were proposed in this pull request?
breeze 0.12 has been released for more than half a year, and it brings lots of new features, performance improvement and bug fixes.
One of the biggest features is ```LBFGS-B``` which is an implementation of ```LBFGS``` with box constraints and much faster for some special case.
We would like to implement Huber loss function for ```LinearRegression``` ([SPARK-3181](https://issues.apache.org/jira/browse/SPARK-3181)) and it requires ```LBFGS-B``` as the optimization solver. So we should bump up the dependent breeze version to 0.12.
For more features, improvements and bug fixes of breeze 0.12, you can refer the following link:
https://groups.google.com/forum/#!topic/scala-breeze/nEeRi_DcY5c
## How was this patch tested?
No new tests, should pass the existing ones.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#14150 from yanboliang/spark-16494.
## What changes were proposed in this pull request?
`\partial\x` ==> `\partial x`
`har{x_i}` ==> `hat{x_i}`
## How was this patch tested?
N/A
Author: WeichenXu <WeichenXu123@outlook.com>
Closes#14246 from WeichenXu123/fix_formular_err.
https://issues.apache.org/jira/browse/SPARK-16535
## What changes were proposed in this pull request?
When I scan through the pom.xml of sub projects, I found this warning as below and attached screenshot
```
Definition of groupId is redundant, because it's inherited from the parent
```
![screen shot 2016-07-13 at 3 13 11 pm](https://cloud.githubusercontent.com/assets/3925641/16823121/744f893e-4916-11e6-8a52-042f83b9db4e.png)
I've tried to remove some of the lines with groupId definition, and the build on my local machine is still ok.
```
<groupId>org.apache.spark</groupId>
```
As I just find now `<maven.version>3.3.9</maven.version>` is being used in Spark 2.x, and Maven-3 supports versionless parent elements: Maven 3 will remove the need to specify the parent version in sub modules. THIS is great (in Maven 3.1).
ref: http://stackoverflow.com/questions/3157240/maven-3-worth-it/3166762#3166762
## How was this patch tested?
I've tested by re-building the project, and build succeeded.
Author: Xin Ren <iamshrek@126.com>
Closes#14189 from keypointt/SPARK-16535.
## What changes were proposed in this pull request?
fininsh => finish
## How was this patch tested?
N/A
Author: WeichenXu <WeichenXu123@outlook.com>
Closes#14238 from WeichenXu123/fix_fininsh_typo.
## What changes were proposed in this pull request?
These are yet more changes that resolve problems with unidoc/genjavadoc and Java 8. It does not fully resolve the problem, but gets rid of as many errors as we can from this end.
## How was this patch tested?
Jenkins build of docs
Author: Sean Owen <sowen@cloudera.com>
Closes#14221 from srowen/SPARK-3359.3.
## What changes were proposed in this pull request?
Fixed a bug that caused `NaN`s in `IsotonicRegression`. The problem occurs when training rows with the same feature value but different labels end up on different partitions. This patch changes a `sortBy` call to a `partitionBy(RangePartitioner)` followed by a `mapPartitions(sortBy)` in order to ensure that all rows with the same feature value end up on the same partition.
## How was this patch tested?
Added a unit test.
Author: z001qdp <Nicholas.Eggert@target.com>
Closes#14140 from neggert/SPARK-16426-isotonic-nan.
## What changes were proposed in this pull request?
Add warning_for the following case when LBFGS training not actually convergence:
1) LogisticRegression
2) AFTSurvivalRegression
3) LBFGS algorithm wrapper in mllib package
## How was this patch tested?
N/A
Author: WeichenXu <WeichenXu123@outlook.com>
Closes#14157 from WeichenXu123/add_lbfgs_convergence_warning_for_all_used_place.
## What changes were proposed in this pull request?
Fixing issues found during 2.0 API checks:
* GeneralizedLinearRegressionModel: linkObj, familyObj, familyAndLink should not be exposed
* sqlDataTypes: name does not follow conventions. Do we need to expose it?
* Evaluator: inconsistent doc between evaluate and isLargerBetter
* MinMaxScaler: math rendering --> hard to make it great, but I'll change it a little
* GeneralizedLinearRegressionSummary: aic doc is incorrect --> will change to use more common name
## How was this patch tested?
Existing unit tests. Docs generated locally. (MinMaxScaler is improved a tiny bit.)
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#14187 from jkbradley/final-api-check-2.0.
## What changes were proposed in this pull request?
General decisions to follow, except where noted:
* spark.mllib, pyspark.mllib: Remove all Experimental annotations. Leave DeveloperApi annotations alone.
* spark.ml, pyspark.ml
** Annotate Estimator-Model pairs of classes and companion objects the same way.
** For all algorithms marked Experimental with Since tag <= 1.6, remove Experimental annotation.
** For all algorithms marked Experimental with Since tag = 2.0, leave Experimental annotation.
* DeveloperApi annotations are left alone, except where noted.
* No changes to which types are sealed.
Exceptions where I am leaving items Experimental in spark.ml, pyspark.ml, mainly because the items are new:
* Model Summary classes
* MLWriter, MLReader, MLWritable, MLReadable
* Evaluator and subclasses: There is discussion of changes around evaluating multiple metrics at once for efficiency.
* RFormula: Its behavior may need to change slightly to match R in edge cases.
* AFTSurvivalRegression
* MultilayerPerceptronClassifier
DeveloperApi changes:
* ml.tree.Node, ml.tree.Split, and subclasses should no longer be DeveloperApi
## How was this patch tested?
N/A
Note to reviewers:
* spark.ml.clustering.LDA underwent significant changes (additional methods), so let me know if you want me to leave it Experimental.
* Be careful to check for cases where a class should no longer be Experimental but has an Experimental method, val, or other feature. I did not find such cases, but please verify.
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#14147 from jkbradley/experimental-audit.
## What changes were proposed in this pull request?
We have a use case of multiplying very big sparse matrices. we have about 1000x1000 distributed block matrices multiplication and the simulate multiply goes like O(n^4) (n being 1000). it takes about 1.5 hours. We modified it slightly with classical hashmap and now run in about 30 seconds O(n^2).
## How was this patch tested?
We have added a performance test and verified the reduced time.
Author: oraviv <oraviv@paypal.com>
Closes#14068 from uzadude/master.
## What changes were proposed in this pull request?
Unpersist broadcasted vars in Word2Vec.fit for more timely / reliable resource cleanup
## How was this patch tested?
Jenkins tests
Author: Sean Owen <sowen@cloudera.com>
Closes#14153 from srowen/SPARK-16440.
## What changes were proposed in this pull request?
In `ml.regression.LinearRegression`, it use breeze `LBFGS` and `OWLQN` optimizer to do data training, but do not check whether breeze's optimizer returned result actually reached convergence.
The `LBFGS` and `OWLQN` optimizer in breeze finish iteration may result the following situations:
1) reach max iteration number
2) function reach value convergence
3) objective function stop improving
4) gradient reach convergence
5) search failed(due to some internal numerical error)
I add warning printing code so that
if the iteration result is (1) or (3) or (5) in above, it will print a warning with respective reason string.
## How was this patch tested?
Manual.
Author: WeichenXu <WeichenXu123@outlook.com>
Closes#14122 from WeichenXu123/add_lr_not_convergence_warn.
## What changes were proposed in this pull request?
In `train` method of `ml.regression.LinearRegression` when handling situation `std(label) == 0`
the code replace `std(label)` with `mean(label)` but the relative comment is inconsistent, I update it.
## How was this patch tested?
N/A
Author: WeichenXu <WeichenXu123@outlook.com>
Closes#14121 from WeichenXu123/update_lr_comment.
## What changes were proposed in this pull request?
After SPARK-16476 (committed earlier today as #14128), we can finally bump the version number.
## How was this patch tested?
N/A
Author: Reynold Xin <rxin@databricks.com>
Closes#14130 from rxin/SPARK-16477.
## What changes were proposed in this pull request?
The following Java code because of type erasing:
```Java
JavaRDD<Vector> rows = jsc.parallelize(...);
RowMatrix mat = new RowMatrix(rows.rdd());
QRDecomposition<RowMatrix, Matrix> result = mat.tallSkinnyQR(true);
```
We should use retag to restore the type to prevent the following exception:
```Java
java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to [Lorg.apache.spark.mllib.linalg.Vector;
```
## How was this patch tested?
Java unit test
Author: Xusen Yin <yinxusen@gmail.com>
Closes#14051 from yinxusen/SPARK-16372.
## What changes were proposed in this pull request?
"test big model load / save" in Word2VecSuite, lately resulted into OOM.
Therefore we decided to make the partitioning adaptive (not based on spark default "spark.kryoserializer.buffer.max" conf) and then testing it using a small buffer size in order to trigger partitioning without allocating too much memory for the test.
## How was this patch tested?
It was tested running the following unit test:
org.apache.spark.mllib.feature.Word2VecSuite
Author: tmnd1991 <antonio.murgia2@studio.unibo.it>
Closes#13509 from tmnd1991/SPARK-15740.
## What changes were proposed in this pull request?
The current tests assumes that `impurity.calculate()` returns the variance correctly. It should be better to make the tests independent of this assumption. In other words verify that the variance computed equals the variance computed manually on a small tree.
## How was this patch tested?
The patch is a test....
Author: MechCoder <mks542@nyu.edu>
Closes#13981 from MechCoder/dt_variance.
## What changes were proposed in this pull request?
jira: https://issues.apache.org/jira/browse/SPARK-16249
Change visibility of Object ml.clustering.LDA to public for loading, thus users can invoke LDA.load("path").
## How was this patch tested?
existing ut and manually test for load ( saved with current code)
Author: Yuhao Yang <yuhao.yang@intel.com>
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes#13941 from hhbyyh/ldapublic.
## What changes were proposed in this pull request?
jira: https://issues.apache.org/jira/browse/SPARK-14608
PipelineStage.transformSchema currently has minimal documentation. It should have more to explain it can:
check schema
check parameter interactions
## How was this patch tested?
unit test
Author: Yuhao Yang <hhbyyh@gmail.com>
Author: Yuhao Yang <yuhao.yang@intel.com>
Closes#12384 from hhbyyh/transformSchemaDoc.
## What changes were proposed in this pull request?
model loading backward compatibility for ml NaiveBayes
## How was this patch tested?
existing ut and manual test for loading models saved by Spark 1.6.
Author: zlpmichelle <zlpmichelle@gmail.com>
Closes#13940 from zlpmichelle/naivebayes.
## What changes were proposed in this pull request?
What changes were proposed in this pull request?
Improving evaluateEachIteration function in mllib as it fails when trying to calculate error by tree for a model that has more than 500 trees
## How was this patch tested?
the batch tested on productions data set (2K rows x 2K features) training a gradient boosted model without validation with 1000 maxIteration settings, then trying to produce the error by tree, the new patch was able to perform the calculation within 30 seconds, while previously it was take hours then fail.
**PS**: It would be better if this PR can be cherry picked into release branches 1.6.1 and 2.0
Author: Mahmoud Rawas <mhmoudr@gmail.com>
Author: Mahmoud Rawas <Mahmoud.Rawas@quantium.com.au>
Closes#13624 from mhmoudr/SPARK-15858.master.
## What changes were proposed in this pull request?
model loading backward compatibility for ml.feature.PCA.
## How was this patch tested?
existing ut and manual test for loading models saved by Spark 1.6.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#13937 from yanboliang/spark-16245.
## What changes were proposed in this pull request?
This PR implements python wrappers for #13888 to convert old/new matrix columns in a DataFrame.
## How was this patch tested?
Doctest in python.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#13935 from yanboliang/spark-16242.
## What changes were proposed in this pull request?
jira: https://issues.apache.org/jira/browse/SPARK-16187
This is to provide conversion utils between old/new vector columns in a DataFrame. So users can use it to migrate their datasets and pipelines manually.
## How was this patch tested?
java and scala ut
Author: Yuhao Yang <yuhao.yang@intel.com>
Closes#13888 from hhbyyh/matComp.
## What changes were proposed in this pull request?
Just adjust the size of an array in line 58 so it does not cause an ArrayOutOfBoundsException in line 66.
## How was this patch tested?
Manual tests. I have recompiled the entire project with the fix, it has been built successfully and I have run the code, also with good results.
line 66: val yD = blas.ddot(trueWeights.length, x, 1, trueWeights, 1) + rnd.nextGaussian() * 0.1
crashes because trueWeights has length "nfeatures + 1" while "x" has length "features", and they should have the same length.
To fix this just make trueWeights be the same length as x.
I have recompiled the project with the change and it is working now:
[spark-1.6.1]$ spark-submit --master local[*] --class org.apache.spark.mllib.util.SVMDataGenerator mllib/target/spark-mllib_2.11-1.6.1.jar local /home/user/test
And it generates the data successfully now in the specified folder.
Author: José Antonio <joseanmunoz@gmail.com>
Closes#13895 from j4munoz/patch-2.
## What changes were proposed in this pull request?
model loading backward compatibility for ml.feature,
## How was this patch tested?
existing ut and manual test for loading 1.6 models.
Author: Yuhao Yang <yuhao.yang@intel.com>
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes#13844 from hhbyyh/featureComp.
## What changes were proposed in this pull request?
jira: https://issues.apache.org/jira/browse/SPARK-16177
model loading backward compatibility for ml.regression
## How was this patch tested?
existing ut and manual test for loading 1.6 models.
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes#13879 from hhbyyh/regreComp.
## What changes were proposed in this pull request?
jira: https://issues.apache.org/jira/browse/SPARK-16130
model loading backward compatibility for ml.classfication.LogisticRegression
## How was this patch tested?
existing ut and manual test for loading old models.
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes#13841 from hhbyyh/lrcomp.
## What changes were proposed in this pull request?
Since we decided to switch spark.mllib package into maintenance mode in 2.0, it would be nice to update the package docs to reflect this change.
## How was this patch tested?
Manually checked generated APIs.
Author: Xiangrui Meng <meng@databricks.com>
Closes#13859 from mengxr/SPARK-16154.
## What changes were proposed in this pull request?
We recently deprecated setLabelCol in ChiSqSelectorModel (#13823):
~~~scala
/** group setParam */
Since("1.6.0")
deprecated("labelCol is not used by ChiSqSelectorModel.", "2.0.0")
def setLabelCol(value: String): this.type = set(labelCol, value)
~~~
This unfortunately hit a genjavadoc bug and broken doc generation. This is the generated Java code:
~~~java
/** group setParam */
public org.apache.spark.ml.feature.ChiSqSelectorModel setOutputCol (java.lang.String value) { throw new RuntimeException(); }
*
* deprecated labelCol is not used by ChiSqSelectorModel. Since 2.0.0.
*/
public org.apache.spark.ml.feature.ChiSqSelectorModel setLabelCol (java.lang.String value) { throw new RuntimeException(); }
~~~
Switching to multiline is a workaround.
Author: Xiangrui Meng <meng@databricks.com>
Closes#13855 from mengxr/SPARK-16153.
## What changes were proposed in this pull request?
`DefaultParamsReadable/Writable` are not user-facing. Only developers who implement `Transformer/Estimator` would use it. So this PR changes the annotation to `DeveloperApi`.
Author: Xiangrui Meng <meng@databricks.com>
Closes#13828 from mengxr/default-readable-should-be-developer-api.
[SPARK-14615](https://issues.apache.org/jira/browse/SPARK-14615) and #12627 changed `spark.ml` pipelines to use the new `ml.linalg` classes for `Vector`/`Matrix`. Some `Since` annotations for public methods/vals have not been updated accordingly to be `2.0.0`. This PR updates them.
## How was this patch tested?
Existing unit tests.
Author: Nick Pentreath <nickp@za.ibm.com>
Closes#13840 from MLnick/SPARK-16127-ml-linalg-since.
## What changes were proposed in this pull request?
Mark ml.classification algorithms as experimental to match Scala algorithms, update PyDoc for for thresholds on `LogisticRegression` to have same level of info as Scala, and enable mathjax for PyDoc.
## How was this patch tested?
Built docs locally & PySpark SQL tests
Author: Holden Karau <holden@us.ibm.com>
Closes#12938 from holdenk/SPARK-15162-SPARK-15164-update-some-pydocs.
#### What changes were proposed in this pull request?
This PR is to use the latest `SparkSession` to replace the existing `SQLContext` in `MLlib`. `SQLContext` is removed from `MLlib`.
Also fix a test case issue in `BroadcastJoinSuite`.
BTW, `SQLContext` is not being used in the `MLlib` test suites.
#### How was this patch tested?
Existing test cases.
Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
Closes#13380 from gatorsmile/sqlContextML.
## What changes were proposed in this pull request?
Deprecate `labelCol`, which is not used by ChiSqSelectorModel.
Author: Xiangrui Meng <meng@databricks.com>
Closes#13823 from mengxr/deprecate-setLabelCol-in-ChiSqSelectorModel.
## What changes were proposed in this pull request?
We forgot the getter of `dropLast` in `OneHotEncoder`
## How was this patch tested?
unit test
Author: Xiangrui Meng <meng@databricks.com>
Closes#13821 from mengxr/SPARK-16118.
## What changes were proposed in this pull request?
LibSVMFileFormat implements data source for LIBSVM format. However, users do not really need to call its APIs to use it. So we should hide it in the public API docs. The main issue is that we still need to put the documentation and example code somewhere. The proposal it to have a dummy class to hold the documentation, as a workaround to https://issues.scala-lang.org/browse/SI-8124.
## How was this patch tested?
Manually checked the generated API doc and tested loading LIBSVM data.
Author: Xiangrui Meng <meng@databricks.com>
Closes#13819 from mengxr/SPARK-16117.
## What changes were proposed in this pull request?
The `checkpointInterval` is a non-expert param. This PR moves its setter to non-expert group.
Author: Xiangrui Meng <meng@databricks.com>
Closes#13813 from mengxr/checkpoint-non-expert.
## What changes were proposed in this pull request?
This PR is a subset of #13023 by yanboliang to make SparkR model param names and default values consistent with MLlib. I tried to avoid other changes from #13023 to keep this PR minimal. I will send a follow-up PR to improve the documentation.
Main changes:
* `spark.glm`: epsilon -> tol, maxit -> maxIter
* `spark.kmeans`: default k -> 2, default maxIter -> 20, default initMode -> "k-means||"
* `spark.naiveBayes`: laplace -> smoothing, default 1.0
## How was this patch tested?
Existing unit tests.
Author: Xiangrui Meng <meng@databricks.com>
Closes#13801 from mengxr/SPARK-15177.1.
This PR adds missing `Since` annotations to `ml.feature` package.
Closes#8505.
## How was this patch tested?
Existing tests.
Author: Nick Pentreath <nickp@za.ibm.com>
Closes#13641 from MLnick/add-since-annotations.
## What changes were proposed in this pull request?
Both VectorUDT and MatrixUDT are private APIs, because UserDefinedType itself is private in Spark. However, in order to let developers implement their own transformers and estimators, we should expose both types in a public API to simply the implementation of transformSchema, transform, etc. Otherwise, they need to get the data types using reflection.
## How was this patch tested?
Unit tests in Scala and Java.
Author: Xiangrui Meng <meng@databricks.com>
Closes#13789 from mengxr/SPARK-16074.
## What changes were proposed in this pull request?
This PR implements python wrappers for #13662 to convert old/new vector columns in a DataFrame.
## How was this patch tested?
doctest in Python
cc: yanboliang
Author: Xiangrui Meng <meng@databricks.com>
Closes#13731 from mengxr/SPARK-15946.
JIRA: [SPARK-16008](https://issues.apache.org/jira/browse/SPARK-16008)
## What changes were proposed in this pull request?
`LogisticAggregator` stores references to two arrays of dimension `numFeatures` which are serialized before the combine op, unnecessarily. This results in the shuffle write being ~3x (for multiclass logistic regression, this number will go up) larger than it should be (in MLlib, for instance, it is 3x smaller).
This patch modifies `LogisticAggregator.add` to accept the two arrays as method parameters which avoids the serialization.
## How was this patch tested?
I tested this locally and verified the serialization reduction.
![image](https://cloud.githubusercontent.com/assets/7275795/16140387/d2974bac-3404-11e6-94f9-268860c931a2.png)
Additionally, I ran some tests of a 4 node cluster (4x48 cores, 4x128 GB RAM). Data set size of 2M rows and 10k features showed >2x iteration speedup.
Author: sethah <seth.hendrickson16@gmail.com>
Closes#13729 from sethah/lr_improvement.
## What changes were proposed in this pull request?
SPARK-15922 reports the following scenario throwing an exception due to the mismatched vector sizes. This PR handles the exceptional case, `cols < (offset + colsPerBlock)`.
**Before**
```scala
scala> import org.apache.spark.mllib.linalg.distributed._
scala> import org.apache.spark.mllib.linalg._
scala> val rows = IndexedRow(0L, new DenseVector(Array(1,2,3))) :: IndexedRow(1L, new DenseVector(Array(1,2,3))):: IndexedRow(2L, new DenseVector(Array(1,2,3))):: Nil
scala> val rdd = sc.parallelize(rows)
scala> val matrix = new IndexedRowMatrix(rdd, 3, 3)
scala> val bmat = matrix.toBlockMatrix
scala> val imat = bmat.toIndexedRowMatrix
scala> imat.rows.collect
... // java.lang.IllegalArgumentException: requirement failed: Vectors must be the same length!
```
**After**
```scala
...
scala> imat.rows.collect
res0: Array[org.apache.spark.mllib.linalg.distributed.IndexedRow] = Array(IndexedRow(0,[1.0,2.0,3.0]), IndexedRow(1,[1.0,2.0,3.0]), IndexedRow(2,[1.0,2.0,3.0]))
```
## How was this patch tested?
Pass the Jenkins tests (including the above case)
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#13643 from dongjoon-hyun/SPARK-15922.
## What changes were proposed in this pull request?
Interface method `FileFormat.prepareRead()` was added in #12088 to handle a special case in the LibSVM data source.
However, the semantics of this interface method isn't intuitive: it returns a modified version of the data source options map. Considering that the LibSVM case can be easily handled using schema metadata inside `inferSchema`, we can remove this interface method to keep the `FileFormat` interface clean.
## How was this patch tested?
Existing tests.
Author: Cheng Lian <lian@databricks.com>
Closes#13698 from liancheng/remove-prepare-read.
## What changes were proposed in this pull request?
This patch renames various Parquet support classes from CatalystAbc to ParquetAbc. This new naming makes more sense for two reasons:
1. These are not optimizer related (i.e. Catalyst) classes.
2. We are in the Spark code base, and as a result it'd be more clear to call out these are Parquet support classes, rather than some Spark classes.
## How was this patch tested?
Renamed test cases as well.
Author: Reynold Xin <rxin@databricks.com>
Closes#13696 from rxin/parquet-rename.
The PR changes outdated scaladocs for Gini and Entropy classes. Since PR #886 Spark supports multiclass classification, but the docs tell only about binary classification.
Author: Wojciech Jurczyk <wojciech.jurczyk@codilime.com>
Closes#11252 from wjur/wjur/docs_multiclass.
## What changes were proposed in this pull request?
This PR provides conversion utils between old/new vector columns in a DataFrame. So users can use it to migrate their datasets and pipelines manually. The methods are implemented under `MLUtils` and called `convertVectorColumnsToML` and `convertVectorColumnsFromML`. Both take a DataFrame and a list of vector columns to be converted. It is a no-op on vector columns that are already converted. A warning message is logged if actual conversion happens.
This is the first sub-task under SPARK-15944 to make it easier to migrate existing pipelines to Spark 2.0.
## How was this patch tested?
Unit tests in Scala and Java.
cc: yanboliang
Author: Xiangrui Meng <meng@databricks.com>
Closes#13662 from mengxr/SPARK-15945.
## What changes were proposed in this pull request?
Now we have PySpark picklers for new and old vector/matrix, individually. However, they are all implemented under `PythonMLlibAPI`. To separate spark.mllib from spark.ml, we should implement the picklers of new vector/matrix under `spark.ml.python` instead.
## How was this patch tested?
Existing tests.
Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
Closes#13219 from viirya/pyspark-pickler-ml.
## What changes were proposed in this pull request?
Currently, `AFTAggregator` is not being merged correctly. For example, if there is any single empty partition in the data, this creates an `AFTAggregator` with zero total count which causes the exception below:
```
IllegalArgumentException: u'requirement failed: The number of instances should be greater than 0.0, but got 0.'
```
Please see [AFTSurvivalRegression.scala#L573-L575](6ecedf39b4/mllib/src/main/scala/org/apache/spark/ml/regression/AFTSurvivalRegression.scala (L573-L575)) as well.
Just to be clear, the python example `aft_survival_regression.py` seems using 5 rows. So, if there exist partitions more than 5, it throws the exception above since it contains empty partitions which results in an incorrectly merged `AFTAggregator`.
Executing `bin/spark-submit examples/src/main/python/ml/aft_survival_regression.py` on a machine with CPUs more than 5 is being failed because it creates tasks with some empty partitions with defualt configurations (AFAIK, it sets the parallelism level to the number of CPU cores).
## How was this patch tested?
An unit test in `AFTSurvivalRegressionSuite.scala` and manually tested by `bin/spark-submit examples/src/main/python/ml/aft_survival_regression.py`.
Author: hyukjinkwon <gurwls223@gmail.com>
Author: Hyukjin Kwon <gurwls223@gmail.com>
Closes#13619 from HyukjinKwon/SPARK-15892.
## What changes were proposed in this pull request?
Currently, we always split the files when it's bigger than maxSplitBytes, but Hadoop LineRecordReader does not respect the splits for compressed files correctly, we should have a API for FileFormat to check whether the file could be splitted or not.
This PR is based on #13442, closes#13442
## How was this patch tested?
add regression tests.
Author: Davies Liu <davies@databricks.com>
Closes#13531 from davies/fix_split.
## What changes were proposed in this pull request?
In scala, immutable.List.length is an expensive operation so we should
avoid using Seq.length == 0 or Seq.lenth > 0, and use Seq.isEmpty and Seq.nonEmpty instead.
## How was this patch tested?
existing tests
Author: wangyang <wangyang@haizhi.com>
Closes#13601 from yangw1234/isEmpty.
## What changes were proposed in this pull request?
Adding __str__ to RFormula and model that will show the set formula param and resolved formula. This is currently present in the Scala API, found missing in PySpark during Spark 2.0 coverage review.
## How was this patch tested?
run pyspark-ml tests locally
Author: Bryan Cutler <cutlerb@gmail.com>
Closes#13481 from BryanCutler/pyspark-ml-rformula_str-SPARK-15738.
## What changes were proposed in this pull request?
https://issues.apache.org/jira/browse/SPARK-15793
Word2vec in ML package should have maxSentenceLength method for feature parity.
## How was this patch tested?
Tested with Spark unit test.
Author: yinxusen <yinxusen@gmail.com>
Closes#13536 from yinxusen/SPARK-15793.
## What changes were proposed in this pull request?
When fitting ```LinearRegressionModel```(by "l-bfgs" solver) and ```LogisticRegressionModel``` w/o intercept on dataset with constant nonzero column, spark.ml produce same model as R glmnet but different from LIBSVM.
When fitting ```AFTSurvivalRegressionModel``` w/o intercept on dataset with constant nonzero column, spark.ml produce different model compared with R survival::survreg.
We should output a warning message and clarify in document for this condition.
## How was this patch tested?
Document change, no unit test.
cc mengxr
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#12731 from yanboliang/spark-13590.
## What changes were proposed in this pull request?
Made DefaultParamsReadable, DefaultParamsWritable public. Also added relevant doc and annotations. Added UnaryTransformerExample to demonstrate use of UnaryTransformer and DefaultParamsReadable,Writable.
## How was this patch tested?
Wrote example making use of the now-public APIs. Compiled and ran locally
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#13461 from jkbradley/defaultparamswritable.
## What changes were proposed in this pull request?
`an -> a`
Use cmds like `find . -name '*.R' | xargs -i sh -c "grep -in ' an [^aeiou]' {} && echo {}"` to generate candidates, and review them one by one.
## How was this patch tested?
manual tests
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#13515 from zhengruifeng/an_a.
`PartitionStatistics` uses `foldLeft` and list concatenation (`++`) to flatten an iterator of lists, but this is extremely inefficient compared to simply doing `flatMap`/`flatten` because it performs many unnecessary object allocations. Simply replacing this `foldLeft` by a `flatMap` results in decent performance gains when constructing PartitionStatistics instances for tables with many columns.
This patch fixes this and also makes two similar changes in MLlib and streaming to try to fix all known occurrences of this pattern.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#13491 from JoshRosen/foldleft-to-flatmap.
## What changes were proposed in this pull request?
1, remove comments `:: Experimental ::` for non-experimental API
2, add comments `:: Experimental ::` for experimental API
3, add comments `:: DeveloperApi ::` for developerApi API
## How was this patch tested?
manual tests
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#13514 from zhengruifeng/del_experimental.
## What changes were proposed in this pull request?
1, del precision,recall in `ml.MulticlassClassificationEvaluator`
2, update user guide for `mlllib.weightedFMeasure`
## How was this patch tested?
local build
Author: Ruifeng Zheng <ruifengz@foxmail.com>
Closes#13390 from zhengruifeng/clarify_f1.
## What changes were proposed in this pull request?
Our encoder framework has been evolved a lot, this PR tries to clean up the code to make it more readable and emphasise the concept that encoder should be used as a container of serde expressions.
1. move validation logic to analyzer instead of encoder
2. only have a `resolveAndBind` method in encoder instead of `resolve` and `bind`, as we don't have the encoder life cycle concept anymore.
3. `Dataset` don't need to keep a resolved encoder, as there is no such concept anymore. bound encoder is still needed to do serialization outside of query framework.
4. Using `BoundReference` to represent an unresolved field in deserializer expression is kind of weird, this PR adds a `GetColumnByOrdinal` for this purpose. (serializer expression still use `BoundReference`, we can replace it with `GetColumnByOrdinal` in follow-ups)
## How was this patch tested?
existing test
Author: Wenchen Fan <wenchen@databricks.com>
Author: Cheng Lian <lian@databricks.com>
Closes#13269 from cloud-fan/clean-encoder.
## What changes were proposed in this pull request?
andrewor14 noticed some OOM errors caused by "test big model load / save" in Word2VecSuite, e.g., https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-maven-hadoop-2.2/1168/consoleFull. It doesn't show up in the test result because it was OOMed.
This PR disables the test. I will leave the JIRA open for a proper fix
## How was this patch tested?
No new features.
Author: Xiangrui Meng <meng@databricks.com>
Closes#13478 from mengxr/SPARK-15740.
## What changes were proposed in this pull request?
ml.feature: update check schema to avoid confusion when user use MLlib.vector as input type
## How was this patch tested?
existing ut
Author: Yuhao Yang <yuhao.yang@intel.com>
Closes#13411 from hhbyyh/schemaCheck.
Clean up style for param setter methods in ALS to match standard style and the other setter in class (this is an artefact of one of my previous PRs that wasn't cleaned up).
## How was this patch tested?
Existing tests - no functionality change.
Author: Nick Pentreath <nickp@za.ibm.com>
Closes#13480 from MLnick/als-param-minor-cleanup.
## What changes were proposed in this pull request?
ML 2.0 QA: Scala APIs audit for ml.feature. Mainly include:
* Remove seed for ```QuantileDiscretizer```, since we use ```approxQuantile``` to produce bins and ```seed``` is useless.
* Scala API docs update.
* Sync Scala and Python API docs for these changes.
## How was this patch tested?
Exist tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#13410 from yanboliang/spark-15587.
## What changes were proposed in this pull request?
if sparkContext.set CheckpointDir to another Dir that is not default FileSystem, it will throw exception when removing CheckpointFile in MLlib.
So we should always get the FileSystem from Path to avoid wrong FS problem.
## How was this patch tested?
N/A
Author: Lianhui Wang <lianhuiwang09@gmail.com>
Closes#13408 from lianhuiwang/SPARK-15664.
## What changes were proposed in this pull request?
This PR changes function `SparkSession.builder.sparkContext(..)` from **private[sql]** into **private[spark]**, and uses it if applicable like the followings.
```
- val spark = SparkSession.builder().config(sc.getConf).getOrCreate()
+ val spark = SparkSession.builder().sparkContext(sc).getOrCreate()
```
## How was this patch tested?
Pass the existing Jenkins tests.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#13365 from dongjoon-hyun/SPARK-15618.
## What changes were proposed in this pull request?
This change resolves a number of build warnings that have accumulated, before 2.x. It does not address a large number of deprecation warnings, especially related to the Accumulator API. That will happen separately.
## How was this patch tested?
Jenkins
Author: Sean Owen <sowen@cloudera.com>
Closes#13377 from srowen/BuildWarnings.
## What changes were proposed in this pull request?
Fix the wrong bound of `k` in `PCA`
`require(k <= sources.first().size, ...` -> `require(k < sources.first().size`
BTW, remove unused import in `ml.ElementwiseProduct`
## How was this patch tested?
manual tests
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#13356 from zhengruifeng/fix_pca.
## What changes were proposed in this pull request?
We're using `asML` to convert the mllib vector/matrix to ml vector/matrix now. Using `as` is more correct given that this conversion actually shares the same underline data structure. As a result, in this PR, `toBreeze` will be changed to `asBreeze`. This is a private API, as a result, it will not affect any user's application.
## How was this patch tested?
unit tests
Author: DB Tsai <dbt@netflix.com>
Closes#13198 from dbtsai/minor.
## What changes were proposed in this pull request?
* Document ```WeightedLeastSquares```(normal equation) and ```IterativelyReweightedLeastSquares```.
* Copy ```L-BFGS``` documents from ```spark.mllib``` to ```spark.ml```.
Due to the session ```Optimization of linear methods``` is used for developers, I think we should provide the brief introduction of the optimization method, necessary references and how it implements in Spark. It's not necessary to paste all mathematical formula and derivation here. If developers/users want to learn more, they can track reference.
## How was this patch tested?
Document update, no tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#13262 from yanboliang/spark-15484.