## What changes were proposed in this pull request?
* Refactor out ```trainWithLabelCheck``` and make ```mllib.NaiveBayes``` call into it.
* Avoid capturing the outer object for ```modelType```.
* Move ```requireNonnegativeValues``` and ```requireZeroOneBernoulliValues``` to companion object.
## How was this patch tested?
Existing tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#15826 from yanboliang/spark-14077-2.
## What changes were proposed in this pull request?
Before this patch, the gradient updates for multinomial logistic regression were computed by an outer loop over the number of classes and an inner loop over the number of features. Inside the inner loop, we standardized the feature value (`value / featuresStd(index)`), which means we performed the computation `numFeatures * numClasses` times. We only need to perform that computation `numFeatures` times, however. If we re-order the inner and outer loop, we can avoid this, but then we lose sequential memory access. In this patch, we instead lay out the coefficients in column major order while we train, so that we can avoid the extra computation and retain sequential memory access. We convert back to row-major order when we create the model.
## How was this patch tested?
This is an implementation detail only, so the original behavior should be maintained. All tests pass. I ran some performance tests to verify speedups. The results are below, and show significant speedups.
## Performance Tests
**Setup**
3 node bare-metal cluster
120 cores total
384 gb RAM total
**Results**
NOTE: The `currentMasterTime` and `thisPatchTime` are times in seconds for a single iteration of L-BFGS or OWL-QN.
| | numPoints | numFeatures | numClasses | regParam | elasticNetParam | currentMasterTime (sec) | thisPatchTime (sec) | pctSpeedup |
|----|-------------|---------------|--------------|------------|-------------------|---------------------------|-----------------------|--------------|
| 0 | 1e+07 | 100 | 500 | 0.5 | 0 | 90 | 18 | 80 |
| 1 | 1e+08 | 100 | 50 | 0.5 | 0 | 90 | 19 | 78 |
| 2 | 1e+08 | 100 | 50 | 0.05 | 1 | 72 | 19 | 73 |
| 3 | 1e+06 | 100 | 5000 | 0.5 | 0 | 93 | 53 | 43 |
| 4 | 1e+07 | 100 | 5000 | 0.5 | 0 | 900 | 390 | 56 |
| 5 | 1e+08 | 100 | 500 | 0.5 | 0 | 840 | 174 | 79 |
| 6 | 1e+08 | 100 | 200 | 0.5 | 0 | 360 | 72 | 80 |
| 7 | 1e+08 | 1000 | 5 | 0.5 | 0 | 9 | 3 | 66 |
Author: sethah <seth.hendrickson16@gmail.com>
Closes#15593 from sethah/MLOR_PERF_COL_MAJOR_COEF.
## What changes were proposed in this pull request?
SparkR ```spark.randomForest``` classification prediction should output original label rather than the indexed label. This issue is very similar with [SPARK-18291](https://issues.apache.org/jira/browse/SPARK-18291).
## How was this patch tested?
Add unit tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#15842 from yanboliang/spark-18401.
## What changes were proposed in this pull request?
ALS.run fail with better message if ratings is empty rdd
ALS.train and ALS.trainImplicit are also affected
## How was this patch tested?
added new tests
Author: Sandeep Singh <sandeep@techaddict.me>
Closes#15809 from techaddict/SPARK-18268.
## What changes were proposed in this pull request?
Gradient Boosted Tree in R.
With a few minor improvements to RandomForest in R.
Since this is relatively isolated I'd like to target this for branch-2.1
## How was this patch tested?
manual tests, unit tests
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#15746 from felixcheung/rgbt.
## What changes were proposed in this pull request?
* Made SingularMatrixException private ml
* WeightedLeastSquares: Changed to allow tol >= 0 instead of only tol > 0
## How was this patch tested?
existing tests
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#15779 from jkbradley/wls-cleanups.
## What changes were proposed in this pull request?
Close `FileStreams`, `ZipFiles` etc to release the resources after using. Not closing the resources will cause IO Exception to be raised while deleting temp files.
## How was this patch tested?
Existing tests
Author: U-FAREAST\tl <tl@microsoft.com>
Author: hyukjinkwon <gurwls223@gmail.com>
Author: Tao LI <tl@microsoft.com>
Closes#15618 from HyukjinKwon/SPARK-14914-1.
## What changes were proposed in this pull request?
Motivation:
`org.apache.spark.ml.Pipeline.copy(extra: ParamMap)` does not create an instance with the same UID. It does not conform to the method specification from its base class `org.apache.spark.ml.param.Params.copy(extra: ParamMap)`
Solution:
- fix for Pipeline UID
- introduced new tests for `org.apache.spark.ml.Pipeline.copy`
- minor improvements in test for `org.apache.spark.ml.PipelineModel.copy`
## How was this patch tested?
Introduced new unit test: `org.apache.spark.ml.PipelineSuite."Pipeline.copy"`
Improved existing unit test: `org.apache.spark.ml.PipelineSuite."PipelineModel.copy"`
Author: Wojciech Szymanski <wk.szymanski@gmail.com>
Closes#15759 from wojtek-szymanski/SPARK-18210.
## What changes were proposed in this pull request?
Only some of the models which contain a training summary currently set the summaries in the copy method. Linear/Logistic regression do, GLR, GMM, KM, and BKM do not. Additionally, these copy methods did not set the parent pointer of the copied model. This patch modifies the copy methods of the four models mentioned above to copy the training summary and set the parent.
## How was this patch tested?
Add unit tests in Linear/Logistic/GeneralizedLinear regression and GaussianMixture/KMeans/BisectingKMeans to check the parent pointer of the copied model and check that the copied model has a summary.
Author: sethah <seth.hendrickson16@gmail.com>
Closes#15773 from sethah/SPARK-18276.
## What changes were proposed in this pull request?
Fix `Locale.US` for all usages of `DateFormat`, `NumberFormat`
## How was this patch tested?
Existing tests.
Author: Sean Owen <sowen@cloudera.com>
Closes#15610 from srowen/SPARK-18076.
## What changes were proposed in this pull request?
- Renamed kbest to numTopFeatures
- Renamed alpha to fpr
- Added missing Since annotations
- Doc cleanups
## How was this patch tested?
Added new standardized unit tests for spark.ml.
Improved existing unit test coverage a bit.
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#15647 from jkbradley/chisqselector-follow-ups.
## What changes were proposed in this pull request?
1, move cast to `Predictor`
2, and then, remove unnecessary cast
## How was this patch tested?
existing tests
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#15414 from zhengruifeng/move_cast.
## What changes were proposed in this pull request?
This patch introduces an internal commit protocol API that is used by the batch data source to do write commits. It currently has only one implementation that uses Hadoop MapReduce's OutputCommitter API. In the future, this commit API can be used to unify streaming and batch commits.
## How was this patch tested?
Should be covered by existing write tests.
Author: Reynold Xin <rxin@databricks.com>
Author: Eric Liang <ekl@databricks.com>
Closes#15707 from rxin/SPARK-18024-2.
## What changes were proposed in this pull request?
Random Forest Regression and Classification for R
Clean-up/reordering generics.R
## How was this patch tested?
manual tests, unit tests
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#15607 from felixcheung/rrandomforest.
## What changes were proposed in this pull request?
Return potentially fewer than k cluster centers in cases where k distinct centroids aren't available or aren't selected.
## How was this patch tested?
Existing tests
Author: Sean Owen <sowen@cloudera.com>
Closes#15450 from srowen/SPARK-3261.
## What changes were proposed in this pull request?
Implement Locality Sensitive Hashing along with approximate nearest neighbors and approximate similarity join based on the [design doc](https://docs.google.com/document/d/1D15DTDMF_UWTTyWqXfG7y76iZalky4QmifUYQ6lH5GM/edit).
Detailed changes are as follows:
(1) Implement abstract LSH, LSHModel classes as Estimator-Model
(2) Implement approxNearestNeighbors and approxSimilarityJoin in the abstract LSHModel
(3) Implement Random Projection as LSH subclass for Euclidean distance, Min Hash for Jaccard Distance
(4) Implement unit test utility methods including checkLshProperty, checkNearestNeighbor and checkSimilarityJoin
Things that will be implemented in a follow-up PR:
- Bit Sampling for Hamming Distance, SignRandomProjection for Cosine Distance
- PySpark Integration for the scala classes and methods.
## How was this patch tested?
Unit test is implemented for all the implemented classes and algorithms. A scalability test on Uber's dataset was performed internally.
Tested the methods on [WEX dataset](https://aws.amazon.com/items/2345) from AWS, with the steps and results [here](https://docs.google.com/document/d/19BXg-67U83NVB3M0I84HVBVg3baAVaESD_mrg_-vLro/edit).
## References
Gionis, Aristides, Piotr Indyk, and Rajeev Motwani. "Similarity search in high dimensions via hashing." VLDB 7 Sep. 1999: 518-529.
Wang, Jingdong et al. "Hashing for similarity search: A survey." arXiv preprint arXiv:1408.2927 (2014).
Author: Yunni <Euler57721@gmail.com>
Author: Yun Ni <yunn@uber.com>
Closes#15148 from Yunni/SPARK-5992-yunn-lsh.
## What changes were proposed in this pull request?
Add instrumentation to GMM
## How was this patch tested?
Test in spark-shell
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#15636 from zhengruifeng/gmm_instr.
## What changes were proposed in this pull request?
This PR is an enhancement of PR with commit ID:57dc326bd00cf0a49da971e9c573c48ae28acaa2.
NaN is a special type of value which is commonly seen as invalid. But We find that there are certain cases where NaN are also valuable, thus need special handling. We provided user when dealing NaN values with 3 options, to either reserve an extra bucket for NaN values, or remove the NaN values, or report an error, by setting handleNaN "keep", "skip", or "error"(default) respectively.
'''Before:
val bucketizer: Bucketizer = new Bucketizer()
.setInputCol("feature")
.setOutputCol("result")
.setSplits(splits)
'''After:
val bucketizer: Bucketizer = new Bucketizer()
.setInputCol("feature")
.setOutputCol("result")
.setSplits(splits)
.setHandleNaN("keep")
## How was this patch tested?
Tests added in QuantileDiscretizerSuite, BucketizerSuite and DataFrameStatSuite
Signed-off-by: VinceShieh <vincent.xieintel.com>
Author: VinceShieh <vincent.xie@intel.com>
Author: Vincent Xie <vincent.xie@intel.com>
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#15428 from VinceShieh/spark-17219_followup.
## What changes were proposed in this pull request?
As we discussed in #14818, I added a separate R wrapper spark.logit for logistic regression.
This single interface supports both binary and multinomial logistic regression. It also has "predict" and "summary" for binary logistic regression.
## How was this patch tested?
New unit tests are added.
Author: wm624@hotmail.com <wm624@hotmail.com>
Closes#15365 from wangmiao1981/glm.
## What changes were proposed in this pull request?
Abstract ```ClusteringSummary``` from ```KMeansSummary```, ```GaussianMixtureSummary``` and ```BisectingSummary```, and eliminate duplicated pieces of code.
## How was this patch tested?
Existing tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#15555 from yanboliang/clustering-summary.
## What changes were proposed in this pull request?
This is follow-up work of #15394.
Reorg some variables of ```WeightedLeastSquares``` and fix one minor issue of ```WeightedLeastSquaresSuite```.
## How was this patch tested?
Existing tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#15621 from yanboliang/spark-17748.
## What changes were proposed in this pull request?
update SparkR MLP, add initalWeights parameter.
## How was this patch tested?
test added.
Author: WeichenXu <WeichenXu123@outlook.com>
Closes#15552 from WeichenXu123/mlp_r_add_initialWeight_param.
## What changes were proposed in this pull request?
Add instrumentation for logging in ML GBT, part of umbrella ticket [SPARK-14567](https://issues.apache.org/jira/browse/SPARK-14567)
## How was this patch tested?
Tested locally:
````
16/10/20 10:24:51 INFO Instrumentation: GBTRegressor-gbtr_2b460d3e2e93-1207021668-45: training: numPartitions=1 storageLevel=StorageLevel(1 replicas)
16/10/20 10:24:51 INFO Instrumentation: GBTRegressor-gbtr_2b460d3e2e93-1207021668-45: {"maxIter":1}
16/10/20 10:24:51 INFO Instrumentation: GBTRegressor-gbtr_2b460d3e2e93-1207021668-45: {"numFeatures":2}
16/10/20 10:24:51 INFO Instrumentation: GBTRegressor-gbtr_2b460d3e2e93-1207021668-45: {"numClasses":0}
...
16/10/20 15:54:21 INFO Instrumentation: GBTRegressor-gbtr_065fad465377-1922077832-22: training finished
````
Author: sethah <seth.hendrickson16@gmail.com>
Closes#15574 from sethah/gbt_instr.
## What changes were proposed in this pull request?
#15394 introduced build error for Scala 2.10, this PR fix it.
## How was this patch tested?
Existing test.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#15625 from yanboliang/spark-17748-scala.
## What changes were proposed in this pull request?
As commented by jkbradley in https://github.com/apache/spark/pull/12394, `model.setSummary(summary)` is superfluous
## How was this patch tested?
existing tests
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#15619 from zhengruifeng/del_superfluous.
## What changes were proposed in this pull request?
1. Make a pluggable solver interface for `WeightedLeastSquares`
2. Add a `QuasiNewton` solver to handle elastic net regularization for `WeightedLeastSquares`
3. Add method `BLAS.dspmv` used by QN solver
4. Add mechanism for WLS to handle singular covariance matrices by falling back to QN solver when Cholesky fails.
## How was this patch tested?
Unit tests - see below.
## Design choices
**Pluggable Normal Solver**
Before, the `WeightedLeastSquares` package always used the Cholesky decomposition solver to compute the solution to the normal equations. Now, we specify the solver as a constructor argument to the `WeightedLeastSquares`. We introduce a new trait:
````scala
private[ml] sealed trait NormalEquationSolver {
def solve(
bBar: Double,
bbBar: Double,
abBar: DenseVector,
aaBar: DenseVector,
aBar: DenseVector): NormalEquationSolution
}
````
We extend this trait for different variants of normal equation solvers. In the future, we can easily add others (like QR) using this interface.
**Always train in the standardized space**
The normal solver did not previously standardize the data, but this patch introduces a change such that we always solve the normal equations in the standardized space. We convert back to the original space in the same way that is done for distributed L-BFGS/OWL-QN. We add test cases for zero variance features/labels.
**Use L-BFGS locally to solve normal equations for singular matrix**
When linear regression with the normal solver is called for a singular matrix, we initially try to solve with Cholesky. We use the output of `lapack.dppsv` to determine if the matrix is singular. If it is, we fall back to using L-BFGS locally to solve the normal equations. We add test cases for this as well.
## Test cases
I found it helpful to enumerate some of the test cases and hopefully it makes review easier.
**WeightedLeastSquares**
1. Constant columns - Cholesky solver fails with no regularization, Auto solver falls back to QN, and QN trains successfully.
2. Collinear features - Cholesky solver fails with no regularization, Auto solver falls back to QN, and QN trains successfully.
3. Label is constant zero - no training is performed regardless of intercept. Coefficients are zero and intercept is zero.
4. Label is constant - if fitIntercept, then no training is performed and intercept equals label mean. If not fitIntercept, then we train and return an answer that matches R's lm package.
5. Test with L1 - go through various combinations of L1/L2, standardization, fitIntercept and verify that output matches glmnet.
6. Initial intercept - verify that setting the initial intercept to label mean is correct by training model with strong L1 regularization so that all coefficients are zero and intercept converges to label mean.
7. Test diagInvAtWA - since we are standardizing features now during training, we should test that the inverse is computed to match R.
**LinearRegression**
1. For all existing L1 test cases, test the "normal" solver too.
2. Check that using the normal solver now handles singular matrices.
3. Check that using the normal solver with L1 produces an objective history in the model summary, but does not produce the inverse of AtA.
**BLAS**
1. Test new method `dspmv`.
## Performance Testing
This patch will speed up linear regression with L1/elasticnet penalties when the feature size is < 4096. I have not conducted performance tests at scale, only observed by testing locally that there is a speed improvement.
We should decide if this PR needs to be blocked before performance testing is conducted.
Author: sethah <seth.hendrickson16@gmail.com>
Closes#15394 from sethah/SPARK-17748.
## What changes were proposed in this pull request?
Add missing tests for `truePositiveRate` and `weightedTruePositiveRate` in `MulticlassMetricsSuite`
## How was this patch tested?
added testing
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#15585 from zhengruifeng/mc_missing_test.
## What changes were proposed in this pull request?
A call to the method `SQLTransformer.transform` previously would create a temporary table and never delete it. This change adds a call to `dropTempView()` that deletes this temporary table before returning the result so that the table will not remain in spark's table catalog. Because `tableName` is randomized and not exposed, there should be no expected use of this table outside of the `transform` method.
## How was this patch tested?
A single new assertion was added to the existing test of the `SQLTransformer.transform` method that all temporary tables are removed. Without the corresponding code change, this new assertion fails. I am not aware of any circumstances in which removing this temporary view would be bad for performance or correctness in other ways, but some expertise here would be helpful.
Author: Drew Robb <drewrobb@gmail.com>
Closes#15526 from drewrobb/SPARK-17986.
## What changes were proposed in this pull request?
This patch adds a new "path" method on OutputWriter that returns the path of the file written by the OutputWriter. This is part of the necessary work to consolidate structured streaming and batch write paths.
The batch write path has a nice feature that each data source can define the extension of the files, and allow Spark to specify the staging directory and the prefix for the files. However, in the streaming path we need to collect the list of files written, and there is no interface right now to do that.
## How was this patch tested?
N/A - there is no behavior change and this should be covered by existing tests.
Author: Reynold Xin <rxin@databricks.com>
Closes#15580 from rxin/SPARK-18042.
## What changes were proposed in this pull request?
`Array[T]()` -> `Array.empty[T]` to avoid allocating 0-length arrays.
Use regex `find . -name '*.scala' | xargs -i bash -c 'egrep "Array\[[A-Za-z]+\]\(\)" -n {} && echo {}'` to find modification candidates.
cc srowen
## How was this patch tested?
existing tests
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#15564 from zhengruifeng/avoid_0_length_array.
## What changes were proposed in this pull request?
Currently each data source OutputWriter is responsible for specifying the entire file name for each file output. This, however, does not make any sense because we rely on file naming schemes for certain behaviors in Spark SQL, e.g. bucket id. The current approach allows individual data sources to break the implementation of bucketing.
On the flip side, we also don't want to move file naming entirely out of data sources, because different data sources do want to specify different extensions.
This patch divides file name specification into two parts: the first part is a prefix specified by the caller of OutputWriter (in WriteOutput), and the second part is the suffix that can be specified by the OutputWriter itself. Note that a side effect of this change is that now all file based data sources also support bucketing automatically.
There are also some other minor cleanups:
- Removed the UUID passed through generic Configuration string
- Some minor rewrites for better clarity
- Renamed "path" in multiple places to "stagingDir", to more accurately reflect its meaning
## How was this patch tested?
This should be covered by existing data source tests.
Author: Reynold Xin <rxin@databricks.com>
Closes#15562 from rxin/SPARK-18021.
## What changes were proposed in this pull request?
The sample weight testing for logistic regressions is not robust. Logistic regression suite already has many test cases comparing results to R glmnet. Since both libraries support sample weights, we should use sample weights in the test to increase coverage for sample weighting. This patch doesn't really add any code and makes the testing more complete.
Also fixed some errors with the R code that was referenced in the test suit. Changed `standardization=T` to `standardize=T` since the former is invalid.
## How was this patch tested?
Existing unit tests are modified. No non-test code is touched.
Author: sethah <seth.hendrickson16@gmail.com>
Closes#15488 from sethah/logreg_weight_tests.
## What changes were proposed in this pull request?
For feature selection method ChiSquareSelector, it is based on the ChiSquareTestResult.statistic (ChiSqure value) to select the features. It select the features with the largest ChiSqure value. But the Degree of Freedom (df) of ChiSqure value is different in Statistics.chiSqTest(RDD), and for different df, you cannot base on ChiSqure value to select features.
So we change statistic to pValue for SelectKBest and SelectPercentile
## How was this patch tested?
change existing test
Author: Peng <peng.meng@intel.com>
Closes#15444 from mpjlu/chisqure-bug.
## What changes were proposed in this pull request?
Add BisectingKMeansSummary
## How was this patch tested?
unit test
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#12394 from zhengruifeng/biKMSummary.
## What changes were proposed in this pull request?
[SPARK-14077](https://issues.apache.org/jira/browse/SPARK-14077) copied the ```NaiveBayes``` implementation from mllib to ml and left mllib as a wrapper. However, there are some difference between mllib and ml to handle labels:
* mllib allow input labels as {-1, +1}, however, ml assumes the input labels in range [0, numClasses).
* mllib ```NaiveBayesModel``` expose ```labels``` but ml did not due to the assumption mention above.
During the copy in [SPARK-14077](https://issues.apache.org/jira/browse/SPARK-14077), we use
```val labels = data.map(_.label).distinct().collect().sorted```
to get the distinct labels firstly, and then encode the labels for training. It involves extra Spark job compared with the original implementation. Since ```NaiveBayes``` only do one pass aggregation during training, adding another one seems less efficient. We can get the labels in a single pass along with ```NaiveBayes``` training and send them to MLlib side.
## How was this patch tested?
Existing tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#15402 from yanboliang/spark-17835.
## What changes were proposed in this pull request?
This is a revival of https://github.com/apache/spark/pull/14948 and related to https://github.com/apache/spark/pull/14937. This removes the 'runs' parameter, which has already been disabled, from the K-means implementation and further deprecates API methods that involve it.
This also happens to resolve the issue that K-means should not return duplicate centers, meaning that it may return less than k centroids if not enough data is available.
## How was this patch tested?
Existing tests
Author: Sean Owen <sowen@cloudera.com>
Closes#15342 from srowen/SPARK-11560.
## What changes were proposed in this pull request?
Fix SparkR ```spark.naiveBayes``` error when response variable of dataset is numeric type.
See details and how to reproduce this bug at [SPARK-15153](https://issues.apache.org/jira/browse/SPARK-15153).
## How was this patch tested?
Add unit test.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#15431 from yanboliang/spark-15153-2.
## What changes were proposed in this pull request?
```RFormula``` will index label only when it is string type currently. If the label is numeric type and we use ```RFormula``` to present a classification model, there is no label attributes in label column metadata. The label attributes are useful when making prediction for classification, so we can force to index label by ```StringIndexer``` whether it is numeric or string type for classification. Then SparkR wrappers can extract label attributes from label column metadata successfully. This feature can help us to fix bug similar with [SPARK-15153](https://issues.apache.org/jira/browse/SPARK-15153).
For regression, we will still to keep label as numeric type.
In this PR, we add a param ```indexLabel``` to control whether to force to index label for ```RFormula```.
## How was this patch tested?
Unit tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#13675 from yanboliang/spark-15957.
## What changes were proposed in this pull request?
A nonsensical split is produced from method `findSplitsForContinuousFeature` for decision trees. This PR removes the superfluous split and updates unit tests accordingly. Additionally, an assertion to check that the number of found splits is `> 0` is removed, and instead features with zero possible splits are ignored.
## How was this patch tested?
A unit test was added to check that finding splits for a constant feature produces an empty array.
Author: sethah <seth.hendrickson16@gmail.com>
Closes#12374 from sethah/SPARK-14610.
## What changes were proposed in this pull request?
While adding R wrapper for LogisticRegression, I found one extra comment. It is minor and I just remove it.
## How was this patch tested?
Unit tests
Author: wm624@hotmail.com <wm624@hotmail.com>
Closes#15391 from wangmiao1981/mlordoc.
## What changes were proposed in this pull request?
In practice we cannot guarantee that an `InternalRow` is immutable. This makes the `MutableRow` almost redundant. This PR folds `MutableRow` into `InternalRow`.
The code below illustrates the immutability issue with InternalRow:
```scala
import org.apache.spark.sql.catalyst.InternalRow
import org.apache.spark.sql.catalyst.expressions.GenericMutableRow
val struct = new GenericMutableRow(1)
val row = InternalRow(struct, 1)
println(row)
scala> [[null], 1]
struct.setInt(0, 42)
println(row)
scala> [[42], 1]
```
This might be somewhat controversial, so feedback is appreciated.
## How was this patch tested?
Existing tests.
Author: Herman van Hovell <hvanhovell@databricks.com>
Closes#15333 from hvanhovell/SPARK-17761.
## What changes were proposed in this pull request?
Before, we computed `instances` in LinearRegression in two spots, even though they did the same thing. One of them did not cast the label column to `DoubleType`. This patch consolidates the computation and always casts the label column to `DoubleType`.
## How was this patch tested?
Added a unit test to check all solvers. This test failed before this patch.
Author: sethah <seth.hendrickson16@gmail.com>
Closes#15364 from sethah/linreg_numeric_type.
## What changes were proposed in this pull request?
Avoid 2D array flatten in ```NaiveBayes``` training, since flatten method might be expensive (It will create another array and copy data there).
## How was this patch tested?
Existing tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#15359 from yanboliang/nb-theta.
## What changes were proposed in this pull request?
1,parity check and add missing test suites for ml's NB
2,remove some unused imports
## How was this patch tested?
manual tests in spark-shell
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#15312 from zhengruifeng/nb_test_parity.
## What changes were proposed in this pull request?
When use PeriodicGraphCheckpointer to persist graph, sometimes the edges isn't persisted. As currently only when vertices's storage level is none, graph is persisted. However there is a chance vertices's storage level is not none while edges's is none. Eg. graph created by a outerJoinVertices operation, vertices is automatically cached while edges is not. In this way, edges will not be persisted if we use PeriodicGraphCheckpointer do persist. We need separately check edges's storage level and persisted it if it's none.
## How was this patch tested?
manual tests
Author: ding <ding@localhost.localdomain>
Closes#15124 from dding3/spark-persisitEdge.
## What changes were proposed in this pull request?
Partial revert of #15277 to instead sort and store input to model rather than require sorted input
## How was this patch tested?
Existing tests.
Author: Sean Owen <sowen@cloudera.com>
Closes#15299 from srowen/SPARK-17704.2.
## What changes were proposed in this pull request?
Revert change for NB Model's Load to maintain compatibility with the model stored before 2.0
## How was this patch tested?
local build
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#15313 from zhengruifeng/revert_save_load.
## What changes were proposed in this pull request?
1,support weighted data
2,use dataset/dataframe instead of rdd
3,make mllib as a wrapper to call ml
## How was this patch tested?
local manual tests in spark-shell
unit tests
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#12819 from zhengruifeng/weighted_nb.
## What changes were proposed in this pull request?
In calling LogisticRegression.evaluate and GeneralizedLinearRegression.evaluate using a Dataset where the Label is not of a double type, calculations pattern match against a double and throw a MatchError. This fix casts the Label column to a DoubleType to ensure there is no MatchError.
## How was this patch tested?
Added unit tests to call evaluate with a dataset that has Label as other numeric types.
Author: Bryan Cutler <cutlerb@gmail.com>
Closes#15288 from BryanCutler/binaryLOR-numericCheck-SPARK-17697.
## What changes were proposed in this pull request?
* changes the implementation of gemv with transposed SparseMatrix and SparseVector both in mllib-local and mllib (identical)
* adds a test that was failing before this change, but succeeds with these changes.
The problem in the previous implementation was that it only increments `i`, that is enumerating the columns of a row in the SparseMatrix, when the row-index of the vector matches the column-index of the SparseMatrix. In cases where a particular row of the SparseMatrix has non-zero values at column-indices lower than corresponding non-zero row-indices of the SparseVector, the non-zero values of the SparseVector are enumerated without ever matching the column-index at index `i` and the remaining column-indices i+1,...,indEnd-1 are never attempted. The test cases in this PR illustrate this issue.
## How was this patch tested?
I have run the specific `gemv` tests in both mllib-local and mllib. I am currently still running `./dev/run-tests`.
## ___
As per instructions, I hereby state that this is my original work and that I license the work to the project (Apache Spark) under the project's open source license.
Mentioning dbtsai, viirya and brkyvz whom I can see have worked/authored on these parts before.
Author: Bjarne Fruergaard <bwahlgreen@gmail.com>
Closes#15296 from bwahlgreen/bugfix-spark-17721.
## What changes were proposed in this pull request?
Several performance improvement for ```ChiSqSelector```:
1, Keep ```selectedFeatures``` ordered ascendent.
```ChiSqSelectorModel.transform``` need ```selectedFeatures``` ordered to make prediction. We should sort it when training model rather than making prediction, since users usually train model once and use the model to do prediction multiple times.
2, When training ```fpr``` type ```ChiSqSelectorModel```, it's not necessary to sort the ChiSq test result by statistic.
## How was this patch tested?
Existing unit tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#15277 from yanboliang/spark-17704.
## What changes were proposed in this pull request?
#14035 added ```testImplicits``` to ML unit tests and promoted ```toDF()```, but left one minor issue at ```VectorIndexerSuite```. If we create the DataFrame by ```Seq(...).toDF()```, it will throw different error/exception compared with ```sc.parallelize(Seq(...)).toDF()``` for one of the test cases.
After in-depth study, I found it was caused by different behavior of local and distributed Dataset if the UDF failed at ```assert```. If the data is local Dataset, it throws ```AssertionError``` directly; If the data is distributed Dataset, it throws ```SparkException``` which is the wrapper of ```AssertionError```. I think we should enforce this test to cover both case.
## How was this patch tested?
Unit test.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#15261 from yanboliang/spark-16356.
## What changes were proposed in this pull request?
This patch addresses a potential cause of resource leaks in data source file scans. As reported in [SPARK-17666](https://issues.apache.org/jira/browse/SPARK-17666), tasks which do not fully-consume their input may cause file handles / network connections (e.g. S3 connections) to be leaked. Spark's `NewHadoopRDD` uses a TaskContext callback to [close its record readers](https://github.com/apache/spark/blame/master/core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala#L208), but the new data source file scans will only close record readers once their iterators are fully-consumed.
This patch modifies `RecordReaderIterator` and `HadoopFileLinesReader` to add `close()` methods and modifies all six implementations of `FileFormat.buildReader()` to register TaskContext task completion callbacks to guarantee that cleanup is eventually performed.
## How was this patch tested?
Tested manually for now.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#15245 from JoshRosen/SPARK-17666-close-recordreader.
## What changes were proposed in this pull request?
This PR introduces more compact representation for ```UnsafeArrayData```.
```UnsafeArrayData``` needs to accept ```null``` value in each entry of an array. In the current version, it has three parts
```
[numElements] [offsets] [values]
```
`Offsets` has the number of `numElements`, and represents `null` if its value is negative. It may increase memory footprint, and introduces an indirection for accessing each of `values`.
This PR uses bitvectors to represent nullability for each element like `UnsafeRow`, and eliminates an indirection for accessing each element. The new ```UnsafeArrayData``` has four parts.
```
[numElements][null bits][values or offset&length][variable length portion]
```
In the `null bits` region, we store 1 bit per element, represents whether an element is null. Its total size is ceil(numElements / 8) bytes, and it is aligned to 8-byte boundaries.
In the `values or offset&length` region, we store the content of elements. For fields that hold fixed-length primitive types, such as long, double, or int, we store the value directly in the field. For fields with non-primitive or variable-length values, we store a relative offset (w.r.t. the base address of the array) that points to the beginning of the variable-length field and length (they are combined into a long). Each is word-aligned. For `variable length portion`, each is aligned to 8-byte boundaries.
The new format can reduce memory footprint and improve performance of accessing each element. An example of memory foot comparison:
1024x1024 elements integer array
Size of ```baseObject``` for ```UnsafeArrayData```: 8 + 1024x1024 + 1024x1024 = 2M bytes
Size of ```baseObject``` for ```UnsafeArrayData```: 8 + 1024x1024/8 + 1024x1024 = 1.25M bytes
In summary, we got 1.0-2.6x performance improvements over the code before applying this PR.
Here are performance results of [benchmark programs](04d2e4b6db/sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/UnsafeArrayDataBenchmark.scala):
**Read UnsafeArrayData**: 1.7x and 1.6x performance improvements over the code before applying this PR
````
OpenJDK 64-Bit Server VM 1.8.0_91-b14 on Linux 4.4.11-200.fc22.x86_64
Intel Xeon E3-12xx v2 (Ivy Bridge)
Without SPARK-15962
Read UnsafeArrayData: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
Int 430 / 436 390.0 2.6 1.0X
Double 456 / 485 367.8 2.7 0.9X
With SPARK-15962
Read UnsafeArrayData: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
Int 252 / 260 666.1 1.5 1.0X
Double 281 / 292 597.7 1.7 0.9X
````
**Write UnsafeArrayData**: 1.0x and 1.1x performance improvements over the code before applying this PR
````
OpenJDK 64-Bit Server VM 1.8.0_91-b14 on Linux 4.0.4-301.fc22.x86_64
Intel Xeon E3-12xx v2 (Ivy Bridge)
Without SPARK-15962
Write UnsafeArrayData: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
Int 203 / 273 103.4 9.7 1.0X
Double 239 / 356 87.9 11.4 0.8X
With SPARK-15962
Write UnsafeArrayData: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
Int 196 / 249 107.0 9.3 1.0X
Double 227 / 367 92.3 10.8 0.9X
````
**Get primitive array from UnsafeArrayData**: 2.6x and 1.6x performance improvements over the code before applying this PR
````
OpenJDK 64-Bit Server VM 1.8.0_91-b14 on Linux 4.0.4-301.fc22.x86_64
Intel Xeon E3-12xx v2 (Ivy Bridge)
Without SPARK-15962
Get primitive array from UnsafeArrayData: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
Int 207 / 217 304.2 3.3 1.0X
Double 257 / 363 245.2 4.1 0.8X
With SPARK-15962
Get primitive array from UnsafeArrayData: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
Int 151 / 198 415.8 2.4 1.0X
Double 214 / 394 293.6 3.4 0.7X
````
**Create UnsafeArrayData from primitive array**: 1.7x and 2.1x performance improvements over the code before applying this PR
````
OpenJDK 64-Bit Server VM 1.8.0_91-b14 on Linux 4.0.4-301.fc22.x86_64
Intel Xeon E3-12xx v2 (Ivy Bridge)
Without SPARK-15962
Create UnsafeArrayData from primitive array: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
Int 340 / 385 185.1 5.4 1.0X
Double 479 / 705 131.3 7.6 0.7X
With SPARK-15962
Create UnsafeArrayData from primitive array: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
Int 206 / 211 306.0 3.3 1.0X
Double 232 / 406 271.6 3.7 0.9X
````
1.7x and 1.4x performance improvements in [```UDTSerializationBenchmark```](https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/mllib/linalg/UDTSerializationBenchmark.scala) over the code before applying this PR
````
OpenJDK 64-Bit Server VM 1.8.0_91-b14 on Linux 4.4.11-200.fc22.x86_64
Intel Xeon E3-12xx v2 (Ivy Bridge)
Without SPARK-15962
VectorUDT de/serialization: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
serialize 442 / 533 0.0 441927.1 1.0X
deserialize 217 / 274 0.0 217087.6 2.0X
With SPARK-15962
VectorUDT de/serialization: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
serialize 265 / 318 0.0 265138.5 1.0X
deserialize 155 / 197 0.0 154611.4 1.7X
````
## How was this patch tested?
Added unit tests into ```UnsafeArraySuite```
Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Closes#13680 from kiszk/SPARK-15962.
## What changes were proposed in this pull request?
This was suggested in 101663f1ae (commitcomment-17114968).
This PR adds `testImplicits` to `MLlibTestSparkContext` so that some implicits such as `toDF()` can be sued across ml tests.
This PR also changes all the usages of `spark.createDataFrame( ... )` to `toDF()` where applicable in ml tests in Scala.
## How was this patch tested?
Existing tests should work.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#14035 from HyukjinKwon/minor-ml-test.
## What changes were proposed in this pull request?
#14597 modified ```ChiSqSelector``` to support ```fpr``` type selector, however, it left some issue need to be addressed:
* We should allow users to set selector type explicitly rather than switching them by using different setting function, since the setting order will involves some unexpected issue. For example, if users both set ```numTopFeatures``` and ```percentile```, it will train ```kbest``` or ```percentile``` model based on the order of setting (the latter setting one will be trained). This make users confused, and we should allow users to set selector type explicitly. We handle similar issues at other place of ML code base such as ```GeneralizedLinearRegression``` and ```LogisticRegression```.
* Meanwhile, if there are more than one parameter except ```alpha``` can be set for ```fpr``` model, we can not handle it elegantly in the existing framework. And similar issues for ```kbest``` and ```percentile``` model. Setting selector type explicitly can solve this issue also.
* If setting selector type explicitly by users is allowed, we should handle param interaction such as if users set ```selectorType = percentile``` and ```alpha = 0.1```, we should notify users the parameter ```alpha``` will take no effect. We should handle complex parameter interaction checks at ```transformSchema```. (FYI #11620)
* We should use lower case of the selector type names to follow MLlib convention.
* Add ML Python API.
## How was this patch tested?
Unit test.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#15214 from yanboliang/spark-17017.
## What changes were proposed in this pull request?
Match ProbabilisticClassifer.thresholds requirements to R randomForest cutoff, requiring all > 0
## How was this patch tested?
Jenkins tests plus new test cases
Author: Sean Owen <sowen@cloudera.com>
Closes#15149 from srowen/SPARK-17057.
## What changes were proposed in this pull request?
To match Tokenizer and for compatibility with Word2Vec, output a nullable string array type in NGram
## How was this patch tested?
Jenkins tests.
Author: Sean Owen <sowen@cloudera.com>
Closes#15179 from srowen/SPARK-10835.
## What changes were proposed in this pull request?
update `MultilayerPerceptronClassifierWrapper.fit` paramter type:
`layers: Array[Int]`
`seed: String`
update several default params in sparkR `spark.mlp`:
`tol` --> 1e-6
`stepSize` --> 0.03
`seed` --> NULL ( when seed == NULL, the scala-side wrapper regard it as a `null` value and the seed will use the default one )
r-side `seed` only support 32bit integer.
remove `layers` default value, and move it in front of those parameters with default value.
add `layers` parameter validation check.
## How was this patch tested?
tests added.
Author: WeichenXu <WeichenXu123@outlook.com>
Closes#15051 from WeichenXu123/update_py_mlp_default.
## What changes were proposed in this pull request?
RandomForest currently sends the entire forest to each worker on each iteration. This is because (a) the node queue is FIFO and (b) the closure references the entire array of trees (topNodes). (a) causes RFs to handle splits in many trees, especially early on in learning. (b) sends all trees explicitly.
This PR:
(a) Change the RF node queue to be FILO (a stack), so that RFs tend to focus on 1 or a few trees before focusing on others.
(b) Change topNodes to pass only the trees required on that iteration.
## How was this patch tested?
Unit tests:
* Existing tests for correctness of tree learning
* Manually modifying code and running tests to verify that a small number of trees are communicated on each iteration
* This last item is hard to test via unit tests given the current APIs.
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#14359 from jkbradley/rfs-fewer-trees.
## What changes were proposed in this pull request?
Allow Spark 2.x to load instances of LDA, LocalLDAModel, and DistributedLDAModel saved from Spark 1.6.
## How was this patch tested?
I tested this manually, saving the 3 types from 1.6 and loading them into master (2.x). In the future, we can add generic tests for testing backwards compatibility across all ML models in SPARK-15573.
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#15034 from jkbradley/lda-backwards.
## What changes were proposed in this pull request?
Add treeAggregateDepth parameter for AFTSurvivalRegression to keep consistent with LiR/LoR.
## How was this patch tested?
Existing tests.
Author: WeichenXu <WeichenXu123@outlook.com>
Closes#14851 from WeichenXu123/add_treeAggregate_param_for_survival_regression.
## What changes were proposed in this pull request?
Update error handling for Cholesky decomposition to provide a little more info when input is singular.
## How was this patch tested?
New test case; jenkins tests.
Author: Sean Owen <sowen@cloudera.com>
Closes#15177 from srowen/SPARK-11918.
## What changes were proposed in this pull request?
This PR fixes an issue when Bucketizer is called to handle a dataset containing NaN value.
Sometimes, null value might also be useful to users, so in these cases, Bucketizer should
reserve one extra bucket for NaN values, instead of throwing an illegal exception.
Before:
```
Bucketizer.transform on NaN value threw an illegal exception.
```
After:
```
NaN values will be grouped in an extra bucket.
```
## How was this patch tested?
New test cases added in `BucketizerSuite`.
Signed-off-by: VinceShieh <vincent.xieintel.com>
Author: VinceShieh <vincent.xie@intel.com>
Closes#14858 from VinceShieh/spark-17219.
## What changes were proposed in this pull request?
Univariate feature selection works by selecting the best features based on univariate statistical tests. False Positive Rate (FPR) is a popular univariate statistical test for feature selection. We add a chiSquare Selector based on False Positive Rate (FPR) test in this PR, like it is implemented in scikit-learn.
http://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection
## How was this patch tested?
Add Scala ut
Author: Peng, Meng <peng.meng@intel.com>
Closes#14597 from mpjlu/fprChiSquare.
## What changes were proposed in this pull request?
The code in `Word2VecModel.findSynonyms` to choose the vocabulary elements with the highest similarity to the query vector currently sorts the collection of similarities for every vocabulary element. This involves making multiple copies of the collection of similarities while doing a (relatively) expensive sort. It would be more efficient to find the best matches by maintaining a bounded priority queue and populating it with a single pass over the vocabulary, and that is exactly what this patch does.
## How was this patch tested?
This patch adds no user-visible functionality and its correctness should be exercised by existing tests. To ensure that this approach is actually faster, I made a microbenchmark for `findSynonyms`:
```
object W2VTiming {
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.mllib.feature.Word2VecModel
def run(modelPath: String, scOpt: Option[SparkContext] = None) {
val sc = scOpt.getOrElse(new SparkContext(new SparkConf(true).setMaster("local[*]").setAppName("test")))
val model = Word2VecModel.load(sc, modelPath)
val keys = model.getVectors.keys
val start = System.currentTimeMillis
for(key <- keys) {
model.findSynonyms(key, 5)
model.findSynonyms(key, 10)
model.findSynonyms(key, 25)
model.findSynonyms(key, 50)
}
val finish = System.currentTimeMillis
println("run completed in " + (finish - start) + "ms")
}
}
```
I ran this test on a model generated from the complete works of Jane Austen and found that the new approach was over 3x faster than the old approach. (If the `num` argument to `findSynonyms` is very close to the vocabulary size, the new approach will have less of an advantage over the old one.)
Author: William Benton <willb@redhat.com>
Closes#15150 from willb/SPARK-17595.
## What changes were proposed in this pull request?
Merge `MultinomialLogisticRegression` into `LogisticRegression` and remove `MultinomialLogisticRegression`.
Marked as WIP because we should discuss the coefficients API in the model. See discussion below.
JIRA: [SPARK-17163](https://issues.apache.org/jira/browse/SPARK-17163)
## How was this patch tested?
Merged test suites and added some new unit tests.
## Design
### Switching between binomial and multinomial
We default to automatically detecting whether we should run binomial or multinomial lor. We expose a new parameter called `family` which defaults to auto. When "auto" is used, we run normal binomial lor with pivoting if there are 1 or 2 label classes. Otherwise, we run multinomial. If the user explicitly sets the family, then we abide by that setting. In the case where "binomial" is set but multiclass lor is detected, we throw an error.
### coefficients/intercept model API (TODO)
This is the biggest design point remaining, IMO. We need to decide how to store the coefficients and intercepts in the model, and in turn how to expose them via the API. Two important points:
* We must maintain compatibility with the old API, i.e. we must expose `def coefficients: Vector` and `def intercept: Double`
* There are two separate cases: binomial lr where we have a single set of coefficients and a single intercept and multinomial lr where we have `numClasses` sets of coefficients and `numClasses` intercepts.
Some options:
1. **Store the binomial coefficients as a `2 x numFeatures` matrix.** This means that we would center the model coefficients before storing them in the model. The BLOR algorithm gives `1 * numFeatures` coefficients, but we would convert them to `2 x numFeatures` coefficients before storing them, effectively doubling the storage in the model. This has the advantage that we can make the code cleaner (i.e. less `if (isMultinomial) ... else ...`) and we don't have to reason about the different cases as much. It has the disadvantage that we double the storage space and we could see small regressions at prediction time since there are 2x the number of operations in the prediction algorithms. Additionally, we still have to produce the uncentered coefficients/intercept via the API, so we will have to either ALSO store the uncentered version, or compute it in `def coefficients: Vector` every time.
2. **Store the binomial coefficients as a `1 x numFeatures` matrix.** We still store the coefficients as a matrix and the intercepts as a vector. When users call `coefficients` we return them a `Vector` that is backed by the same underlying array as the `coefficientMatrix`, so we don't duplicate any data. At prediction time, we use the old prediction methods that are specialized for binary LOR. The benefits here are that we don't store extra data, and we won't see any regressions in performance. The cost of this is that we have separate implementations for predict methods in the binary vs multiclass case. The duplicated code is really not very high, but it's still a bit messy.
If we do decide to store the 2x coefficients, we would likely want to see some performance tests to understand the potential regressions.
**Update:** We have chosen option 2
### Threshold/thresholds (TODO)
Currently, when `threshold` is set we clear whatever value is in `thresholds` and when `thresholds` is set we clear whatever value is in `threshold`. [SPARK-11543](https://issues.apache.org/jira/browse/SPARK-11543) was created to prefer thresholds over threshold. We should decide if we should implement this behavior now or if we want to do it in a separate JIRA.
**Update:** Let's leave it for a follow up PR
## Follow up
* Summary model for multiclass logistic regression [SPARK-17139](https://issues.apache.org/jira/browse/SPARK-17139)
* Thresholds vs threshold [SPARK-11543](https://issues.apache.org/jira/browse/SPARK-11543)
Author: sethah <seth.hendrickson16@gmail.com>
Closes#14834 from sethah/SPARK-17163.
## What changes were proposed in this pull request?
This pull request changes the behavior of `Word2VecModel.findSynonyms` so that it will not spuriously reject the best match when invoked with a vector that does not correspond to a word in the model's vocabulary. Instead of blindly discarding the best match, the changed implementation discards a match that corresponds to the query word (in cases where `findSynonyms` is invoked with a word) or that has an identical angle to the query vector.
## How was this patch tested?
I added a test to `Word2VecSuite` to ensure that the word with the most similar vector from a supplied vector would not be spuriously rejected.
Author: William Benton <willb@redhat.com>
Closes#15105 from willb/fix/findSynonyms.
## What changes were proposed in this pull request?
as the TODO described,
check weight vector size and if wrong throw exception.
## How was this patch tested?
existing tests.
Author: WeichenXu <WeichenXu123@outlook.com>
Closes#15060 from WeichenXu123/check_input_weight_size_of_ann.
## What changes were proposed in this pull request?
#14956 reduced default k-means|| init steps to 2 from 5 only for spark.mllib package, we should also do same change for spark.ml and PySpark.
## How was this patch tested?
Existing tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#15050 from yanboliang/spark-17389.
## What changes were proposed in this pull request?
Reduce default k-means|| init steps to 2 from 5. See JIRA for discussion.
See also https://github.com/apache/spark/pull/14948
## How was this patch tested?
Existing tests.
Author: Sean Owen <sowen@cloudera.com>
Closes#14956 from srowen/SPARK-17389.2.
## What changes were proposed in this pull request?
#13584 resolved the issue of features and label columns conflict with ```RFormula``` default ones when loading libsvm data, but it still left some issues should be resolved:
1, It’s not necessary to check and rename label column.
Since we have considerations on the design of ```RFormula```, it can handle the case of label column already exists(with restriction of the existing label column should be numeric/boolean type). So it’s not necessary to change the column name to avoid conflict. If the label column is not numeric/boolean type, ```RFormula``` will throw exception.
2, We should rename features column name to new one if there is conflict, but appending a random value is enough since it was used internally only. We done similar work when implementing ```SQLTransformer```.
3, We should set correct new features column for the estimators. Take ```GLM``` as example:
```GLM``` estimator should set features column with the changed one(rFormula.getFeaturesCol) rather than the default “features”. Although it’s same when training model, but it involves problems when predicting. The following is the prediction result of GLM before this PR:
![image](https://cloud.githubusercontent.com/assets/1962026/18308227/84c3c452-74a8-11e6-9caa-9d6d846cc957.png)
We should drop the internal used feature column name, otherwise, it will appear on the prediction DataFrame which will confused users. And this behavior is same as other scenarios which does not exist column name conflict.
After this PR:
![image](https://cloud.githubusercontent.com/assets/1962026/18308240/92082a04-74a8-11e6-9226-801f52b856d9.png)
## How was this patch tested?
Existing unit tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#14993 from yanboliang/spark-15509.
## What changes were proposed in this pull request?
We should generally use `ArrayBuffer.+=(A)` rather than `ArrayBuffer.append(A)`, because `append(A)` would involve extra boxing / unboxing.
## How was this patch tested?
N/A
Author: Liwei Lin <lwlin7@gmail.com>
Closes#14914 from lw-lin/append_to_plus_eq_v2.
## What changes were proposed in this pull request?
1, remove unnecessary `require()`, because it will make following check useless.
2, update the error msg.
## How was this patch tested?
no test
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#14972 from zhengruifeng/del_unnecessary_check.
## What changes were proposed in this pull request?
```weights``` of ```MultilayerPerceptronClassificationModel``` should be the output weights of layers rather than initial weights, this PR correct it.
## How was this patch tested?
Doc change.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#14967 from yanboliang/mlp-weights.
## What changes were proposed in this pull request?
If `ScalaUDF` throws exceptions during executing user code, sometimes it's hard for users to figure out what's wrong, especially when they use Spark shell. An example
```
org.apache.spark.SparkException: Job aborted due to stage failure: Task 12 in stage 325.0 failed 4 times, most recent failure: Lost task 12.3 in stage 325.0 (TID 35622, 10.0.207.202): java.lang.NullPointerException
at line8414e872fb8b42aba390efc153d1611a12.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(<console>:40)
at line8414e872fb8b42aba390efc153d1611a12.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(<console>:40)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
...
```
We should catch these exceptions and rethrow them with better error message, to say that the exception is happened in scala udf.
This PR also does some clean up for `ScalaUDF` and add a unit test suite for it.
## How was this patch tested?
the new test suite
Author: Wenchen Fan <wenchen@databricks.com>
Closes#14850 from cloud-fan/npe.
## What changes were proposed in this pull request?
Since we have updated breeze version to 0.12, we should remove work around for bug of breeze sparse matrix in v0.11.
I checked all mllib code and found this is the only work around for breeze 0.11.
## How was this patch tested?
Existing tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#14953 from yanboliang/matrices.
## What changes were proposed in this pull request?
Related to https://github.com/apache/spark/pull/14524 -- just the 'fix' rather than a behavior change.
- PythonMLlibAPI methods that take a seed now always take a `java.lang.Long` consistently, allowing the Python API to specify "no seed"
- .mllib's Word2VecModel seemed to be an odd man out in .mllib in that it picked its own random seed. Instead it defaults to None, meaning, letting the Scala implementation pick a seed
- BisectingKMeansModel arguably should not hard-code a seed for consistency with .mllib, I think. However I left it.
## How was this patch tested?
Existing tests
Author: Sean Owen <sowen@cloudera.com>
Closes#14826 from srowen/SPARK-16832.2.
## What changes were proposed in this pull request?
Improved the code quality of spark by replacing all pattern match on boolean value by if/else block.
## How was this patch tested?
By running the tests
Author: Shivansh <shiv4nsh@gmail.com>
Closes#14873 from shiv4nsh/SPARK-17308.
## What changes were proposed in this pull request?
This PR tries to add Kolmogorov-Smirnov Test wrapper to SparkR. This wrapper implementation only supports one sample test against normal distribution.
## How was this patch tested?
R unit test.
Author: Junyang Qian <junyangq@databricks.com>
Closes#14881 from junyangq/SPARK-17315.
## What changes were proposed in this pull request?
fix `MultivariantOnlineSummerizer.numNonZeros` method,
return `nnz` array, instead of `weightSum` array
## How was this patch tested?
Existing test.
Author: WeichenXu <WeichenXu123@outlook.com>
Closes#14923 from WeichenXu123/fix_MultivariantOnlineSummerizer_numNonZeros.
https://issues.apache.org/jira/browse/SPARK-15509
## What changes were proposed in this pull request?
Currently in SparkR, when you load a LibSVM dataset using the sqlContext and then pass it to an MLlib algorithm, the ML wrappers will fail since they will try to create a "features" column, which conflicts with the existing "features" column from the LibSVM loader. E.g., using the "mnist" dataset from LibSVM:
`training <- loadDF(sqlContext, ".../mnist", "libsvm")`
`model <- naiveBayes(label ~ features, training)`
This fails with:
```
16/05/24 11:52:41 ERROR RBackendHandler: fit on org.apache.spark.ml.r.NaiveBayesWrapper failed
Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
java.lang.IllegalArgumentException: Output column features already exists.
at org.apache.spark.ml.feature.VectorAssembler.transformSchema(VectorAssembler.scala:120)
at org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:179)
at org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:179)
at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:186)
at org.apache.spark.ml.Pipeline.transformSchema(Pipeline.scala:179)
at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:67)
at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:131)
at org.apache.spark.ml.feature.RFormula.fit(RFormula.scala:169)
at org.apache.spark.ml.r.NaiveBayesWrapper$.fit(NaiveBayesWrapper.scala:62)
at org.apache.spark.ml.r.NaiveBayesWrapper.fit(NaiveBayesWrapper.sca
The same issue appears for the "label" column once you rename the "features" column.
```
The cause is, when using `loadDF()` to generate dataframes, sometimes it’s with default column name `“label”` and `“features”`, and these two name will conflict with default column names `setDefault(labelCol, "label")` and ` setDefault(featuresCol, "features")` of `SharedParams.scala`
## How was this patch tested?
Test on my local machine.
Author: Xin Ren <iamshrek@126.com>
Closes#13584 from keypointt/SPARK-15509.
## What changes were proposed in this pull request?
Avoid allocating some 0-length arrays, esp. in UTF8String, and by using Array.empty in Scala over Array[T]()
## How was this patch tested?
Jenkins
Author: Sean Owen <sowen@cloudera.com>
Closes#14895 from srowen/SPARK-17331.
https://issues.apache.org/jira/browse/SPARK-17241
## What changes were proposed in this pull request?
Spark has configurable L2 regularization parameter for generalized linear regression. It is very important to have them in SparkR so that users can run ridge regression.
## How was this patch tested?
Test manually on local laptop.
Author: Xin Ren <iamshrek@126.com>
Closes#14856 from keypointt/SPARK-17241.
## What changes were proposed in this pull request?
Clean up unused variables and unused import statements, unnecessary `return` and `toArray`, and some more style improvement, when I walk through the code examples.
## How was this patch tested?
Testet manually on local laptop.
Author: Xin Ren <iamshrek@126.com>
Closes#14836 from keypointt/codeWalkThroughML.
## What changes were proposed in this pull request?
Allow centering / mean scaling of sparse vectors in StandardScaler, if requested. This is for compatibility with `VectorAssembler` in common usages.
## How was this patch tested?
Jenkins tests, including new caes to reflect the new behavior.
Author: Sean Owen <sowen@cloudera.com>
Closes#14663 from srowen/SPARK-17001.
## What changes were proposed in this pull request?
The require condition and message doesn't match, and the condition also should be optimized.
Small change. Please kindly let me know if JIRA required.
## How was this patch tested?
No additional test required.
Author: Peng, Meng <peng.meng@intel.com>
Closes#14824 from mpjlu/smallChangeForMatrixRequire.
## What changes were proposed in this pull request?
fix comparing Vector bug in TestingUtils.
There is the same bug for Matrix comparing. How to check the length of Matrix should be discussed first.
## How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
Author: Peng, Meng <peng.meng@intel.com>
Closes#14785 from mpjlu/testUtils.
https://issues.apache.org/jira/browse/SPARK-16445
## What changes were proposed in this pull request?
Create Multilayer Perceptron Classifier wrapper in SparkR
## How was this patch tested?
Tested manually on local machine
Author: Xin Ren <iamshrek@126.com>
Closes#14447 from keypointt/SPARK-16445.
## What changes were proposed in this pull request?
In cases when QuantileDiscretizerSuite is called upon a numeric array with duplicated elements, we will take the unique elements generated from approxQuantiles as input for Bucketizer.
## How was this patch tested?
An unit test is added in QuantileDiscretizerSuite
QuantileDiscretizer.fit will throw an illegal exception when calling setSplits on a list of splits
with duplicated elements. Bucketizer.setSplits should only accept either a numeric vector of two
or more unique cut points, although that may produce less number of buckets than requested.
Signed-off-by: VinceShieh <vincent.xieintel.com>
Author: VinceShieh <vincent.xie@intel.com>
Closes#14747 from VinceShieh/SPARK-17086.
## What changes were proposed in this pull request?
Fix a typo
## How was this patch tested?
no tests
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#14772 from zhengruifeng/minor_numClasses.
## What changes were proposed in this pull request?
In Latex, it is common to find "}}}" when closing several expressions at once. [SPARK-16822](https://issues.apache.org/jira/browse/SPARK-16822) added Mathjax to render Latex equations in scaladoc. However, when scala doc sees "}}}" or "{{{" it treats it as a special character for code block. This results in some very strange output.
Author: Jagadeesan <as2@us.ibm.com>
Closes#14688 from jagadeesanas2/SPARK-17095.
## What changes were proposed in this pull request?
Add expert param support to SharedParamsCodeGen where aggregationDepth a expert param is added.
Author: hqzizania <hqzizania@gmail.com>
Closes#14738 from hqzizania/SPARK-17090-minor.
## What changes were proposed in this pull request?
Add missing `numFeatures` and `numClasses` to the wrapped Java models in PySpark ML pipelines. Also tag `DecisionTreeClassificationModel` as Expiremental to match Scala doc.
## How was this patch tested?
Extended doctests
Author: Holden Karau <holden@us.ibm.com>
Closes#12889 from holdenk/SPARK-15113-add-missing-numFeatures-numClasses.
## What changes were proposed in this pull request?
Spark SQL doesn't have its own meta store yet, and use hive's currently. However, hive's meta store has some limitations(e.g. columns can't be too many, not case-preserving, bad decimal type support, etc.), so we have some hacks to successfully store data source table metadata into hive meta store, i.e. put all the information in table properties.
This PR moves these hacks to `HiveExternalCatalog`, tries to isolate hive specific logic in one place.
changes overview:
1. **before this PR**: we need to put metadata(schema, partition columns, etc.) of data source tables to table properties before saving it to external catalog, even the external catalog doesn't use hive metastore(e.g. `InMemoryCatalog`)
**after this PR**: the table properties tricks are only in `HiveExternalCatalog`, the caller side doesn't need to take care of it anymore.
2. **before this PR**: because the table properties tricks are done outside of external catalog, so we also need to revert these tricks when we read the table metadata from external catalog and use it. e.g. in `DescribeTableCommand` we will read schema and partition columns from table properties.
**after this PR**: The table metadata read from external catalog is exactly the same with what we saved to it.
bonus: now we can create data source table using `SessionCatalog`, if schema is specified.
breaks: `schemaStringLengthThreshold` is not configurable anymore. `hive.default.rcfile.serde` is not configurable anymore.
## How was this patch tested?
existing tests.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#14155 from cloud-fan/catalog-table.
## What changes were proposed in this pull request?
Linear/logistic regression use treeAggregate with default depth (always = 2) for collecting coefficient gradient updates to the driver. For high dimensional problems, this can cause OOM error on the driver. This patch makes it configurable to avoid this problem if users' input data has many features. It adds a HasTreeDepth API in `sharedParams.scala`, and extends it to both Linear regression and logistic regression in .ml
Author: hqzizania <hqzizania@gmail.com>
Closes#14717 from hqzizania/SPARK-17090.
## What changes were proposed in this pull request?
In the existing code, ```MinMaxScaler``` handle ```NaN``` value indeterminately.
* If a column has identity value, that is ```max == min```, ```MinMaxScalerModel``` transformation will output ```0.5``` for all rows even the original value is ```NaN```.
* Otherwise, it will remain ```NaN``` after transformation.
I think we should unify the behavior by remaining ```NaN``` value at any condition, since we don't know how to transform a ```NaN``` value. In Python sklearn, it will throw exception when there is ```NaN``` in the dataset.
## How was this patch tested?
Unit tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#14716 from yanboliang/spark-17141.
## What changes were proposed in this pull request?
This patch adds a new estimator/transformer `MultinomialLogisticRegression` to spark ML.
JIRA: [SPARK-7159](https://issues.apache.org/jira/browse/SPARK-7159)
## How was this patch tested?
Added new test suite `MultinomialLogisticRegressionSuite`.
## Approach
### Do not use a "pivot" class in the algorithm formulation
Many implementations of multinomial logistic regression treat the problem as K - 1 independent binary logistic regression models where K is the number of possible outcomes in the output variable. In this case, one outcome is chosen as a "pivot" and the other K - 1 outcomes are regressed against the pivot. This is somewhat undesirable since the coefficients returned will be different for different choices of pivot variables. An alternative approach to the problem models class conditional probabilites using the softmax function and will return uniquely identifiable coefficients (assuming regularization is applied). This second approach is used in R's glmnet and was also recommended by dbtsai.
### Separate multinomial logistic regression and binary logistic regression
The initial design makes multinomial logistic regression a separate estimator/transformer than the existing LogisticRegression estimator/transformer. An alternative design would be to merge them into one.
**Arguments for:**
* The multinomial case without pivot is distinctly different than the current binary case since the binary case uses a pivot class.
* The current logistic regression model in ML uses a vector of coefficients and a scalar intercept. In the multinomial case, we require a matrix of coefficients and a vector of intercepts. There are potential workarounds for this issue if we were to merge the two estimators, but none are particularly elegant.
**Arguments against:**
* It may be inconvenient for users to have to switch the estimator class when transitioning between binary and multiclass (although the new multinomial estimator can be used for two class outcomes).
* Some portions of the code are repeated.
This is a major design point and warrants more discussion.
### Mean centering
When no regularization is applied, the coefficients will not be uniquely identifiable. This is not hard to show and is discussed in further detail [here](https://core.ac.uk/download/files/153/6287975.pdf). R's glmnet deals with this by choosing the minimum l2 regularized solution (i.e. mean centering). Additionally, the intercepts are never regularized so they are always mean centered. This is the approach taken in this PR as well.
### Feature scaling
In current ML logistic regression, the features are always standardized when running the optimization algorithm. They are always returned to the user in the original feature space, however. This same approach is maintained in this patch as well, but the implementation details are different. In ML logistic regression, the unregularized feature values are divided by the column standard deviation in every gradient update iteration. In contrast, MLlib transforms the entire input dataset to the scaled space _before_ optimizaton. In ML, this means that `numFeatures * numClasses` extra scalar divisions are required in every iteration. Performance testing shows that this has significant (4x in some cases) slow downs in each iteration. This can be avoided by transforming the input to the scaled space ala MLlib once, before iteration begins. This does add some overhead initially, but can make significant time savings in some cases.
One issue with this approach is that if the input data is already cached, there may not be enough memory to cache the transformed data, which would make the algorithm _much_ slower. The tradeoffs here merit more discussion.
### Specifying and inferring the number of outcome classes
The estimator checks the dataframe label column for metadata which specifies the number of values. If they are not specified, the length of the `histogram` variable is used, which is essentially the maximum value found in the column. The assumption then, is that the labels are zero-indexed when they are provided to the algorithm.
## Performance
Below are some performance tests I have run so far. I am happy to add more cases or trials if we deem them necessary.
Test cluster: 4 bare metal nodes, 128 GB RAM each, 48 cores each
Notes:
* Time in units of seconds
* Metric is classification accuracy
| algo | elasticNetParam | fitIntercept | metric | maxIter | numPoints | numClasses | numFeatures | time | standardization | regParam |
|--------|-------------------|----------------|----------|-----------|-------------|--------------|---------------|---------|-------------------|------------|
| ml | 0 | true | 0.746415 | 30 | 100000 | 3 | 100000 | 327.923 | true | 0 |
| mllib | 0 | true | 0.743785 | 30 | 100000 | 3 | 100000 | 390.217 | true | 0 |
| algo | elasticNetParam | fitIntercept | metric | maxIter | numPoints | numClasses | numFeatures | time | standardization | regParam |
|--------|-------------------|----------------|----------|-----------|-------------|--------------|---------------|---------|-------------------|------------|
| ml | 0 | true | 0.973238 | 30 | 2000000 | 3 | 10000 | 385.476 | true | 0 |
| mllib | 0 | true | 0.949828 | 30 | 2000000 | 3 | 10000 | 550.403 | true | 0 |
| algo | elasticNetParam | fitIntercept | metric | maxIter | numPoints | numClasses | numFeatures | time | standardization | regParam |
|--------|-------------------|----------------|----------|-----------|-------------|--------------|---------------|---------|-------------------|------------|
| mllib | 0 | true | 0.864358 | 30 | 2000000 | 3 | 10000 | 543.359 | true | 0.1 |
| ml | 0 | true | 0.867418 | 30 | 2000000 | 3 | 10000 | 401.955 | true | 0.1 |
| algo | elasticNetParam | fitIntercept | metric | maxIter | numPoints | numClasses | numFeatures | time | standardization | regParam |
|--------|-------------------|----------------|----------|-----------|-------------|--------------|---------------|---------|-------------------|------------|
| ml | 1 | true | 0.807449 | 30 | 2000000 | 3 | 10000 | 334.892 | true | 0.05 |
| algo | elasticNetParam | fitIntercept | metric | maxIter | numPoints | numClasses | numFeatures | time | standardization | regParam |
|--------|-------------------|----------------|----------|-----------|-------------|--------------|---------------|---------|-------------------|------------|
| ml | 0 | true | 0.602006 | 30 | 2000000 | 500 | 100 | 112.319 | true | 0 |
| mllib | 0 | true | 0.567226 | 30 | 2000000 | 500 | 100 | 263.768 | true | 0 |e | 0.567226 | 30 | 2000000 | 500 | 100 | 263.768 | true | 0 |
## References
Friedman, et al. ["Regularization Paths for Generalized Linear Models via Coordinate Descent"](https://core.ac.uk/download/files/153/6287975.pdf)
[http://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html](http://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html)
## Follow up items
* Consider using level 2 BLAS routines in the gradient computations - [SPARK-17134](https://issues.apache.org/jira/browse/SPARK-17134)
* Add model summary for MLOR - [SPARK-17139](https://issues.apache.org/jira/browse/SPARK-17139)
* Add initial model to MLOR and add test for intercept priors - [SPARK-17140](https://issues.apache.org/jira/browse/SPARK-17140)
* Python API - [SPARK-17138](https://issues.apache.org/jira/browse/SPARK-17138)
* Consider changing the tree aggregation level for MLOR/BLOR or making it user configurable to avoid memory problems with high dimensional data - [SPARK-17090](https://issues.apache.org/jira/browse/SPARK-17090)
* Refactor helper classes out of `LogisticRegression.scala` - [SPARK-17135](https://issues.apache.org/jira/browse/SPARK-17135)
* Design optimizer interface for added flexibility in ML algos - [SPARK-17136](https://issues.apache.org/jira/browse/SPARK-17136)
* Support compressing the coefficients and intercepts for MLOR models - [SPARK-17137](https://issues.apache.org/jira/browse/SPARK-17137)
Author: sethah <seth.hendrickson16@gmail.com>
Closes#13796 from sethah/SPARK-7159_M.
## What changes were proposed in this pull request?
Add LDA Wrapper in SparkR with the following interfaces:
- spark.lda(data, ...)
- spark.posterior(object, newData, ...)
- spark.perplexity(object, ...)
- summary(object)
- write.ml(object)
- read.ml(path)
## How was this patch tested?
Test with SparkR unit test.
Author: Xusen Yin <yinxusen@gmail.com>
Closes#14229 from yinxusen/SPARK-16447.
## What changes were proposed in this pull request?
Gaussian Mixture Model wrapper in SparkR, similarly to R's ```mvnormalmixEM```.
## How was this patch tested?
Unit test.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#14392 from yanboliang/spark-16446.
## What changes were proposed in this pull request?
(Please fill in changes proposed in this fix)
Add Isotonic Regression wrapper in SparkR
Wrappers in R and Scala are added.
Unit tests
Documentation
## How was this patch tested?
Manually tested with sudo ./R/run-tests.sh
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
Author: wm624@hotmail.com <wm624@hotmail.com>
Closes#14182 from wangmiao1981/isoR.
## What changes were proposed in this pull request?
Fix ```LogisticRegression``` typo in error message.
## How was this patch tested?
Docs change, no new tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#14633 from yanboliang/lr-typo.
## What changes were proposed in this pull request?
Replaces custom choose function with o.a.commons.math3.CombinatoricsUtils.binomialCoefficient
## How was this patch tested?
Spark unit tests
Author: zero323 <zero323@users.noreply.github.com>
Closes#14614 from zero323/SPARK-17027.
## What changes were proposed in this pull request?
```GaussianMixture``` should use ```treeAggregate``` rather than ```aggregate``` to improve performance and scalability. In my test of dataset with 200 features and 1M instance, I found there is 20% increased performance.
BTW, we should destroy broadcast variable ```compute``` at the end of each iteration.
## How was this patch tested?
Existing tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#14621 from yanboliang/spark-17033.
## What changes were proposed in this pull request?
Training GLMs on weighted dataset is very important use cases, but it is not supported by SparkR currently. Users can pass argument ```weights``` to specify the weights vector in native R. For ```spark.glm```, we can pass in the ```weightCol``` which is consistent with MLlib.
## How was this patch tested?
Unit test.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#14346 from yanboliang/spark-16710.
## What changes were proposed in this pull request?
Avoid using postfix operation for command execution in SQLQuerySuite where it wasn't whitelisted and audit existing whitelistings removing postfix operators from most places. Some notable places where postfix operation remains is in the XML parsing & time units (seconds, millis, etc.) where it arguably can improve readability.
## How was this patch tested?
Existing tests.
Author: Holden Karau <holden@us.ibm.com>
Closes#14407 from holdenk/SPARK-16779.
## What changes were proposed in this pull request?
This is follow-up for #14378. When we add ```transformSchema``` for all estimators and transformers, I found there are tests failed for ```StringIndexer``` and ```VectorAssembler```. So I moved these parts of work separately in this PR, to make it more clear to review.
The corresponding tests should throw ```IllegalArgumentException``` at schema validation period after we add ```transformSchema```. It's efficient that to throw exception at the start of ```fit``` or ```transform``` rather than during the process.
## How was this patch tested?
Modified unit tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#14455 from yanboliang/transformSchema.
## What changes were proposed in this pull request?
Add threshoulds' length checking for Classifiers which extends ProbabilisticClassifier
## How was this patch tested?
unit tests and manual tests
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#14470 from zhengruifeng/classifier_check_setThreshoulds_length.
## What changes were proposed in this pull request?
To Make sure ANN layer input training data to be persisted,
so that it can avoid overhead cost if the RDD need to be computed from lineage.
## How was this patch tested?
Existing Tests.
Author: WeichenXu <WeichenXu123@outlook.com>
Closes#14483 from WeichenXu123/add_ann_persist_training_data.
## What changes were proposed in this pull request?
Add a length checking for threshoulds' length in method `setThreshoulds()` of classification models.
## How was this patch tested?
unit tests
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#14457 from zhengruifeng/check_setThresholds.
## What changes were proposed in this pull request?
Removed useless latex in a log messge.
## How was this patch tested?
Check generated scaladoc.
Author: Shuai Lin <linshuai2012@gmail.com>
Closes#14380 from lins05/fix-docs-formatting.
## What changes were proposed in this pull request?
update unused broadcast in KMeans/Word2Vec,
use destroy(false) to release memory in time.
and several place destroy() update to destroy(false) so that it will be async-called,
it will better than blocking called.
and update bcNewCenters in KMeans to make it destroy in correct time.
I use a list to store all historical `bcNewCenters` generated in each loop iteration and delay them to release at the end of loop.
fix TODO in `BisectingKMeans.run` "unpersist old indices",
Implements the pattern "persist current step RDD, and unpersist previous one" in the loop iteration.
## How was this patch tested?
Existing tests.
Author: WeichenXu <WeichenXu123@outlook.com>
Closes#14333 from WeichenXu123/broadvar_unpersist_to_destroy.
## What changes were proposed in this pull request?
Use foreach/for instead of map where operation requires execution of body, not actually defining a transformation
## How was this patch tested?
Jenkins
Author: Sean Owen <sowen@cloudera.com>
Closes#14332 from srowen/SPARK-16694.
## What changes were proposed in this pull request?
ML ```GaussianMixture``` training failed due to feature column type mistake. The feature column type should be ```ml.linalg.VectorUDT``` but got ```mllib.linalg.VectorUDT``` by mistake.
See [SPARK-16750](https://issues.apache.org/jira/browse/SPARK-16750) for how to reproduce this bug.
Why the unit tests did not complain this errors? Because some estimators/transformers missed calling ```transformSchema(dataset.schema)``` firstly during ```fit``` or ```transform```. I will also add this function to all estimators/transformers who missed in this PR.
## How was this patch tested?
No new tests, should pass existing ones.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#14378 from yanboliang/spark-16750.
## What changes were proposed in this pull request?
Updated ML pipeline Cross Validation Scaladoc & PyDoc.
## How was this patch tested?
Documentation update
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
Author: krishnakalyan3 <krishnakalyan3@gmail.com>
Closes#13894 from krishnakalyan3/kfold-cv.
## What changes were proposed in this pull request?
Fix some mistake in ```LinearRegression``` formula.
## How was this patch tested?
Documents change, no tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#14369 from yanboliang/LiR-formula.
## What changes were proposed in this pull request?
In `LDAOptimizer.submitMiniBatch`, do persist on `stats: RDD[(BDM[Double], List[BDV[Double]])]`
and also move the place of unpersisting `expElogbetaBc` broadcast variable,
to avoid the `expElogbetaBc` broadcast variable to be unpersisted too early,
and update previous `expElogbetaBc.unpersist()` into `expElogbetaBc.destroy(false)`
## How was this patch tested?
Existing test.
Author: WeichenXu <WeichenXu123@outlook.com>
Closes#14335 from WeichenXu123/improve_LDA.
## What changes were proposed in this pull request?
replace ANN convergence tolerance param default
from 1e-4 to 1e-6
so that it will be the same with other algorithms in MLLib which use LBFGS as optimizer.
## How was this patch tested?
Existing Test.
Author: WeichenXu <WeichenXu123@outlook.com>
Closes#14286 from WeichenXu123/update_ann_tol.
## What changes were proposed in this pull request?
renaming var names to make code more clear:
nnz => weightSum
weightSum => totalWeightSum
and add a new member vector `nnz` (not `nnz` in previous code, which renamed to `weightSum`) to count each dimensions non-zero value number.
using `nnz` which I added above instead of `weightSum` when calculating min/max so that it fix several numerical error in some extreme case.
## How was this patch tested?
A new testcase added.
Author: WeichenXu <WeichenXu123@outlook.com>
Closes#14216 from WeichenXu123/multivarOnlineSummary.
## What changes were proposed in this pull request?
Forgotten broadcasted variables were persisted into a previous #PR 14153). This PR turns those `unpersist()` into `destroy()` so that memory is freed even on the driver.
## How was this patch tested?
Unit Tests in Word2VecSuite were run locally.
This contribution is done on behalf of Criteo, according to the
terms of the Apache license 2.0.
Author: Anthony Truchet <a.truchet@criteo.com>
Closes#14268 from AnthonyTruchet/SPARK-16440.
## What changes were proposed in this pull request?
breeze 0.12 has been released for more than half a year, and it brings lots of new features, performance improvement and bug fixes.
One of the biggest features is ```LBFGS-B``` which is an implementation of ```LBFGS``` with box constraints and much faster for some special case.
We would like to implement Huber loss function for ```LinearRegression``` ([SPARK-3181](https://issues.apache.org/jira/browse/SPARK-3181)) and it requires ```LBFGS-B``` as the optimization solver. So we should bump up the dependent breeze version to 0.12.
For more features, improvements and bug fixes of breeze 0.12, you can refer the following link:
https://groups.google.com/forum/#!topic/scala-breeze/nEeRi_DcY5c
## How was this patch tested?
No new tests, should pass the existing ones.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#14150 from yanboliang/spark-16494.
## What changes were proposed in this pull request?
`\partial\x` ==> `\partial x`
`har{x_i}` ==> `hat{x_i}`
## How was this patch tested?
N/A
Author: WeichenXu <WeichenXu123@outlook.com>
Closes#14246 from WeichenXu123/fix_formular_err.
https://issues.apache.org/jira/browse/SPARK-16535
## What changes were proposed in this pull request?
When I scan through the pom.xml of sub projects, I found this warning as below and attached screenshot
```
Definition of groupId is redundant, because it's inherited from the parent
```
![screen shot 2016-07-13 at 3 13 11 pm](https://cloud.githubusercontent.com/assets/3925641/16823121/744f893e-4916-11e6-8a52-042f83b9db4e.png)
I've tried to remove some of the lines with groupId definition, and the build on my local machine is still ok.
```
<groupId>org.apache.spark</groupId>
```
As I just find now `<maven.version>3.3.9</maven.version>` is being used in Spark 2.x, and Maven-3 supports versionless parent elements: Maven 3 will remove the need to specify the parent version in sub modules. THIS is great (in Maven 3.1).
ref: http://stackoverflow.com/questions/3157240/maven-3-worth-it/3166762#3166762
## How was this patch tested?
I've tested by re-building the project, and build succeeded.
Author: Xin Ren <iamshrek@126.com>
Closes#14189 from keypointt/SPARK-16535.
## What changes were proposed in this pull request?
fininsh => finish
## How was this patch tested?
N/A
Author: WeichenXu <WeichenXu123@outlook.com>
Closes#14238 from WeichenXu123/fix_fininsh_typo.
## What changes were proposed in this pull request?
These are yet more changes that resolve problems with unidoc/genjavadoc and Java 8. It does not fully resolve the problem, but gets rid of as many errors as we can from this end.
## How was this patch tested?
Jenkins build of docs
Author: Sean Owen <sowen@cloudera.com>
Closes#14221 from srowen/SPARK-3359.3.
## What changes were proposed in this pull request?
Fixed a bug that caused `NaN`s in `IsotonicRegression`. The problem occurs when training rows with the same feature value but different labels end up on different partitions. This patch changes a `sortBy` call to a `partitionBy(RangePartitioner)` followed by a `mapPartitions(sortBy)` in order to ensure that all rows with the same feature value end up on the same partition.
## How was this patch tested?
Added a unit test.
Author: z001qdp <Nicholas.Eggert@target.com>
Closes#14140 from neggert/SPARK-16426-isotonic-nan.
## What changes were proposed in this pull request?
Add warning_for the following case when LBFGS training not actually convergence:
1) LogisticRegression
2) AFTSurvivalRegression
3) LBFGS algorithm wrapper in mllib package
## How was this patch tested?
N/A
Author: WeichenXu <WeichenXu123@outlook.com>
Closes#14157 from WeichenXu123/add_lbfgs_convergence_warning_for_all_used_place.
## What changes were proposed in this pull request?
Fixing issues found during 2.0 API checks:
* GeneralizedLinearRegressionModel: linkObj, familyObj, familyAndLink should not be exposed
* sqlDataTypes: name does not follow conventions. Do we need to expose it?
* Evaluator: inconsistent doc between evaluate and isLargerBetter
* MinMaxScaler: math rendering --> hard to make it great, but I'll change it a little
* GeneralizedLinearRegressionSummary: aic doc is incorrect --> will change to use more common name
## How was this patch tested?
Existing unit tests. Docs generated locally. (MinMaxScaler is improved a tiny bit.)
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#14187 from jkbradley/final-api-check-2.0.
## What changes were proposed in this pull request?
General decisions to follow, except where noted:
* spark.mllib, pyspark.mllib: Remove all Experimental annotations. Leave DeveloperApi annotations alone.
* spark.ml, pyspark.ml
** Annotate Estimator-Model pairs of classes and companion objects the same way.
** For all algorithms marked Experimental with Since tag <= 1.6, remove Experimental annotation.
** For all algorithms marked Experimental with Since tag = 2.0, leave Experimental annotation.
* DeveloperApi annotations are left alone, except where noted.
* No changes to which types are sealed.
Exceptions where I am leaving items Experimental in spark.ml, pyspark.ml, mainly because the items are new:
* Model Summary classes
* MLWriter, MLReader, MLWritable, MLReadable
* Evaluator and subclasses: There is discussion of changes around evaluating multiple metrics at once for efficiency.
* RFormula: Its behavior may need to change slightly to match R in edge cases.
* AFTSurvivalRegression
* MultilayerPerceptronClassifier
DeveloperApi changes:
* ml.tree.Node, ml.tree.Split, and subclasses should no longer be DeveloperApi
## How was this patch tested?
N/A
Note to reviewers:
* spark.ml.clustering.LDA underwent significant changes (additional methods), so let me know if you want me to leave it Experimental.
* Be careful to check for cases where a class should no longer be Experimental but has an Experimental method, val, or other feature. I did not find such cases, but please verify.
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#14147 from jkbradley/experimental-audit.
## What changes were proposed in this pull request?
We have a use case of multiplying very big sparse matrices. we have about 1000x1000 distributed block matrices multiplication and the simulate multiply goes like O(n^4) (n being 1000). it takes about 1.5 hours. We modified it slightly with classical hashmap and now run in about 30 seconds O(n^2).
## How was this patch tested?
We have added a performance test and verified the reduced time.
Author: oraviv <oraviv@paypal.com>
Closes#14068 from uzadude/master.
## What changes were proposed in this pull request?
Unpersist broadcasted vars in Word2Vec.fit for more timely / reliable resource cleanup
## How was this patch tested?
Jenkins tests
Author: Sean Owen <sowen@cloudera.com>
Closes#14153 from srowen/SPARK-16440.
## What changes were proposed in this pull request?
In `ml.regression.LinearRegression`, it use breeze `LBFGS` and `OWLQN` optimizer to do data training, but do not check whether breeze's optimizer returned result actually reached convergence.
The `LBFGS` and `OWLQN` optimizer in breeze finish iteration may result the following situations:
1) reach max iteration number
2) function reach value convergence
3) objective function stop improving
4) gradient reach convergence
5) search failed(due to some internal numerical error)
I add warning printing code so that
if the iteration result is (1) or (3) or (5) in above, it will print a warning with respective reason string.
## How was this patch tested?
Manual.
Author: WeichenXu <WeichenXu123@outlook.com>
Closes#14122 from WeichenXu123/add_lr_not_convergence_warn.
## What changes were proposed in this pull request?
In `train` method of `ml.regression.LinearRegression` when handling situation `std(label) == 0`
the code replace `std(label)` with `mean(label)` but the relative comment is inconsistent, I update it.
## How was this patch tested?
N/A
Author: WeichenXu <WeichenXu123@outlook.com>
Closes#14121 from WeichenXu123/update_lr_comment.
## What changes were proposed in this pull request?
After SPARK-16476 (committed earlier today as #14128), we can finally bump the version number.
## How was this patch tested?
N/A
Author: Reynold Xin <rxin@databricks.com>
Closes#14130 from rxin/SPARK-16477.
## What changes were proposed in this pull request?
The following Java code because of type erasing:
```Java
JavaRDD<Vector> rows = jsc.parallelize(...);
RowMatrix mat = new RowMatrix(rows.rdd());
QRDecomposition<RowMatrix, Matrix> result = mat.tallSkinnyQR(true);
```
We should use retag to restore the type to prevent the following exception:
```Java
java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to [Lorg.apache.spark.mllib.linalg.Vector;
```
## How was this patch tested?
Java unit test
Author: Xusen Yin <yinxusen@gmail.com>
Closes#14051 from yinxusen/SPARK-16372.
## What changes were proposed in this pull request?
"test big model load / save" in Word2VecSuite, lately resulted into OOM.
Therefore we decided to make the partitioning adaptive (not based on spark default "spark.kryoserializer.buffer.max" conf) and then testing it using a small buffer size in order to trigger partitioning without allocating too much memory for the test.
## How was this patch tested?
It was tested running the following unit test:
org.apache.spark.mllib.feature.Word2VecSuite
Author: tmnd1991 <antonio.murgia2@studio.unibo.it>
Closes#13509 from tmnd1991/SPARK-15740.
## What changes were proposed in this pull request?
The current tests assumes that `impurity.calculate()` returns the variance correctly. It should be better to make the tests independent of this assumption. In other words verify that the variance computed equals the variance computed manually on a small tree.
## How was this patch tested?
The patch is a test....
Author: MechCoder <mks542@nyu.edu>
Closes#13981 from MechCoder/dt_variance.
## What changes were proposed in this pull request?
jira: https://issues.apache.org/jira/browse/SPARK-16249
Change visibility of Object ml.clustering.LDA to public for loading, thus users can invoke LDA.load("path").
## How was this patch tested?
existing ut and manually test for load ( saved with current code)
Author: Yuhao Yang <yuhao.yang@intel.com>
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes#13941 from hhbyyh/ldapublic.
## What changes were proposed in this pull request?
jira: https://issues.apache.org/jira/browse/SPARK-14608
PipelineStage.transformSchema currently has minimal documentation. It should have more to explain it can:
check schema
check parameter interactions
## How was this patch tested?
unit test
Author: Yuhao Yang <hhbyyh@gmail.com>
Author: Yuhao Yang <yuhao.yang@intel.com>
Closes#12384 from hhbyyh/transformSchemaDoc.
## What changes were proposed in this pull request?
model loading backward compatibility for ml NaiveBayes
## How was this patch tested?
existing ut and manual test for loading models saved by Spark 1.6.
Author: zlpmichelle <zlpmichelle@gmail.com>
Closes#13940 from zlpmichelle/naivebayes.
## What changes were proposed in this pull request?
What changes were proposed in this pull request?
Improving evaluateEachIteration function in mllib as it fails when trying to calculate error by tree for a model that has more than 500 trees
## How was this patch tested?
the batch tested on productions data set (2K rows x 2K features) training a gradient boosted model without validation with 1000 maxIteration settings, then trying to produce the error by tree, the new patch was able to perform the calculation within 30 seconds, while previously it was take hours then fail.
**PS**: It would be better if this PR can be cherry picked into release branches 1.6.1 and 2.0
Author: Mahmoud Rawas <mhmoudr@gmail.com>
Author: Mahmoud Rawas <Mahmoud.Rawas@quantium.com.au>
Closes#13624 from mhmoudr/SPARK-15858.master.
## What changes were proposed in this pull request?
model loading backward compatibility for ml.feature.PCA.
## How was this patch tested?
existing ut and manual test for loading models saved by Spark 1.6.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#13937 from yanboliang/spark-16245.
## What changes were proposed in this pull request?
This PR implements python wrappers for #13888 to convert old/new matrix columns in a DataFrame.
## How was this patch tested?
Doctest in python.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#13935 from yanboliang/spark-16242.
## What changes were proposed in this pull request?
jira: https://issues.apache.org/jira/browse/SPARK-16187
This is to provide conversion utils between old/new vector columns in a DataFrame. So users can use it to migrate their datasets and pipelines manually.
## How was this patch tested?
java and scala ut
Author: Yuhao Yang <yuhao.yang@intel.com>
Closes#13888 from hhbyyh/matComp.
## What changes were proposed in this pull request?
Just adjust the size of an array in line 58 so it does not cause an ArrayOutOfBoundsException in line 66.
## How was this patch tested?
Manual tests. I have recompiled the entire project with the fix, it has been built successfully and I have run the code, also with good results.
line 66: val yD = blas.ddot(trueWeights.length, x, 1, trueWeights, 1) + rnd.nextGaussian() * 0.1
crashes because trueWeights has length "nfeatures + 1" while "x" has length "features", and they should have the same length.
To fix this just make trueWeights be the same length as x.
I have recompiled the project with the change and it is working now:
[spark-1.6.1]$ spark-submit --master local[*] --class org.apache.spark.mllib.util.SVMDataGenerator mllib/target/spark-mllib_2.11-1.6.1.jar local /home/user/test
And it generates the data successfully now in the specified folder.
Author: José Antonio <joseanmunoz@gmail.com>
Closes#13895 from j4munoz/patch-2.
## What changes were proposed in this pull request?
model loading backward compatibility for ml.feature,
## How was this patch tested?
existing ut and manual test for loading 1.6 models.
Author: Yuhao Yang <yuhao.yang@intel.com>
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes#13844 from hhbyyh/featureComp.
## What changes were proposed in this pull request?
jira: https://issues.apache.org/jira/browse/SPARK-16177
model loading backward compatibility for ml.regression
## How was this patch tested?
existing ut and manual test for loading 1.6 models.
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes#13879 from hhbyyh/regreComp.
## What changes were proposed in this pull request?
jira: https://issues.apache.org/jira/browse/SPARK-16130
model loading backward compatibility for ml.classfication.LogisticRegression
## How was this patch tested?
existing ut and manual test for loading old models.
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes#13841 from hhbyyh/lrcomp.
## What changes were proposed in this pull request?
Since we decided to switch spark.mllib package into maintenance mode in 2.0, it would be nice to update the package docs to reflect this change.
## How was this patch tested?
Manually checked generated APIs.
Author: Xiangrui Meng <meng@databricks.com>
Closes#13859 from mengxr/SPARK-16154.
## What changes were proposed in this pull request?
We recently deprecated setLabelCol in ChiSqSelectorModel (#13823):
~~~scala
/** group setParam */
Since("1.6.0")
deprecated("labelCol is not used by ChiSqSelectorModel.", "2.0.0")
def setLabelCol(value: String): this.type = set(labelCol, value)
~~~
This unfortunately hit a genjavadoc bug and broken doc generation. This is the generated Java code:
~~~java
/** group setParam */
public org.apache.spark.ml.feature.ChiSqSelectorModel setOutputCol (java.lang.String value) { throw new RuntimeException(); }
*
* deprecated labelCol is not used by ChiSqSelectorModel. Since 2.0.0.
*/
public org.apache.spark.ml.feature.ChiSqSelectorModel setLabelCol (java.lang.String value) { throw new RuntimeException(); }
~~~
Switching to multiline is a workaround.
Author: Xiangrui Meng <meng@databricks.com>
Closes#13855 from mengxr/SPARK-16153.
## What changes were proposed in this pull request?
`DefaultParamsReadable/Writable` are not user-facing. Only developers who implement `Transformer/Estimator` would use it. So this PR changes the annotation to `DeveloperApi`.
Author: Xiangrui Meng <meng@databricks.com>
Closes#13828 from mengxr/default-readable-should-be-developer-api.
[SPARK-14615](https://issues.apache.org/jira/browse/SPARK-14615) and #12627 changed `spark.ml` pipelines to use the new `ml.linalg` classes for `Vector`/`Matrix`. Some `Since` annotations for public methods/vals have not been updated accordingly to be `2.0.0`. This PR updates them.
## How was this patch tested?
Existing unit tests.
Author: Nick Pentreath <nickp@za.ibm.com>
Closes#13840 from MLnick/SPARK-16127-ml-linalg-since.
## What changes were proposed in this pull request?
Mark ml.classification algorithms as experimental to match Scala algorithms, update PyDoc for for thresholds on `LogisticRegression` to have same level of info as Scala, and enable mathjax for PyDoc.
## How was this patch tested?
Built docs locally & PySpark SQL tests
Author: Holden Karau <holden@us.ibm.com>
Closes#12938 from holdenk/SPARK-15162-SPARK-15164-update-some-pydocs.
#### What changes were proposed in this pull request?
This PR is to use the latest `SparkSession` to replace the existing `SQLContext` in `MLlib`. `SQLContext` is removed from `MLlib`.
Also fix a test case issue in `BroadcastJoinSuite`.
BTW, `SQLContext` is not being used in the `MLlib` test suites.
#### How was this patch tested?
Existing test cases.
Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
Closes#13380 from gatorsmile/sqlContextML.
## What changes were proposed in this pull request?
Deprecate `labelCol`, which is not used by ChiSqSelectorModel.
Author: Xiangrui Meng <meng@databricks.com>
Closes#13823 from mengxr/deprecate-setLabelCol-in-ChiSqSelectorModel.
## What changes were proposed in this pull request?
We forgot the getter of `dropLast` in `OneHotEncoder`
## How was this patch tested?
unit test
Author: Xiangrui Meng <meng@databricks.com>
Closes#13821 from mengxr/SPARK-16118.
## What changes were proposed in this pull request?
LibSVMFileFormat implements data source for LIBSVM format. However, users do not really need to call its APIs to use it. So we should hide it in the public API docs. The main issue is that we still need to put the documentation and example code somewhere. The proposal it to have a dummy class to hold the documentation, as a workaround to https://issues.scala-lang.org/browse/SI-8124.
## How was this patch tested?
Manually checked the generated API doc and tested loading LIBSVM data.
Author: Xiangrui Meng <meng@databricks.com>
Closes#13819 from mengxr/SPARK-16117.
## What changes were proposed in this pull request?
The `checkpointInterval` is a non-expert param. This PR moves its setter to non-expert group.
Author: Xiangrui Meng <meng@databricks.com>
Closes#13813 from mengxr/checkpoint-non-expert.
## What changes were proposed in this pull request?
This PR is a subset of #13023 by yanboliang to make SparkR model param names and default values consistent with MLlib. I tried to avoid other changes from #13023 to keep this PR minimal. I will send a follow-up PR to improve the documentation.
Main changes:
* `spark.glm`: epsilon -> tol, maxit -> maxIter
* `spark.kmeans`: default k -> 2, default maxIter -> 20, default initMode -> "k-means||"
* `spark.naiveBayes`: laplace -> smoothing, default 1.0
## How was this patch tested?
Existing unit tests.
Author: Xiangrui Meng <meng@databricks.com>
Closes#13801 from mengxr/SPARK-15177.1.
This PR adds missing `Since` annotations to `ml.feature` package.
Closes#8505.
## How was this patch tested?
Existing tests.
Author: Nick Pentreath <nickp@za.ibm.com>
Closes#13641 from MLnick/add-since-annotations.
## What changes were proposed in this pull request?
Both VectorUDT and MatrixUDT are private APIs, because UserDefinedType itself is private in Spark. However, in order to let developers implement their own transformers and estimators, we should expose both types in a public API to simply the implementation of transformSchema, transform, etc. Otherwise, they need to get the data types using reflection.
## How was this patch tested?
Unit tests in Scala and Java.
Author: Xiangrui Meng <meng@databricks.com>
Closes#13789 from mengxr/SPARK-16074.
## What changes were proposed in this pull request?
This PR implements python wrappers for #13662 to convert old/new vector columns in a DataFrame.
## How was this patch tested?
doctest in Python
cc: yanboliang
Author: Xiangrui Meng <meng@databricks.com>
Closes#13731 from mengxr/SPARK-15946.
JIRA: [SPARK-16008](https://issues.apache.org/jira/browse/SPARK-16008)
## What changes were proposed in this pull request?
`LogisticAggregator` stores references to two arrays of dimension `numFeatures` which are serialized before the combine op, unnecessarily. This results in the shuffle write being ~3x (for multiclass logistic regression, this number will go up) larger than it should be (in MLlib, for instance, it is 3x smaller).
This patch modifies `LogisticAggregator.add` to accept the two arrays as method parameters which avoids the serialization.
## How was this patch tested?
I tested this locally and verified the serialization reduction.
![image](https://cloud.githubusercontent.com/assets/7275795/16140387/d2974bac-3404-11e6-94f9-268860c931a2.png)
Additionally, I ran some tests of a 4 node cluster (4x48 cores, 4x128 GB RAM). Data set size of 2M rows and 10k features showed >2x iteration speedup.
Author: sethah <seth.hendrickson16@gmail.com>
Closes#13729 from sethah/lr_improvement.
## What changes were proposed in this pull request?
SPARK-15922 reports the following scenario throwing an exception due to the mismatched vector sizes. This PR handles the exceptional case, `cols < (offset + colsPerBlock)`.
**Before**
```scala
scala> import org.apache.spark.mllib.linalg.distributed._
scala> import org.apache.spark.mllib.linalg._
scala> val rows = IndexedRow(0L, new DenseVector(Array(1,2,3))) :: IndexedRow(1L, new DenseVector(Array(1,2,3))):: IndexedRow(2L, new DenseVector(Array(1,2,3))):: Nil
scala> val rdd = sc.parallelize(rows)
scala> val matrix = new IndexedRowMatrix(rdd, 3, 3)
scala> val bmat = matrix.toBlockMatrix
scala> val imat = bmat.toIndexedRowMatrix
scala> imat.rows.collect
... // java.lang.IllegalArgumentException: requirement failed: Vectors must be the same length!
```
**After**
```scala
...
scala> imat.rows.collect
res0: Array[org.apache.spark.mllib.linalg.distributed.IndexedRow] = Array(IndexedRow(0,[1.0,2.0,3.0]), IndexedRow(1,[1.0,2.0,3.0]), IndexedRow(2,[1.0,2.0,3.0]))
```
## How was this patch tested?
Pass the Jenkins tests (including the above case)
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#13643 from dongjoon-hyun/SPARK-15922.
## What changes were proposed in this pull request?
Interface method `FileFormat.prepareRead()` was added in #12088 to handle a special case in the LibSVM data source.
However, the semantics of this interface method isn't intuitive: it returns a modified version of the data source options map. Considering that the LibSVM case can be easily handled using schema metadata inside `inferSchema`, we can remove this interface method to keep the `FileFormat` interface clean.
## How was this patch tested?
Existing tests.
Author: Cheng Lian <lian@databricks.com>
Closes#13698 from liancheng/remove-prepare-read.
## What changes were proposed in this pull request?
This patch renames various Parquet support classes from CatalystAbc to ParquetAbc. This new naming makes more sense for two reasons:
1. These are not optimizer related (i.e. Catalyst) classes.
2. We are in the Spark code base, and as a result it'd be more clear to call out these are Parquet support classes, rather than some Spark classes.
## How was this patch tested?
Renamed test cases as well.
Author: Reynold Xin <rxin@databricks.com>
Closes#13696 from rxin/parquet-rename.
The PR changes outdated scaladocs for Gini and Entropy classes. Since PR #886 Spark supports multiclass classification, but the docs tell only about binary classification.
Author: Wojciech Jurczyk <wojciech.jurczyk@codilime.com>
Closes#11252 from wjur/wjur/docs_multiclass.
## What changes were proposed in this pull request?
This PR provides conversion utils between old/new vector columns in a DataFrame. So users can use it to migrate their datasets and pipelines manually. The methods are implemented under `MLUtils` and called `convertVectorColumnsToML` and `convertVectorColumnsFromML`. Both take a DataFrame and a list of vector columns to be converted. It is a no-op on vector columns that are already converted. A warning message is logged if actual conversion happens.
This is the first sub-task under SPARK-15944 to make it easier to migrate existing pipelines to Spark 2.0.
## How was this patch tested?
Unit tests in Scala and Java.
cc: yanboliang
Author: Xiangrui Meng <meng@databricks.com>
Closes#13662 from mengxr/SPARK-15945.
## What changes were proposed in this pull request?
Now we have PySpark picklers for new and old vector/matrix, individually. However, they are all implemented under `PythonMLlibAPI`. To separate spark.mllib from spark.ml, we should implement the picklers of new vector/matrix under `spark.ml.python` instead.
## How was this patch tested?
Existing tests.
Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
Closes#13219 from viirya/pyspark-pickler-ml.
## What changes were proposed in this pull request?
Currently, `AFTAggregator` is not being merged correctly. For example, if there is any single empty partition in the data, this creates an `AFTAggregator` with zero total count which causes the exception below:
```
IllegalArgumentException: u'requirement failed: The number of instances should be greater than 0.0, but got 0.'
```
Please see [AFTSurvivalRegression.scala#L573-L575](6ecedf39b4/mllib/src/main/scala/org/apache/spark/ml/regression/AFTSurvivalRegression.scala (L573-L575)) as well.
Just to be clear, the python example `aft_survival_regression.py` seems using 5 rows. So, if there exist partitions more than 5, it throws the exception above since it contains empty partitions which results in an incorrectly merged `AFTAggregator`.
Executing `bin/spark-submit examples/src/main/python/ml/aft_survival_regression.py` on a machine with CPUs more than 5 is being failed because it creates tasks with some empty partitions with defualt configurations (AFAIK, it sets the parallelism level to the number of CPU cores).
## How was this patch tested?
An unit test in `AFTSurvivalRegressionSuite.scala` and manually tested by `bin/spark-submit examples/src/main/python/ml/aft_survival_regression.py`.
Author: hyukjinkwon <gurwls223@gmail.com>
Author: Hyukjin Kwon <gurwls223@gmail.com>
Closes#13619 from HyukjinKwon/SPARK-15892.
## What changes were proposed in this pull request?
Currently, we always split the files when it's bigger than maxSplitBytes, but Hadoop LineRecordReader does not respect the splits for compressed files correctly, we should have a API for FileFormat to check whether the file could be splitted or not.
This PR is based on #13442, closes#13442
## How was this patch tested?
add regression tests.
Author: Davies Liu <davies@databricks.com>
Closes#13531 from davies/fix_split.
## What changes were proposed in this pull request?
In scala, immutable.List.length is an expensive operation so we should
avoid using Seq.length == 0 or Seq.lenth > 0, and use Seq.isEmpty and Seq.nonEmpty instead.
## How was this patch tested?
existing tests
Author: wangyang <wangyang@haizhi.com>
Closes#13601 from yangw1234/isEmpty.
## What changes were proposed in this pull request?
Adding __str__ to RFormula and model that will show the set formula param and resolved formula. This is currently present in the Scala API, found missing in PySpark during Spark 2.0 coverage review.
## How was this patch tested?
run pyspark-ml tests locally
Author: Bryan Cutler <cutlerb@gmail.com>
Closes#13481 from BryanCutler/pyspark-ml-rformula_str-SPARK-15738.
## What changes were proposed in this pull request?
https://issues.apache.org/jira/browse/SPARK-15793
Word2vec in ML package should have maxSentenceLength method for feature parity.
## How was this patch tested?
Tested with Spark unit test.
Author: yinxusen <yinxusen@gmail.com>
Closes#13536 from yinxusen/SPARK-15793.
## What changes were proposed in this pull request?
When fitting ```LinearRegressionModel```(by "l-bfgs" solver) and ```LogisticRegressionModel``` w/o intercept on dataset with constant nonzero column, spark.ml produce same model as R glmnet but different from LIBSVM.
When fitting ```AFTSurvivalRegressionModel``` w/o intercept on dataset with constant nonzero column, spark.ml produce different model compared with R survival::survreg.
We should output a warning message and clarify in document for this condition.
## How was this patch tested?
Document change, no unit test.
cc mengxr
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#12731 from yanboliang/spark-13590.
## What changes were proposed in this pull request?
Made DefaultParamsReadable, DefaultParamsWritable public. Also added relevant doc and annotations. Added UnaryTransformerExample to demonstrate use of UnaryTransformer and DefaultParamsReadable,Writable.
## How was this patch tested?
Wrote example making use of the now-public APIs. Compiled and ran locally
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#13461 from jkbradley/defaultparamswritable.
## What changes were proposed in this pull request?
`an -> a`
Use cmds like `find . -name '*.R' | xargs -i sh -c "grep -in ' an [^aeiou]' {} && echo {}"` to generate candidates, and review them one by one.
## How was this patch tested?
manual tests
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#13515 from zhengruifeng/an_a.
`PartitionStatistics` uses `foldLeft` and list concatenation (`++`) to flatten an iterator of lists, but this is extremely inefficient compared to simply doing `flatMap`/`flatten` because it performs many unnecessary object allocations. Simply replacing this `foldLeft` by a `flatMap` results in decent performance gains when constructing PartitionStatistics instances for tables with many columns.
This patch fixes this and also makes two similar changes in MLlib and streaming to try to fix all known occurrences of this pattern.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#13491 from JoshRosen/foldleft-to-flatmap.
## What changes were proposed in this pull request?
1, remove comments `:: Experimental ::` for non-experimental API
2, add comments `:: Experimental ::` for experimental API
3, add comments `:: DeveloperApi ::` for developerApi API
## How was this patch tested?
manual tests
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#13514 from zhengruifeng/del_experimental.
## What changes were proposed in this pull request?
1, del precision,recall in `ml.MulticlassClassificationEvaluator`
2, update user guide for `mlllib.weightedFMeasure`
## How was this patch tested?
local build
Author: Ruifeng Zheng <ruifengz@foxmail.com>
Closes#13390 from zhengruifeng/clarify_f1.
## What changes were proposed in this pull request?
Our encoder framework has been evolved a lot, this PR tries to clean up the code to make it more readable and emphasise the concept that encoder should be used as a container of serde expressions.
1. move validation logic to analyzer instead of encoder
2. only have a `resolveAndBind` method in encoder instead of `resolve` and `bind`, as we don't have the encoder life cycle concept anymore.
3. `Dataset` don't need to keep a resolved encoder, as there is no such concept anymore. bound encoder is still needed to do serialization outside of query framework.
4. Using `BoundReference` to represent an unresolved field in deserializer expression is kind of weird, this PR adds a `GetColumnByOrdinal` for this purpose. (serializer expression still use `BoundReference`, we can replace it with `GetColumnByOrdinal` in follow-ups)
## How was this patch tested?
existing test
Author: Wenchen Fan <wenchen@databricks.com>
Author: Cheng Lian <lian@databricks.com>
Closes#13269 from cloud-fan/clean-encoder.
## What changes were proposed in this pull request?
andrewor14 noticed some OOM errors caused by "test big model load / save" in Word2VecSuite, e.g., https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-maven-hadoop-2.2/1168/consoleFull. It doesn't show up in the test result because it was OOMed.
This PR disables the test. I will leave the JIRA open for a proper fix
## How was this patch tested?
No new features.
Author: Xiangrui Meng <meng@databricks.com>
Closes#13478 from mengxr/SPARK-15740.
## What changes were proposed in this pull request?
ml.feature: update check schema to avoid confusion when user use MLlib.vector as input type
## How was this patch tested?
existing ut
Author: Yuhao Yang <yuhao.yang@intel.com>
Closes#13411 from hhbyyh/schemaCheck.
Clean up style for param setter methods in ALS to match standard style and the other setter in class (this is an artefact of one of my previous PRs that wasn't cleaned up).
## How was this patch tested?
Existing tests - no functionality change.
Author: Nick Pentreath <nickp@za.ibm.com>
Closes#13480 from MLnick/als-param-minor-cleanup.
## What changes were proposed in this pull request?
ML 2.0 QA: Scala APIs audit for ml.feature. Mainly include:
* Remove seed for ```QuantileDiscretizer```, since we use ```approxQuantile``` to produce bins and ```seed``` is useless.
* Scala API docs update.
* Sync Scala and Python API docs for these changes.
## How was this patch tested?
Exist tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#13410 from yanboliang/spark-15587.
## What changes were proposed in this pull request?
if sparkContext.set CheckpointDir to another Dir that is not default FileSystem, it will throw exception when removing CheckpointFile in MLlib.
So we should always get the FileSystem from Path to avoid wrong FS problem.
## How was this patch tested?
N/A
Author: Lianhui Wang <lianhuiwang09@gmail.com>
Closes#13408 from lianhuiwang/SPARK-15664.
## What changes were proposed in this pull request?
This PR changes function `SparkSession.builder.sparkContext(..)` from **private[sql]** into **private[spark]**, and uses it if applicable like the followings.
```
- val spark = SparkSession.builder().config(sc.getConf).getOrCreate()
+ val spark = SparkSession.builder().sparkContext(sc).getOrCreate()
```
## How was this patch tested?
Pass the existing Jenkins tests.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#13365 from dongjoon-hyun/SPARK-15618.
## What changes were proposed in this pull request?
This change resolves a number of build warnings that have accumulated, before 2.x. It does not address a large number of deprecation warnings, especially related to the Accumulator API. That will happen separately.
## How was this patch tested?
Jenkins
Author: Sean Owen <sowen@cloudera.com>
Closes#13377 from srowen/BuildWarnings.
## What changes were proposed in this pull request?
Fix the wrong bound of `k` in `PCA`
`require(k <= sources.first().size, ...` -> `require(k < sources.first().size`
BTW, remove unused import in `ml.ElementwiseProduct`
## How was this patch tested?
manual tests
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#13356 from zhengruifeng/fix_pca.
## What changes were proposed in this pull request?
We're using `asML` to convert the mllib vector/matrix to ml vector/matrix now. Using `as` is more correct given that this conversion actually shares the same underline data structure. As a result, in this PR, `toBreeze` will be changed to `asBreeze`. This is a private API, as a result, it will not affect any user's application.
## How was this patch tested?
unit tests
Author: DB Tsai <dbt@netflix.com>
Closes#13198 from dbtsai/minor.
## What changes were proposed in this pull request?
* Document ```WeightedLeastSquares```(normal equation) and ```IterativelyReweightedLeastSquares```.
* Copy ```L-BFGS``` documents from ```spark.mllib``` to ```spark.ml```.
Due to the session ```Optimization of linear methods``` is used for developers, I think we should provide the brief introduction of the optimization method, necessary references and how it implements in Spark. It's not necessary to paste all mathematical formula and derivation here. If developers/users want to learn more, they can track reference.
## How was this patch tested?
Document update, no tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#13262 from yanboliang/spark-15484.
## What changes were proposed in this pull request?
This PR replaces `spark.sql.sources.` strings with `CreateDataSourceTableUtils.*` constant variables.
## How was this patch tested?
Pass the existing Jenkins tests.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#13349 from dongjoon-hyun/SPARK-15584.
## What changes were proposed in this pull request?
This PR replaces all deprecated `SQLContext` occurrences with `SparkSession` in `ML/MLLib` module except the following two classes. These two classes use `SQLContext` in their function signatures.
- ReadWrite.scala
- TreeModels.scala
## How was this patch tested?
Pass the existing Jenkins tests.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#13352 from dongjoon-hyun/SPARK-15603.
## What changes were proposed in this pull request?
`a` -> `an`
I use regex to generate potential error lines:
`grep -in ' a [aeiou]' mllib/src/main/scala/org/apache/spark/ml/*/*scala`
and review them line by line.
## How was this patch tested?
local build
`lint-java` checking
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#13317 from zhengruifeng/a_an.
## What changes were proposed in this pull request?
This PR changes SQLContext/HiveContext's public constructor to use SparkSession.build.getOrCreate and removes isRootContext from SQLContext.
## How was this patch tested?
Existing tests.
Author: Yin Huai <yhuai@databricks.com>
Closes#13310 from yhuai/SPARK-15532.
## What changes were proposed in this pull request?
Several classes and methods have been deprecated and are creating lots of build warnings in branch-2.0. This issue is to identify and fix those items:
* WithSGD classes: Change to make class not deprecated, object deprecated, and public class constructor deprecated. Any public use will require a deprecated API. We need to keep a non-deprecated private API since we cannot eliminate certain uses: Python API, streaming algs, and examples.
* Use in PythonMLlibAPI: Change to using private constructors
* Streaming algs: No warnings after we un-deprecate the classes
* Examples: Deprecate or change ones which use deprecated APIs
* MulticlassMetrics fields (precision, etc.)
* LinearRegressionSummary.model field
## How was this patch tested?
Existing tests. Checked for warnings manually.
Author: Sean Owen <sowen@cloudera.com>
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#13314 from jkbradley/warning-cleanups.
## What changes were proposed in this pull request?
See https://issues.apache.org/jira/browse/SPARK-15523
This PR replaces PR #13293. It's isolated to a new branch, and contains some more squashed changes.
## How was this patch tested?
1. Executed `mvn clean package` in `mllib` directory
2. Executed `dev/test-dependencies.sh --replace-manifest` in the root directory.
Author: Villu Ruusmann <villu.ruusmann@gmail.com>
Closes#13297 from vruusmann/update-jpmml.
## What changes were proposed in this pull request?
This patch renames various DefaultSources to make their names more self-describing. The choice of "DefaultSource" was from the days when we did not have a good way to specify short names.
They are now named:
- LibSVMFileFormat
- CSVFileFormat
- JdbcRelationProvider
- JsonFileFormat
- ParquetFileFormat
- TextFileFormat
Backward compatibility is maintained through aliasing.
## How was this patch tested?
Updated relevant test cases too.
Author: Reynold Xin <rxin@databricks.com>
Closes#13311 from rxin/SPARK-15543.
## What changes were proposed in this pull request?
Add a warning log for the case that `numIterations * miniBatchFraction <1.0` during gradient descent. If the product of those two numbers is less than `1.0`, then not all training examples will be used during optimization. To put this concretely, suppose that `numExamples = 100`, `miniBatchFraction = 0.2` and `numIterations = 3`. Then, 3 iterations will occur each sampling approximately 6 examples each. In the best case, each of the 6 examples are unique; hence 18/100 examples are used.
This may be counter-intuitive to most users and led to the issue during the development of another Spark ML model: https://github.com/zhengruifeng/spark-libFM/issues/11. If a user actually does not require the training data set, it would be easier and more intuitive to use `RDD.sample`.
## How was this patch tested?
`build/mvn -DskipTests clean package` build succeeds
Author: Gio Borje <gborje@linkedin.com>
Closes#13265 from Hydrotoast/master.
Remove "Default: MEMORY_AND_DISK" from `Param` doc field in ALS storage level params. This fixes up the output of `explainParam(s)` so that default values are not displayed twice.
We can revisit in the case that [SPARK-15130](https://issues.apache.org/jira/browse/SPARK-15130) moves ahead with adding defaults in some way to PySpark param doc fields.
Tests N/A.
Author: Nick Pentreath <nickp@za.ibm.com>
Closes#13277 from MLnick/SPARK-15500-als-remove-default-storage-param.
fixed typos for source code for components [mllib] [streaming] and [SQL]
None and obvious.
Author: lfzCarlosC <lfz.carlos@gmail.com>
Closes#13298 from lfzCarlosC/master.
This PR adds the `relativeError` param to PySpark's `QuantileDiscretizer` to match Scala.
Also cleaned up a duplication of `numBuckets` where the param is both a class and instance attribute (I removed the instance attr to match the style of params throughout `ml`).
Finally, cleaned up the docs for `QuantileDiscretizer` to reflect that it now uses `approxQuantile`.
## How was this patch tested?
A little doctest and built API docs locally to check HTML doc generation.
Author: Nick Pentreath <nickp@za.ibm.com>
Closes#13228 from MLnick/SPARK-15442-py-relerror-param.
## What changes were proposed in this pull request?
* ```GeneralizedLinearRegression``` API docs enhancement.
* The default value of ```GeneralizedLinearRegression``` ```linkPredictionCol``` is not set rather than empty. This will consistent with other similar params such as ```weightCol```
* Make some methods more private.
* Fix a minor bug of LinearRegression.
* Fix some other issues.
## How was this patch tested?
Existing tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#13129 from yanboliang/spark-15339.
## What changes were proposed in this pull request?
Currently SparkSession.Builder use SQLContext.getOrCreate. It should probably the the other way around, i.e. all the core logic goes in SparkSession, and SQLContext just calls that. This patch does that.
This patch also makes sure config options specified in the builder are propagated to the existing (and of course the new) SparkSession.
## How was this patch tested?
Updated tests to reflect the change, and also introduced a new SparkSessionBuilderSuite that should cover all the branches.
Author: Reynold Xin <rxin@databricks.com>
Closes#13200 from rxin/SPARK-15075.
## What changes were proposed in this pull request?
Refactor All Java Tests that use SparkSession, to extend SharedSparkSesion
## How was this patch tested?
Existing Tests
Author: Sandeep Singh <sandeep@techaddict.me>
Closes#13101 from techaddict/SPARK-15296.
## What changes were proposed in this pull request?
```ml.evaluation``` Scala and Python API sync.
## How was this patch tested?
Only API docs change, no new tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#13195 from yanboliang/evaluation-doc.
## What changes were proposed in this pull request?
Currently in ```model.write```, we don't save ```summary```(if applicable). We should add documentation to clarify it.
We fixed the incorrect link ```[[MLWriter]]``` to ```[[org.apache.spark.ml.util.MLWriter]]``` BTW.
## How was this patch tested?
Documentation update, no unit test.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#13131 from yanboliang/spark-15341.
## What changes were proposed in this pull request?
Open up APIs for converting between new, old linear algebra types (in spark.mllib.linalg):
`Sparse`/`Dense` X `Vector`/`Matrices` `.asML` and `.fromML`
## How was this patch tested?
Existing Tests
Author: Sandeep Singh <sandeep@techaddict.me>
Closes#13202 from techaddict/SPARK-15414.
## What changes were proposed in this pull request?
Audit Scala API for ml.clustering.
Fix some wrong API documentations and update outdated one.
## How was this patch tested?
Existing unit tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#13148 from yanboliang/spark-15361.
## What changes were proposed in this pull request?
Add since to ml.stat.MultivariateOnlineSummarizer.scala
## How was this patch tested?
unit tests
Author: DB Tsai <dbt@netflix.com>
Closes#13197 from dbtsai/cleanup.
## What changes were proposed in this pull request?
Audit Scala API for classification, almost all issues were related ```MultilayerPerceptronClassifier``` in this section.
* Fix one wrong param getter function: ```getOptimizer``` -> ```getSolver```
* Add missing setter function for ```solver``` and ```stepSize```.
* Make ```GD``` solver take effect.
* Update docs, annotations and fix other minor issues.
## How was this patch tested?
Existing unit tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#13076 from yanboliang/spark-15292.
## What changes were proposed in this pull request?
[SPARK-14646](https://issues.apache.org/jira/browse/SPARK-14646) makes ```KMeansModel``` store the cluster centers one per row. ```KMeansModel.load()``` method needs to be updated in order to load models saved with Spark 1.6.
## How was this patch tested?
Since ```save/load``` is ```Experimental``` for 1.6, I think offline test for backwards compatibility is enough.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#13149 from yanboliang/spark-15362.
## What changes were proposed in this pull request?
I reviewed Scala and Python APIs for ml.feature and corrected discrepancies.
## How was this patch tested?
Built docs locally, ran style checks
Author: Bryan Cutler <cutlerb@gmail.com>
Closes#13159 from BryanCutler/ml.feature-api-sync.
## What changes were proposed in this pull request?
This PR adds null check in `SparkSession.createDataFrame`, so that we can make sure the passed in rows matches the given schema.
## How was this patch tested?
new tests in `DatasetSuite`
Author: Wenchen Fan <wenchen@databricks.com>
Closes#13008 from cloud-fan/row-encoder.
This PR adds schema validation to `ml`'s ALS and ALSModel. Currently, no schema validation was performed as `transformSchema` was never called in `ALS.fit` or `ALSModel.transform`. Furthermore, due to no schema validation, if users passed in Long (or Float etc) ids, they would be silently cast to Int with no warning or error thrown.
With this PR, ALS now supports all numeric types for `user`, `item`, and `rating` columns. The rating column is cast to `Float` and the user and item cols are cast to `Int` (as is the case currently) - however for user/item, the cast throws an error if the value is outside integer range. Behavior for rating col is unchanged (as it is not an issue).
## How was this patch tested?
New test cases in `ALSSuite`.
Author: Nick Pentreath <nickp@za.ibm.com>
Closes#12762 from MLnick/SPARK-14891-als-validate-schema.
mateiz srowen
I state that the contribution is my original work and that I license the work to the project under the project's open source license
There's some format problems with my last PR, with HyukjinKwon 's help I read the guidance, re-check my code and PR, then run the tests, finally re-submit the PR request here.
The related JIRA issue though marked as resolved, this change may relate to it I think.
## Proposed Change
After picking each new initial centers, it's unnecessary to compute the distances between all the points and the old ones.
Instead this change keeps the distance between all the points and their closest centers, and compare to the distance of them with the new center then update them.
## Test result
One can find an easy test way in (https://issues.apache.org/jira/browse/SPARK-6706)
I test the KMeans++ method for a small dataset with 16k points, and the whole KMeans|| with a large one with 240k points.
The data has 4096 features and I tunes K from 100 to 500.
The test environment was on my 4 machine cluster, I also tested a 3M points data on a larger cluster with 25 machines and got similar results, which I would not draw the detail curve. The result of the first two exps are shown below
### Local KMeans++ test:
Dataset:4m_ini_center
Data_size:16234
Dimension:4096
Lloyd's Iteration = 10
The y-axis is time in sec, the x-axis is tuning the K.
![image](https://cloud.githubusercontent.com/assets/10915169/15175831/d0c92b82-179a-11e6-8b68-4e165fc2fdff.png)
![local_total](https://cloud.githubusercontent.com/assets/10915169/15175957/6b21c3b0-179b-11e6-9741-66dfe4e23eb7.jpg)
### On a larger dataset
An improve show in the graph but not commit in this file: In this experiment I also have an improvement for calculation in normalization data (the distance is convert to the cosine distance). As if the data is normalized into (0,1), one improvement in the original vesion for util.MLUtils.fastSauaredDistance would have no effect (the precisionBound 2.0 * EPSILON * sumSquaredNorm / (normDiff * normDiff + EPSILON) will never less then precision in this case). Therefore I design an early terminal method when comparing two distance (used for findClosest). But I don't include this improve in this file, you may only refer to the curves without "normalize" for comparing the results.
Dataset:4k24
Data_size:243960
Dimension:4096
Normlize Enlarge Initialize Lloyd's_Iteration
NO 1 3 5
YES 10000 3 5
Notice: the normlized data is enlarged to ensure precision
The cost time: x-for value of K, y-for time in sec
![4k24_total](https://cloud.githubusercontent.com/assets/10915169/15176635/9a54c0bc-179e-11e6-81c5-238e0c54bce2.jpg)
SE for unnormalized data between two version, to ensure the correctness
![4k24_unnorm_se](https://cloud.githubusercontent.com/assets/10915169/15176661/b85dabc8-179e-11e6-9269-fe7d2101dd48.jpg)
Here is the SE between normalized data just for reference, it's also correct.
![4k24_norm_se](https://cloud.githubusercontent.com/assets/10915169/15176742/1fbde940-179f-11e6-8290-d24b0dd4a4f7.jpg)
Author: DLucky <mouendless@gmail.com>
Closes#13133 from mouendless/patch-2.
## What changes were proposed in this pull request?
I use Intellj-IDEA to search usage of deprecate SparkContext.accumulator in the whole spark project, and update the code.(except those test code for accumulator method itself)
## How was this patch tested?
Exisiting unit tests
Author: WeichenXu <WeichenXu123@outlook.com>
Closes#13112 from WeichenXu123/update_accuV2_in_mllib.
## What changes were proposed in this pull request?
Update the unit test code, examples, and documents to remove calls to deprecated method `dataset.registerTempTable`.
## How was this patch tested?
This PR only changes the unit test code, examples, and comments. It should be safe.
This is a follow up of PR https://github.com/apache/spark/pull/12945 which was merged.
Author: Sean Zhong <seanzhong@databricks.com>
Closes#13098 from clockfly/spark-15171-remove-deprecation.
## What changes were proposed in this pull request?
Once SPARK-14487 and SPARK-14549 are merged, we will migrate to use the new vector and matrix type in the new ml pipeline based apis.
## How was this patch tested?
Unit tests
Author: DB Tsai <dbt@netflix.com>
Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
Author: Xiangrui Meng <meng@databricks.com>
Closes#12627 from dbtsai/SPARK-14615-NewML.
## What changes were proposed in this pull request?
According to the recent change, this PR replaces all the remaining `sqlContext` usage with `spark` in ScalaDoc/JavaDoc (.scala/.java files) except `SQLContext.scala`, `SparkPlan.scala', and `DatasetHolder.scala`.
## How was this patch tested?
Manual.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#13125 from dongjoon-hyun/minor_doc_sparksession.
## What changes were proposed in this pull request?
(See https://github.com/apache/spark/pull/12416 where most of this was already reviewed and committed; this is just the module structure and move part. This change does not move the annotations into test scope, which was the apparently problem last time.)
Rename `spark-test-tags` -> `spark-tags`; move common annotations like `Since` to `spark-tags`
## How was this patch tested?
Jenkins tests.
Author: Sean Owen <sowen@cloudera.com>
Closes#13074 from srowen/SPARK-15290.
## What changes were proposed in this pull request?
1,Rename matrix args in BreezeUtil to upper to match the doc
2,Fix several typos in ML and SQL
## How was this patch tested?
manual tests
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#13078 from zhengruifeng/fix_ann.
## What changes were proposed in this pull request?
(Please fill in changes proposed in this fix)
Throw better exception when numClasses is empty and empty.max is thrown.
## How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
Add a new unit test, which calls histogram with empty numClasses.
Author: wm624@hotmail.com <wm624@hotmail.com>
Closes#12969 from wangmiao1981/logisticR.
## What changes were proposed in this pull request?
Currently, Parquet, JSON and CSV data sources have a class for thier options, (`ParquetOptions`, `JSONOptions` and `CSVOptions`).
It is convenient to manage options for sources to gather options into a class. Currently, `JDBC`, `Text`, `libsvm` and `ORC` datasources do not have this class. This might be nicer if these options are in a unified format so that options can be added and
This PR refactors the options in Spark internal data sources adding new classes, `OrcOptions`, `TextOptions`, `JDBCOptions` and `LibSVMOptions`.
Also, this PR change the default compression codec for ORC from `NONE` to `SNAPPY`.
## How was this patch tested?
Existing tests should cover this for refactoring and unittests in `OrcHadoopFsRelationSuite` for changing the default compression codec for ORC.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#13048 from HyukjinKwon/SPARK-15267.
## What changes were proposed in this pull request?
(Please fill in changes proposed in this fix)
Add accuracy to MulticlassMetrics class and add corresponding code in MulticlassClassificationEvaluator.
## How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
Scala Unit tests in ml.evaluation
Author: wm624@hotmail.com <wm624@hotmail.com>
Closes#12882 from wangmiao1981/accuracy.
## What changes were proposed in this pull request?
Made ChiSqSelector and RFormula accept all numeric types for label
## How was this patch tested?
Unit tests
Author: BenFradet <benjamin.fradet@gmail.com>
Closes#12467 from BenFradet/SPARK-13961.
## What changes were proposed in this pull request?
This patch adds a python API for generalized linear regression summaries (training and test). This helps provide feature parity for Python GLMs.
## How was this patch tested?
Added a unit test to `pyspark.ml.tests`
Author: sethah <seth.hendrickson16@gmail.com>
Closes#12961 from sethah/GLR_summary.
## What changes were proposed in this pull request?
Deprecates registerTempTable and add dataset.createTempView, dataset.createOrReplaceTempView.
## How was this patch tested?
Unit tests.
Author: Sean Zhong <seanzhong@databricks.com>
Closes#12945 from clockfly/spark-15171.
## What changes were proposed in this pull request?
We have a private `UDTRegistration` API to register user defined type. Currently `JavaTypeInference` can't work with it. So `SparkSession.createDataFrame` from a bean class will not correctly infer the schema of the bean class.
## How was this patch tested?
`VectorUDTSuite`.
Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
Closes#13046 from viirya/fix-udt-registry-javatypeinference.
## What changes were proposed in this pull request?
Use SparkSession instead of SQLContext in Scala/Java TestSuites
as this PR already very big working Python TestSuites in a diff PR.
## How was this patch tested?
Existing tests
Author: Sandeep Singh <sandeep@techaddict.me>
Closes#12907 from techaddict/SPARK-15037.
## What changes were proposed in this pull request?
Explicitly tell user initial coefficients is ignored if its size doesn't match expected size in LogisticRegression
## How was this patch tested?
local build
Author: dding3 <dingding@dingding-ubuntu.sh.intel.com>
Closes#12948 from dding3/master.
## What changes were proposed in this pull request?
jira: https://issues.apache.org/jira/browse/SPARK-14814
fix a java compatibility function in mllib DecisionTreeModel. As synced in jira, other compatibility issues don't need fixes.
## How was this patch tested?
existing ut
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes#12971 from hhbyyh/javacompatibility.
## What changes were proposed in this pull request?
We need to use `requiredSchema` in `LibSVMRelation` to project the fetch required columns when loading data from this data source. Otherwise, when users try to select `features` column, it will cause failure.
## How was this patch tested?
`LibSVMRelationSuite`.
Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
Closes#12986 from viirya/fix-libsvmrelation.
## What changes were proposed in this pull request?
This PR continues the work from #11871 with the following changes:
* load English stopwords as default
* covert stopwords to list in Python
* update some tests and doc
## How was this patch tested?
Unit tests.
Closes#11871
cc: burakkose srowen
Author: Burak Köse <burakks41@gmail.com>
Author: Xiangrui Meng <meng@databricks.com>
Author: Burak KOSE <burakks41@gmail.com>
Closes#12843 from mengxr/SPARK-14050.
## What changes were proposed in this pull request?
Minor doc and code style fixes
## How was this patch tested?
local build
Author: Jacek Laskowski <jacek@japila.pl>
Closes#12928 from jaceklaskowski/SPARK-15152.
## What changes were proposed in this pull request?
Copy the package documentation from Scala/Java to Python for ML package and remove beta tags. Not super sure if we want to keep the BETA tag but since we are making it the default it seems like probably the time to remove it (happy to put it back in if we want to keep it BETA).
## How was this patch tested?
Python documentation built locally as HTML and text and verified output.
Author: Holden Karau <holden@us.ibm.com>
Closes#12883 from holdenk/SPARK-15106-add-pyspark-package-doc-for-ml.
## What changes were proposed in this pull request?
Introduction of setFeaturesCol and setPredictionCol methods to KMeansModel in ML library.
## How was this patch tested?
By running KMeansSuite.
Author: Dominik Jastrzębski <dominik.jastrzebski@codilime.com>
Closes#12609 from dominik-jastrzebski/master.
## What changes were proposed in this pull request?
Currently, various `FileFormat` data sources share approximately the same code for partition value appending. This PR tries to eliminate this duplication.
A new method `buildReaderWithPartitionValues()` is added to `FileFormat` with a default implementation that appends partition values to `InternalRow`s produced by the reader function returned by `buildReader()`.
Special data sources like Parquet, which implements partition value appending inside `buildReader()` because of the vectorized reader, and the Text data source, which doesn't support partitioning, override `buildReaderWithPartitionValues()` and simply delegate to `buildReader()`.
This PR brings two benefits:
1. Apparently, it de-duplicates partition value appending logic
2. Now the reader function returned by `buildReader()` is only required to produce `InternalRow`s rather than `UnsafeRow`s if the data source doesn't override `buildReaderWithPartitionValues()`.
Because the safe-to-unsafe conversion is also performed while appending partition values. This makes 3rd-party data sources (e.g. spark-avro) easier to implement since they no longer need to access private APIs involving `UnsafeRow`.
## How was this patch tested?
Existing tests should do the work.
Author: Cheng Lian <lian@databricks.com>
Closes#12866 from liancheng/spark-14237-simplify-partition-values-appending.
## What changes were proposed in this pull request?
https://issues.apache.org/jira/browse/SPARK-14973
Add seed support when saving/loading of CrossValidator and TrainValidationSplit.
## How was this patch tested?
Spark unit test.
Author: yinxusen <yinxusen@gmail.com>
Closes#12825 from yinxusen/SPARK-14973.
## What changes were proposed in this pull request?
When ALS is run with a checkpoint interval, during the checkpoint materialize the current state and cleanup the previous shuffles (non-blocking).
## How was this patch tested?
Existing ALS unit tests, new ALS checkpoint cleanup unit tests added & shuffle files checked after ALS w/checkpointing run.
Author: Holden Karau <holden@us.ibm.com>
Author: Holden Karau <holden@pigscanfly.ca>
Closes#11919 from holdenk/SPARK-6717-clear-shuffle-files-after-checkpointing-in-ALS.
## What changes were proposed in this pull request?
This PR is an update for [https://github.com/apache/spark/pull/12738] which:
* Adds a generic unit test for JavaParams wrappers in pyspark.ml for checking default Param values vs. the defaults in the Scala side
* Various fixes for bugs found
* This includes changing classes taking weightCol to treat unset and empty String Param values the same way.
Defaults changed:
* Scala
* LogisticRegression: weightCol defaults to not set (instead of empty string)
* StringIndexer: labels default to not set (instead of empty array)
* GeneralizedLinearRegression:
* maxIter always defaults to 25 (simpler than defaulting to 25 for a particular solver)
* weightCol defaults to not set (instead of empty string)
* LinearRegression: weightCol defaults to not set (instead of empty string)
* Python
* MultilayerPerceptron: layers default to not set (instead of [1,1])
* ChiSqSelector: numTopFeatures defaults to 50 (instead of not set)
## How was this patch tested?
Generic unit test. Manually tested that unit test by changing defaults and verifying that broke the test.
Author: Joseph K. Bradley <joseph@databricks.com>
Author: yinxusen <yinxusen@gmail.com>
Closes#12816 from jkbradley/yinxusen-SPARK-14931.
## What changes were proposed in this pull request?
* ```RFormula``` supports empty response variable like ```~ x + y```.
* Support formula in ```spark.kmeans``` in SparkR.
* Fix some outdated docs for SparkR.
## How was this patch tested?
Unit tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#12813 from yanboliang/spark-15030.
#### What changes were proposed in this pull request?
This PR removes three methods the were deprecated in 1.6.0:
- `PortableDataStream.close()`
- `LinearRegression.weights`
- `LogisticRegression.weights`
The rationale for doing this is that the impact is small and that Spark 2.0 is a major release.
#### How was this patch tested?
Compilation succeded.
Author: Herman van Hovell <hvanhovell@questtec.nl>
Closes#12732 from hvanhovell/SPARK-14952.
## What changes were proposed in this pull request?
This PR moves Vector.toJson/fromJson to ml.linalg.VectorEncoder under mllib/ to keep mllib-local's dependency minimal. The json encoding is used by Params. So we still need this feature in SPARK-14615, where we will switch to ml.linalg in spark.ml APIs.
## How was this patch tested?
Copied existing unit tests over.
cc; dbtsai
Author: Xiangrui Meng <meng@databricks.com>
Closes#12802 from mengxr/SPARK-14653.
## What changes were proposed in this pull request?
This PR fixes the bug that generates infinite distances between word vectors. For example,
Before this PR, we have
```
val synonyms = model.findSynonyms("who", 40)
```
will give the following results:
```
to Infinity
and Infinity
that Infinity
with Infinity
```
With this PR, the distance between words is a value between 0 and 1, as follows:
```
scala> model.findSynonyms("who", 10)
res0: Array[(String, Double)] = Array((Harvard-educated,0.5253688097000122), (ex-SAS,0.5213794708251953), (McMutrie,0.5187736749649048), (fellow,0.5166833400726318), (businessman,0.5145374536514282), (American-born,0.5127736330032349), (British-born,0.5062344074249268), (gray-bearded,0.5047978162765503), (American-educated,0.5035858750343323), (mentored,0.49849334359169006))
scala> model.findSynonyms("king", 10)
res1: Array[(String, Double)] = Array((queen,0.6787897944450378), (prince,0.6786158084869385), (monarch,0.659771203994751), (emperor,0.6490438580513), (goddess,0.643266499042511), (dynasty,0.635733425617218), (sultan,0.6166239380836487), (pharaoh,0.6150713562965393), (birthplace,0.6143025159835815), (empress,0.6109727025032043))
scala> model.findSynonyms("queen", 10)
res2: Array[(String, Double)] = Array((princess,0.7670737504959106), (godmother,0.6982434988021851), (raven-haired,0.6877717971801758), (swan,0.684934139251709), (hunky,0.6816608309745789), (Titania,0.6808111071586609), (heroine,0.6794036030769348), (king,0.6787897944450378), (diva,0.67848801612854), (lip-synching,0.6731793284416199))
```
### There are two places changed in this PR:
- Normalize the word vector to avoid overflow when calculating inner product between word vectors. This also simplifies the distance calculation, since the word vectors only need to be normalized once.
- Scale the learning rate by number of iteration, to be consistent with Google Word2Vec implementation
## How was this patch tested?
Use word2vec to train text corpus, and run model.findSynonyms() to get the distances between word vectors.
Author: Junyang <fly.shenjy@gmail.com>
Author: flyskyfly <fly.shenjy@gmail.com>
Closes#11812 from flyjy/TVec.
## What changes were proposed in this pull request?
As discussed in #12660, this PR renames
* intermediateRDDStorageLevel -> intermediateStorageLevel
* finalRDDStorageLevel -> finalStorageLevel
The argument name in `ALS.train` will be addressed in SPARK-15027.
## How was this patch tested?
Existing unit tests.
Author: Xiangrui Meng <meng@databricks.com>
Closes#12803 from mengxr/SPARK-14412.
## What changes were proposed in this pull request?
Fix for part of SPARK-14533: trivial simplification and more accurate computation of column means. See also https://github.com/apache/spark/pull/12299 which contained a complete fix that was very slow. This PR does _not_ resolve SPARK-14533 entirely.
## How was this patch tested?
Existing tests.
Author: Sean Owen <sowen@cloudera.com>
Closes#12779 from srowen/SPARK-14533.2.
## What changes were proposed in this pull request?
This PR uses `UnsafeArrayData.fromPrimitiveArray` to implement `ml.VectorUDT/MatrixUDT` to avoid boxing/unboxing.
## How was this patch tested?
Exiting unit tests.
cc: cloud-fan
Author: Xiangrui Meng <meng@databricks.com>
Closes#12805 from mengxr/SPARK-14850.
## What changes were proposed in this pull request?
This PR adds `fromPrimitiveArray` and `toPrimitiveArray` in `UnsafeArrayData`, so that we can do the conversion much faster in VectorUDT/MatrixUDT.
## How was this patch tested?
existing tests and new test suite `UnsafeArraySuite`
Author: Wenchen Fan <wenchen@databricks.com>
Closes#12640 from cloud-fan/ml.
`mllib` `ALS` supports `setIntermediateRDDStorageLevel` and `setFinalRDDStorageLevel`. This PR adds these as Params in `ml` `ALS`. They are put in group **expertParam** since few users will need them.
## How was this patch tested?
New test cases in `ALSSuite` and `tests.py`.
cc yanboliang jkbradley sethah rishabhbhardwaj
Author: Nick Pentreath <nickp@za.ibm.com>
Closes#12660 from MLnick/SPARK-14412-als-storage-params.
## What changes were proposed in this pull request?
Modified Kmeans to store cluster centers with one per row
## How was this patch tested?
Existing tests
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#12792 from jkbradley/kmeans-save-fix.
## What changes were proposed in this pull request?
Added Instrumentation logging to DecisionTree{Classifier,Regressor} and RandomForest{Classifier,Regressor}
## How was this patch tested?
No tests involved since it's logging related.
Author: BenFradet <benjamin.fradet@gmail.com>
Closes#12536 from BenFradet/SPARK-14570.
## What changes were proposed in this pull request?
pyspark.ml API for LDA
* LDA, LDAModel, LocalLDAModel, DistributedLDAModel
* includes persistence
This replaces [https://github.com/apache/spark/pull/10242]
## How was this patch tested?
* doc test for LDA, including Param setters
* unit test for persistence
Author: Joseph K. Bradley <joseph@databricks.com>
Author: Jeff Zhang <zjffdu@apache.org>
Closes#12723 from jkbradley/zjffdu-SPARK-11940.
## What changes were proposed in this pull request?
Deprecated model field in LinearRegressionSummary
Removed unnecessary Since annotations
## How was this patch tested?
Existing tests
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#12763 from jkbradley/lr-summary-api.
SparkR ```glm``` and ```kmeans``` model persistence.
Unit tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Author: Gayathri Murali <gayathri.m.softie@gmail.com>
Closes#12778 from yanboliang/spark-14311.
Closes#12680Closes#12683
## What changes were proposed in this pull request?
Add log instrumentation for parameters:
rank, numUserBlocks, numItemBlocks, implicitPrefs, alpha,
userCol, itemCol, ratingCol, predictionCol, maxIter,
regParam, nonnegative, checkpointInterval, seed
Add log instrumentation for numUserFeatures and numItemFeatures
## How was this patch tested?
Manual test: Set breakpoint in intellij and run def testALS(). Single step debugging and check the log method is called.
Author: wm624@hotmail.com <wm624@hotmail.com>
Closes#12560 from wangmiao1981/log.
## What changes were proposed in this pull request?
This PR removes duplicate implementation of compute in LogisticGradient class
## How was this patch tested?
unit tests
Author: dding3 <dingding@dingding-ubuntu.sh.intel.com>
Closes#12747 from dding3/master.
## What changes were proposed in this pull request?
Handle case where number of predictions is less than label set, k in nDCG computation
## How was this patch tested?
New unit test; existing tests
Author: Sean Owen <sowen@cloudera.com>
Closes#12756 from srowen/SPARK-14886.
## What changes were proposed in this pull request?
According to the [SPARK-14829](https://issues.apache.org/jira/browse/SPARK-14829), deprecate API of LogisticRegression and LinearRegression using SGD
## How was this patch tested?
manual tests
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#12596 from zhengruifeng/deprecate_sgd.
## What changes were proposed in this pull request?
Updated Classifier, DecisionTreeClassifier, RandomForestClassifier, GBTClassifier to not require input column metadata.
* They first check for metadata.
* If numClasses is not specified in metadata, they identify the largest label value (up to a limit).
This functionality is implemented in a new Classifier.getNumClasses method.
Also
* Updated Classifier.extractLabeledPoints to (a) check label values and (b) include a second version which takes a numClasses value for validity checking.
## How was this patch tested?
* Unit tests in ClassifierSuite for helper methods
* Unit tests for DecisionTreeClassifier, RandomForestClassifier, GBTClassifier with toy datasets lacking label metadata
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#12663 from jkbradley/trees-no-metadata.
## What changes were proposed in this pull request?
This PR adds `since` tag into the matrix and vector classes in spark-mllib-local.
## How was this patch tested?
Scala-style checks passed.
Author: Pravin Gadakh <prgadakh@in.ibm.com>
Closes#12416 from pravingadakh/SPARK-14613.
## What changes were proposed in this pull request?
jira: https://issues.apache.org/jira/browse/SPARK-14916
FreqItemset as the result of FPGrowth should have a more friendly toString(), to help users and developers.
sample:
{a, b}: 5
{x, y, z}: 4
## How was this patch tested?
existing unit tests.
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes#12698 from hhbyyh/freqtos.
## What changes were proposed in this pull request?
This splits GeneralizedLinearRegressionSummary into 2 summary types:
* GeneralizedLinearRegressionSummary, which does not store info from fitting (diagInvAtWA)
* GeneralizedLinearRegressionTrainingSummary, which is a subclass of GeneralizedLinearRegressionSummary and stores info from fitting
This also add a method evaluate() which can produce a GeneralizedLinearRegressionSummary on a new dataset.
The summary no longer provides the model itself as a public val.
Also:
* Fixes bug where GeneralizedLinearRegressionTrainingSummary was created with model, not summaryModel.
* Adds hasSummary method.
* Renames findSummaryModelAndPredictionCol -> getSummaryModel and simplifies that method.
* In summary, extract values from model immediately in case user later changes those (e.g., predictionCol).
* Pardon the style fixes; that is IntelliJ being obnoxious.
## How was this patch tested?
Existing unit tests + updated test for evaluate and hasSummary
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#12624 from jkbradley/model-summary-api.
## What changes were proposed in this pull request?
Currently we use `SQLUserDefinedType` annotation to register UDTs for user classes. However, by doing this, we add Spark dependency to user classes.
For some user classes, it is unnecessary to add such dependency that will increase deployment difficulty.
We should provide alternative approach to register UDTs for user classes without `SQLUserDefinedType` annotation.
## How was this patch tested?
`UserDefinedTypeSuite`
Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
Closes#12259 from viirya/improve-sql-usertype.
## What changes were proposed in this pull request?
Pipeline.setStages failed for some code examples which worked in 1.5 but fail in 1.6. This tends to occur when using a mix of transformers from ml.feature. It is because Java Arrays are non-covariant and the addition of MLWritable to some transformers means the stages0/1 arrays above are not of type Array[PipelineStage]. This PR modifies the following to accept subclasses of PipelineStage:
* Pipeline.setStages()
* Params.w()
## How was this patch tested?
Unit test which fails to compile before this fix.
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#12430 from jkbradley/pipeline-setstages.
## What changes were proposed in this pull request?
Since [SPARK-10574](https://issues.apache.org/jira/browse/SPARK-10574) breaks behavior of ```HashingTF```, we should try to enforce good practice by removing the "native" hashAlgorithm option in spark.ml and pyspark.ml. We can leave spark.mllib and pyspark.mllib alone.
## How was this patch tested?
Unit tests.
cc jkbradley
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#12702 from yanboliang/spark-14899.
This PR adds the remaining group of methods to PySpark's distributed linear algebra classes as follows:
* `RowMatrix` <sup>**[1]**</sup>
1. `computeGramianMatrix`
2. `computeCovariance`
3. `computeColumnSummaryStatistics`
4. `columnSimilarities`
5. `tallSkinnyQR` <sup>**[2]**</sup>
* `IndexedRowMatrix` <sup>**[3]**</sup>
1. `computeGramianMatrix`
* `CoordinateMatrix`
1. `transpose`
* `BlockMatrix`
1. `validate`
2. `cache`
3. `persist`
4. `transpose`
**[1]**: Note: `multiply`, `computeSVD`, and `computePrincipalComponents` are already part of PR #7963 for SPARK-6227.
**[2]**: Implementing `tallSkinnyQR` uncovered a bug with our PySpark `RowMatrix` constructor. As discussed on the dev list [here](http://apache-spark-developers-list.1001551.n3.nabble.com/K-Means-And-Class-Tags-td10038.html), there appears to be an issue with type erasure with RDDs coming from Java, and by extension from PySpark. Although we are attempting to construct a `RowMatrix` from an `RDD[Vector]` in [PythonMLlibAPI](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala#L1115), the `Vector` type is erased, resulting in an `RDD[Object]`. Thus, when calling Scala's `tallSkinnyQR` from PySpark, we get a Java `ClassCastException` in which an `Object` cannot be cast to a Spark `Vector`. As noted in the aforementioned dev list thread, this issue was also encountered with `DecisionTrees`, and the fix involved an explicit `retag` of the RDD with a `Vector` type. Thus, this PR currently contains that fix applied to the `createRowMatrix` helper function in `PythonMLlibAPI`. `IndexedRowMatrix` and `CoordinateMatrix` do not appear to have this issue likely due to their related helper functions in `PythonMLlibAPI` creating the RDDs explicitly from DataFrames with pattern matching, thus preserving the types. However, this fix may be out of scope for this single PR, and it may be better suited in a separate JIRA/PR. Therefore, I have marked this PR as WIP and am open to discussion.
**[3]**: Note: `multiply` and `computeSVD` are already part of PR #7963 for SPARK-6227.
Author: Mike Dusenberry <mwdusenb@us.ibm.com>
Closes#9441 from dusenberrymw/SPARK-9656_Add_Missing_Methods_to_PySpark_Distributed_Linear_Algebra.
## What changes were proposed in this pull request?
Before, spark.ml GaussianMixtureModel used the spark.mllib MultivariateGaussian in its public API. This was added after 1.6, so we can modify this API without breaking APIs.
This PR copies MultivariateGaussian to mllib-local in spark.ml, with a few changes:
* Renamed fields to match numpy, scipy: mu => mean, sigma => cov
This PR then uses the spark.ml MultivariateGaussian in the spark.ml GaussianMixtureModel, which involves:
* Modifying the constructor
* Adding a computeProbabilities method
Also:
* Added EPSILON to mllib-local for use in MultivariateGaussian
## How was this patch tested?
Existing unit tests
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#12593 from jkbradley/sparkml-gmm-fix.
## What changes were proposed in this pull request?
There have been continuing requests (e.g., SPARK-7131) for allowing users to extend and modify MLlib models and algorithms.
This PR makes tree and ensemble classes, Node types, and Split types in spark.ml no longer final. This matches most other spark.ml algorithms.
Constructors for models are still private since we may need to refactor how stats are maintained in tree nodes.
## How was this patch tested?
Existing unit tests
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#12711 from jkbradley/final-trees.
## What changes were proposed in this pull request?
This PR changes `GLMRegressionModel.save` function like the following code that is similar to other algorithms' parquet write.
```
- val dataRDD: DataFrame = sc.parallelize(Seq(data), 1).toDF()
- // TODO: repartition with 1 partition after SPARK-5532 gets fixed
- dataRDD.write.parquet(Loader.dataPath(path))
+ sqlContext.createDataFrame(Seq(data)).repartition(1).write.parquet(Loader.dataPath(path))
```
## How was this patch tested?
Manual.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#12676 from dongjoon-hyun/SPARK-14907.
## What changes were proposed in this pull request?
We deprecated ```runs``` of mllib.KMeans in Spark 1.6 (SPARK-11358). In 2.0, we will make it no effect (with warning messages). We did not remove ```setRuns/getRuns``` for better binary compatibility.
This PR change `runs` which are appeared at the public API. Usage inside of ```KMeans.runAlgorithm()``` will be resolved at #10806.
## How was this patch tested?
Existing unit tests.
cc jkbradley
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#12608 from yanboliang/spark-11559.
## What changes were proposed in this pull request?
That patch mistakenly widened the visibility from `private[x]` to `protected[x]`. This patch reverts those changes.
Author: Andrew Or <andrew@databricks.com>
Closes#12686 from andrewor14/visibility.
## What changes were proposed in this pull request?
We currently have no way for users to propagate options to the underlying library that rely in Hadoop configurations to work. For example, there are various options in parquet-mr that users might want to set, but the data source API does not expose a per-job way to set it. This patch propagates the user-specified options also into Hadoop Configuration.
## How was this patch tested?
Used a mock data source implementation to test both the read path and the write path.
Author: Reynold Xin <rxin@databricks.com>
Closes#12688 from rxin/SPARK-14912.
## What changes were proposed in this pull request?
```AFTSurvivalRegressionModel``` supports ```save/load``` in SparkR.
## How was this patch tested?
Unit tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#12685 from yanboliang/spark-14313.
## What changes were proposed in this pull request?
Made BinaryClassificationEvaluator, MulticlassClassificationEvaluator and RegressionEvaluator accept all numeric types for label
## How was this patch tested?
Unit tests
Author: BenFradet <benjamin.fradet@gmail.com>
Closes#12500 from BenFradet/SPARK-13962.
## What changes were proposed in this pull request?
In Spark 2.0, `SparkSession` is the new thing. Internally we should stop using `SQLContext` everywhere since that's supposed to be not the main user-facing API anymore.
In this patch I took care to not break any public APIs. The one place that's suspect is `o.a.s.ml.source.libsvm.DefaultSource`, but according to mengxr it's not supposed to be public so it's OK to change the underlying `FileFormat` trait.
**Reviewers**: This is a big patch that may be difficult to review but the changes are actually really straightforward. If you prefer I can break it up into a few smaller patches, but it will delay the progress of this issue a little.
## How was this patch tested?
No change in functionality intended.
Author: Andrew Or <andrew@databricks.com>
Closes#12625 from andrewor14/spark-session-refactor.
## What changes were proposed in this pull request?
SparkR ```NaiveBayesModel``` supports ```save/load``` by the following API:
```
df <- createDataFrame(sqlContext, infert)
model <- naiveBayes(education ~ ., df, laplace = 0)
ml.save(model, path)
model2 <- ml.load(path)
```
## How was this patch tested?
Add unit tests.
cc mengxr
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#12573 from yanboliang/spark-14312.
## What changes were proposed in this pull request?
As the discussion at [SPARK-10574](https://issues.apache.org/jira/browse/SPARK-10574), ```HashingTF``` should support MurmurHash3 and make it as the default hash algorithm. We should also expose set/get API for ```hashAlgorithm```, then users can choose the hash method.
Note: The problem that ```mllib.feature.HashingTF``` behaves differently between Scala/Java and Python will be resolved in the followup work.
## How was this patch tested?
unit tests.
cc jkbradley MLnick
Author: Yanbo Liang <ybliang8@gmail.com>
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#12498 from yanboliang/spark-10574.
## What changes were proposed in this pull request?
Add Python API in ML for GaussianMixture
## How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
Add doctest and test cases are the same as mllib Python tests
./dev/lint-python
PEP8 checks passed.
rm -rf _build/*
pydoc checks passed.
./python/run-tests --python-executables=python2.7 --modules=pyspark-ml
Running PySpark tests. Output is in /Users/mwang/spark_ws_0904/python/unit-tests.log
Will test against the following Python executables: ['python2.7']
Will test the following Python modules: ['pyspark-ml']
Finished test(python2.7): pyspark.ml.evaluation (18s)
Finished test(python2.7): pyspark.ml.clustering (40s)
Finished test(python2.7): pyspark.ml.classification (49s)
Finished test(python2.7): pyspark.ml.recommendation (44s)
Finished test(python2.7): pyspark.ml.feature (64s)
Finished test(python2.7): pyspark.ml.regression (45s)
Finished test(python2.7): pyspark.ml.tuning (30s)
Finished test(python2.7): pyspark.ml.tests (56s)
Tests passed in 106 seconds
Author: wm624@hotmail.com <wm624@hotmail.com>
Closes#12402 from wangmiao1981/gmm.
## What changes were proposed in this pull request?
add the checking for StepSize and Tol in sharedParams
## How was this patch tested?
Unit tests
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#12530 from zhengruifeng/ml_args_checking.
## What changes were proposed in this pull request?
Spark uses `NewLineAtEofChecker` rule in Scala by ScalaStyle. And, most Java code also comply with the rule. This PR aims to enforce the same rule `NewlineAtEndOfFile` by CheckStyle explicitly. Also, this fixes lint-java errors since SPARK-14465. The followings are the items.
- Adds a new line at the end of the files (19 files)
- Fixes 25 lint-java errors (12 RedundantModifier, 6 **ArrayTypeStyle**, 2 LineLength, 2 UnusedImports, 2 RegexpSingleline, 1 ModifierOrder)
## How was this patch tested?
After the Jenkins test succeeds, `dev/lint-java` should pass. (Currently, Jenkins dose not run lint-java.)
```bash
$ dev/lint-java
Using `mvn` from path: /usr/local/bin/mvn
Checkstyle checks passed.
```
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#12632 from dongjoon-hyun/SPARK-14868.
## What changes were proposed in this pull request?
del unused imports in ML/MLLIB
## How was this patch tested?
unit tests
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#12497 from zhengruifeng/del_unused_imports.
## What changes were proposed in this pull request?
We use `RowEncoder` in libsvm data source to serialize the label and features read from libsvm files. However, the schema passed in this encoder is not correct. As the result, we can't correctly select `features` column from the DataFrame. We should use full data schema instead of `requiredSchema` to serialize the data read in. Then do projection to select required columns later.
## How was this patch tested?
`LibSVMRelationSuite`.
Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
Closes#12611 from viirya/fix-libsvm.
## What changes were proposed in this pull request?
1, fix the indentation
2, add a missing param desc
## How was this patch tested?
unit tests
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#12499 from zhengruifeng/fix_doc.
## What changes were proposed in this pull request?
Implement some `hashCode` and `equals` together in order to enable the scalastyle.
This is a first batch, I will continue to implement them but I wanted to know your thoughts.
Author: Joan <joan@goyeau.com>
Closes#12157 from joan38/SPARK-6429-HashCode-Equals.
## What changes were proposed in this pull request?
GLM supports output link prediction.
## How was this patch tested?
unit test.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#12287 from yanboliang/spark-14479.
## What changes were proposed in this pull request?
For maintaining wrappers around spark.mllib algorithms in spark.ml, it will be useful to have ```private[spark]``` methods for converting from one linear algebra representation to another.
This PR adds toNew, fromNew methods for all spark.mllib Vector and Matrix types.
## How was this patch tested?
Unit tests for all conversions
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#12504 from jkbradley/linalg-conversions.
## What changes were proposed in this pull request?
https://issues.apache.org/jira/browse/SPARK-14569
Log instrumentation in KMeans:
- featuresCol
- predictionCol
- k
- initMode
- initSteps
- maxIter
- seed
- tol
- summary
## How was this patch tested?
Manually test on local machine, by running and checking output of org.apache.spark.examples.ml.KMeansExample
Author: Xin Ren <iamshrek@126.com>
Closes#12432 from keypointt/SPARK-14569.
## What changes were proposed in this pull request?
Currently, MLlib's StandardScaler scales columns using the corrected standard deviation (sqrt of unbiased variance). This matches what R's scale package does.
This PR documents this fact.
## How was this patch tested?
doc only
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#12519 from jkbradley/scaler-variance-doc.
## What changes were proposed in this pull request?
- replaced `FileSystem.get(conf)` calls with `path.getFileSystem(conf)`
## How was this patch tested?
N/A
Author: Liwei Lin <lwlin7@gmail.com>
Closes#12450 from lw-lin/fix-fs-get.
## What changes were proposed in this pull request?
This PR moves `HadoopFsRelation` related data source API into `execution/datasources` package.
Note that to avoid conflicts, this PR is based on #12153. Effective changes for this PR only consist of the last three commits. Will rebase after merging #12153.
## How was this patch tested?
Existing tests.
Author: Yin Huai <yhuai@databricks.com>
Author: Cheng Lian <lian@databricks.com>
Closes#12361 from liancheng/spark-14407-hide-hadoop-fs-relation.
## What changes were proposed in this pull request?
Added windowSize getter/setter to ML/MLlib
## How was this patch tested?
Added test cases in tests.py under both ML and MLlib
Author: Jason Lee <cjlee@us.ibm.com>
Closes#12428 from jasoncl/SPARK-14564.
## What changes were proposed in this pull request?
https://issues.apache.org/jira/browse/SPARK-14306
Add PySpark OneVsRest save/load supports.
## How was this patch tested?
Test with Python unit test.
Author: Xusen Yin <yinxusen@gmail.com>
Closes#12439 from yinxusen/SPARK-14306-0415.
## What changes were proposed in this pull request?
This PR removes
- Inappropriate type notations
For example, from
```scala
words.foreachRDD { (rdd: RDD[String], time: Time) =>
...
```
to
```scala
words.foreachRDD { (rdd, time) =>
...
```
- Extra anonymous closure within functional transformations.
For example,
```scala
.map(item => {
...
})
```
which can be just simply as below:
```scala
.map { item =>
...
}
```
and corrects some obvious style nits.
## How was this patch tested?
This was tested after adding rules in `scalastyle-config.xml`, which ended up with not finding all perfectly.
The rules applied were below:
- For the first correction,
```xml
<check customId="NoExtraClosure" level="error" class="org.scalastyle.file.RegexChecker" enabled="true">
<parameters><parameter name="regex">(?m)\.[a-zA-Z_][a-zA-Z0-9]*\(\s*[^,]+s*=>\s*\{[^\}]+\}\s*\)</parameter></parameters>
</check>
```
```xml
<check customId="NoExtraClosure" level="error" class="org.scalastyle.file.RegexChecker" enabled="true">
<parameters><parameter name="regex">\.[a-zA-Z_][a-zA-Z0-9]*\s*[\{|\(]([^\n>,]+=>)?\s*\{([^()]|(?R))*\}^[,]</parameter></parameters>
</check>
```
- For the second correction
```xml
<check customId="TypeNotation" level="error" class="org.scalastyle.file.RegexChecker" enabled="true">
<parameters><parameter name="regex">\.[a-zA-Z_][a-zA-Z0-9]*\s*[\{|\(]\s*\([^):]*:R))*\}^[,]</parameter></parameters>
</check>
```
**Those rules were not added**
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#12413 from HyukjinKwon/SPARK-style.
## What changes were proposed in this pull request?
Expose R-like summary statistics in SparkR::glm for more family and link functions.
Note: Not all values in R [summary.glm](http://stat.ethz.ch/R-manual/R-patched/library/stats/html/summary.glm.html) are exposed, we only provide the most commonly used statistics in this PR. More statistics can be added in the followup work.
## How was this patch tested?
Unit tests.
SparkR Output:
```
Deviance Residuals:
(Note: These are approximate quantiles with relative error <= 0.01)
Min 1Q Median 3Q Max
-0.95096 -0.16585 -0.00232 0.17410 0.72918
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.6765 0.23536 7.1231 4.4561e-11
Sepal_Length 0.34988 0.046301 7.5566 4.1873e-12
Species_versicolor -0.98339 0.072075 -13.644 0
Species_virginica -1.0075 0.093306 -10.798 0
(Dispersion parameter for gaussian family taken to be 0.08351462)
Null deviance: 28.307 on 149 degrees of freedom
Residual deviance: 12.193 on 146 degrees of freedom
AIC: 59.22
Number of Fisher Scoring iterations: 1
```
R output:
```
Deviance Residuals:
Min 1Q Median 3Q Max
-0.95096 -0.16522 0.00171 0.18416 0.72918
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.67650 0.23536 7.123 4.46e-11 ***
Sepal.Length 0.34988 0.04630 7.557 4.19e-12 ***
Speciesversicolor -0.98339 0.07207 -13.644 < 2e-16 ***
Speciesvirginica -1.00751 0.09331 -10.798 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for gaussian family taken to be 0.08351462)
Null deviance: 28.307 on 149 degrees of freedom
Residual deviance: 12.193 on 146 degrees of freedom
AIC: 59.217
Number of Fisher Scoring iterations: 2
```
cc mengxr
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#12393 from yanboliang/spark-13925.
## What changes were proposed in this pull request?
Removed duplicated generation of `ids` in OnlineLDAOptimizer.
## How was this patch tested?
tested with existing unit tests.
Author: Pravin Gadakh <prgadakh@in.ibm.com>
Closes#12176 from pravingadakh/SPARK-14370.
## What changes were proposed in this pull request?
This task will copy the Vector and Matrix classes from mllib to ml package in mllib-local jar. The UDTs and `since` annotation in ml vector and matrix will be removed from now. UDTs will be achieved by #SPARK-14487, and `since` will be replaced by /* since 1.2.0 */
The BLAS implementation will be copied, and some of the test utilities will be copies as well.
Summary of changes:
1. In mllib-local/src/main/scala/org/apache/spark/**ml**/linalg/BLAS.scala
- Copied from mllib/src/main/scala/org/apache/spark/**mllib**/linalg/BLAS.scala
- logDebug("gemm: alpha is equal to 0 and beta is equal to 1. Returning C.") is removed in ml version.
2. In mllib-local/src/main/scala/org/apache/spark/**ml**/linalg/Matrices.scala
- Copied from mllib/src/main/scala/org/apache/spark/**mllib**/linalg/Matrices.scala
- `Since` was removed, and we'll use standard `/* Since /*` Java doc. Will be in another PR.
- `UDT` related code was removed, and will use `SPARK-13944` https://github.com/apache/spark/pull/12259 to replace the annotation.
3. In mllib-local/src/main/scala/org/apache/spark/**ml**/linalg/Vectors.scala
- Copied from mllib/src/main/scala/org/apache/spark/**mllib**/linalg/Vectors.scala
- `Since` was removed.
- `UDT` related code was removed.
- In `def parseNumeric`, it was throwing `throw new SparkException(s"Cannot parse $other.")`, and now it's throwing `throw new IllegalArgumentException(s"Cannot parse $other.")`
4. In mllib/src/main/scala/org/apache/spark/**mllib**/linalg/Vectors.scala
- For consistency with ML version of vector, `def parseNumeric` is now throwing `throw new IllegalArgumentException(s"Cannot parse $other.")`
5. mllib/src/main/scala/org/apache/spark/**mllib**/util/NumericParser.scala is moved to mllib-local/src/main/scala/org/apache/spark/**ml**/util/NumericParser.scala
- All the `throw new SparkException` were replaced by `throw new IllegalArgumentException`
## How was this patch tested?
unit tests
Author: DB Tsai <dbt@netflix.com>
Closes#12317 from dbtsai/dbtsai-ml-vector.
Hi guys,
I've implemented an improved version of the `toIndexedRowMatrix` function on the `BlockMatrix`. I needed this for a project, but would like to share it with the rest of the community. In the case of dense matrices, it can increase performance up to 19 times:
https://github.com/Fokko/BlockMatrixToIndexedRowMatrix
If there are any questions or suggestions, please let me know. Keep up the good work! Cheers.
Author: Fokko Driesprong <f.driesprong@catawiki.nl>
Author: Fokko Driesprong <fokko@driesprongen.nl>
Closes#10839 from Fokko/master.
## What changes were proposed in this pull request?
This fix tries to change RandomForest's supported strategies from using regexes to using parseInt and
parseDouble, for the purpose of robustness and maintainability.
## How was this patch tested?
Existing tests passed.
Author: Yong Tang <yong.tang.github@outlook.com>
Closes#12360 from yongtang/SPARK-14565.
## What changes were proposed in this pull request?
In Spark 1.4, we negated some metrics from RegressionEvaluator since CrossValidator always maximized metrics. This was fixed in 1.5, but the docs were not updated. This PR updates the docs.
## How was this patch tested?
no tests
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#12377 from jkbradley/regeval-doc.
## What changes were proposed in this pull request?
Move json4s, breeze dependency declaration into parent
## How was this patch tested?
Should be no functional change, but Jenkins tests will test that.
Author: Sean Owen <sowen@cloudera.com>
Closes#12390 from srowen/SPARK-14612.
## What changes were proposed in this pull request?
* Modify ```KMeansSummary.clusterSizes``` method to make it robust to empty clusters.
* Add unit test for spark.ml ```KMeansSummary```.
* Add Since tag.
## How was this patch tested?
unit tests.
cc jkbradley
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#12254 from yanboliang/spark-14375.
## What changes were proposed in this pull request?
GLM training summaries should provide solver.
## How was this patch tested?
Unit tests.
cc jkbradley
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#12253 from yanboliang/spark-14461.
```PrefixSpanModel``` supports ```save/load```. It's similar with #9267.
cc jkbradley
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#10664 from yanboliang/spark-10386.
## What changes were proposed in this pull request?
* Added save/load for ```GBTClassifier/GBTClassificationModel/GBTRegressor/GBTRegressionModel```.
* Meanwhile, I modified ```EnsembleModelReadWrite.saveImpl/loadImpl``` to support save/load ```treeWeights```.
## How was this patch tested?
Adds standard unit tests for GBT save/load.
cc jkbradley GayathriMurali
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#12230 from yanboliang/spark-13783.
## What changes were proposed in this pull request?
This adds extra logging information about a `LogisticRegression` estimator when being fit on a dataset. With this PR, you see the following extra lines when running the example in the documentation:
```
16/04/13 07:19:00 INFO Instrumentation: Instrumentation(LogisticRegression-logreg_55dd3c09f164-1230977381-1): training: numPartitions=1 storageLevel=StorageLevel(disk=true, memory=true, offheap=false, deserialized=true, replication=1)
16/04/13 07:19:00 INFO Instrumentation: Instrumentation(LogisticRegression-logreg_55dd3c09f164-1230977381-1): {"regParam":0.3,"elasticNetParam":0.8,"maxIter":10}
...
16/04/12 11:48:07 INFO Instrumentation: Instrumentation(LogisticRegression-logreg_a89eb23cb386-358781145):numClasses=2
16/04/12 11:48:07 INFO Instrumentation: Instrumentation(LogisticRegression-logreg_a89eb23cb386-358781145):numFeatures=692
...
16/04/13 07:19:01 INFO Instrumentation: Instrumentation(LogisticRegression-logreg_55dd3c09f164-1230977381-1): training finished
```
## How was this patch tested?
This PR was manually tested.
Author: Timothy Hunter <timhunter@databricks.com>
Closes#12331 from thunterdb/1604-instrumentation.
## What changes were proposed in this pull request?
It looks several recent commits for datasources (maybe while removing old `HadoopFsRelation` interface) missed removing some unused imports.
This PR removes some unused imports in datasources.
## How was this patch tested?
`sbt scalastyle` and some unit tests for them.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#12326 from HyukjinKwon/minor-imports.
## What changes were proposed in this pull request?
SparkR does not support type of vector which is the default type of feature column in ML. R predict also does not output intermediate feature column. So SparkR ```predict``` should not output feature column. In this PR, I only fix this issue for ```naiveBayes``` and ```survreg```. ```kmeans``` has the right code route already and ```glm``` will be fixed at SparkRWrapper refactor(#12294).
## How was this patch tested?
No new tests.
cc mengxr shivaram
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#11958 from yanboliang/spark-14147.
## What changes were proposed in this pull request?
Use a random table name instead of `__THIS__` in SQLTransformer, and add a test for `transformSchema`. The problems of using `__THIS__` are:
* It doesn't work under HiveContext (in Spark 1.6)
* Race conditions
## How was this patch tested?
* Manual test with HiveContext.
* Added a unit test for `transformSchema` to improve coverage.
cc: yhuai
Author: Xiangrui Meng <meng@databricks.com>
Closes#12330 from mengxr/SPARK-14563.
## What changes were proposed in this pull request?
AFTSurvivalRegression should support feature standardization, it will improve the convergence rate.
Test the convergence rate on the [Ovarian](https://stat.ethz.ch/R-manual/R-devel/library/survival/html/ovarian.html) data which is standard data comes with Survival library in R,
* without standardization(before this PR) -> 74 iterations.
* with standardization(after this PR) -> 38 iterations.
But after this fix, with or without ```standardization``` will converge to the same solution. It means that ```standardization = false``` will run the same code route as ```standardization = true```. Because if the features are not standardized at all, it will result convergency issue when the features have very different scales. This behavior is the same as ML [```LinearRegression``` and ```LogisticRegression```](https://issues.apache.org/jira/browse/SPARK-8522). See more discussion about this topic at #11247.
cc mengxr
## How was this patch tested?
unit test.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#11365 from yanboliang/spark-13322.
* SparkR glm supports families and link functions which match R's signature for family.
* SparkR glm API refactor. The comparative standard of the new API is R glm, so I only expose the arguments that R glm supports: ```formula, family, data, epsilon and maxit```.
* This PR is focus on glm() and predict(), summary statistics will be done in a separate PR after this get in.
* This PR depends on #12287 which make GLMs support link prediction at Scala side. After that merged, I will add more tests for predict() to this PR.
Unit tests.
cc mengxr jkbradley hhbyyh
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#12294 from yanboliang/spark-12566.
## What changes were proposed in this pull request?
This PR tries to support more options for feature subset size in RandomForest implementation. Previously, RandomForest only support "auto", "all", "sort", "log2", "onethird". This PR tries to support any given value to allow model search.
In this PR, `featureSubsetStrategy` could be passed with:
a) a real number in the range of `(0.0-1.0]` that represents the fraction of the number of features in each subset,
b) an integer number (`>0`) that represents the number of features in each subset.
## How was this patch tested?
Two tests `JavaRandomForestClassifierSuite` and `JavaRandomForestRegressorSuite` have been updated to check the additional options for params in this PR.
An additional test has been added to `org.apache.spark.mllib.tree.RandomForestSuite` to cover the cases in this PR.
Author: Yong Tang <yong.tang.github@outlook.com>
Closes#11989 from yongtang/SPARK-3724.
## What changes were proposed in this pull request?
According to the [Spark Code Style Guide](https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide) and [Scala Style Guide](http://docs.scala-lang.org/style/control-structures.html#curlybraces), we had better enforce the following rule.
```
case: Always omit braces in case clauses.
```
This PR makes a new ScalaStyle rule, 'OmitBracesInCase', and enforces it to the code.
## How was this patch tested?
Pass the Jenkins tests (including Scala style checking)
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#12280 from dongjoon-hyun/SPARK-14508.
## What changes were proposed in this pull request?
Now `HadoopFsRelation` with all kinds of file formats can be handled in `FileSourceStrategy`, we can remove the branches for `HadoopFsRelation` in `FileSourceStrategy` and the `buildInternalScan` API from `FileFormat`.
## How was this patch tested?
existing tests.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#12300 from cloud-fan/remove.
## What changes were proposed in this pull request?
Fixes to eliminate warnings during package and doc builds.
## How was this patch tested?
Existing unit tests
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#12263 from jkbradley/warning-cleanups.
## What changes were proposed in this pull request?
This is follow up for #12089, add unit test for EM LDA which test disable checkpointing when set ```checkpointInterval = -1```.
## How was this patch tested?
unit test.
cc jkbradley
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#12286 from yanboliang/spark-14298-followup.
## What changes were proposed in this pull request?
QuantileDiscretizer can return an unexpected number of buckets in certain cases. This PR proposes to fix this issue and also refactor QuantileDiscretizer to use approxQuantiles from DataFrame stats functions.
## How was this patch tested?
QuantileDiscretizerSuite unit tests (some existing tests will change or even be removed in this PR)
Author: Oliver Pierson <ocp@gatech.edu>
Closes#11553 from oliverpierson/SPARK-13600.
## What changes were proposed in this pull request?
In order to separate the linear algebra, and vector matrix classes into a standalone jar, we need to setup the build first. This PR will create a new jar called mllib-local with minimal dependencies.
The previous PR was failing the build because of `spark-core:test` dependency, and that was reverted. In this PR, `FunSuite` with `// scalastyle:ignore funsuite` in mllib-local test was used, similar to sketch.
Thanks.
## How was this patch tested?
Unit tests
mengxr tedyu holdenk
Author: DB Tsai <dbt@netflix.com>
Closes#12298 from dbtsai/dbtsai-mllib-local-build-fix.
## What changes were proposed in this pull request?
add the checking for LDA and StreamingKMeans
## How was this patch tested?
manual tests
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#12062 from zhengruifeng/initmodel.
## What changes were proposed in this pull request?
This PR updates MLlib APIs to accept `Dataset[_]` as input where `DataFrame` was the input type. This PR doesn't change the output type. In Java, `Dataset[_]` maps to `Dataset<?>`, which includes `Dataset<Row>`. Some implementations were changed in order to return `DataFrame`. Tests and examples were updated. Note that this is a breaking change for subclasses of Transformer/Estimator.
Lol, we don't have to rename the input argument, which has been `dataset` since Spark 1.2.
TODOs:
- [x] update MiMaExcludes (seems all covered by explicit filters from SPARK-13920)
- [x] Python
- [x] add a new test to accept Dataset[LabeledPoint]
- [x] remove unused imports of Dataset
## How was this patch tested?
Exiting unit tests with some modifications.
cc: rxin jkbradley
Author: Xiangrui Meng <meng@databricks.com>
Closes#12274 from mengxr/SPARK-14500.
## What changes were proposed in this pull request?
Replace sortBy() with top() to calculate the top N frequent words as dictionary.
## How was this patch tested?
existing unit tests. The terms with same TF would be sorted in descending order. The test would fail if hardcode the terms with same TF the dictionary like "c", "d"...
Author: fwang1 <desperado.wf@gmail.com>
Closes#12265 from lionelfeng/master.
## What changes were proposed in this pull request?
In order to separate the linear algebra, and vector matrix classes into a standalone jar, we need to setup the build first. This PR will create a new jar called mllib-local with minimal dependencies. The test scope will still depend on spark-core and spark-core-test in order to use the common utilities, but the runtime will avoid any platform dependency. Couple platform independent classes will be moved to this package to demonstrate how this work.
## How was this patch tested?
Unit tests
Author: DB Tsai <dbt@netflix.com>
Closes#12241 from dbtsai/dbtsai-mllib-local-build.
## What changes were proposed in this pull request?
CountVectorizerModel has a binary toggle param. This PR is to add binary toggle param for estimator CountVectorizer. As discussed in the JIRA, instead of adding a param into CountVerctorizer, I moved the binary param to CountVectorizerParams. Therefore, the estimator inherits the binary param.
## How was this patch tested?
Add a new test case, which fits the model with binary flag set to true and then check the trained model's all non-zero counts is set to 1.0.
All tests in CounterVectorizerSuite.scala are passed.
Author: wm624@hotmail.com <wm624@hotmail.com>
Closes#12200 from wangmiao1981/binary_param.
## What changes were proposed in this pull request?
Cleanups to documentation. No changes to code.
* GBT docs: Move Scala doc for private object GradientBoostedTrees to public docs for GBTClassifier,Regressor
* GLM regParam: needs doc saying it is for L2 only
* TrainValidationSplitModel: add .. versionadded:: 2.0.0
* Rename “_transformer_params_from_java” to “_transfer_params_from_java”
* LogReg Summary classes: “probability” col should not say “calibrated”
* LR summaries: coefficientStandardErrors —> document that intercept stderr comes last. Same for t,p-values
* approxCountDistinct: Document meaning of “rsd" argument.
* LDA: note which params are for online LDA only
## How was this patch tested?
Doc build
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#12266 from jkbradley/ml-doc-cleanups.
## What changes were proposed in this pull request?
In the doc of [```checkpointInterval```](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/param/shared/sharedParams.scala#L241), we told users that they can disable checkpoint by setting ```checkpointInterval = -1```. But we did not handle this situation for LDA actually, we should fix this bug.
## How was this patch tested?
Existing tests.
cc jkbradley
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#12089 from yanboliang/spark-14298.
## What changes were proposed in this pull request?
The EMLDAOptimizer should generally not delete its last checkpoint since that can cause failures when DistributedLDAModel methods are called (if any partitions need to be recovered from the checkpoint).
This PR adds a "deleteLastCheckpoint" option which defaults to false. This is a change in behavior from Spark 1.6, in that the last checkpoint will not be removed by default.
This involves adding the deleteLastCheckpoint option to both spark.ml and spark.mllib, and modifying PeriodicCheckpointer to support the option.
This also:
* Makes MLlibTestSparkContext extend TempDirectory and set the checkpointDir to tempDir
* Updates LibSVMRelationSuite because of a name conflict with "tempDir" (and fixes a bug where it failed to delete a temp directory)
* Adds a MIMA exclude for DistributedLDAModel constructor, which is already ```private[clustering]```
## How was this patch tested?
Added 2 new unit tests to spark.ml LDASuite, which calls into spark.mllib.
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#12166 from jkbradley/emlda-save-checkpoint.
The current package name uses a dash, which is a little weird but seemed
to work. That is, until a new test tried to mock a class that references
one of those shaded types, and then things started failing.
Most changes are just noise to fix the logging configs.
For reference, SPARK-8815 also raised this issue, although at the time it
did not cause any issues in Spark, so it was not addressed.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#11941 from vanzin/SPARK-14134.
## What changes were proposed in this pull request?
This patch removes the implementation of gradient boosted trees in mllib/tree/GradientBoostedTrees.scala and changes mllib GBTs to call the implementation in spark.ML.
Primary changes:
* Removed `boost` method in mllib GradientBoostedTrees.scala
* Created new test suite GradientBoostedTreesSuite in ML, which contains unit tests that were specific to GBT internals from mllib
Other changes:
* Added an `updatePrediction` method in GradientBoostedTrees package. This method is added to provide consistency for methods that build predictions from boosted models. There are several methods that hard code the method of predicting as: sum_{i=1}^{numTrees} (treePrediction*treeWeight). Calling this function ensures that test methods that check accuracy use the same prediction method that the algorithm uses during training
* Added methods that were previously only used in testing, but were public methods, to GradientBoostedTrees. This includes `computeError` (previously part of `Loss` trait) and `evaluateEachIteration`. These are used in the new spark.ML unit tests. They are left in mllib as well so as to not break the API.
## How was this patch tested?
Existing unit tests which compare ML and MLlib ensure that mllib GBTs have not changed. Only a single unit test was moved to ML, which verifies that `runWithValidation` performs as expected.
Author: sethah <seth.hendrickson16@gmail.com>
Closes#12050 from sethah/SPARK-12382.
## What changes were proposed in this pull request?
According to the [Spark Code Style Guide](https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide#SparkCodeStyleGuide-Indentation), this PR adds a new scalastyle rule to prevent the followings.
```
/** In Spark, we don't use the ScalaDoc style so this
* is not correct.
*/
```
## How was this patch tested?
Pass the Jenkins tests (including `lint-scala`).
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#12221 from dongjoon-hyun/SPARK-14444.
## What changes were proposed in this pull request?
Adding Python API for training summaries of LogisticRegression and LinearRegression in PySpark ML.
## How was this patch tested?
Added unit tests to exercise the api calls for the summary classes. Also, manually verified values are expected and match those from Scala directly.
Author: Bryan Cutler <cutlerb@gmail.com>
Closes#11621 from BryanCutler/pyspark-ml-summary-SPARK-13430.
JIRA: https://issues.apache.org/jira/browse/SPARK-13538
## What changes were proposed in this pull request?
Add GaussianMixture and GaussianMixtureModel to ML package
## How was this patch tested?
unit tests and manual tests were done.
Local Scalastyle checks passed.
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Author: Ruifeng Zheng <ruifengz@foxmail.com>
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#11419 from zhengruifeng/mlgmm.
## What changes were proposed in this pull request?
jira: https://issues.apache.org/jira/browse/SPARK-14322
OnlineLDAOptimizer uses RDD.reduce in two places where it could use treeAggregate. This can cause scalability issues. This should be an easy fix.
This is also a bug since it modifies the first argument to reduce, so we should use aggregate or treeAggregate.
See this line: f12f11e578/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala (L452)
and a few lines below it.
## How was this patch tested?
unit tests
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes#12106 from hhbyyh/ldaTreeReduce.
## What changes were proposed in this pull request?
https://issues.apache.org/jira/browse/SPARK-13786
Add save/load for Python CrossValidator/Model and TrainValidationSplit/Model.
## How was this patch tested?
Test with Python doctest.
Author: Xusen Yin <yinxusen@gmail.com>
Closes#12020 from yinxusen/SPARK-13786.
## What changes were proposed in this pull request?
KMeansSummary class : deprecated size and added clusterSizes
Author: Shally Sangal <shallysangal@gmail.com>
Closes#12084 from shallys/master.
## What changes were proposed in this pull request?
In spark.ml, GBT and RandomForest expose the trait DecisionTreeModel in the trees method, but they should not since it is a private trait (and not ready to be made public). It will also be more useful to users if we return the concrete types.
This PR: return concrete types
The MIMA checks appear to be OK with this change.
## How was this patch tested?
Existing unit tests
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#12158 from jkbradley/hide-dtm.
## What changes were proposed in this pull request?
**Main change**: Added save/load for RandomForestClassifier, RandomForestRegressor (implementation details below)
Modified numTrees method (*deprecation*)
* Goal: Use default implementations of unit tests which assume Estimators and Models share the same set of Params.
* What this PR does: Moves method numTrees outside of trait TreeEnsembleModel. Adds it to GBT and RF Models. Deprecates it in RF Models in favor of new method getNumTrees. In Spark 2.1, we can have RF Models include Param numTrees.
Minor items
* Fixes bugs in GBTClassificationModel, GBTRegressionModel fromOld methods where they assign the wrong old UID.
**Implementation details**
* Split DecisionTreeModelReadWrite.loadTreeNodes into 2 methods in order to reuse some code for ensembles.
* Added EnsembleModelReadWrite object with save/load implementations usable for RFs and GBTs
* These store all trees' nodes in a single DataFrame, and all trees' metadata in a second DataFrame.
* Split trait RandomForestParams into parts in order to add more Estimator Params to RF models
* Split DefaultParamsWriter.saveMetadata into two methods to allow ensembles to store sub-models' metadata in a single DataFrame. Same for DefaultParamsReader.loadMetadata
## How was this patch tested?
Adds standard unit tests for RF save/load
Author: Joseph K. Bradley <joseph@databricks.com>
Author: GayathriMurali <gayathri.m.softie@gmail.com>
Closes#12118 from jkbradley/GayathriMurali-SPARK-13784.
## What changes were proposed in this pull request?
This PR contains the following 5 types of maintenance fix over 59 files (+94 lines, -93 lines).
- Fix typos(exception/log strings, testcase name, comments) in 44 lines.
- Fix lint-java errors (MaxLineLength) in 6 lines. (New codes after SPARK-14011)
- Use diamond operators in 40 lines. (New codes after SPARK-13702)
- Fix redundant semicolon in 5 lines.
- Rename class `InferSchemaSuite` to `CSVInferSchemaSuite` in CSVInferSchemaSuite.scala.
## How was this patch tested?
Manual and pass the Jenkins tests.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#12139 from dongjoon-hyun/SPARK-14355.
## What changes were proposed in this pull request?
This PR aims to fix all Scala-Style multiline comments into Java-Style multiline comments in Scala codes.
(All comment-only changes over 77 files: +786 lines, −747 lines)
## How was this patch tested?
Manual.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#12130 from dongjoon-hyun/use_multiine_javadoc_comments.
## What changes were proposed in this pull request?
Typo fixes. No functional changes.
## How was this patch tested?
Built the sources and ran with samples.
Author: Jacek Laskowski <jacek@japila.pl>
Closes#11802 from jaceklaskowski/typo-fixes.
## What changes were proposed in this pull request?
Decision tree helper classes will be migrated to ML. This patch moves those internal classes that are not part of the public API and removes ones that are no longer used, after [SPARK-12183](https://github.com/apache/spark/pull/11855). No functional changes are made.
Details:
* Bin.scala is removed as the ML implementation does not require bins
* mllib NodeIdCache is removed. It was only used by the mllib implementation previously, which no longer exists
* mllib TreePoint is removed. It was only used by the mllib implementation previously, which no longer exists
* BaggedPoint, DTStatsAggregator, DecisionTreeMetadata, BaggedPointSuite and TimeTracker are all moved to ML.
## How was this patch tested?
No functional changes are made. Existing unit tests ensure behavior is unchanged.
Author: sethah <seth.hendrickson16@gmail.com>
Closes#12097 from sethah/cleanup_mllib_tree.
Currently, the Predictor abstraction expects the input labelCol type to be DoubleType, but we should support other numeric types. This will involve updating the PredictorParams.validateAndTransformSchema method.
Author: BenFradet <benjamin.fradet@gmail.com>
Closes#10355 from BenFradet/SPARK-7425.
## What changes were proposed in this pull request?
Fixes a compilation failure introduced in PR #12088 under Scala 2.10.
## How was this patch tested?
Compilation.
Author: Cheng Lian <lian@databricks.com>
Closes#12107 from liancheng/spark-14295-hotfix.
## What changes were proposed in this pull request?
Define and use ```KMeansWrapper``` for ```SparkR::kmeans```. It's only the code refactor for the original ```KMeans``` wrapper.
## How was this patch tested?
Existing tests.
cc mengxr
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#12039 from yanboliang/spark-14059.
1.Implement LossFunction trait and implement squared error and cross entropy
loss with it
2.Implement unit test for gradient and loss
3.Implement InPlace trait and in-place layer evaluation
4.Refactor interface for ActivationFunction
5.Update of Layer and LayerModel interfaces
6.Fix random weights assignment
7.Implement memory allocation by MLP model instead of individual layers
These features decreased the memory usage and increased flexibility of
internal API.
Author: Alexander Ulanov <nashb@yandex.ru>
Author: avulanov <avulanov@gmail.com>
Closes#9229 from avulanov/mlp-refactoring.
## What changes were proposed in this pull request?
This PR implements `FileFormat.buildReader()` for the LibSVM data source. Besides that, a new interface method `prepareRead()` is added to `FileFormat`:
```scala
def prepareRead(
sqlContext: SQLContext,
options: Map[String, String],
files: Seq[FileStatus]): Map[String, String] = options
```
After migrating from `buildInternalScan()` to `buildReader()`, we lost the opportunity to collect necessary global information, since `buildReader()` works in a per-partition manner. For example, LibSVM needs to infer the total number of features if the `numFeatures` data source option is not set. Any necessary collected global information should be returned using the data source options map. By default, this method just returns the original options untouched.
An alternative approach is to absorb `inferSchema()` into `prepareRead()`, since schema inference is also some kind of global information gathering. However, this approach wasn't chosen because schema inference is optional, while `prepareRead()` must be called whenever a `HadoopFsRelation` based data source relation is instantiated.
One unaddressed problem is that, when `numFeatures` is absent, now the input data will be scanned twice. The `buildInternalScan()` code path doesn't need to do this because it caches the raw parsed RDD in memory before computing the total number of features. However, with `FileScanRDD`, the raw parsed RDD is created in a different way (e.g. partitioning) from the final RDD.
## How was this patch tested?
Tested using existing test suites.
Author: Cheng Lian <lian@databricks.com>
Closes#12088 from liancheng/spark-14295-libsvm-build-reader.
# What changes were proposed in this pull request?
https://issues.apache.org/jira/browse/SPARK-11892
Add save/load for spark ml.OneVsRest and its model. Also add OneVsRest and OneVsRestModel in MetaAlgorithmReadWrite.
# How was this patch tested?
Test with Scala unit test.
Author: Xusen Yin <yinxusen@gmail.com>
Closes#9934 from yinxusen/SPARK-11892.
## What changes were proposed in this pull request?
jira: https://issues.apache.org/jira/browse/SPARK-13782
Model export/import for BisectingKMeans in spark.ml and mllib
## How was this patch tested?
unit tests
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes#11933 from hhbyyh/bisectingsave.
## What changes were proposed in this pull request?
This issue improves an input layer validation and adds related testcases to MultilayerPerceptronClassifier.
```scala
- // TODO: how to check ALSO that all elements are greater than 0?
- ParamValidators.arrayLengthGt(1)
+ (t: Array[Int]) => t.forall(ParamValidators.gt(0)) && t.length > 1
```
## How was this patch tested?
Pass the Jenkins tests including the new testcases.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#11964 from dongjoon-hyun/SPARK-14164.
jira: https://issues.apache.org/jira/browse/SPARK-11507
"In certain situations when adding two block matrices, I get an error regarding colPtr and the operation fails. External issue URL includes full error and code for reproducing the problem."
root cause: colPtr.last does NOT always equal to values.length in breeze SCSMatrix, which fails the require in SparseMatrix.
easy step to repro:
```
val m1: BM[Double] = new CSCMatrix[Double] (Array (1.0, 1, 1), 3, 3, Array (0, 1, 2, 3), Array (0, 1, 2) )
val m2: BM[Double] = new CSCMatrix[Double] (Array (1.0, 2, 2, 4), 3, 3, Array (0, 0, 2, 4), Array (1, 2, 1, 2) )
val sum = m1 + m2
Matrices.fromBreeze(sum)
```
Solution: By checking the code in [CSCMatrix](28000a7b90/math/src/main/scala/breeze/linalg/CSCMatrix.scala), CSCMatrix in breeze can have extra zeros in the end of data array. Invoking compact will make sure it aligns with the require of SparseMatrix. This should add limited overhead as the actual compact operation is only performed when necessary.
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes#9520 from hhbyyh/matricesFromBreeze.
## What changes were proposed in this pull request?
Fix the wrong param name of LDA ```topicDistributionCol```.
## How was this patch tested?
No tests.
cc jkbradley
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#12065 from yanboliang/lda-topicDistributionCol.
https://issues.apache.org/jira/browse/SPARK-14181
TrainValidationSplit should have HasSeed for the random split of RDD. I also changed the random split from the RDD function to the DataFrame function.
Author: Xusen Yin <yinxusen@gmail.com>
Closes#11985 from yinxusen/SPARK-14181.
## What changes were proposed in this pull request?
jira: https://issues.apache.org/jira/browse/SPARK-14154
I just read the code for KolmogorovSmirnovTest and find it could be much simplified following the original definition.
Send a PR for discussion
## How was this patch tested?
unit test
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes#11954 from hhbyyh/ksoptimize.
## What changes were proposed in this pull request?
Adding binary toggle parameter to ml.feature.HashingTF, as well as mllib.feature.HashingTF since the former wraps this functionality. This parameter, if true, will set non-zero valued term counts to 1 to transform term count features to binary values that are well suited for discrete probability models.
## How was this patch tested?
Added unit tests for ML and MLlib
Author: Bryan Cutler <cutlerb@gmail.com>
Closes#11832 from BryanCutler/binary-param-HashingTF-SPARK-13963.
## What changes were proposed in this pull request?
Now that GBTs have been moved to ML, they can use the implementation of feature importance for random forests. This patch simply adds a `featureImportances` attribute to `GBTClassifier` and `GBTRegressor` and adds tests for each.
GBT feature importances here simply average the feature importances for each tree in its ensemble. This follows the implementation from scikit-learn. This method is also suggested by J Friedman in [this paper](https://statweb.stanford.edu/~jhf/ftp/trebst.pdf).
## How was this patch tested?
Unit tests were added to `GBTClassifierSuite` and `GBTRegressorSuite` to validate feature importances.
Author: sethah <seth.hendrickson16@gmail.com>
Closes#11961 from sethah/SPARK-11730.
https://issues.apache.org/jira/browse/SPARK-11893
jkbradley In order to share read/write with `TrainValidationSplit`, I move the `SharedReadWrite` out of `CrossValidator` into a new trait `SharedReadWrite` in the tunning package.
To reduce the repeated tests, I move the complex tests from `CrossValidatorSuite` to `SharedReadWriteSuite`, and create a fake validator called `MyValidator` to test the shared code.
With `SharedReadWrite`, potential newly added `Validator` can share the read/write common part, and only need to implement their extra params save/load.
Author: Xusen Yin <yinxusen@gmail.com>
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#9971 from yinxusen/SPARK-11893.
## What changes were proposed in this pull request?
Fix incorrect use of binarySearch in SparseMatrix
## How was this patch tested?
Unit test added.
Author: Chenliang Xu <chexu@groupon.com>
Closes#11992 from luckyrandom/SPARK-14187.
## What changes were proposed in this pull request?
Better error message with k-means init can't be enough samples from input (because it is perhaps empty)
## How was this patch tested?
Jenkins tests.
Author: Sean Owen <sowen@cloudera.com>
Closes#11979 from srowen/SPARK-12494.
## What changes were proposed in this pull request?
Made evaluate method public. Fixed LogisticRegressionModel evaluate to handle case when probabilityCol is not specified.
## How was this patch tested?
There were already unit tests for these methods.
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#11928 from jkbradley/public-evaluate.
## What changes were proposed in this pull request?
This PR fixes the following line and the related code. Historically, this code was added in [SPARK-5597](https://issues.apache.org/jira/browse/SPARK-5597). After [SPARK-5597](https://issues.apache.org/jira/browse/SPARK-5597) was committed, [SPARK-3365](https://issues.apache.org/jira/browse/SPARK-3365) is fixed now. Now, we had better remove the comment without changing persistent code.
```scala
- categories: Seq[Double]) { // TODO: Change to List once SPARK-3365 is fixed
+ categories: Seq[Double]) {
```
## How was this patch tested?
Pass the Jenkins tests.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#11966 from dongjoon-hyun/change_categories_type.
## What changes were proposed in this pull request?
Removed methods that has been deprecated since 1.1, 1.2, 1.3, 1.4, and 1.5.
## How was this patch tested?
- manully checked that no codes in Spark call these methods any more
- existing test suits
Author: Liwei Lin <lwlin7@gmail.com>
Author: proflin <proflin.me@gmail.com>
Closes#11910 from lw-lin/remove-deprecates.
## What changes were proposed in this pull request?
StringIndexerModel.transform sets the output column metadata to use name inputCol. It should not. Fixing this causes a problem with the metadata produced by RFormula.
Fix in RFormula: I added the StringIndexer columns to prefixesToRewrite, and I modified VectorAttributeRewriter to find and replace all "prefixes" since attributes collect multiple prefixes from StringIndexer + Interaction.
Note that "prefixes" is no longer accurate since internal strings may be replaced.
## How was this patch tested?
Unit test which failed before this fix.
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#11965 from jkbradley/StringIndexer-fix.
## What changes were proposed in this pull request?
This PR continues the work in #11447, we implemented the wrapper of ```AFTSurvivalRegression``` named ```survreg``` in SparkR.
## How was this patch tested?
Test against output from R package survival's survreg.
cc mengxr felixcheung
Close#11447
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#11932 from yanboliang/spark-13010-new.
## What changes were proposed in this pull request?
https://issues.apache.org/jira/browse/SPARK-11871
Add save/load for MLPC
## How was this patch tested?
Test with Scala unit test
Author: Xusen Yin <yinxusen@gmail.com>
Closes#9854 from yinxusen/SPARK-11871.
## What changes were proposed in this pull request?
Just a typo
## How was this patch tested?
N/A
Author: Juarez Bochi <jbochi@gmail.com>
Closes#11896 from jbochi/patch-1.
Primary change:
* Removed spark.mllib.tree.DecisionTree implementation of tree and forest learning.
* spark.mllib now calls the spark.ml implementation.
* Moved unit tests (of tree learning internals) from spark.mllib to spark.ml as needed.
ml.tree.DecisionTreeModel
* Added toOld and made ```private[spark]```, implemented for Classifier and Regressor in subclasses. These methods now use OldInformationGainStats.invalidInformationGainStats for LeafNodes in order to mimic the spark.mllib implementation.
ml.tree.Node
* Added ```private[tree] def deepCopy```, used by unit tests
Copied developer comments from spark.mllib implementation to spark.ml one.
Moving unit tests
* Tree learning internals were tested by spark.mllib.tree.DecisionTreeSuite, or spark.mllib.tree.RandomForestSuite.
* Those tests were all moved to spark.ml.tree.impl.RandomForestSuite. The order in the file + the test names are the same, so you should be able to compare them by opening them in 2 windows side-by-side.
* I made minimal changes to each test to allow it to run. Each test makes the same checks as before, except for a few removed assertions which were checking irrelevant values.
* No new unit tests were added.
* mllib.tree.DecisionTreeSuite: I removed some checks of splits and bins which were not relevant to the unit tests they were in. Those same split calculations were already being tested in other unit tests, for each dataset type.
**Changes of behavior** (to be noted in SPARK-13448 once this PR is merged)
* spark.ml.tree.impl.RandomForest: Rather than throwing an error when maxMemoryInMB is set to too small a value (to split any node), we now allow 1 node to be split, even if its memory requirements exceed maxMemoryInMB. This involved removing the maxMemoryPerNode check in RandomForest.run, as well as modifying selectNodesToSplit(). Once this PR is merged, I will note the change of behavior on SPARK-13448.
* spark.mllib.tree.DecisionTree: When a tree only has one node (root = leaf node), the "stats" field will now be empty, rather than being set to InformationGainStats.invalidInformationGainStats. This does not remove information from the tree, and it will save a bit of storage.
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#11855 from jkbradley/remove-mllib-tree-impl.
## What changes were proposed in this pull request?
`GBTClassifier` and `GBTRegressor` should use random seed for reproducible results. Because of the nature of current unit tests, which compare GBTs in ML and GBTs in MLlib for equality, I also added a random seed to MLlib GBT algorithm. I made alternate constructors in `mllib.tree.GradientBoostedTrees` to accept a random seed, but left them as private so as to not change the API unnecessarily.
## How was this patch tested?
Existing unit tests verify that functionality did not change. Other ML algorithms do not seem to have unit tests that directly test the functionality of random seeding, but reproducibility with seeding for GBTs is effectively verified in existing tests. I can add more tests if needed.
Author: sethah <seth.hendrickson16@gmail.com>
Closes#11903 from sethah/SPARK-13952.
## What changes were proposed in this pull request?
Print more info about failed NaiveBayesSuite tests which have exhibited flakiness.
## How was this patch tested?
Ran locally with incorrect check to cause failure.
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#11858 from jkbradley/naive-bayes-bug-log.
## What changes were proposed in this pull request?
This PR continues the work in #11486 from yinxusen with some code refactoring. In R package e1071, `naiveBayes` supports both categorical (Bernoulli) and continuous features (Gaussian), while in MLlib we support Bernoulli and multinomial. This PR implements the common subset: Bernoulli.
I moved the implementation out from SparkRWrappers to NaiveBayesWrapper to make it easier to read. Argument names, default values, and summary now match e1071's naiveBayes.
I removed the preprocess part that omit NA values because we don't know which columns to process.
## How was this patch tested?
Test against output from R package e1071's naiveBayes.
cc: yanboliang yinxusen
Closes#11486
Author: Xusen Yin <yinxusen@gmail.com>
Author: Xiangrui Meng <meng@databricks.com>
Closes#11890 from mengxr/SPARK-13449.
## What changes were proposed in this pull request?
Spark uses `DeveloperApi` annotation, but sometimes it seems to conflict with visibility. This PR tries to fix those conflict by removing annotations for non-publics. The following is the example.
**JobResult.scala**
```scala
DeveloperApi
sealed trait JobResult
DeveloperApi
case object JobSucceeded extends JobResult
-DeveloperApi
private[spark] case class JobFailed(exception: Exception) extends JobResult
```
## How was this patch tested?
Pass the existing Jenkins test.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#11797 from dongjoon-hyun/SPARK-13986.
## What changes were proposed in this pull request?
[Spark Coding Style Guide](https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide) has 100-character limit on lines, but it's disabled for Java since 11/09/15. This PR enables **LineLength** checkstyle again. To help that, this also introduces **RedundantImport** and **RedundantModifier**, too. The following is the diff on `checkstyle.xml`.
```xml
- <!-- TODO: 11/09/15 disabled - the lengths are currently > 100 in many places -->
- <!--
<module name="LineLength">
<property name="max" value="100"/>
<property name="ignorePattern" value="^package.*|^import.*|a href|href|http://|https://|ftp://"/>
</module>
- -->
<module name="NoLineWrap"/>
<module name="EmptyBlock">
<property name="option" value="TEXT"/>
-167,5 +164,7
</module>
<module name="CommentsIndentation"/>
<module name="UnusedImports"/>
+ <module name="RedundantImport"/>
+ <module name="RedundantModifier"/>
```
## How was this patch tested?
Currently, `lint-java` is disabled in Jenkins. It needs a manual test.
After passing the Jenkins tests, `dev/lint-java` should passes locally.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#11831 from dongjoon-hyun/SPARK-14011.
This PR changes the `findSplits` method in spark.ml to perform split calculations on the workers. This PR is meant to copy [PR-8246](https://github.com/apache/spark/pull/8246) which added the same feature for MLlib.
Author: sethah <seth.hendrickson16@gmail.com>
Closes#10231 from sethah/SPARK-12182.
## What changes were proposed in this pull request?
This is a continued work for https://github.com/apache/spark/pull/11536#issuecomment-198511013,
containing some comment update and style adjustment.
jkbradley
## How was this patch tested?
unit tests.
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes#11830 from hhbyyh/cvToggle.
## What changes were proposed in this pull request?
When trainingSummary is None, it should throw ```RuntimeException```.
cc mengxr
## How was this patch tested?
Existing tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#11784 from yanboliang/fix-summary.
Decision trees in spark.ml (RandomForest.scala) communicate twice as much data as needed for unordered categorical features. Here's an example.
Say there are 3 categories A, B, C. We consider 3 splits:
* A vs. B, C
* A, B vs. C
* A, C vs. B
Currently, we collect statistics for each of the 6 subsets of categories (3 * 2 = 6). However, we could instead collect statistics for the 3 subsets on the left-hand side of the 3 possible splits: A and A,B and A,C. If we also have stats for the entire node, then we can compute the stats for the 3 subsets on the right-hand side of the splits. In pseudomath: stats(B,C) = stats(A,B,C) - stats(A).
This patch adds a parent stats array to the `DTStatsAggregator` so that the right child stats do not need to be stored. The right child stats are computed by subtracting left child stats from the parent stats for unordered categorical features.
Author: sethah <seth.hendrickson16@gmail.com>
Closes#9474 from sethah/SPARK-10788.
## What changes were proposed in this pull request?
Cleanups from [https://github.com/apache/spark/pull/11620]: remove remaining uses of validateParams, and put functionality into transformSchema
## How was this patch tested?
Existing unit tests, modified to check using transformSchema instead of validateParams
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#11790 from jkbradley/SPARK-13761-cleanup.
## What changes were proposed in this pull request?
Logging was made private in Spark 2.0. If we move it, then users would be able to create a Logging trait themselves to avoid changing their own code.
## How was this patch tested?
existing tests.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#11764 from cloud-fan/logger.
## What changes were proposed in this pull request?
Deprecate validateParams() method here: 035d3acdf3/mllib/src/main/scala/org/apache/spark/ml/param/params.scala (L553)
Move all functionality in overridden methods to transformSchema().
Check docs to make sure they indicate complex Param interaction checks should be done in transformSchema.
## How was this patch tested?
unit tests
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes#11620 from hhbyyh/depreValid.
## What changes were proposed in this pull request?
Narrow down the parameter type of `UserDefinedType#serialize()`. Currently, the parameter type is `Any`, however it would logically make more sense to narrow it down to the type of the actual user defined type.
## How was this patch tested?
Existing tests were successfully run on local machine.
Author: Jakob Odersky <jakob@odersky.com>
Closes#11379 from jodersky/SPARK-11011-udt-types.
## What changes were proposed in this pull request?
Add row/column iterator to local matrices to simplify tasks like BlockMatrix => RowMatrix conversion. It handles dense and sparse matrices properly.
## How was this patch tested?
Unit tests on sparse and dense matrix.
cc: dbtsai
Author: Xiangrui Meng <meng@databricks.com>
Closes#11757 from mengxr/SPARK-13927.
### What changes were proposed in this pull request?
Made these MLReadable and MLWritable: DecisionTreeClassifier, DecisionTreeClassificationModel, DecisionTreeRegressor, DecisionTreeRegressionModel
* The shared implementation is in treeModels.scala
* I use case classes to create a DataFrame to save, and I use the Dataset API to parse loaded files.
Other changes:
* Made CategoricalSplit.numCategories public (to use in persistence)
* Fixed a bug in DefaultReadWriteTest.testEstimatorAndModelReadWrite, where it did not call the checkModelData function passed as an argument. This caused an error in LDASuite, which I fixed.
### How was this patch tested?
Persistence is tested via unit tests. For each algorithm, there are 2 non-trivial trees (depth 2). One is built with continuous features, and one with categorical; this ensures that both types of splits are tested.
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#11581 from jkbradley/dt-io.
## What changes were proposed in this pull request?
Provide ignored test cases to export the test dataset into CSV format in ```LinearRegressionSuite```, ```LogisticRegressionSuite```, ```AFTSurvivalRegressionSuite``` and ```GeneralizedLinearRegressionSuite```, so users can validate the training accuracy compared with R's glm, glmnet and survival package.
cc mengxr
## How was this patch tested?
The test suite is ignored, but I have enabled all these cases offline and it works as expected.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#11463 from yanboliang/spark-13613.
## What changes were proposed in this pull request?
https://issues.apache.org/jira/browse/SPARK-13894
Change the return type of the `SQLContext.range` API from `DataFrame` to `Dataset`.
## How was this patch tested?
No additional unit test required.
Author: Cheng Hao <hao.cheng@intel.com>
Closes#11730 from chenghao-intel/range.
## What changes were proposed in this pull request?
Follow up to https://github.com/apache/spark/pull/11657
- Also update `String.getBytes("UTF-8")` to use `StandardCharsets.UTF_8`
- And fix one last new Coverity warning that turned up (use of unguarded `wait()` replaced by simpler/more robust `java.util.concurrent` classes in tests)
- And while we're here cleaning up Coverity warnings, just fix about 15 more build warnings
## How was this patch tested?
Jenkins tests
Author: Sean Owen <sowen@cloudera.com>
Closes#11725 from srowen/SPARK-13823.2.
## What changes were proposed in this pull request?
Provide R-like summary statistics for GLMs via iteratively reweighted least squares.
## How was this patch tested?
unit tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#11694 from yanboliang/spark-9837.
Currently, GBTs in spark.ml wrap the implementation in spark.mllib. This is preventing several improvements to GBTs in spark.ml, so we need to move the implementation to ml and use spark.ml decision trees in the implementation. At first, we should make minimal changes to the implementation.
Performance testing should be done to ensure there were no regressions.
Performance testing results are [here](https://docs.google.com/document/d/1dYd2mnfGdUKkQ3vZe2BpzsTnI5IrpSLQ-NNKDZhUkgw/edit?usp=sharing)
Author: sethah <seth.hendrickson16@gmail.com>
Closes#10607 from sethah/SPARK-12379.
This PR adds a new strategy, `FileSourceStrategy`, that can be used for planning scans of collections of files that might be partitioned or bucketed.
Compared with the existing planning logic in `DataSourceStrategy` this version has the following desirable properties:
- It removes the need to have `RDD`, `broadcastedHadoopConf` and other distributed concerns in the public API of `org.apache.spark.sql.sources.FileFormat`
- Partition column appending is delegated to the format to avoid an extra copy / devectorization when appending partition columns
- It minimizes the amount of data that is shipped to each executor (i.e. it does not send the whole list of files to every worker in the form of a hadoop conf)
- it natively supports bucketing files into partitions, and thus does not require coalescing / creating a `UnionRDD` with the correct partitioning.
- Small files are automatically coalesced into fewer tasks using an approximate bin-packing algorithm.
Currently only a testing source is planned / tested using this strategy. In follow-up PRs we will port the existing formats to this API.
A stub for `FileScanRDD` is also added, but most methods remain unimplemented.
Other minor cleanups:
- partition pruning is pushed into `FileCatalog` so both the new and old code paths can use this logic. This will also allow future implementations to use indexes or other tricks (i.e. a MySQL metastore)
- The partitions from the `FileCatalog` now propagate information about file sizes all the way up to the planner so we can intelligently spread files out.
- `Array` -> `Seq` in some internal APIs to avoid unnecessary `toArray` calls
- Rename `Partition` to `PartitionDirectory` to differentiate partitions used earlier in pruning from those where we have already enumerated the files and their sizes.
Author: Michael Armbrust <michael@databricks.com>
Closes#11646 from marmbrus/fileStrategy.
## What changes were proposed in this pull request?
`LinearRegressionWithSGD` and `StreamingLinearRegressionWithSGD` does not have `regParam` as their constructor arguments. They just depends on GradientDescent's default reqParam values.
To be consistent with other algorithms, we had better add them. The same default value is used.
## How was this patch tested?
Pass the existing unit test.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#11527 from dongjoon-hyun/SPARK-13686.
## What changes were proposed in this pull request?
This PR fixes 135 typos over 107 files:
* 121 typos in comments
* 11 typos in testcase name
* 3 typos in log messages
## How was this patch tested?
Manual.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#11689 from dongjoon-hyun/fix_more_typos.
## What changes were proposed in this pull request?
- Fixes calls to `new String(byte[])` or `String.getBytes()` that rely on platform default encoding, to use UTF-8
- Same for `InputStreamReader` and `OutputStreamWriter` constructors
- Standardizes on UTF-8 everywhere
- Standardizes specifying the encoding with `StandardCharsets.UTF-8`, not the Guava constant or "UTF-8" (which means handling `UnuspportedEncodingException`)
- (also addresses the other remaining Coverity scan issues, which are pretty trivial; these are separated into commit 1deecd8d9c )
## How was this patch tested?
Jenkins tests
Author: Sean Owen <sowen@cloudera.com>
Closes#11657 from srowen/SPARK-13823.
## What changes were proposed in this pull request?
This PR removes two methods, `collectRows()` and `takeRows()`, from `Dataset[T]`. These methods were added in PR #11443, and were later considered not useful.
## How was this patch tested?
Existing tests should do the work.
Author: Cheng Lian <lian@databricks.com>
Closes#11678 from liancheng/remove-collect-rows-and-take-rows.
## What changes were proposed in this pull request?
This PR unifies DataFrame and Dataset by migrating existing DataFrame operations to Dataset and make `DataFrame` a type alias of `Dataset[Row]`.
Most Scala code changes are source compatible, but Java API is broken as Java knows nothing about Scala type alias (mostly replacing `DataFrame` with `Dataset<Row>`).
There are several noticeable API changes related to those returning arrays:
1. `collect`/`take`
- Old APIs in class `DataFrame`:
```scala
def collect(): Array[Row]
def take(n: Int): Array[Row]
```
- New APIs in class `Dataset[T]`:
```scala
def collect(): Array[T]
def take(n: Int): Array[T]
def collectRows(): Array[Row]
def takeRows(n: Int): Array[Row]
```
Two specialized methods `collectRows` and `takeRows` are added because Java doesn't support returning generic arrays. Thus, for example, `DataFrame.collect(): Array[T]` actually returns `Object` instead of `Array<T>` from Java side.
Normally, Java users may fall back to `collectAsList` and `takeAsList`. The two new specialized versions are added to avoid performance regression in ML related code (but maybe I'm wrong and they are not necessary here).
1. `randomSplit`
- Old APIs in class `DataFrame`:
```scala
def randomSplit(weights: Array[Double], seed: Long): Array[DataFrame]
def randomSplit(weights: Array[Double]): Array[DataFrame]
```
- New APIs in class `Dataset[T]`:
```scala
def randomSplit(weights: Array[Double], seed: Long): Array[Dataset[T]]
def randomSplit(weights: Array[Double]): Array[Dataset[T]]
```
Similar problem as above, but hasn't been addressed for Java API yet. We can probably add `randomSplitAsList` to fix this one.
1. `groupBy`
Some original `DataFrame.groupBy` methods have conflicting signature with original `Dataset.groupBy` methods. To distinguish these two, typed `Dataset.groupBy` methods are renamed to `groupByKey`.
Other noticeable changes:
1. Dataset always do eager analysis now
We used to support disabling DataFrame eager analysis to help reporting partially analyzed malformed logical plan on analysis failure. However, Dataset encoders requires eager analysi during Dataset construction. To preserve the error reporting feature, `AnalysisException` now takes an extra `Option[LogicalPlan]` argument to hold the partially analyzed plan, so that we can check the plan tree when reporting test failures. This plan is passed by `QueryExecution.assertAnalyzed`.
## How was this patch tested?
Existing tests do the work.
## TODO
- [ ] Fix all tests
- [ ] Re-enable MiMA check
- [ ] Update ScalaDoc (`since`, `group`, and example code)
Author: Cheng Lian <lian@databricks.com>
Author: Yin Huai <yhuai@databricks.com>
Author: Wenchen Fan <wenchen@databricks.com>
Author: Cheng Lian <liancheng@users.noreply.github.com>
Closes#11443 from liancheng/ds-to-df.
## What changes were proposed in this pull request?
Since the opening curly brace, '{', has many usages as discussed in [SPARK-3854](https://issues.apache.org/jira/browse/SPARK-3854), this PR adds a ScalaStyle rule to prevent '){' pattern for the following majority pattern and fixes the code accordingly. If we enforce this in ScalaStyle from now, it will improve the Scala code quality and reduce review time.
```
// Correct:
if (true) {
println("Wow!")
}
// Incorrect:
if (true){
println("Wow!")
}
```
IntelliJ also shows new warnings based on this.
## How was this patch tested?
Pass the Jenkins ScalaStyle test.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#11637 from dongjoon-hyun/SPARK-3854.
Adding support for other numeric types:
* Integer
* Short
* Long
* Float
* Decimal
Author: sethah <seth.hendrickson16@gmail.com>
Closes#9777 from sethah/SPARK-11108.
This patch adds an API entry point for single decision tree feature importances.
Author: sethah <seth.hendrickson16@gmail.com>
Closes#9912 from sethah/SPARK-11861.
## What changes were proposed in this pull request?
```GeneralizedLinearRegression``` supports ```save/load```.
cc mengxr
## How was this patch tested?
unit test.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#11465 from yanboliang/spark-13615.
## What changes were proposed in this pull request?
In order to make `docs/examples` (and other related code) more simple/readable/user-friendly, this PR replaces existing codes like the followings by using `diamond` operator.
```
- final ArrayList<Product2<Object, Object>> dataToWrite =
- new ArrayList<Product2<Object, Object>>();
+ final ArrayList<Product2<Object, Object>> dataToWrite = new ArrayList<>();
```
Java 7 or higher supports **diamond** operator which replaces the type arguments required to invoke the constructor of a generic class with an empty set of type parameters (<>). Currently, Spark Java code use mixed usage of this.
## How was this patch tested?
Manual.
Pass the existing tests.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#11541 from dongjoon-hyun/SPARK-13702.
## What changes were proposed in this pull request?
Although we defined ```checkModelData``` in [```read/write``` test](https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala#L994) of ML estimators/models and pass it to ```testEstimatorAndModelReadWrite```, ```testEstimatorAndModelReadWrite``` omits to call ```checkModelData``` to check the equality of model data. So actually we did not run the check of model data equality for all test cases currently, we should fix it.
BTW, fix the bug of LDA read/write test which did not set ```docConcentration```. This bug should have failed test, but it does not complain because we did not run ```checkModelData``` actually.
cc jkbradley mengxr
## How was this patch tested?
No new unit test, should pass the exist ones.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#11513 from yanboliang/ml-check-model-data.
## What changes were proposed in this pull request?
Remove last usage of jblas, in tests
## How was this patch tested?
Jenkins tests -- the same ones that are being modified.
Author: Sean Owen <sowen@cloudera.com>
Closes#11560 from srowen/SPARK-13715.
`HadoopFsRelation` is used for reading most files into Spark SQL. However today this class mixes the concerns of file management, schema reconciliation, scan building, bucketing, partitioning, and writing data. As a result, many data sources are forced to reimplement the same functionality and the various layers have accumulated a fair bit of inefficiency. This PR is a first cut at separating this into several components / interfaces that are each described below. Additionally, all implementations inside of Spark (parquet, csv, json, text, orc, svmlib) have been ported to the new API `FileFormat`. External libraries, such as spark-avro will also need to be ported to work with Spark 2.0.
### HadoopFsRelation
A simple `case class` that acts as a container for all of the metadata required to read from a datasource. All discovery, resolution and merging logic for schemas and partitions has been removed. This an internal representation that no longer needs to be exposed to developers.
```scala
case class HadoopFsRelation(
sqlContext: SQLContext,
location: FileCatalog,
partitionSchema: StructType,
dataSchema: StructType,
bucketSpec: Option[BucketSpec],
fileFormat: FileFormat,
options: Map[String, String]) extends BaseRelation
```
### FileFormat
The primary interface that will be implemented by each different format including external libraries. Implementors are responsible for reading a given format and converting it into `InternalRow` as well as writing out an `InternalRow`. A format can optionally return a schema that is inferred from a set of files.
```scala
trait FileFormat {
def inferSchema(
sqlContext: SQLContext,
options: Map[String, String],
files: Seq[FileStatus]): Option[StructType]
def prepareWrite(
sqlContext: SQLContext,
job: Job,
options: Map[String, String],
dataSchema: StructType): OutputWriterFactory
def buildInternalScan(
sqlContext: SQLContext,
dataSchema: StructType,
requiredColumns: Array[String],
filters: Array[Filter],
bucketSet: Option[BitSet],
inputFiles: Array[FileStatus],
broadcastedConf: Broadcast[SerializableConfiguration],
options: Map[String, String]): RDD[InternalRow]
}
```
The current interface is based on what was required to get all the tests passing again, but still mixes a couple of concerns (i.e. `bucketSet` is passed down to the scan instead of being resolved by the planner). Additionally, scans are still returning `RDD`s instead of iterators for single files. In a future PR, bucketing should be removed from this interface and the scan should be isolated to a single file.
### FileCatalog
This interface is used to list the files that make up a given relation, as well as handle directory based partitioning.
```scala
trait FileCatalog {
def paths: Seq[Path]
def partitionSpec(schema: Option[StructType]): PartitionSpec
def allFiles(): Seq[FileStatus]
def getStatus(path: Path): Array[FileStatus]
def refresh(): Unit
}
```
Currently there are two implementations:
- `HDFSFileCatalog` - based on code from the old `HadoopFsRelation`. Infers partitioning by recursive listing and caches this data for performance
- `HiveFileCatalog` - based on the above, but it uses the partition spec from the Hive Metastore.
### ResolvedDataSource
Produces a logical plan given the following description of a Data Source (which can come from DataFrameReader or a metastore):
- `paths: Seq[String] = Nil`
- `userSpecifiedSchema: Option[StructType] = None`
- `partitionColumns: Array[String] = Array.empty`
- `bucketSpec: Option[BucketSpec] = None`
- `provider: String`
- `options: Map[String, String]`
This class is responsible for deciding which of the Data Source APIs a given provider is using (including the non-file based ones). All reconciliation of partitions, buckets, schema from metastores or inference is done here.
### DataSourceAnalysis / DataSourceStrategy
Responsible for analyzing and planning reading/writing of data using any of the Data Source APIs, including:
- pruning the files from partitions that will be read based on filters.
- appending partition columns*
- applying additional filters when a data source can not evaluate them internally.
- constructing an RDD that is bucketed correctly when required*
- sanity checking schema match-up and other analysis when writing.
*In the future we should do that following:
- Break out file handling into its own Strategy as its sufficiently complex / isolated.
- Push the appending of partition columns down in to `FileFormat` to avoid an extra copy / unvectorization.
- Use a custom RDD for scans instead of `SQLNewNewHadoopRDD2`
Author: Michael Armbrust <michael@databricks.com>
Author: Wenchen Fan <wenchen@databricks.com>
Closes#11509 from marmbrus/fileDataSource.
Add save/load for feature.py. Meanwhile, add save/load for `ElementwiseProduct` in Scala side and fix a bug of missing `setDefault` in `VectorSlicer` and `StopWordsRemover`.
In this PR I ignore the `RFormula` and `RFormulaModel` because its Scala implementation is pending in https://github.com/apache/spark/pull/9884. I'll add them in this PR if https://github.com/apache/spark/pull/9884 gets merged first. Or add a follow-up JIRA for `RFormula`.
Author: Xusen Yin <yinxusen@gmail.com>
Closes#11203 from yinxusen/SPARK-13036.
## What changes were proposed in this pull request?
It avoids counting the dataframe twice.
Author: Abou Haydar Elias <abouhaydar.elias@gmail.com>
Author: Elie A <abouhaydar.elias@gmail.com>
Closes#11491 from eliasah/quantile-discretizer-patch.
## What changes were proposed in this pull request?
This PR fixes typos in comments and testcase name of code.
## How was this patch tested?
manual.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#11481 from dongjoon-hyun/minor_fix_typos_in_code.
## What changes were proposed in this pull request?
Remove duplicated periods at the end of some sharedParams in ScalaDoc, such as [here](https://github.com/apache/spark/pull/11344/files#diff-9edc669edcf2c0c7cf1efe4a0a57da80L367)
cc mengxr srowen
## How was this patch tested?
Documents change, no test.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#11344 from yanboliang/shared-cleanup.
## What changes were proposed in this pull request?
After SPARK-6990, `dev/lint-java` keeps Java code healthy and helps PR review by saving much time.
This issue aims remove unused imports from Java/Scala code and add `UnusedImports` checkstyle rule to help developers.
## How was this patch tested?
```
./dev/lint-java
./build/sbt compile
```
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#11438 from dongjoon-hyun/SPARK-13583.
## What changes were proposed in this pull request?
Make some cross-cutting code improvements according to static analysis. These are individually up for discussion since they exist in separate commits that can be reverted. The changes are broadly:
- Inner class should be static
- Mismatched hashCode/equals
- Overflow in compareTo
- Unchecked warnings
- Misuse of assert, vs junit.assert
- get(a) + getOrElse(b) -> getOrElse(a,b)
- Array/String .size -> .length (occasionally, -> .isEmpty / .nonEmpty) to avoid implicit conversions
- Dead code
- tailrec
- exists(_ == ) -> contains find + nonEmpty -> exists filter + size -> count
- reduce(_+_) -> sum map + flatten -> map
The most controversial may be .size -> .length simply because of its size. It is intended to avoid implicits that might be expensive in some places.
## How was the this patch tested?
Existing Jenkins unit tests.
Author: Sean Owen <sowen@cloudera.com>
Closes#11292 from srowen/SPARK-13423.
Estimator for Generalized Linear Models(GLMs) which will be solved by IRLS.
cc mengxr
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#11136 from yanboliang/spark-12811.
JIRA: https://issues.apache.org/jira/browse/SPARK-13506
## What changes were proposed in this pull request?
just chang R Snippet Comment in AssociationRulesSuite
## How was this patch tested?
unit test passsed
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#11387 from zhengruifeng/ars.
## What changes were proposed in this pull request?
* The default value of ```regParam``` of PySpark MLlib ```LogisticRegressionWithLBFGS``` should be consistent with Scala which is ```0.0```. (This is also consistent with ML ```LogisticRegression```.)
* BTW, if we use a known updater(L1 or L2) for binary classification, ```LogisticRegressionWithLBFGS``` will call the ML implementation. We should update the API doc to clarifying ```numCorrections``` will have no effect if we fall into that route.
* Make a pass for all parameters of ```LogisticRegressionWithLBFGS```, others are set properly.
cc mengxr dbtsai
## How was this patch tested?
No new tests, it should pass all current tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#11424 from yanboliang/spark-13545.
Part of task for [SPARK-11219](https://issues.apache.org/jira/browse/SPARK-11219) to make PySpark MLlib parameter description formatting consistent. This is for the tree module.
closes#10601
Author: Bryan Cutler <cutlerb@gmail.com>
Author: vijaykiran <mail@vijaykiran.com>
Closes#11353 from BryanCutler/param-desc-consistent-tree-SPARK-12634.
## What changes were proposed in this pull request?
This is another try of PR #11323.
This PR removes DataFrame RDD operations except for `foreach` and `foreachPartitions` (they are actions rather than transformations). Original calls are now replaced by calls to methods of `DataFrame.rdd`.
PR #11323 was reverted because it introduced a regression: both `DataFrame.foreach` and `DataFrame.foreachPartitions` wrap underlying RDD operations with `withNewExecutionId` to track Spark jobs. But they are removed in #11323.
## How was the this patch tested?
No extra tests are added. Existing tests should do the work.
Author: Cheng Lian <lian@databricks.com>
Closes#11388 from liancheng/remove-df-rdd-ops.
jira: https://issues.apache.org/jira/browse/SPARK-13028
MaxAbsScaler works in a very similar way as MinMaxScaler, but scales in a way that the training data lies within the range [-1, 1] by dividing through the largest maximum value in each feature. The motivation to use this scaling includes robustness to very small standard deviations of features and preserving zero entries in sparse data.
Unlike StandardScaler and MinMaxScaler, MaxAbsScaler does not shift/center the data, and thus does not destroy any sparsity.
Something similar from sklearn:
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MaxAbsScaler.html#sklearn.preprocessing.MaxAbsScaler
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes#10939 from hhbyyh/maxabs and squashes the following commits:
fd8bdcd [Yuhao Yang] add tag and some optimization on fit
648fced [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into maxabs
75bebc2 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into maxabs
cb10bb6 [Yuhao Yang] remove minmax
91ef8f3 [Yuhao Yang] ut added
8ab0747 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into maxabs
a9215b5 [Yuhao Yang] max abs scaler
## What changes were proposed in this pull request?
ML StringIndexer does not protect itself from column name duplication.
We should still improve a way to validate a schema of `StringIndexer` and `StringIndexerModel`. However, it would be great to fix at another issue.
## How was this patch tested?
unit test
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
Closes#11370 from yu-iskw/SPARK-12874.
## What changes were proposed in this pull request?
This PR removes DataFrame RDD operations. Original calls are now replaced by calls to methods of `DataFrame.rdd`.
## How was the this patch tested?
No extra tests are added. Existing tests should do the work.
Author: Cheng Lian <lian@databricks.com>
Closes#11323 from liancheng/remove-df-rdd-ops.
## What changes were proposed in this pull request?
Like #11027 for ```LogisticRegression```, ```LinearRegression``` with L1 regularization should also cache the value of the ```standardization``` rather than re-fetching it from the ```ParamMap``` for every OWLQN iteration.
cc srowen
## How was this patch tested?
No extra tests are added. It should pass all existing tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#11367 from yanboliang/spark-13490.
## What changes were proposed in this pull request?
Change line 113 of QuantileDiscretizer.scala to
`val requiredSamples = math.max(numBins * numBins, 10000.0)`
so that `requiredSamples` is a `Double`. This will fix the division in line 114 which currently results in zero if `requiredSamples < dataset.count`
## How was the this patch tested?
Manual tests. I was having a problems using QuantileDiscretizer with my a dataset and after making this change QuantileDiscretizer behaves as expected.
Author: Oliver Pierson <ocp@gatech.edu>
Author: Oliver Pierson <opierson@umd.edu>
Closes#11319 from oliverpierson/SPARK-13444.
`GraphImpl.fromExistingRDDs` expects preprocessed vertex RDD as input. We call it in LDA without validating this requirement. So it might introduce errors. Replacing it by `Graph.apply` would be safer and more proper because it is a public API. The tests still pass. So maybe it is safe to use `fromExistingRDDs` here (though it doesn't seem so based on the implementation) or the test cases are special. jkbradley ankurdave
Author: Xiangrui Meng <meng@databricks.com>
Closes#11226 from mengxr/SPARK-13355.
## What changes were proposed in this pull request?
In order to provide better and consistent result, let's change the default value of MLlib ```LogisticRegressionWithLBFGS convergenceTol``` from ```1E-4``` to ```1E-6``` which will be equal to ML ```LogisticRegression```.
cc dbtsai
## How was the this patch tested?
unit tests
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#11299 from yanboliang/spark-13429.
As also mentioned/marked by TODO in AFTAggregator.AFTAggregator.add(data: AFTPoint) method a new array is being created for intercept value and it is being concatenated
with another array which contains the betas, the resulted Array is being converted into a Dense vector which in its turn is being converted into breeze vector.
This is expensive and not necessarily beautiful.
I've tried to solve above mentioned problem by simple algebraic decompositions - keeping and treating intercept independently.
Please let me know what do you think and if you have any questions.
Thanks,
Narine
Author: Narine Kokhlikyan <narine.kokhlikyan@gmail.com>
Closes#11179 from NarineK/survivaloptim.
ML ```KMeansModel / BisectingKMeansModel / QuantileDiscretizer``` should set parent.
cc mengxr
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#11214 from yanboliang/spark-13334.
Part of task for [SPARK-11219](https://issues.apache.org/jira/browse/SPARK-11219) to make PySpark MLlib parameter description formatting consistent. This is for the fpm and recommendation modules.
Closes#10602Closes#10897
Author: Bryan Cutler <cutlerb@gmail.com>
Author: somideshmukh <somilde@us.ibm.com>
Closes#11186 from BryanCutler/param-desc-consistent-fpmrecc-SPARK-12632.
## What changes were proposed in this pull request?
This PR tries to fix all typos in all markdown files under `docs` module,
and fixes similar typos in other comments, too.
## How was the this patch tested?
manual tests.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#11300 from dongjoon-hyun/minor_fix_typos.
add support of arbitrary length sentence by using the nature representation of sentences in the input.
add new similarity functions and add normalization option for distances in synonym finding
add new accessor for internal structure(the vocabulary and wordindex) for convenience
need instructions about how to set value for the Since annotation for newly added public functions. 1.5.3?
jira link: https://issues.apache.org/jira/browse/SPARK-12153
Author: Yong Gang Cao <ygcao@amazon.com>
Author: Yong-Gang Cao <ygcao@users.noreply.github.com>
Closes#10152 from ygcao/improvementForSentenceBoundary.
## What changes were proposed in this pull request?
Fix MLlib LogisticRegressionWithLBFGS regularization map as:
```SquaredL2Updater``` -> ```elasticNetParam = 0.0```
```L1Updater``` -> ```elasticNetParam = 1.0```
cc dbtsai
## How was the this patch tested?
unit tests
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#11258 from yanboliang/spark-13379.
This PR fixes some warnings found by `build/sbt mllib/test:compile`.
Author: Xiangrui Meng <meng@databricks.com>
Closes#11227 from mengxr/fix-mllib-warnings-201602.
This documents the implementation of ALS in `spark.ml` with example code in scala, java and python.
Author: BenFradet <benjamin.fradet@gmail.com>
Closes#10411 from BenFradet/SPARK-12247.
This enhancement extends the existing SparkML Binarizer [SPARK-5891] to allow Vector in addition to the existing Double input column type.
A use case for this enhancement is for when a user wants to Binarize many similar feature columns at once using the same threshold value (for example a binary threshold applied to many pixels in an image).
This contribution is my original work and I license the work to the project under the project's open source license.
viirya mengxr
Author: seddonm1 <seddonm1@gmail.com>
Closes#10976 from seddonm1/master.
JIRA: https://issues.apache.org/jira/browse/SPARK-12363
This issue is pointed by yanboliang. When `setRuns` is removed from PowerIterationClustering, one of the tests will be failed. I found that some `dstAttr`s of the normalized graph are not correct values but 0.0. By setting `TripletFields.All` in `mapTriplets` it can work.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Author: Xiangrui Meng <meng@databricks.com>
Closes#10539 from viirya/fix-poweriter.
jkbradley I tried to improve the function to export a model. When I tried to export a model to S3 under Spark 1.6, we couldn't do that. So, it should offer S3 besides HDFS. Can you review it when you have time? Thanks!
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
Closes#11151 from yu-iskw/SPARK-13265.
In spark-env.sh.template, there are multi-byte characters, this PR will remove it.
Author: Sasaki Toru <sasakitoa@nttdata.co.jp>
Closes#11149 from sasakitoa/remove_multibyte_in_sparkenv.
JIRA: https://issues.apache.org/jira/browse/SPARK-10524
Currently we use the hard prediction (`ImpurityCalculator.predict`) to order categories' bins. But we should use the soft prediction.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Author: Liang-Chi Hsieh <viirya@appier.com>
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#8734 from viirya/dt-soft-centroids.
KMeans:
Make a private non-deprecated version of setRuns API so that we can call it from the PythonAPI without deprecation warnings in our own build. Also use it internally when being called from train. Add a logWarning for non-1 values
MFDataGenerator:
Apparently we are calling round on an integer which now in Scala 2.11 results in a warning (it didn't make any sense before either). Figure out if this is a mistake we can just remove or if we got the types wrong somewhere.
I put these two together since they are both deprecation fixes in MLlib and pretty small, but I can split them up if we would prefer it that way.
Author: Holden Karau <holden@us.ibm.com>
Closes#11112 from holdenk/SPARK-13201-non-deprecated-setRuns-SPARK-mathround-integer.
cache the value of the standardization Param in LogisticRegression, rather than re-fetching it from the ParamMap for every index and every optimization step in the quasi-newton optimizer
also, fix Param#toString to cache the stringified representation, rather than re-interpolating it on every call, so any other implementations that have similar repeated access patterns will see a benefit.
this change improves training times for one of my test sets from ~7m30s to ~4m30s
Author: Gary King <gary@idibon.com>
Closes#11027 from idigary/spark-13132-optimize-logistic-regression.
Fixed the bug in linear regression train for the case when the target variable is constant. The two cases for `fitIntercept=true` or `fitIntercept=false` should be treated differently.
Author: Imran Younus <iyounus@us.ibm.com>
Closes#10702 from iyounus/SPARK-12732_bug_fix_in_linear_regression_train.
Fixes problem and verifies fix by test suite.
Also - adds optional parameter: nullable (Boolean) to: SchemaUtils.appendColumn
and deduplicates SchemaUtils.appendColumn functions.
Author: Grzegorz Chilkiewicz <grzegorz.chilkiewicz@codilime.com>
Closes#10741 from grzegorz-chilkiewicz/master.
Part of task for [SPARK-11219](https://issues.apache.org/jira/browse/SPARK-11219) to make PySpark MLlib parameter description formatting consistent. This is for the clustering module.
Author: Bryan Cutler <cutlerb@gmail.com>
Closes#10610 from BryanCutler/param-desc-consistent-cluster-SPARK-12631.
This patch changes Spark's build to make Scala 2.11 the default Scala version. To be clear, this does not mean that Spark will stop supporting Scala 2.10: users will still be able to compile Spark for Scala 2.10 by following the instructions on the "Building Spark" page; however, it does mean that Scala 2.11 will be the default Scala version used by our CI builds (including pull request builds).
The Scala 2.11 compiler is faster than 2.10, so I think we'll be able to look forward to a slight speedup in our CI builds (it looks like it's about 2X faster for the Maven compile-only builds, for instance).
After this patch is merged, I'll update Jenkins to add new compile-only jobs to ensure that Scala 2.10 compilation doesn't break.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#10608 from JoshRosen/SPARK-6363.
Implement ```IterativelyReweightedLeastSquares``` solver for GLM. I consider it as a solver rather than estimator, it only used internal so I keep it ```private[ml]```.
There are two limitations in the current implementation compared with R:
* It can not support ```Tuple``` as response for ```Binomial``` family, such as the following code:
```
glm( cbind(using, notUsing) ~ age + education + wantsMore , family = binomial)
```
* It does not support ```offset```.
Because I considered that ```RFormula``` did not support ```Tuple``` as label and ```offset``` keyword, so I simplified the implementation. But to add support for these two functions is not very hard, I can do it in follow-up PR if it is necessary. Meanwhile, we can also add R-like statistic summary for IRLS.
The implementation refers R, [statsmodels](https://github.com/statsmodels/statsmodels) and [sparkGLM](https://github.com/AlteryxLabs/sparkGLM).
Please focus on the main structure and overpass minor issues/docs that I will update later. Any comments and opinions will be appreciated.
cc mengxr jkbradley
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#10639 from yanboliang/spark-9835.
The intercept in Logistic Regression represents a prior on categories which should not be regularized. In MLlib, the regularization is handled through Updater, and the Updater penalizes all the components without excluding the intercept which resulting poor training accuracy with regularization.
The new implementation in ML framework handles this properly, and we should call the implementation in ML from MLlib since majority of users are still using MLlib api.
Note that both of them are doing feature scalings to improve the convergence, and the only difference is ML version doesn't regularize the intercept. As a result, when lambda is zero, they will converge to the same solution.
Previously partially reviewed at https://github.com/apache/spark/pull/6386#issuecomment-168781424 re-opening for dbtsai to review.
Author: Holden Karau <holden@us.ibm.com>
Author: Holden Karau <holden@pigscanfly.ca>
Closes#10788 from holdenk/SPARK-7780-intercept-in-logisticregressionwithLBFGS-should-not-be-regularized.
… Add LibSVMOutputWriter
The behavior of LibSVMRelation is not changed except adding LibSVMOutputWriter
* Partition is still not supported
* Multiple input paths is not supported
Author: Jeff Zhang <zjffdu@apache.org>
Closes#9595 from zjffdu/SPARK-11622.
https://issues.apache.org/jira/browse/SPARK-12834
We use `SerDe.dumps()` to serialize `JavaArray` and `JavaList` in `PythonMLLibAPI`, then deserialize them with `PickleSerializer` in Python side. However, there is no need to transform them in such an inefficient way. Instead of it, we can use type conversion to convert them, e.g. `list(JavaArray)` or `list(JavaList)`. What's more, there is an issue to Ser/De Scala Array as I said in https://issues.apache.org/jira/browse/SPARK-12780
Author: Xusen Yin <yinxusen@gmail.com>
Closes#10772 from yinxusen/SPARK-12834.
```PCAModel``` can output ```explainedVariance``` at Python side.
cc mengxr srowen
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#10830 from yanboliang/spark-12905.
Update user guide for RFormula feature interactions. Meanwhile we also update other new features such as supporting string label in Spark 1.6.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#10222 from yanboliang/spark-11965.
- Remove Akka dependency from core. Note: the streaming-akka project still uses Akka.
- Remove HttpFileServer
- Remove Akka configs from SparkConf and SSLOptions
- Rename `spark.akka.frameSize` to `spark.rpc.message.maxSize`. I think it's still worth to keep this config because using `DirectTaskResult` or `IndirectTaskResult` depends on it.
- Update comments and docs
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#10854 from zsxwing/remove-akka.
When all labels are the same, it's a dangerous ground for LogisticRegression without intercept to converge. GLMNET doesn't support this case, and will just exit. GLM can train, but will have a warning message saying the algorithm doesn't converge.
Author: DB Tsai <dbt@netflix.com>
Closes#10862 from dbtsai/add-tests.
Add Since annotations to ml.param and ml.*
Author: Takahashi Hiroshi <takahashi.hiroshi@lab.ntt.co.jp>
Author: Hiroshi Takahashi <takahashi.hiroshi@lab.ntt.co.jp>
Closes#8935 from taishi-oss/issue10263.
This fixes the behavior of WeightedLeastSquars.fit() when the standard deviation of the target variable is zero. If the fitIntercept is true, there is no need to train.
Author: Imran Younus <iyounus@us.ibm.com>
Closes#10274 from iyounus/SPARK-12230_bug_fix_in_weighted_least_squares.
This PR aims to allow the prediction column of `BinaryClassificationEvaluator` to be of double type.
Author: BenFradet <benjamin.fradet@gmail.com>
Closes#10472 from BenFradet/SPARK-9716.
From the coverage issues for 1.6 : Add Python API for mllib.clustering.BisectingKMeans.
Author: Holden Karau <holden@us.ibm.com>
Closes#10150 from holdenk/SPARK-11937-python-api-coverage-SPARK-11944-python-mllib.clustering.BisectingKMeans.
Change assertion's message so it's consistent with the code. The old message says that the invoked method was lapack.dports, where in fact it was lapack.dppsv method.
Author: Wojciech Jurczyk <wojtek.jurczyk@gmail.com>
Closes#10818 from wjur/wjur/rename_error_message.
Currently `summary()` fails on a GLM model fitted over a vector feature missing ML attrs, since the output feature attrs will also have no name. We can avoid this situation by forcing `VectorAssembler` to make up suitable names when inputs are missing names.
cc mengxr
Author: Eric Liang <ekl@databricks.com>
Closes#10323 from ericl/spark-12346.
I create new pr since original pr long time no update.
Please help to review.
srowen
Author: Tommy YU <tummyyu@163.com>
Closes#10756 from Wenpei/add_since_to_recomm.
jira: https://issues.apache.org/jira/browse/SPARK-12026
The issue is valid as features.toArray.view.zipWithIndex.slice(startCol, endCol) becomes slower as startCol gets larger.
I tested on local and the change can improve the performance and the running time was stable.
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes#10146 from hhbyyh/chiSq.
jira: https://issues.apache.org/jira/browse/SPARK-10809
We could provide a single-document topicDistributions method for LocalLDAModel to allow for quick queries which avoid RDD operations. Currently, the user must use an RDD of documents.
add some missing assert too.
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes#9484 from hhbyyh/ldaTopicPre.
jira: https://issues.apache.org/jira/browse/SPARK-12685
the log of `word2vec` reports
trainWordsCount = -785727483
during computation over a large dataset.
Update the priority as it will affect the computation process.
`alpha = learningRate * (1 - numPartitions * wordCount.toDouble / (trainWordsCount + 1))`
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes#10627 from hhbyyh/w2voverflow.
PySpark MLlib ```GaussianMixtureModel``` should support single instance ```predict/predictSoft``` just like Scala do.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#10552 from yanboliang/spark-12603.
Turn import ordering violations into build errors, plus a few adjustments
to account for how the checker behaves. I'm a little on the fence about
whether the existing code is right, but it's easier to appease the checker
than to discuss what's the more correct order here.
Plus a few fixes to imports that cropped in since my recent cleanups.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#10612 from vanzin/SPARK-3873-enable.
Fix the style violation (space before , and :).
This PR is a followup for #10643.
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes#10684 from sarutak/SPARK-12692-followup-mllib.
Fix most build warnings: mostly deprecated API usages. I'll annotate some of the changes below. CC rxin who is leading the charge to remove the deprecated APIs.
Author: Sean Owen <sowen@cloudera.com>
Closes#10570 from srowen/SPARK-12618.
For the BinaryClassificationEvaluator, the scaladoc doesn't mention that "areaUnderPR" is supported, only that the default is "areadUnderROC".
Also, in the documentation, it is said that:
"The default metric used to choose the best ParamMap can be overriden by the setMetric method in each of these evaluators."
However, the method is called setMetricName.
This PR aims to fix both issues.
Author: BenFradet <benjamin.fradet@gmail.com>
Closes#10328 from BenFradet/SPARK-12368.
Modified the definition of R^2 for regression through origin. Added modified test for regression metrics.
Author: Imran Younus <iyounus@us.ibm.com>
Author: Imran Younus <imranyounus@gmail.com>
Closes#10384 from iyounus/SPARK_12331_R2_for_regression_through_origin.
DecisionTreeRegressor will provide variance of prediction as a Double column.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#8866 from yanboliang/spark-9622.
callUDF has been deprecated. However, we do not have an alternative for users to specify the output data type without type tags. This pull request introduced a new API for that, and replaces the invocation of the deprecated callUDF with that.
Author: Reynold Xin <rxin@databricks.com>
Closes#10547 from rxin/SPARK-12599.
A slight adjustment to the checker configuration was needed; there is
a handful of warnings still left, but those are because of a bug in
the checker that I'll fix separately (before enabling errors for the
checker, of course).
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#10535 from vanzin/SPARK-3873-mllib.
Include the following changes:
1. Close `java.sql.Statement`
2. Fix incorrect `asInstanceOf`.
3. Remove unnecessary `synchronized` and `ReentrantLock`.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#10440 from zsxwing/findbugs.
ParamMap#filter uses `mutable.Map#filterKeys`. The return type of `filterKey` is collection.Map, not mutable.Map but the result is casted to mutable.Map using `asInstanceOf` so we get `ClassCastException`.
Also, the return type of Map#filterKeys is not Serializable. It's the issue of Scala (https://issues.scala-lang.org/browse/SI-6654).
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes#10381 from sarutak/SPARK-12424.