## What changes were proposed in this pull request?
Added DefaultParamsWriteable, DefaultParamsReadable, DefaultParamsWriter, and DefaultParamsReader to Python to support Python-only persistence of Json-serializable parameters.
## How was this patch tested?
Instantiated an estimator with Json-serializable parameters (ex. LogisticRegression), saved it using the added helper functions, and loaded it back, and compared it to the original instance to make sure it is the same. This test was both done in the Python REPL and implemented in the unit tests.
Note to reviewers: there are a few excess comments that I left in the code for clarity but will remove before the code is merged to master.
Author: Ajay Saini <ajays725@gmail.com>
Closes#18742 from ajaysaini725/PythonPersistenceHelperFunctions.
## What changes were proposed in this pull request?
comments of parentStats in RF are wrong.
parentStats is not only used for the first iteration, it is used with all the iteration for unordered features.
## How was this patch tested?
Author: Peng Meng <peng.meng@intel.com>
Closes#18832 from mpjlu/fixRFDoc.
## What changes were proposed in this pull request?
Support offset in SparkR GLM #16699
Author: actuaryzhang <actuaryzhang10@gmail.com>
Closes#18831 from actuaryzhang/sparkROffset.
## What changes were proposed in this pull request?
GBTs inherit from HasStepSize & LInearSVC/Binarizer from HasThreshold
## How was this patch tested?
existing tests
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Author: Ruifeng Zheng <ruifengz@foxmail.com>
Closes#18612 from zhengruifeng/override_HasXXX.
## What changes were proposed in this pull request?
SPARK-20307 Added handleInvalid option to RFormula for tree-based classification algorithms. We should add this parameter for other classification algorithms in SparkR.
This is a followup PR for SPARK-20307.
## How was this patch tested?
New Unit tests are added.
Author: wangmiao1981 <wm624@hotmail.com>
Closes#18605 from wangmiao1981/class.
## What changes were proposed in this pull request?
add `setWeightCol` method for OneVsRest.
`weightCol` is ignored if classifier doesn't inherit HasWeightCol trait.
## How was this patch tested?
+ [x] add an unit test.
Author: Yan Facai (颜发才) <facai.yan@gmail.com>
Closes#18554 from facaiy/BUG/oneVsRest_missing_weightCol.
## What changes were proposed in this pull request?
Add R-like summary table to GLM summary, which includes feature name (if exist), parameter estimate, standard error, t-stat and p-value. This allows scala users to easily gather these commonly used inference results.
srowen yanboliang felixcheung
## How was this patch tested?
New tests. One for testing feature Name, and one for testing the summary Table.
Author: actuaryzhang <actuaryzhang10@gmail.com>
Author: Wayne Zhang <actuaryzhang10@gmail.com>
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#16630 from actuaryzhang/glmTable.
## What changes were proposed in this pull request?
This change pulls the `LogisticAggregator` class out of LogisticRegression.scala and makes it extend `DifferentiableLossAggregator`. It also changes logistic regression to use the generic `RDDLossFunction` instead of having its own.
Other minor changes:
* L2Regularization accepts `Option[Int => Double]` for features standard deviation
* L2Regularization uses `Vector` type instead of Array
* Some tests added to LeastSquaresAggregator
## How was this patch tested?
Unit test suites are added.
Author: sethah <shendrickson@cloudera.com>
Closes#18305 from sethah/SPARK-20988.
## What changes were proposed in this pull request?
jira: https://issues.apache.org/jira/browse/SPARK-21524
ValidatorParamsSuiteHelpers.testFileMove() is generating temp dir in the wrong place and does not delete them.
ValidatorParamsSuiteHelpers.testFileMove() is invoked by TrainValidationSplitSuite and crossValidatorSuite. Currently it uses `tempDir` from `TempDirectory`, which unfortunately is never initialized since the `boforeAll()` of `ValidatorParamsSuiteHelpers` is never invoked.
In my system, it leaves some temp directories in the assembly folder each time I run the TrainValidationSplitSuite and crossValidatorSuite.
## How was this patch tested?
unit test fix
Author: Yuhao Yang <yuhao.yang@intel.com>
Closes#18728 from hhbyyh/tempDirFix.
## What changes were proposed in this pull request?
There are mainly two reasons for this reorg:
* Some params are placed in ```RFormulaBase```, while others are placed in ```RFormula```, this is disordered.
* ```RFormulaModel``` should have params ```handleInvalid```, ```formula``` and ```forceIndexLabel```, that users can get invalid values handling policy, formula or whether to force index label if they only have a ```RFormulaModel```. So we need move these params to ```RFormulaBase``` which is also inherited by ```RFormulaModel```.
* ```RFormulaModel``` should support set different ```handleInvalid``` when cross validation.
## How was this patch tested?
Existing tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#18681 from yanboliang/rformula-reorg.
## What changes were proposed in this pull request?
Address scapegoat warnings for:
- BigDecimal double constructor
- Catching NPE
- Finalizer without super
- List.size is O(n)
- Prefer Seq.empty
- Prefer Set.empty
- reverse.map instead of reverseMap
- Type shadowing
- Unnecessary if condition.
- Use .log1p
- Var could be val
In some instances like Seq.empty, I avoided making the change even where valid in test code to keep the scope of the change smaller. Those issues are concerned with performance and it won't matter for tests.
## How was this patch tested?
Existing tests
Author: Sean Owen <sowen@cloudera.com>
Closes#18635 from srowen/Scapegoat1.
## What changes were proposed in this pull request?
Added functionality for CrossValidator and TrainValidationSplit to persist nested estimators such as OneVsRest. Also added CrossValidator and TrainValidation split persistence to pyspark.
## How was this patch tested?
Performed both cross validation and train validation split with a one vs. rest estimator and tested read/write functionality of the estimator parameter maps required by these meta-algorithms.
Author: Ajay Saini <ajays725@gmail.com>
Closes#18428 from ajaysaini725/MetaAlgorithmPersistNestedEstimators.
## What changes were proposed in this pull request?
```RFormula``` should handle invalid for both features and label column.
#18496 only handle invalid values in features column. This PR add handling invalid values for label column and test cases.
## How was this patch tested?
Add test cases.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#18613 from yanboliang/spark-20307.
## What changes were proposed in this pull request?
- Remove Scala 2.10 build profiles and support
- Replace some 2.10 support in scripts with commented placeholders for 2.12 later
- Remove deprecated API calls from 2.10 support
- Remove usages of deprecated context bounds where possible
- Remove Scala 2.10 workarounds like ScalaReflectionLock
- Other minor Scala warning fixes
## How was this patch tested?
Existing tests
Author: Sean Owen <sowen@cloudera.com>
Closes#17150 from srowen/SPARK-19810.
## What changes were proposed in this pull request?
1, HasHandleInvaild support override
2, Make QuantileDiscretizer/Bucketizer/StringIndexer/RFormula inherit from HasHandleInvalid
## How was this patch tested?
existing tests
[JIRA](https://issues.apache.org/jira/browse/SPARK-18619)
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#18582 from zhengruifeng/heritate_HasHandleInvalid.
## What changes were proposed in this pull request?
This PR is similar to #17869.
Once` 'spark.local.dir'` is set. Unless this is manually cleared before/after a test. it could return the same directory even if this property is configured.
and add before/after for each likewise in ALSCleanerSuite.
## How was this patch tested?
existing test.
Author: caoxuewen <cao.xuewen@zte.com.cn>
Closes#18537 from heary-cao/ALSCleanerSuite.
## What changes were proposed in this pull request?
For randomForest classifier, if test data contains unseen labels, it will throw an error. The StringIndexer already has the handleInvalid logic. The patch add a new method to set the underlying StringIndexer handleInvalid logic.
This patch should also apply to other classifiers. This PR focuses on the main logic and randomForest classifier. I will do follow-up PR for other classifiers.
## How was this patch tested?
Add a new unit test based on the error case in the JIRA.
Author: wangmiao1981 <wm624@hotmail.com>
Closes#18496 from wangmiao1981/handle.
## What changes were proposed in this pull request?
add the column name in the exception which is raised by unsupported data type.
## How was this patch tested?
+ [x] pass all tests.
Author: Yan Facai (颜发才) <facai.yan@gmail.com>
Closes#18523 from facaiy/ENH/vectorassembler_add_col.
## What changes were proposed in this pull request?
This is related with [SPARK-19918](https://issues.apache.org/jira/browse/SPARK-19918) and [SPARK-18362](https://issues.apache.org/jira/browse/SPARK-18362).
This PR proposes to use `TextFileFormat` and allow multiple input paths (but with a warning) when determining the number of features in LibSVM data source via an extra scan.
There are three points here:
- The main advantage of this change should be to remove file-listing bottlenecks in driver side.
- Another advantage is ones from using `FileScanRDD`. For example, I guess we can use `spark.sql.files.ignoreCorruptFiles` option when determining the number of features.
- We can unify the schema inference code path in text based data sources. This is also a preparation for [SPARK-21289](https://issues.apache.org/jira/browse/SPARK-21289).
## How was this patch tested?
Unit tests in `LibSVMRelationSuite`.
Closes#18288
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#18556 from HyukjinKwon/libsvm-schema.
## What changes were proposed in this pull request?
SparkContext is shared by all sessions, we should not update its conf for only one session.
## How was this patch tested?
existing tests
Author: Wenchen Fan <wenchen@databricks.com>
Closes#18536 from cloud-fan/config.
## What changes were proposed in this pull request?
The scal() and creation of newCenter vector is done in the driver, after a collectAsMap operation while it could be done in the distributed RDD.
This PR moves this code before the collectAsMap for more efficiency
## How was this patch tested?
This was tested manually by running the KMeansExample and verifying that the new code ran without error and gave same output as before.
Author: dardelet <guillaumegorp@gmail.com>
Author: Guillaume Dardelet <dardelet@users.noreply.github.com>
Closes#18491 from dardelet/move-center-calculation-to-distributed-map-kmean.
## What changes were proposed in this pull request?
Added "les" as french stop word (plurial of le)
Author: Thomas Decaux <ebuildy@gmail.com>
Closes#18514 from ebuildy/patch-1.
## What changes were proposed in this pull request?
1, make param support non-final with `finalFields` option
2, generate `HasSolver` with `finalFields = false`
3, override `solver` in LiR, GLR, and make MLPC inherit `HasSolver`
## How was this patch tested?
existing tests
Author: Ruifeng Zheng <ruifengz@foxmail.com>
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#16028 from zhengruifeng/param_non_final.
## What changes were proposed in this pull request?
Fix scala-2.10 build failure of ```GeneralizedLinearRegressionSuite```.
## How was this patch tested?
Build with scala-2.10.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#18489 from yanboliang/glr.
## What changes were proposed in this pull request?
Add support for offset in GLM. This is useful for at least two reasons:
1. Account for exposure: e.g., when modeling the number of accidents, we may need to use miles driven as an offset to access factors on frequency.
2. Test incremental effects of new variables: we can use predictions from the existing model as offset and run a much smaller model on only new variables. This avoids re-estimating the large model with all variables (old + new) and can be very important for efficient large-scaled analysis.
## How was this patch tested?
New test.
yanboliang srowen felixcheung sethah
Author: actuaryzhang <actuaryzhang10@gmail.com>
Closes#16699 from actuaryzhang/offset.
PR #15999 included fixes for doc strings in the ML shared param traits (occurrences of `>` and `>=`).
This PR simply uses the HTML-escaped version of the param doc to embed into the Scaladoc, to ensure that when `SharedParamsCodeGen` is run, the generated javadoc will be compliant for Java 8.
## How was this patch tested?
Existing tests
Author: Nick Pentreath <nickp@za.ibm.com>
Closes#18420 from MLnick/shared-params-javadoc8.
## What changes were proposed in this pull request?
Please see [SPARK-14657](https://issues.apache.org/jira/browse/SPARK-14657) for detail of this bug.
I searched online and test some other cases, found when we fit R glm model(or other models powered by R formula) w/o intercept on a dataset including string/category features, one of the categories in the first category feature is being used as reference category, we will not drop any category for that feature.
I think we should keep consistent semantics between Spark RFormula and R formula.
## How was this patch tested?
Add standard unit tests.
cc mengxr
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#12414 from yanboliang/spark-14657.
## What changes were proposed in this pull request?
PR https://github.com/apache/spark/pull/17715 Added Constrained Logistic Regression for ML. We should add it to SparkR.
## How was this patch tested?
Add new unit tests.
Author: wangmiao1981 <wm624@hotmail.com>
Closes#18128 from wangmiao1981/test.
## What changes were proposed in this pull request?
Add `stringIndexerOrderType` to `spark.glm` and `spark.survreg` to support string encoding that is consistent with default R.
## How was this patch tested?
new tests
Author: actuaryzhang <actuaryzhang10@gmail.com>
Closes#18140 from actuaryzhang/sparkRFormula.
## What changes were proposed in this pull request?
LinearSVC should use its own threshold param, rather than the shared one, since it applies to rawPrediction instead of probability. This PR changes the param in the Scala, Python and R APIs.
## How was this patch tested?
New unit test to make sure the threshold can be set to any Double value.
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#18151 from jkbradley/ml-2.2-linearsvc-cleanup.
## What changes were proposed in this pull request?
The method calculateNumberOfPartitions() uses Int, not Long (unlike the MLlib version), so it is very easily to have an overflow in calculating the number of partitions for ML persistence.
This modifies the calculations to use Long.
## How was this patch tested?
New unit test. I verified that the test fails before this patch.
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#18265 from jkbradley/word2vec-save-fix.
## What changes were proposed in this pull request?
JIRA: [SPARK-19762](https://issues.apache.org/jira/browse/SPARK-19762)
The larger changes in this patch are:
* Adds a `DifferentiableLossAggregator` trait which is intended to be used as a common parent trait to all Spark ML aggregator classes. It factors out the common methods: `merge, gradient, loss, weight` from the aggregator subclasses.
* Adds a `RDDLossFunction` which is intended to be the only implementation of Breeze's `DiffFunction` necessary in Spark ML, and can be used by all other algorithms. It takes the aggregator type as a type parameter, and maps the aggregator over an RDD. It additionally takes in a optional regularization loss function for applying the differentiable part of regularization.
* Factors out the regularization from the data part of the cost function, and treats regularization as a separate independent cost function which can be evaluated and added to the data cost function.
* Changes `LinearRegression` to use this new hierarchy as a proof of concept.
* Adds the following new namespaces `o.a.s.ml.optim.loss` and `o.a.s.ml.optim.aggregator`
Also note that none of these are public-facing changes. All of these classes are internal to Spark ML and remain that way.
**NOTE: The large majority of the "lines added" and "lines deleted" are simply code moving around or unit tests.**
BTW, I also converted LinearSVC to this framework as a way to prove that this new hierarchy is flexible enough for the other algorithms, but I backed those changes out because the PR is large enough as is.
## How was this patch tested?
Test suites are added for the new components, and some test suites are also added to provide coverage where there wasn't any before.
* DifferentiablLossAggregatorSuite
* LeastSquaresAggregatorSuite
* RDDLossFunctionSuite
* DifferentiableRegularizationSuite
Below are some performance testing numbers. Run on a 6 node virtual cluster with 44 cores and ~110G RAM, the dataset size is about 37G. These are not "large-scale" tests, but we really want to just make sure the iteration times don't increase with this patch. Notably we are doing the regularization a bit differently than before, but that should cost very little. I think there's very little risk otherwise, and these numbers don't show a difference. Of course I'm happy to add more tests as we think it's necessary, but I think the patch is ready for review now.
**Note:** timings are best of 3 runs.
| | numFeatures | numPoints | maxIter | regParam | elasticNetParam | SPARK-19762 (sec) | master (sec) |
|----|---------------|-------------|-----------|------------|-------------------|---------------------|----------------|
| 0 | 5000 | 1e+06 | 30 | 0 | 0 | 129.594 | 131.153 |
| 1 | 5000 | 1e+06 | 30 | 0.1 | 0 | 135.54 | 136.327 |
| 2 | 5000 | 1e+06 | 30 | 0.01 | 0.5 | 135.148 | 129.771 |
| 3 | 50000 | 100000 | 30 | 0 | 0 | 145.764 | 144.096 |
## Follow ups
If this design is accepted, we will convert the other ML algorithms that use this aggregator pattern to this new hierarchy in follow up PRs.
Author: sethah <seth.hendrickson16@gmail.com>
Author: sethah <shendrickson@cloudera.com>
Closes#17094 from sethah/ml_aggregators.
## What changes were proposed in this pull request?
Destroy broadcasted centers after computing cost
## How was this patch tested?
existing tests
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#18152 from zhengruifeng/destroy_kmeans_model.
## What changes were proposed in this pull request?
Remove extraneous logging.
## How was this patch tested?
Unit tests pass.
Author: David Eis <deis@bloomberg.net>
Closes#18188 from davideis/fix-test.
## What changes were proposed in this pull request?
The current conf setting logic is a little complex and has duplication, this PR simplifies it.
## How was this patch tested?
existing tests.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#18172 from cloud-fan/session.
## What changes were proposed in this pull request?
- ~~I added the method `toBlockMatrixDense` to the IndexedRowMatrix class. The current implementation of `toBlockMatrix` is insufficient for users with relatively dense IndexedRowMatrix objects, since it assumes sparsity.~~
EDIT: Ended up deciding that there should be just a single `toBlockMatrix` method, which creates a BlockMatrix whose blocks may be dense or sparse depending on the sparsity of the rows. This method will work better on any current use case of `toBlockMatrix` and doesn't go through `CoordinateMatrix` like the old method.
## How was this patch tested?
~~I used the same tests already written for `toBlockMatrix()` to test this method. I also added a new additional unit test for an edge case that was not adequately tested by current test suite.~~
I ran the original `IndexedRowMatrix` tests, plus wrote more to better handle edge cases ignored by original tests.
Author: John Compitello <johnc@broadinstitute.org>
Closes#17459 from johnc1231/johnc-fix-ir-to-block.
## What changes were proposed in this pull request?
Revert the handling of negative values in ALS with implicit feedback, so that the confidence is the absolute value of the rating and the preference is 0 for negative ratings. This was the original behavior.
## How was this patch tested?
This patch was tested with the existing unit tests and an added unit test to ensure that negative ratings are not ignored.
mengxr
Author: David Eis <deis@bloomberg.net>
Closes#18022 from davideis/bugfix/negative-rating.
## What changes were proposed in this pull request?
When handling strings, the category dropped by RFormula and R are different:
- RFormula drops the least frequent level
- R drops the first level after ascending alphabetical ordering
This PR supports different string ordering types in StringIndexer #17879 so that RFormula can drop the same level as R when handling strings using`stringOrderType = "alphabetDesc"`.
## How was this patch tested?
new tests
Author: Wayne Zhang <actuaryzhang@uber.com>
Closes#17967 from actuaryzhang/RFormula.
## What changes were proposed in this pull request?
Joint coefficients with intercept for SparkR linear SVM summary.
## How was this patch tested?
Existing tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#18035 from yanboliang/svm-r.
## What changes were proposed in this pull request?
support decision tree in R
## How was this patch tested?
added tests
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#17981 from zhengruifeng/dt_r.
## What changes were proposed in this pull request?
When two Breeze SparseMatrices are operated, the result matrix may contain provisional 0 values extra in rowIndices and data arrays. This causes an incoherence with the colPtrs data, but Breeze get away with this incoherence by keeping a counter of the valid data.
In spark, when this matrices are converted to SparseMatrices, Sparks relies solely on rowIndices, data, and colPtrs, but these might be incorrect because of breeze internal hacks. Therefore, we need to slice both rowIndices and data, using their counter of active data
This method is at least called by BlockMatrix when performing distributed block operations, causing exceptions on valid operations.
See http://stackoverflow.com/questions/33528555/error-thrown-when-using-blockmatrix-add
## How was this patch tested?
Added a test to MatricesSuite that verifies that the conversions are valid and that code doesn't crash. Originally the same code would crash on Spark.
Bugfix for https://issues.apache.org/jira/browse/SPARK-20687
Author: Ignacio Bermudez <ignaciobermudez@gmail.com>
Author: Ignacio Bermudez Corrales <icorrales@splunk.com>
Closes#17940 from ghoto/bug-fix/SPARK-20687.
Small clean ups from #17742 and #17845.
## How was this patch tested?
Existing unit tests.
Author: Nick Pentreath <nickp@za.ibm.com>
Closes#17919 from MLnick/SPARK-20677-als-perf-followup.
## What changes were proposed in this pull request?
Review new Scala APIs introduced in 2.2.
## How was this patch tested?
Existing tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#17934 from yanboliang/spark-20501.
## What changes were proposed in this pull request?
Before 2.2, MLlib keep to remove APIs deprecated in last feature/minor release. But from Spark 2.2, we decide to remove deprecated APIs in a major release, so we need to change corresponding annotations to tell users those will be removed in 3.0.
Meanwhile, this fixed bugs in ML documents. The original ML docs can't show deprecated annotations in ```MLWriter``` and ```MLReader``` related class, we correct it in this PR.
Before:
![image](https://cloud.githubusercontent.com/assets/1962026/25939889/f8c55f20-3666-11e7-9fa2-0605bfb3ed06.png)
After:
![image](https://cloud.githubusercontent.com/assets/1962026/25939870/e9b0d5be-3666-11e7-9765-5e04885e4b32.png)
## How was this patch tested?
Existing tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#17946 from yanboliang/spark-20707.
## What changes were proposed in this pull request?
make param `family` in LoR and `optimizer` in LDA case insensitive
## How was this patch tested?
updated tests
yanboliang
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#17910 from zhengruifeng/lr_family_lowercase.
## What changes were proposed in this pull request?
StringIndexer maps labels to numbers according to the descending order of label frequency. Other types of ordering (e.g., alphabetical) may be needed in feature ETL. For example, the ordering will affect the result in one-hot encoding and RFormula.
This PR proposes to support other ordering methods and we add a parameter `stringOrderType` that supports the following four options:
- 'frequencyDesc': descending order by label frequency (most frequent label assigned 0)
- 'frequencyAsc': ascending order by label frequency (least frequent label assigned 0)
- 'alphabetDesc': descending alphabetical order
- 'alphabetAsc': ascending alphabetical order
The default is still descending order of label frequency, so there should be no impact to existing programs.
## How was this patch tested?
new test
Author: Wayne Zhang <actuaryzhang@uber.com>
Closes#17879 from actuaryzhang/stringIndexer.
## What changes were proposed in this pull request?
This pr added `withName` in `UserDefinedFunction` for printing UDF names in EXPLAIN
## How was this patch tested?
Added tests in `UDFSuite`.
Author: Takeshi Yamamuro <yamamuro@apache.org>
Closes#17712 from maropu/SPARK-20416.