This makes equality test failures much more readable.
mengxr
Author: Eric Liang <ekl@databricks.com>
Author: Eric Liang <ekhliang@gmail.com>
Closes#8826 from ericl/attrgroupstr.
```GBTParams``` has ```stepSize``` as learning rate currently.
ML has shared param class ```HasStepSize```, ```GBTParams``` can extend from it rather than duplicated implementation.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#8552 from yanboliang/spark-10394.
Should be the same as SPARK-7808 but use Java for the code example.
It would be great to add package doc for `spark.ml.feature`.
Author: Holden Karau <holden@pigscanfly.ca>
Closes#8740 from holdenk/SPARK-10077-JAVA-PACKAGE-DOC-FOR-SPARK.ML.FEATURE.
In fraud detection dataset, almost all the samples are negative while only couple of them are positive. This type of high imbalanced data will bias the models toward negative resulting poor performance. In python-scikit, they provide a correction allowing users to Over-/undersample the samples of each class according to the given weights. In auto mode, selects weights inversely proportional to class frequencies in the training set. This can be done in a more efficient way by multiplying the weights into loss and gradient instead of doing actual over/undersampling in the training dataset which is very expensive.
http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
On the other hand, some of the training data maybe more important like the training samples from tenure users while the training samples from new users maybe less important. We should be able to provide another "weight: Double" information in the LabeledPoint to weight them differently in the learning algorithm.
Author: DB Tsai <dbt@netflix.com>
Author: DB Tsai <dbt@dbs-mac-pro.corp.netflix.com>
Closes#7884 from dbtsai/SPARK-7685.
This change does two things:
- tag a few tests and adds the mechanism in the build to be able to disable those tags,
both in maven and sbt, for both junit and scalatest suites.
- add some logic to run-tests.py to disable some tags depending on what files have
changed; that's used to disable expensive tests when a module hasn't explicitly
been changed, to speed up testing for changes that don't directly affect those
modules.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#8437 from vanzin/test-tags.
jira: https://issues.apache.org/jira/browse/SPARK-10491
We implemented dspr with sparse vector support in `RowMatrix`. This method is also used in WeightedLeastSquares and other places. It would be useful to move it to `linalg.BLAS`.
Let me know if new UT needed.
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes#8663 from hhbyyh/movedspr.
Fixes bug where IndexToString output schema was DoubleType. Correct me if I'm wrong, but it doesn't seem like the output needs to have any "ML Attribute" metadata.
Author: Nick Pritchard <nicholas.pritchard@falkonry.com>
Closes#8751 from pnpritchard/SPARK-10573.
[SPARK-3382](https://issues.apache.org/jira/browse/SPARK-3382) added a ```convergenceTol``` parameter for GradientDescent-based methods in Scala. We need that parameter in Python; otherwise, Python users will not be able to adjust that behavior (or even reproduce behavior from previous releases since the default changed).
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#8457 from yanboliang/spark-10194.
A few Identifiable types did override their toString method but without using the parent implementation. As a consequence, the uid was not present anymore in the toString result. It is the default behaviour.
This patch is a quick fix. The question of enforcement is still up.
No tests have been written to verify the toString method behaviour. That would be long to do because all types should be tested and not only those which have a regression now.
It is possible to enforce the condition using the compiler by making the toString method final but that would introduce unwanted potential API breaking changes (see jira).
Author: Bertrand Dechoux <BertrandDechoux@users.noreply.github.com>
Closes#8062 from BertrandDechoux/SPARK-9720.
Changes:
* Make Scala doc for StringIndexerInverse clearer. Also remove Scala doc from transformSchema, so that the doc is inherited.
* MetadataUtils.scala: “ Helper utilities for tree-based algorithms” —> not just trees anymore
CC: holdenk mengxr
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#8679 from jkbradley/doc-fixes-1.5.
We should document options in public API doc. Otherwise, it is hard to find out the options without looking at the code. I tried to make `DefaultSource` private and put the documentation to package doc. However, since then there exists no public class under `source.libsvm`, the Java package doc doesn't show up in the generated html file (http://bugs.java.com/bugdatabase/view_bug.do?bug_id=4492654). So I put the doc to `DefaultSource` instead. There are several minor updates in this PR:
1. Do `vectorType == "sparse"` only once.
2. Update `hashCode` and `equals`.
3. Remove inherited doc.
4. Delete temp dir in `afterAll`.
Lewuathe
Author: Xiangrui Meng <meng@databricks.com>
Closes#8699 from mengxr/SPARK-10537.
"checkpointInterval" is member of DecisionTreeParams in Scala API which is inconsistency with Python API, we should unified them.
```
member of DecisionTreeParams <-> Scala API
shared param for all ML Transformer/Estimator <-> Python API
```
Proposal:
"checkpointInterval" is also used by ALS, so we make it shared params at Scala.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#8528 from yanboliang/spark-10023.
It is convenient to implement data source API for LIBSVM format to have a better integration with DataFrames and ML pipeline API.
Two option is implemented.
* `numFeatures`: Specify the dimension of features vector
* `featuresType`: Specify the type of output vector. `sparse` is default.
Author: lewuathe <lewuathe@me.com>
Closes#8537 from Lewuathe/SPARK-10117 and squashes the following commits:
986999d [lewuathe] Change unit test phrase
11d513f [lewuathe] Fix some reviews
21600a4 [lewuathe] Merge branch 'master' into SPARK-10117
9ce63c7 [lewuathe] Rewrite service loader file
1fdd2df [lewuathe] Merge branch 'SPARK-10117' of github.com:Lewuathe/spark into SPARK-10117
ba3657c [lewuathe] Merge branch 'master' into SPARK-10117
0ea1c1c [lewuathe] LibSVMRelation is registered into META-INF
4f40891 [lewuathe] Improve test suites
5ab62ab [lewuathe] Merge branch 'master' into SPARK-10117
8660d0e [lewuathe] Fix Java unit test
b56a948 [lewuathe] Merge branch 'master' into SPARK-10117
2c12894 [lewuathe] Remove unnecessary tag
7d693c2 [lewuathe] Resolv conflict
62010af [lewuathe] Merge branch 'master' into SPARK-10117
a97ee97 [lewuathe] Fix some points
aef9564 [lewuathe] Fix
70ee4dd [lewuathe] Add Java test
3fd8dce [lewuathe] [SPARK-10117] Implement SQL data source API for reading LIBSVM data
40d3027 [lewuathe] Add Java test
7056d4a [lewuathe] Merge branch 'master' into SPARK-10117
99accaa [lewuathe] [SPARK-10117] Implement SQL data source API for reading LIBSVM data
The bulk of the changes are on `transient` annotation on class parameter. Often the compiler doesn't generate a field for this parameters, so the the transient annotation would be unnecessary.
But if the class parameter are used in methods, then fields are created. So it is safer to keep the annotations.
The remainder are some potential bugs, and deprecated syntax.
Author: Luc Bourlier <luc.bourlier@typesafe.com>
Closes#8433 from skyluc/issue/sbt-2.11.
Add WeibullGenerator for RandomDataGenerator.
#8611 need use WeibullGenerator to generate random data based on Weibull distribution.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#8622 from yanboliang/spark-10464.
The goal of this PR is to have a weighted least squares implementation that takes the normal equation approach, and hence to be able to provide R-like summary statistics and support IRLS (used by GLMs). The tests match R's lm and glmnet.
There are couple TODOs that can be addressed in future PRs:
* consolidate summary statistics aggregators
* move `dspr` to `BLAS`
* etc
It would be nice to have this merged first because it blocks couple other features.
dbtsai
Author: Xiangrui Meng <meng@databricks.com>
Closes#8588 from mengxr/SPARK-9834.
Loader.checkSchema was called to verify the schema after dataframe.select(...).
Schema verification should be done before dataframe.select(...)
Author: Vinod K C <vinod.kc@huawei.com>
Closes#8636 from vinodkc/fix_GaussianMixtureModel_load_verification.
Copied model must have the same parent, but ml.IsotonicRegressionModel.copy did not set parent.
Here fix it and add test case.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#8637 from yanboliang/spark-10470.
This PR fix two model ```copy()``` related issues:
[SPARK-10480](https://issues.apache.org/jira/browse/SPARK-10480)
```ML.LinearRegressionModel.copy()``` ignored argument ```extra```, it will not take effect when users setting this parameter.
[SPARK-10479](https://issues.apache.org/jira/browse/SPARK-10479)
```ML.LogisticRegressionModel.copy()``` should copy model summary if available.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#8641 from yanboliang/linear-regression-copy.
From Jira: We should use assertTrue, etc. instead to make sure the asserts are not ignored in tests.
Author: Holden Karau <holden@pigscanfly.ca>
Closes#8607 from holdenk/SPARK-10013-remove-java-assert-from-java-unit-tests.
We should make sure the scaladoc for params includes their default values through the models in ml/
Author: Holden Karau <holden@pigscanfly.ca>
Closes#8591 from holdenk/SPARK-10402-add-scaladoc-for-default-values-of-params-in-ml.
Params.getOrDefault should throw a more meaningful exception than what you get from a bad key lookup.
Author: Holden Karau <holden@pigscanfly.ca>
Closes#8567 from holdenk/SPARK-9723-params-getordefault-should-throw-more-useful-error.
Currently OneVsRest use UDF to generate new binary label during training.
Considering that [SPARK-7321](https://issues.apache.org/jira/browse/SPARK-7321) has been merged, we can use ```when ... otherwise``` which will be more efficiency.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#8519 from yanboliang/spark-10349.
This could help reduce hash collisions, e.g., in `RDD[Vector].repartition`. jkbradley
Author: Xiangrui Meng <meng@databricks.com>
Closes#8182 from mengxr/SPARK-9954.
* do not cache first cost RDD
* change following cost RDD cache level to MEMORY_AND_DISK
* remove Vector wrapper to save a object per instance
Further improvements will be addressed in SPARK-10329
cc: yu-iskw HuJiayin
Author: Xiangrui Meng <meng@databricks.com>
Closes#8526 from mengxr/SPARK-10354.
* Adds user guide for ml.feature.StopWordsRemovers, ran code examples on my machine
* Cleans up scaladocs for public methods
* Adds test for Java compatibility
* Follow up Python user guide code example is tracked by SPARK-10249
Author: Feynman Liang <fliang@databricks.com>
Closes#8436 from feynmanliang/SPARK-10230.
`GeneralizedLinearModel` creates a cached RDD when building a model. It's inconvenient, since these RDDs flood the memory when building several models in a row, so useful data might get evicted from the cache.
The proposed solution is to always cache the dataset & remove the warning. There's a caveat though: input dataset gets evaluated twice, in line 270 when fitting `StandardScaler` for the first time, and when running optimizer for the second time. So, it might worth to return removed warning.
Another possible solution is to disable caching entirely & return removed warning. I don't really know what approach is better.
Author: Vyacheslav Baranov <slavik.baranov@gmail.com>
Closes#8395 from SlavikBaranov/SPARK-10182.
* Replaces instances of `Lists.newArrayList` with `Arrays.asList`
* Replaces `commons.lang.StringUtils` over `com.google.collections.Strings`
* Replaces `List` interface over `ArrayList` implementations
This PR along with #8445#8446#8447 completely removes all `com.google.collections.Lists` dependencies within mllib's Java tests.
Author: Feynman Liang <fliang@databricks.com>
Closes#8451 from feynmanliang/SPARK-10257.
Fix for [JavaConverters.asJavaListConverter](http://www.scala-lang.org/api/2.10.5/index.html#scala.collection.JavaConverters$) being removed in 2.11.7 and hence the build fails with the 2.11 profile enabled. Tested with the default 2.10 and 2.11 profiles. BUILD SUCCESS in both cases.
Build for 2.10:
./build/mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.7.1 -DskipTests clean install
and 2.11:
./dev/change-scala-version.sh 2.11
./build/mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.7.1 -Dscala-2.11 -DskipTests clean install
Author: Jacek Laskowski <jacek@japila.pl>
Closes#8479 from jaceklaskowski/SPARK-9613-hotfix.
* Replaces `com.google.common` dependencies with `java.util.Arrays`
* Small clean up in `JavaNormalizerSuite`
Author: Feynman Liang <fliang@databricks.com>
Closes#8445 from feynmanliang/SPARK-10254.
I only found `ml.NaiveBayes` missing `Experimental` annotation. This PR doesn't cover Python APIs.
cc jkbradley
Author: Xiangrui Meng <meng@databricks.com>
Closes#8452 from mengxr/SPARK-9665.
Same as #8421 but for `mllib.feature`.
cc dbtsai
Author: Xiangrui Meng <meng@databricks.com>
Closes#8449 from mengxr/SPARK-10236.feature and squashes the following commits:
0e8d658 [Xiangrui Meng] remove unnecessary comment
ad70b03 [Xiangrui Meng] update since versions in mllib.feature
Same as #8421 but for `mllib.regression`.
cc freeman-lab dbtsai
Author: Xiangrui Meng <meng@databricks.com>
Closes#8426 from mengxr/SPARK-10235 and squashes the following commits:
6cd28e4 [Xiangrui Meng] update since versions in mllib.regression
The same as #8241 but for `mllib.stat` and `mllib.random`.
cc feynmanliang
Author: Xiangrui Meng <meng@databricks.com>
Closes#8439 from mengxr/SPARK-10242.
Same as #8421 but for `mllib.linalg`.
cc dbtsai
Author: Xiangrui Meng <meng@databricks.com>
Closes#8440 from mengxr/SPARK-10238 and squashes the following commits:
b38437e [Xiangrui Meng] update since versions in mllib.linalg
* Adds two new sections to LDA's user guide; one for each optimizer/model
* Documents new features added to LDA (e.g. topXXXperXXX, asymmetric priors, hyperpam optimization)
* Cleans up a TODO and sets a default parameter in LDA code
jkbradley hhbyyh
Author: Feynman Liang <fliang@databricks.com>
Closes#8254 from feynmanliang/SPARK-9888.
Same as #8421 but for `mllib.pmml` and `mllib.util`.
cc dbtsai
Author: Xiangrui Meng <meng@databricks.com>
Closes#8430 from mengxr/SPARK-10239 and squashes the following commits:
a189acf [Xiangrui Meng] update since versions in mllib.pmml and mllib.util
Adds default convergence tolerance (0.001, set in `GradientDescent.convergenceTol`) to `setConvergenceTol`'s scaladoc
Author: Feynman Liang <fliang@databricks.com>
Closes#8424 from feynmanliang/SPARK-9797.
* Adds doc for alias of runMIniBatchSGD documenting default value for convergeTol
* Cleans up a note in code
Author: Feynman Liang <fliang@databricks.com>
Closes#8425 from feynmanliang/SPARK-9800.
Update `Since` annotation in `mllib.classification`:
1. add version to classes, objects, constructors, and public variables declared in constructors
2. correct some versions
3. remove `Since` on `toString`
MechCoder dbtsai
Author: Xiangrui Meng <meng@databricks.com>
Closes#8421 from mengxr/SPARK-10231 and squashes the following commits:
b2dce80 [Xiangrui Meng] update @Since annotation for mllib.classification
Replace `JavaConversions` implicits with `JavaConverters`
Most occurrences I've seen so far are necessary conversions; a few have been avoidable. None are in critical code as far as I see, yet.
Author: Sean Owen <sowen@cloudera.com>
Closes#8033 from srowen/SPARK-9613.
GaussianMixture now distributes matrix decompositions for certain problem sizes. Distributed computation actually fails, but this was not tested in unit tests.
This PR adds a unit test which checks this. It failed previously but works with this fix.
CC: mengxr
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#8370 from jkbradley/gmm-fix.
Add user guide for `VectorSlicer`, with Java test suite and Python version VectorSlicer.
Note that Python version does not support selecting by names now.
Author: Xusen Yin <yinxusen@gmail.com>
Closes#8267 from yinxusen/SPARK-9893.
Removed categorical feature info validation since no longer needed
This is needed to make the ML user guide examples work (in another current PR).
CC: mengxr
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#8367 from jkbradley/gbt-single-cat.
For each (document, term) pair, return top topic. Note that instances of (doc, term) pairs within a document (a.k.a. "tokens") are exchangeable, so we should provide an estimate per document-term, rather than per token.
CC: rotationsymmetry mengxr
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#8329 from jkbradley/lda-topic-assignments.
This continues the work from #8256. I removed `since` tags from private/protected/local methods/variables (see 72fdeb6463). MechCoder
Closes#8256
Author: Xiangrui Meng <meng@databricks.com>
Author: Xiaoqing Wang <spark445@126.com>
Author: MechCoder <manojkumarsivaraj334@gmail.com>
Closes#8288 from mengxr/SPARK-8918.
Previously, users of evaluator (`CrossValidator` and `TrainValidationSplit`) would only maximize the metric in evaluator, leading to a hacky solution which negated metrics to be minimized and caused erroneous negative values to be reported to the user.
This PR adds a `isLargerBetter` attribute to the `Evaluator` base class, instructing users of `Evaluator` on whether the chosen metric should be maximized or minimized.
CC jkbradley
Author: Feynman Liang <fliang@databricks.com>
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#8290 from feynmanliang/SPARK-10097.
jira: https://issues.apache.org/jira/browse/SPARK-9028
Add an estimator for CountVectorizerModel. The estimator will extract a vocabulary from document collections according to the term frequency.
I changed the meaning of minCount as a filter across the corpus. This aligns with Word2Vec and the similar parameter in SKlearn.
Author: Yuhao Yang <hhbyyh@gmail.com>
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#7388 from hhbyyh/cvEstimator.
Fix the issue that ```layers``` and ```weights``` should be public variables of ```MultilayerPerceptronClassificationModel```. Users can not get ```layers``` and ```weights``` from a ```MultilayerPerceptronClassificationModel``` currently.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#8263 from yanboliang/mlp-public.
This PR adds a short description of `ml.feature` package with code example. The Java package doc will come in a separate PR. jkbradley
Author: Xiangrui Meng <meng@databricks.com>
Closes#8260 from mengxr/SPARK-7808.
Added since tags to mllib.regression
Author: Prayag Chandran <prayagchandran@gmail.com>
Closes#7518 from prayagchandran/sinceTags and squashes the following commits:
fa4dda2 [Prayag Chandran] Re-formatting
6c6d584 [Prayag Chandran] Corrected a few tags. Removed few unnecessary tags
1a0365f [Prayag Chandran] Reformating and adding a few more tags
89fdb66 [Prayag Chandran] SPARK-8916 [Documentation, MLlib] Add @since tags to mllib.regression
Also added unit test for integration between StringIndexerModel and IndexToString
CC: holdenk We realized we should have left in your unit test (to catch the issue with removing the inverse() method), so this adds it back. mengxr
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#8211 from jkbradley/stridx-labels.
in MLlib sometimes we need to set metadata for the new column, thus we will alias the new column with metadata before call `withColumn` and in `withColumn` we alias this clolumn again. Here I overloaded `withColumn` to allow user set metadata, just like what we did for `Column.as`.
Author: Wenchen Fan <cloud0fan@outlook.com>
Closes#8159 from cloud-fan/withColumn.
It would be helpful to allow users to pass a pre-computed index to create an indexer, rather than always going through StringIndexer to create the model.
Author: Holden Karau <holden@pigscanfly.ca>
Closes#7267 from holdenk/SPARK-8744-StringIndexerModel-should-have-public-constructor.
This modifies DecisionTreeMetadata construction to treat 1-category features as continuous, so that trees do not fail with such features. It is important for the pipelines API, where VectorIndexer can automatically categorize certain features as categorical.
As stated in the JIRA, this is a temp fix which we can improve upon later by automatically filtering out those features. That will take longer, though, since it will require careful indexing.
Targeted for 1.5 and master
CC: manishamde mengxr yanboliang
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#8187 from jkbradley/tree-1cat.
Some minor clean-ups after SPARK-9661. See my inline comments. MechCoder jkbradley
Author: Xiangrui Meng <meng@databricks.com>
Closes#8190 from mengxr/SPARK-9661-fix.
What `StringIndexerInverse` does is not strictly associated with `StringIndexer`, and the name is not clearly describing the transformation. Renaming to `IndexToString` might be better.
~~I also changed `invert` to `inverse` without arguments. `inputCol` and `outputCol` could be set after.~~
I also removed `invert`.
jkbradley holdenk
Author: Xiangrui Meng <meng@databricks.com>
Closes#8152 from mengxr/SPARK-9922.
I skimmed through the docs for various instance of Object and replaced them with Java compaible versions of the same.
1. Some methods in LDAModel.
2. runMiniBatchSGD
3. kolmogorovSmirnovTest
Author: MechCoder <manojkumarsivaraj334@gmail.com>
Closes#8126 from MechCoder/java_incop.
To follow the naming rule of ML, change `MultilayerPerceptronClassifierModel` to `MultilayerPerceptronClassificationModel` like `DecisionTreeClassificationModel`, `GBTClassificationModel` and so on.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#8164 from yanboliang/mlp-name.
Copied ML models must have the same parent of original ones
Author: lewuathe <lewuathe@me.com>
Author: Lewuathe <lewuathe@me.com>
Closes#7447 from Lewuathe/SPARK-9073.
This requires some discussion. I'm not sure whether `runs` is a useful parameter. It certainly complicates the implementation. We might want to optimize the k-means implementation with block matrix operations. In this case, having `runs` may not be worth the trade-off. Also it increases the communication cost in a single job, which might cause other issues.
This PR also renames `epsilon` to `tol` to have consistent naming among algorithms. The Python constructor is updated to include all parameters.
jkbradley yu-iskw
Author: Xiangrui Meng <meng@databricks.com>
Closes#8148 from mengxr/SPARK-9918 and squashes the following commits:
149b9e5 [Xiangrui Meng] fix constructor in Python and rename epsilon to tol
3cc15b3 [Xiangrui Meng] fix test and change initStep to initSteps in python
a0a0274 [Xiangrui Meng] remove runs from k-means in the pipeline API
The problem with defining setters in the base class is that it doesn't return the correct type in Java.
ericl
Author: Xiangrui Meng <meng@databricks.com>
Closes#8143 from mengxr/SPARK-9914 and squashes the following commits:
d36c887 [Xiangrui Meng] remove setters from model
a49021b [Xiangrui Meng] define setters explicitly for Java and use setParam group
There exists a chance that the prefixes keep growing to the maximum pattern length. Then the final local processing step becomes unnecessary. feynmanliang
Author: Xiangrui Meng <meng@databricks.com>
Closes#8136 from mengxr/SPARK-9903.
Made ProbabilisticClassifier, Identifiable, VectorUDT public. All are annotated as DeveloperApi.
CC: mengxr EronWright
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#8004 from jkbradley/ml-api-public-items and squashes the following commits:
7ebefda [Joseph K. Bradley] update per code review
7ff0768 [Joseph K. Bradley] attepting to add mima fix
756d84c [Joseph K. Bradley] VectorUDT annotated as AlphaComponent
ae7767d [Joseph K. Bradley] added another warning
94fd553 [Joseph K. Bradley] Made ProbabilisticClassifier, Identifiable, VectorUDT public APIs
As per the TODO move weightCol to Shared Params.
Author: Holden Karau <holden@pigscanfly.ca>
Closes#8144 from holdenk/SPARK-9909-move-weightCol-toSharedParams.
Reinstated LogisticRegression.threshold Param for binary compatibility. Param thresholds overrides threshold, if set.
CC: mengxr dbtsai feynmanliang
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#8079 from jkbradley/logreg-reinstate-threshold.
From JIRA: Currently, Params.copyValues copies default parameter values to the paramMap of the target instance, rather than the defaultParamMap. It should copy to the defaultParamMap because explicitly setting a parameter can change the semantics.
This issue arose in SPARK-9789, where 2 params "threshold" and "thresholds" for LogisticRegression can have mutually exclusive values. If thresholds is set, then fit() will copy the default value of threshold as well, easily resulting in inconsistent settings for the 2 params.
CC: mengxr
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#8115 from jkbradley/copyvalues-fix.