Added user guide sections with code examples.
Also added small Java unit tests to test Java example in guide.
CC: mengxr
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#6127 from jkbradley/feature-guide-2 and squashes the following commits:
cd47f4b [Joseph K. Bradley] Updated based on code review
f16bcec [Joseph K. Bradley] Fixed merge issues and update Python examples print calls for Python 3
0a862f9 [Joseph K. Bradley] Added Normalizer, StandardScaler to ml-features doc, plus small Java unit tests
a21c2d6 [Joseph K. Bradley] Updated ml-features.md with IDF
(cherry picked from commit 2728c3df66)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
This PR updates `HashingTF` to output ML attributes that tell the number of features in the output column. We need to expand `UnaryTransformer` to support output metadata. A `df outputMetadata: Metadata` is not sufficient because the metadata may also depends on the input data. Though this is not true for `HashingTF`, I think it is reasonable to update `UnaryTransformer` in a separate PR. `checkParams` is added to verify common requirements for params. I will send a separate PR to use it in other test suites. jkbradley
Author: Xiangrui Meng <meng@databricks.com>
Closes#6308 from mengxr/SPARK-7219 and squashes the following commits:
9bd2922 [Xiangrui Meng] address comments
e82a68a [Xiangrui Meng] remove sqlContext from test suite
995535b [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7219
2194703 [Xiangrui Meng] add test for attributes
178ae23 [Xiangrui Meng] update HashingTF with tests
91a6106 [Xiangrui Meng] WIP
(cherry picked from commit 85b96372cf)
Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
The previous default is `{gaps: false, pattern: "\\p{L}+|[^\\p{L}\\s]+"}`. The default pattern is hard to understand. This PR changes the default to `{gaps: true, pattern: "\\s+"}`. jkbradley
Author: Xiangrui Meng <meng@databricks.com>
Closes#6330 from mengxr/SPARK-7794 and squashes the following commits:
5ee7cde [Xiangrui Meng] update RegexTokenizer default settings
(cherry picked from commit f5db4b416c)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
We removed `varargs` due to Java compilation issues. That was a false alarm because I didn't run `build/sbt clean`. So this PR reverts the changes. jkbradley
Author: Xiangrui Meng <meng@databricks.com>
Closes#6320 from mengxr/SPARK-7498 and squashes the following commits:
74a7259 [Xiangrui Meng] add varargs back to setDefault
(cherry picked from commit cdc7c055c9)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
Added VectorIndexer section to ML user guide. Also added javaCategoryMaps() method and Java unit test for it.
CC: mengxr
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#6255 from jkbradley/vector-indexer-guide and squashes the following commits:
dbb8c4c [Joseph K. Bradley] simplified VectorIndexerModel.javaCategoryMaps
f692084 [Joseph K. Bradley] Added VectorIndexer section to ML user guide. Also added javaCategoryMaps() method and Java unit test for it.
(cherry picked from commit 6d75ed7e5c)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
same issue and fix as in Spark-7694.
Author: Shuo Xiang <shuoxiangpub@gmail.com>
Closes#6321 from coderxiang/nb and squashes the following commits:
a5e6de4 [Shuo Xiang] use getOrElse for svmmodel.tostring
2cb0177 [Shuo Xiang] Merge remote-tracking branch 'upstream/master' into nb
5f109b4 [Shuo Xiang] Merge remote-tracking branch 'upstream/master'
c5c5bfe [Shuo Xiang] Merge remote-tracking branch 'upstream/master'
98804c9 [Shuo Xiang] fix bug in topBykey and update test
(cherry picked from commit 4f572008f8)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
to be consistent with other string names in MLlib. This PR also updates the implementation to use vals instead of hardcoded strings. jkbradley leahmcguire
Author: Xiangrui Meng <meng@databricks.com>
Closes#6277 from mengxr/SPARK-7752 and squashes the following commits:
f38b662 [Xiangrui Meng] add another case _ back in test
ae5c66a [Xiangrui Meng] model type -> modelType
711d1c6 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7752
40ae53e [Xiangrui Meng] fix Java test suite
264a814 [Xiangrui Meng] add case _ back
3c456a8 [Xiangrui Meng] update NB user guide
17bba53 [Xiangrui Meng] update naive Bayes to use lowercase model type strings
(cherry picked from commit 13348e21b6)
Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
Update `KernelDensity` API to make it extensible to different kernels in the future. `bandwidth` is used instead of `standardDeviation`. The static `kernelDensity` method is removed from `Statistics`. The implementation is updated using BLAS, while the algorithm remains the same. sryza srowen
Author: Xiangrui Meng <meng@databricks.com>
Closes#6279 from mengxr/SPARK-7753 and squashes the following commits:
4cdfadc [Xiangrui Meng] add example code in the doc
767fd5a [Xiangrui Meng] update KernelDensity API
(cherry picked from commit 947ea1cf5f)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
to simplify test suites that require a SQLContext.
Author: Xiangrui Meng <meng@databricks.com>
Closes#6303 from mengxr/SPARK-7774 and squashes the following commits:
0622b5a [Xiangrui Meng] update some other test suites
e1f9b8d [Xiangrui Meng] add sqlContext to MLlibTestSparkContext
(cherry picked from commit ddec173cba)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
Set a default value for `outputCol` instead of forcing users to name it. This is useful for intermediate transformers in the pipeline. jkbradley
Author: Xiangrui Meng <meng@databricks.com>
Closes#6289 from mengxr/SPARK-7762 and squashes the following commits:
54edebc [Xiangrui Meng] merge master
bff8667 [Xiangrui Meng] update unit test
171246b [Xiangrui Meng] add unit test for outputCol
a4321bd [Xiangrui Meng] set default value for outputCol
(cherry picked from commit c330e52dae)
Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
Minor updates to the spark.mllib APIs:
1. Add `DeveloperApi` to `PMMLExportable` and add `Experimental` to `toPMML` methods.
2. Mention `RankingMetrics.of` in the `RankingMetrics` constructor.
Author: Xiangrui Meng <meng@databricks.com>
Closes#6280 from mengxr/SPARK-7537 and squashes the following commits:
1bd2583 [Xiangrui Meng] organize imports
94afa7a [Xiangrui Meng] mark all toPMML methods experimental
4c40da1 [Xiangrui Meng] mention the factory method for RankingMetrics for Java users
88c62d0 [Xiangrui Meng] add DeveloperApi to PMMLExportable
(cherry picked from commit 2ad4837cfa)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
Add MultilabelMetrics in PySpark/MLlib
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#6276 from yanboliang/spark-6094 and squashes the following commits:
b8e3343 [Yanbo Liang] Add MultilabelMetrics in PySpark/MLlib
(cherry picked from commit 98a46f9dff)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
CC jkbradley.
JIRA [issue](https://issues.apache.org/jira/browse/SPARK-7586).
Author: Xusen Yin <yinxusen@gmail.com>
Closes#6181 from yinxusen/SPARK-7586 and squashes the following commits:
77014c5 [Xusen Yin] comment fix
57a4c07 [Xusen Yin] small fix for docs
1178c8f [Xusen Yin] remove the correctness check in java suite
1c3f389 [Xusen Yin] delete sbt commit
1af152b [Xusen Yin] check python example code
1b5369e [Xusen Yin] add docs of word2vec
(cherry picked from commit 68fb2a46ed)
Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
Changed shared param HasSeed to have default based on hashCode of class name, instead of random number.
Also, removed fixed random seeds from Word2Vec and ALS.
CC: mengxr
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#6251 from jkbradley/scala-fixed-seed and squashes the following commits:
0e37184 [Joseph K. Bradley] Fixed Word2VecSuite, ALSSuite in spark.ml to use original fixed random seeds
678ec3a [Joseph K. Bradley] Removed fixed random seeds from Word2Vec and ALS. Changed shared param HasSeed to have default based on hashCode of class name, instead of random number.
(cherry picked from commit 7b16e9f211)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
Made Model.parent transient. Added Model.hasParent to test for null parent
CC: mengxr
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#5914 from jkbradley/parent-optional and squashes the following commits:
d501774 [Joseph K. Bradley] Made Model.parent transient. Added Model.hasParent to test for null parent
(cherry picked from commit fb90273212)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
JIRA [here](https://issues.apache.org/jira/browse/SPARK-7581).
CC jkbradley
Author: Xusen Yin <yinxusen@gmail.com>
Closes#6113 from yinxusen/SPARK-7581 and squashes the following commits:
1a7d80d [Xusen Yin] merge with master
892a8e9 [Xusen Yin] fix python 3 compatibility
ec935bf [Xusen Yin] small fix
3e9fa1d [Xusen Yin] delete note
69fcf85 [Xusen Yin] simplify and add python example
81d21dc [Xusen Yin] add programming guide for Polynomial Expansion
40babfb [Xusen Yin] add java test suite for PolynomialExpansion
(cherry picked from commit 6008ec14ed)
Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
JIRA: https://issues.apache.org/jira/browse/SPARK-7681
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#6209 from viirya/sparsevector_gemv and squashes the following commits:
ce0bb8b [Liang-Chi Hsieh] Still need to scal y when beta is 0.0 because it clears out y.
b890e63 [Liang-Chi Hsieh] Do not delete multiply for DenseVector.
57a8c1e [Liang-Chi Hsieh] Add MimaExcludes for v1.4.
458d1ae [Liang-Chi Hsieh] List DenseMatrix.multiply and SparseMatrix.multiply to MimaExcludes too.
054f05d [Liang-Chi Hsieh] Fix scala style.
410381a [Liang-Chi Hsieh] Address comments. Make Matrix.multiply more generalized.
4616696 [Liang-Chi Hsieh] Add support for SparseVector with SparseMatrix.
5d6d07a [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into sparsevector_gemv
c069507 [Liang-Chi Hsieh] Add SparseVector support for gemv with DenseMatrix.
(cherry picked from commit d03638cc2d)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
This PR makes pipeline stages in Python copyable and hence simplifies some implementations. It also includes the following changes:
1. Rename `paramMap` and `defaultParamMap` to `_paramMap` and `_defaultParamMap`, respectively.
2. Accept a list of param maps in `fit`.
3. Use parent uid and name to identify param.
jkbradley
Author: Xiangrui Meng <meng@databricks.com>
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#6088 from mengxr/SPARK-7380 and squashes the following commits:
413c463 [Xiangrui Meng] remove unnecessary doc
4159f35 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7380
611c719 [Xiangrui Meng] fix python style
68862b8 [Xiangrui Meng] update _java_obj initialization
927ad19 [Xiangrui Meng] fix ml/tests.py
0138fc3 [Xiangrui Meng] update feature transformers and fix a bug in RegexTokenizer
9ca44fb [Xiangrui Meng] simplify Java wrappers and add tests
c7d84ef [Xiangrui Meng] update ml/tests.py to test copy params
7e0d27f [Xiangrui Meng] merge master
46840fb [Xiangrui Meng] update wrappers
b6db1ed [Xiangrui Meng] update all self.paramMap to self._paramMap
46cb6ed [Xiangrui Meng] merge master
a163413 [Xiangrui Meng] fix style
1042e80 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7380
9630eae [Xiangrui Meng] fix Identifiable._randomUID
13bd70a [Xiangrui Meng] update ml/tests.py
64a536c [Xiangrui Meng] use _fit/_transform/_evaluate to simplify the impl
02abf13 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into copyable-python
66ce18c [Joseph K. Bradley] some cleanups before sending to Xiangrui
7431272 [Joseph K. Bradley] Rebased with master
(cherry picked from commit 9c7e802a5a)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
The `toString` method of `LogisticRegressionModel` calls `get` method on an Option (threshold) without a safeguard. In spark-shell, the following code `val model = algorithm.run(data).clearThreshold()` in lbfgs code will fail as `toString `method will be called right after `clearThreshold()` to show the results in the REPL.
Author: Shuo Xiang <shuoxiangpub@gmail.com>
Closes#6224 from coderxiang/getorelse and squashes the following commits:
d5f53c9 [Shuo Xiang] use getOrElse for getting the threshold of LR model
5f109b4 [Shuo Xiang] Merge remote-tracking branch 'upstream/master'
c5c5bfe [Shuo Xiang] Merge remote-tracking branch 'upstream/master'
98804c9 [Shuo Xiang] fix bug in topBykey and update test
(cherry picked from commit 775e6f9909)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
Author: Reynold Xin <rxin@databricks.com>
Closes#6211 from rxin/mllib-reader and squashes the following commits:
79a2cb9 [Reynold Xin] [SPARK-7654][MLlib] Migrate MLlib to the DataFrame reader/writer API.
(cherry picked from commit 161d0b4a41)
Signed-off-by: Reynold Xin <rxin@databricks.com>
reservoir feature sample by using existing api
Author: AiHe <ai.he@ussuning.com>
Closes#5988 from AiHe/reservoir and squashes the following commits:
e7a41ac [AiHe] remove non-robust testing case
28ffb9a [AiHe] set seed as rng.nextLong
37459e1 [AiHe] set fixed seed
1e98a4c [AiHe] [MLLIB][tree] Add reservoir sample in RandomForest
(cherry picked from commit deb411335a)
Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
JIRA: https://issues.apache.org/jira/browse/SPARK-7668
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#6188 from viirya/fix_matrix_map and squashes the following commits:
2a7cc97 [Liang-Chi Hsieh] Preserve isTransposed property for Matrix after calling map function.
(cherry picked from commit f96b85ab44)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
Implement Python API for major disparities of GaussianMixture cluster algorithm between Scala & Python
```scala
GaussianMixture
setInitialModel
GaussianMixtureModel
k
```
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#6087 from yanboliang/spark-6258 and squashes the following commits:
b3af21c [Yanbo Liang] fix typo
2b645c1 [Yanbo Liang] fix doc
638b4b7 [Yanbo Liang] address comments
b5bcade [Yanbo Liang] GaussianMixture Python API parity check
(cherry picked from commit 94761485b2)
Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
A param instance is strongly attached to an parent in the current implementation. So if we make a copy of an estimator or a transformer in pipelines and other meta-algorithms, it becomes error-prone to copy the params to the copied instances. In this PR, a param is identified by its parent's UID and the param name. So it becomes loosely attached to its parent and all its derivatives. The UID is preserved during copying or fitting. All components now have a default constructor and a constructor that takes a UID as input. I keep the constructors for Param in this PR to reduce the amount of diff and moved `parent` as a mutable field.
This PR still needs some clean-ups, and there are several spark.ml PRs pending. I'll try to get them merged first and then update this PR.
jkbradley
Author: Xiangrui Meng <meng@databricks.com>
Closes#6019 from mengxr/SPARK-7407 and squashes the following commits:
c4c8120 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7407
520f0a2 [Xiangrui Meng] address comments
2569168 [Xiangrui Meng] fix tests
873caca [Xiangrui Meng] fix tests in OneVsRest; fix a racing condition in shouldOwn
409ea08 [Xiangrui Meng] minor updates
83a163c [Xiangrui Meng] update JavaDeveloperApiExample
5db5325 [Xiangrui Meng] update OneVsRest
7bde7ae [Xiangrui Meng] merge master
697fdf9 [Xiangrui Meng] update Bucketizer
7b4f6c2 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7407
629d402 [Xiangrui Meng] fix LRSuite
154516f [Xiangrui Meng] merge master
aa4a611 [Xiangrui Meng] fix examples/compile
a4794dd [Xiangrui Meng] change Param to use to reduce the size of diff
fdbc415 [Xiangrui Meng] all tests passed
c255f17 [Xiangrui Meng] fix tests in ParamsSuite
818e1db [Xiangrui Meng] merge master
e1160cf [Xiangrui Meng] fix tests
fbc39f0 [Xiangrui Meng] pass test:compile
108937e [Xiangrui Meng] pass compile
8726d39 [Xiangrui Meng] use parent uid in Param
eaeed35 [Xiangrui Meng] update Identifiable
(cherry picked from commit 1b8625f425)
Signed-off-by: Xiangrui Meng <meng@databricks.com>