Commit graph

736 commits

Author SHA1 Message Date
Patrick Wendell ac3197e1b9 Preparing development version 1.4.1-SNAPSHOT 2015-05-19 09:35:12 +00:00
Patrick Wendell 777a08166f Preparing Spark release v1.4.0-rc1 2015-05-19 09:35:12 +00:00
Patrick Wendell 586ede6b32 Revert "Preparing Spark release v1.4.0-rc1"
This reverts commit 79fb01a3be.
2015-05-19 02:27:14 -07:00
Patrick Wendell e7309ec729 Revert "Preparing development version 1.4.1-SNAPSHOT"
This reverts commit a1d896b85b.
2015-05-19 02:27:07 -07:00
Patrick Wendell a1d896b85b Preparing development version 1.4.1-SNAPSHOT 2015-05-19 07:13:24 +00:00
Patrick Wendell 79fb01a3be Preparing Spark release v1.4.0-rc1 2015-05-19 07:13:24 +00:00
Patrick Wendell b0c63d2413 Revert "Preparing Spark release v1.4.0-rc1"
This reverts commit 38ccef36c1.
2015-05-19 00:10:39 -07:00
Patrick Wendell 198a186ad3 Revert "Preparing development version 1.4.1-SNAPSHOT"
This reverts commit 40190ce226.
2015-05-19 00:10:37 -07:00
Xusen Yin 38a3fc8485 [SPARK-7581] [ML] [DOC] User guide for spark.ml PolynomialExpansion
JIRA [here](https://issues.apache.org/jira/browse/SPARK-7581).

CC jkbradley

Author: Xusen Yin <yinxusen@gmail.com>

Closes #6113 from yinxusen/SPARK-7581 and squashes the following commits:

1a7d80d [Xusen Yin] merge with master
892a8e9 [Xusen Yin] fix python 3 compatibility
ec935bf [Xusen Yin] small fix
3e9fa1d [Xusen Yin] delete note
69fcf85 [Xusen Yin] simplify and add python example
81d21dc [Xusen Yin] add programming guide for Polynomial Expansion
40babfb [Xusen Yin] add java test suite for PolynomialExpansion

(cherry picked from commit 6008ec14ed)
Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
2015-05-19 00:06:43 -07:00
Patrick Wendell 40190ce226 Preparing development version 1.4.1-SNAPSHOT 2015-05-19 06:06:41 +00:00
Patrick Wendell 38ccef36c1 Preparing Spark release v1.4.0-rc1 2015-05-19 06:06:40 +00:00
Patrick Wendell 152b0291c0 Revert "Preparing Spark release v1.4.0-rc1"
This reverts commit e8e97e3a63.
2015-05-18 23:06:15 -07:00
Patrick Wendell 4d098bc049 Revert "Preparing development version 1.4.1-SNAPSHOT"
This reverts commit 758ca74bab.
2015-05-18 23:06:13 -07:00
Patrick Wendell 758ca74bab Preparing development version 1.4.1-SNAPSHOT 2015-05-19 05:01:11 +00:00
Patrick Wendell e8e97e3a63 Preparing Spark release v1.4.0-rc1 2015-05-19 05:01:11 +00:00
Liang-Chi Hsieh dd9f873bea [SPARK-7681] [MLLIB] Add SparseVector support for gemv
JIRA: https://issues.apache.org/jira/browse/SPARK-7681

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #6209 from viirya/sparsevector_gemv and squashes the following commits:

ce0bb8b [Liang-Chi Hsieh] Still need to scal y when beta is 0.0 because it clears out y.
b890e63 [Liang-Chi Hsieh] Do not delete multiply for DenseVector.
57a8c1e [Liang-Chi Hsieh] Add MimaExcludes for v1.4.
458d1ae [Liang-Chi Hsieh] List DenseMatrix.multiply and SparseMatrix.multiply to MimaExcludes too.
054f05d [Liang-Chi Hsieh] Fix scala style.
410381a [Liang-Chi Hsieh] Address comments. Make Matrix.multiply more generalized.
4616696 [Liang-Chi Hsieh] Add support for SparseVector with SparseMatrix.
5d6d07a [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into sparsevector_gemv
c069507 [Liang-Chi Hsieh] Add SparseVector support for gemv with DenseMatrix.

(cherry picked from commit d03638cc2d)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
2015-05-18 21:32:46 -07:00
Xiangrui Meng cf4e04a0c5 [SPARK-7380] [MLLIB] pipeline stages should be copyable in Python
This PR makes pipeline stages in Python copyable and hence simplifies some implementations. It also includes the following changes:

1. Rename `paramMap` and `defaultParamMap` to `_paramMap` and `_defaultParamMap`, respectively.
2. Accept a list of param maps in `fit`.
3. Use parent uid and name to identify param.

jkbradley

Author: Xiangrui Meng <meng@databricks.com>
Author: Joseph K. Bradley <joseph@databricks.com>

Closes #6088 from mengxr/SPARK-7380 and squashes the following commits:

413c463 [Xiangrui Meng] remove unnecessary doc
4159f35 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7380
611c719 [Xiangrui Meng] fix python style
68862b8 [Xiangrui Meng] update _java_obj initialization
927ad19 [Xiangrui Meng] fix ml/tests.py
0138fc3 [Xiangrui Meng] update feature transformers and fix a bug in RegexTokenizer
9ca44fb [Xiangrui Meng] simplify Java wrappers and add tests
c7d84ef [Xiangrui Meng] update ml/tests.py to test copy params
7e0d27f [Xiangrui Meng] merge master
46840fb [Xiangrui Meng] update wrappers
b6db1ed [Xiangrui Meng] update all self.paramMap to self._paramMap
46cb6ed [Xiangrui Meng] merge master
a163413 [Xiangrui Meng] fix style
1042e80 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7380
9630eae [Xiangrui Meng] fix Identifiable._randomUID
13bd70a [Xiangrui Meng] update ml/tests.py
64a536c [Xiangrui Meng] use _fit/_transform/_evaluate to simplify the impl
02abf13 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into copyable-python
66ce18c [Joseph K. Bradley] some cleanups before sending to Xiangrui
7431272 [Joseph K. Bradley] Rebased with master

(cherry picked from commit 9c7e802a5a)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
2015-05-18 12:02:26 -07:00
Shuo Xiang 0b6bc8a239 [SPARK-7694] [MLLIB] Use getOrElse for getting the threshold of LR model
The `toString` method of `LogisticRegressionModel` calls `get` method on an Option (threshold) without a safeguard. In spark-shell, the following code `val model = algorithm.run(data).clearThreshold()` in lbfgs code will fail as `toString `method will be called right after `clearThreshold()` to show the results in the REPL.

Author: Shuo Xiang <shuoxiangpub@gmail.com>

Closes #6224 from coderxiang/getorelse and squashes the following commits:

d5f53c9 [Shuo Xiang] use getOrElse for getting the threshold of LR model
5f109b4 [Shuo Xiang] Merge remote-tracking branch 'upstream/master'
c5c5bfe [Shuo Xiang] Merge remote-tracking branch 'upstream/master'
98804c9 [Shuo Xiang] fix bug in topBykey and update test

(cherry picked from commit 775e6f9909)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
2015-05-17 21:16:59 -07:00
Reynold Xin bd057f8b55 [SPARK-7654][MLlib] Migrate MLlib to the DataFrame reader/writer API.
Author: Reynold Xin <rxin@databricks.com>

Closes #6211 from rxin/mllib-reader and squashes the following commits:

79a2cb9 [Reynold Xin] [SPARK-7654][MLlib] Migrate MLlib to the DataFrame reader/writer API.

(cherry picked from commit 161d0b4a41)
Signed-off-by: Reynold Xin <rxin@databricks.com>
2015-05-16 15:04:26 -07:00
AiHe f41be8fb38 [SPARK-7473] [MLLIB] Add reservoir sample in RandomForest
reservoir feature sample by using existing api

Author: AiHe <ai.he@ussuning.com>

Closes #5988 from AiHe/reservoir and squashes the following commits:

e7a41ac [AiHe] remove non-robust testing case
28ffb9a [AiHe] set seed as rng.nextLong
37459e1 [AiHe] set fixed seed
1e98a4c [AiHe] [MLLIB][tree] Add reservoir sample in RandomForest

(cherry picked from commit deb411335a)
Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
2015-05-15 20:42:59 -07:00
Liang-Chi Hsieh d1f5651004 [SPARK-7668] [MLLIB] Preserve isTransposed property for Matrix after calling map function
JIRA: https://issues.apache.org/jira/browse/SPARK-7668

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #6188 from viirya/fix_matrix_map and squashes the following commits:

2a7cc97 [Liang-Chi Hsieh] Preserve isTransposed property for Matrix after calling map function.

(cherry picked from commit f96b85ab44)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
2015-05-15 10:03:44 -07:00
Yanbo Liang c0bb974a46 [SPARK-6258] [MLLIB] GaussianMixture Python API parity check
Implement Python API for major disparities of GaussianMixture cluster algorithm between Scala & Python
```scala
GaussianMixture
    setInitialModel
GaussianMixtureModel
    k
```

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #6087 from yanboliang/spark-6258 and squashes the following commits:

b3af21c [Yanbo Liang] fix typo
2b645c1 [Yanbo Liang] fix doc
638b4b7 [Yanbo Liang] address comments
b5bcade [Yanbo Liang] GaussianMixture Python API parity check

(cherry picked from commit 94761485b2)
Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
2015-05-15 00:19:20 -07:00
Xiangrui Meng e45cd9f73a [SPARK-7407] [MLLIB] use uid + name to identify parameters
A param instance is strongly attached to an parent in the current implementation. So if we make a copy of an estimator or a transformer in pipelines and other meta-algorithms, it becomes error-prone to copy the params to the copied instances. In this PR, a param is identified by its parent's UID and the param name. So it becomes loosely attached to its parent and all its derivatives. The UID is preserved during copying or fitting. All components now have a default constructor and a constructor that takes a UID as input. I keep the constructors for Param in this PR to reduce the amount of diff and moved `parent` as a mutable field.

This PR still needs some clean-ups, and there are several spark.ml PRs pending. I'll try to get them merged first and then update this PR.

jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #6019 from mengxr/SPARK-7407 and squashes the following commits:

c4c8120 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7407
520f0a2 [Xiangrui Meng] address comments
2569168 [Xiangrui Meng] fix tests
873caca [Xiangrui Meng] fix tests in OneVsRest; fix a racing condition in shouldOwn
409ea08 [Xiangrui Meng] minor updates
83a163c [Xiangrui Meng] update JavaDeveloperApiExample
5db5325 [Xiangrui Meng] update OneVsRest
7bde7ae [Xiangrui Meng] merge master
697fdf9 [Xiangrui Meng] update Bucketizer
7b4f6c2 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7407
629d402 [Xiangrui Meng] fix LRSuite
154516f [Xiangrui Meng] merge master
aa4a611 [Xiangrui Meng] fix examples/compile
a4794dd [Xiangrui Meng] change Param to use  to reduce the size of diff
fdbc415 [Xiangrui Meng] all tests passed
c255f17 [Xiangrui Meng] fix tests in ParamsSuite
818e1db [Xiangrui Meng] merge master
e1160cf [Xiangrui Meng] fix tests
fbc39f0 [Xiangrui Meng] pass test:compile
108937e [Xiangrui Meng] pass compile
8726d39 [Xiangrui Meng] use parent uid in Param
eaeed35 [Xiangrui Meng] update Identifiable

(cherry picked from commit 1b8625f425)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
2015-05-14 01:22:32 -07:00
DB Tsai 9ab4db29ff [SPARK-7620] [ML] [MLLIB] Removed calling size, length in while condition to avoid extra JVM call
Author: DB Tsai <dbt@netflix.com>

Closes #6137 from dbtsai/clean and squashes the following commits:

185816d [DB Tsai] fix compilication issue
f418d08 [DB Tsai] first commit

(cherry picked from commit d3db2fd667)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
2015-05-13 22:23:31 -07:00
Xiangrui Meng 82f387fe23 [SPARK-7612] [MLLIB] update NB training to use mllib's BLAS
This is similar to the changes to k-means, which gives us better control on the performance. dbtsai

Author: Xiangrui Meng <meng@databricks.com>

Closes #6128 from mengxr/SPARK-7612 and squashes the following commits:

b5c24c5 [Xiangrui Meng] merge master
a90e3ec [Xiangrui Meng] update NB training to use mllib's BLAS

(cherry picked from commit d5f18de165)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
2015-05-13 21:27:39 -07:00
leahmcguire d9fb905beb [SPARK-7545] [MLLIB] Added check in Bernoulli Naive Bayes to make sure that both training and predict features have values of 0 or 1
Author: leahmcguire <lmcguire@salesforce.com>

Closes #6073 from leahmcguire/binaryCheckNB and squashes the following commits:

b8442c2 [leahmcguire] changed to if else for value checks
911bf83 [leahmcguire] undid reformat
4eedf1e [leahmcguire] moved bernoulli check
9ee9e84 [leahmcguire] fixed style error
3f3b32c [leahmcguire] fixed zero one check so only called in combiner
831fd27 [leahmcguire] got test working
f44bb3c [leahmcguire] removed changes from CV branch
67253f0 [leahmcguire] added check to bernoulli to ensure feature values are zero or one
f191c71 [leahmcguire] fixed name
58d060b [leahmcguire] changed param name and test according to comments
04f0d3c [leahmcguire] Added stats from cross validation as a val in the cross validation model to save them for user access

(cherry picked from commit 61e05fc58e)
Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
2015-05-13 14:13:29 -07:00
Burak Yavuz 11911b0ae9 [SPARK-7593] [ML] Python Api for ml.feature.Bucketizer
Added `ml.feature.Bucketizer` to PySpark.

cc mengxr

Author: Burak Yavuz <brkyvz@gmail.com>

Closes #6124 from brkyvz/ml-bucket and squashes the following commits:

05285be [Burak Yavuz] added sphinx doc
6abb6ed [Burak Yavuz] added support for Bucketizer

(cherry picked from commit 5db18ba6e1)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
2015-05-13 13:21:45 -07:00
Xiangrui Meng 6c292a213b [SPARK-7528] [MLLIB] make RankingMetrics Java-friendly
`RankingMetrics` contains a ClassTag, which is hard to create in Java. This PR adds a factory method `of` for Java users. coderxiang

Author: Xiangrui Meng <meng@databricks.com>

Closes #6098 from mengxr/SPARK-7528 and squashes the following commits:

e5d57ae [Xiangrui Meng] make RankingMetrics Java-friendly

(cherry picked from commit 2713bc65af)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
2015-05-12 16:54:04 -07:00
Joseph K. Bradley 612247ff05 [SPARK-7573] [ML] OneVsRest cleanups
Minor cleanups discussed with [~mengxr]:
* move OneVsRest from reduction to classification sub-package
* make model constructor private

Some doc cleanups too

CC: harsha2010  Could you please verify this looks OK?  Thanks!

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #6097 from jkbradley/onevsrest-cleanup and squashes the following commits:

4ecd48d [Joseph K. Bradley] org imports
430b065 [Joseph K. Bradley] moved OneVsRest from reduction subpackage to classification.  small java doc style fixes
9f8b9b9 [Joseph K. Bradley] Small cleanups to OneVsRest.  Made model constructor private to ml package.

(cherry picked from commit 96c4846db8)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
2015-05-12 16:42:36 -07:00
Joseph K. Bradley d080df10bc [SPARK-7557] [ML] [DOC] User guide for spark.ml HashingTF, Tokenizer
Added feature transformer subsection to spark.ml guide, with HashingTF and Tokenizer.  Added JavaHashingTFSuite to test Java examples in new guide.

I've run Scala, Python examples in the Spark/PySpark shells.  I ran the Java examples via the test suite (with small modifications for printing).

CC: mengxr

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #6093 from jkbradley/hashingtf-guide and squashes the following commits:

d5d213f [Joseph K. Bradley] small fix
dd6e91a [Joseph K. Bradley] fixes from code review of user guide
33c3ff9 [Joseph K. Bradley] small fix
bc6058c [Joseph K. Bradley] fix link
361a174 [Joseph K. Bradley] Added subsection for feature transformers to spark.ml guide, with HashingTF and Tokenizer.  Added JavaHashingTFSuite to test Java examples in new guide

(cherry picked from commit f0c1bc3472)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
2015-05-12 16:40:03 -07:00
Xiangrui Meng 2555517c97 [SPARK-7571] [MLLIB] rename Math to math
`scala.Math` is deprecated since 2.8. This PR only touchs `Math` usages in MLlib. dbtsai

Author: Xiangrui Meng <meng@databricks.com>

Closes #6092 from mengxr/SPARK-7571 and squashes the following commits:

fe8f8d3 [Xiangrui Meng] Math -> math

(cherry picked from commit a4874b0d18)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
2015-05-12 14:39:11 -07:00
Xiangrui Meng 98ccd934f3 [SPARK-7559] [MLLIB] Bucketizer should include the right most boundary in the last bucket.
We make special treatment for +inf in `Bucketizer`. This could be simplified by always including the largest split value in the last bucket. E.g., (x1, x2, x3) defines buckets [x1, x2) and [x2, x3]. This shouldn't affect user code much, and there are applications that need to include the right-most value. For example, we can bucketize ratings from 0 to 10 to bad, neutral, and good with splits 0, 4, 6, 10. It may reads weird if the users need to put 0, 4, 6, 10.1 (or 11).

This also update the impl to use `Arrays.binarySearch` and `withClue` in test.

yinxusen jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #6075 from mengxr/SPARK-7559 and squashes the following commits:

e28f910 [Xiangrui Meng] update bucketizer impl

(cherry picked from commit 23b9863e2a)
Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
2015-05-12 14:24:33 -07:00
Ram Sriharsha fd16709f04 [SPARK-7015] [MLLIB] [WIP] Multiclass to Binary Reduction: One Against All
initial cut of one against all. test code is a scaffolding , not fully implemented.
This WIP is to gather early feedback.

Author: Ram Sriharsha <rsriharsha@hw11853.local>

Closes #5830 from harsha2010/reduction and squashes the following commits:

5f4b495 [Ram Sriharsha] Fix Test
386e98b [Ram Sriharsha] Style fix
49b4a17 [Ram Sriharsha] Simplify the test
02279cc [Ram Sriharsha] Output Label Metadata in Prediction Col
bc78032 [Ram Sriharsha] Code Review Updates
8ce4845 [Ram Sriharsha] Merge with Master
2a807be [Ram Sriharsha] Merge branch 'master' into reduction
e21bfcc [Ram Sriharsha] Style Fix
5614f23 [Ram Sriharsha] Style Fix
c75583a [Ram Sriharsha] Cleanup
7a5f136 [Ram Sriharsha] Fix TODOs
804826b [Ram Sriharsha] Merge with Master
1448a5f [Ram Sriharsha] Style Fix
6e47807 [Ram Sriharsha] Style Fix
d63e46b [Ram Sriharsha] Incorporate Code Review Feedback
ced68b5 [Ram Sriharsha] Refactor OneVsAll to implement Predictor
78fa82a [Ram Sriharsha] extra line
0dfa1fb [Ram Sriharsha] Fix inexhaustive match cases that may arise from UnresolvedAttribute
a59a4f4 [Ram Sriharsha] @Experimental
4167234 [Ram Sriharsha] Merge branch 'master' into reduction
868a4fd [Ram Sriharsha] @Experimental
041d905 [Ram Sriharsha] Code Review Fixes
df188d8 [Ram Sriharsha] Style fix
612ec48 [Ram Sriharsha] Style Fix
6ef43d3 [Ram Sriharsha] Prefer Unresolved Attribute to Option: Java APIs are cleaner
6bf6bff [Ram Sriharsha] Update OneHotEncoder to new API
e29cb89 [Ram Sriharsha] Merge branch 'master' into reduction
1c7fa44 [Ram Sriharsha] Fix Tests
ca83672 [Ram Sriharsha] Incorporate Code Review Feedback + Rename to OneVsRestClassifier
221beeed [Ram Sriharsha] Upgrade to use Copy method for cloning Base Classifiers
26f1ddb [Ram Sriharsha] Merge with SPARK-5956 API changes
9738744 [Ram Sriharsha] Merge branch 'master' into reduction
1a3e375 [Ram Sriharsha] More efficient Implementation: Use withColumn to generate label column dynamically
32e0189 [Ram Sriharsha] Restrict reduction to Margin Based Classifiers
ff272da [Ram Sriharsha] Style fix
28771f5 [Ram Sriharsha] Add Tests for Multiclass to Binary Reduction
b60f874 [Ram Sriharsha] Fix Style issues in Test
3191cdf [Ram Sriharsha] Remove this test, accidental commit
23f056c [Ram Sriharsha] Fix Headers for test
1b5e929 [Ram Sriharsha] Fix Style issues and add Header
8752863 [Ram Sriharsha] [SPARK-7015][MLLib][WIP] Multiclass to Binary Reduction: One Against All

(cherry picked from commit 595a67589a)
Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
2015-05-12 13:35:33 -07:00
Marcelo Vanzin afe54b76a6 [SPARK-7485] [BUILD] Remove pyspark files from assembly.
The sbt part of the build is hacky; it basically tricks sbt
into generating the zip by using a generator, but returns
an empty list for the generated files so that nothing is
actually added to the assembly.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #6022 from vanzin/SPARK-7485 and squashes the following commits:

22c1e04 [Marcelo Vanzin] Remove unneeded code.
4893622 [Marcelo Vanzin] [SPARK-7485] [build] Remove pyspark files from assembly.

(cherry picked from commit 82e890fb19)
Signed-off-by: Andrew Or <andrew@databricks.com>
2015-05-12 01:39:28 -07:00
Xusen Yin f188815989 [SPARK-5893] [ML] Add bucketizer
JIRA issue [here](https://issues.apache.org/jira/browse/SPARK-5893).

One thing to make clear, the `buckets` parameter, which is an array of `Double`, performs as split points. Say,

```scala
buckets = Array(-0.5, 0.0, 0.5)
```

splits the real number into 4 ranges, (-inf, -0.5], (-0.5, 0.0], (0.0, 0.5], (0.5, +inf), which is encoded as 0, 1, 2, 3.

Author: Xusen Yin <yinxusen@gmail.com>
Author: Joseph K. Bradley <joseph@databricks.com>

Closes #5980 from yinxusen/SPARK-5893 and squashes the following commits:

dc8c843 [Xusen Yin] Merge pull request #4 from jkbradley/yinxusen-SPARK-5893
1ca973a [Joseph K. Bradley] one more bucketizer test
34f124a [Joseph K. Bradley] Removed lowerInclusive, upperInclusive params from Bucketizer, and used splits instead.
eacfcfa [Xusen Yin] change ML attribute from splits into buckets
c3cc770 [Xusen Yin] add more unit test for binary search
3a16cc2 [Xusen Yin] refine comments and names
ac77859 [Xusen Yin] fix style error
fb30d79 [Xusen Yin] fix and test binary search
2466322 [Xusen Yin] refactor Bucketizer
11fb00a [Xusen Yin] change it into an Estimator
998bc87 [Xusen Yin] check buckets
4024cf1 [Xusen Yin] add test suite
5fe190e [Xusen Yin] add bucketizer

(cherry picked from commit 35fb42a0b0)
Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
2015-05-11 18:41:36 -07:00
Yanbo Liang 017f9fa674 [SPARK-6092] [MLLIB] Add RankingMetrics in PySpark/MLlib
Author: Yanbo Liang <ybliang8@gmail.com>

Closes #6044 from yanboliang/spark-6092 and squashes the following commits:

726a9b1 [Yanbo Liang] add newRankingMetrics
33f649c [Yanbo Liang] Add RankingMetrics in PySpark/MLlib

(cherry picked from commit 042dda3c5c)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
2015-05-11 09:14:26 -07:00
Kirill A. Korinskiy 193ff69d5d [SPARK-5521] PCA wrapper for easy transform vectors
I implement a simple PCA wrapper for easy transform of vectors by PCA for example LabeledPoint or another complicated structure.

Example of usage:
```
  import org.apache.spark.mllib.regression.LinearRegressionWithSGD
  import org.apache.spark.mllib.regression.LabeledPoint
  import org.apache.spark.mllib.linalg.Vectors
  import org.apache.spark.mllib.feature.PCA

  val data = sc.textFile("data/mllib/ridge-data/lpsa.data").map { line =>
    val parts = line.split(',')
    LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)))
  }.cache()

  val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L)
  val training = splits(0).cache()
  val test = splits(1)

  val pca = PCA.create(training.first().features.size/2, data.map(_.features))
  val training_pca = training.map(p => p.copy(features = pca.transform(p.features)))
  val test_pca = test.map(p => p.copy(features = pca.transform(p.features)))

  val numIterations = 100
  val model = LinearRegressionWithSGD.train(training, numIterations)
  val model_pca = LinearRegressionWithSGD.train(training_pca, numIterations)

  val valuesAndPreds = test.map { point =>
    val score = model.predict(point.features)
    (score, point.label)
  }

  val valuesAndPreds_pca = test_pca.map { point =>
    val score = model_pca.predict(point.features)
    (score, point.label)
  }

  val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean()
  val MSE_pca = valuesAndPreds_pca.map{case(v, p) => math.pow((v - p), 2)}.mean()

  println("Mean Squared Error = " + MSE)
  println("PCA Mean Squared Error = " + MSE_pca)
```

Author: Kirill A. Korinskiy <catap@catap.ru>
Author: Joseph K. Bradley <joseph@databricks.com>

Closes #4304 from catap/pca and squashes the following commits:

501bcd9 [Joseph K. Bradley] Small updates: removed k from Java-friendly PCA fit().  In PCASuite, converted results to set for comparison. Added an error message for bad k in PCA.
9dcc02b [Kirill A. Korinskiy] [SPARK-5521] fix scala style
1892a06 [Kirill A. Korinskiy] [SPARK-5521] PCA wrapper for easy transform vectors

(cherry picked from commit 8c07c75c98)
Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
2015-05-10 13:34:16 -07:00
Yanbo Liang fe46374f9c [SPARK-6091] [MLLIB] Add MulticlassMetrics in PySpark/MLlib
https://issues.apache.org/jira/browse/SPARK-6091

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #6011 from yanboliang/spark-6091 and squashes the following commits:

bb3e4ba [Yanbo Liang] trigger jenkins
53c045d [Yanbo Liang] keep compatibility for python 2.6
972d5ac [Yanbo Liang] Add MulticlassMetrics in PySpark/MLlib

(cherry picked from commit bf7e81a51c)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
2015-05-10 00:57:29 -07:00
Joseph K. Bradley 25972d3713 [SPARK-7498] [ML] removed varargs annotation from Params.setDefaults
In SPARK-7429 and PR https://github.com/apache/spark/pull/5960, I added the varargs annotation to Params.setDefault which takes a variable number of ParamPairs. It worked locally and on Jenkins for me.
However, mengxr reported issues compiling on his machine. So I'm reverting the change introduced in https://github.com/apache/spark/pull/5960 by removing varargs.

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #6021 from jkbradley/revert-varargs and squashes the following commits:

098ed39 [Joseph K. Bradley] removed varargs annotation from Params.setDefaults taking multiple ParamPairs

(cherry picked from commit 2992623841)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
2015-05-08 21:56:02 -07:00
DB Tsai 80bbe72d55 [SPARK-7262] [ML] Binary LogisticRegression with L1/L2 (elastic net) using OWLQN in new ML package
1) Handle scaling and addBias internally.
2) L1/L2 elasticnet using OWLQN optimizer.

Author: DB Tsai <dbt@netflix.com>

Closes #5967 from dbtsai/lor and squashes the following commits:

fa029bb [DB Tsai] made the bound smaller
0806002 [DB Tsai] better initial intercept and more test
5c31824 [DB Tsai] fix import
c387e25 [DB Tsai] Merge branch 'master' into lor
c84e931 [DB Tsai] Made MultiClassSummarizer private
f98e711 [DB Tsai] address feedback
a784321 [DB Tsai] fix style
8ec65d2 [DB Tsai] remove new line
f3f8c88 [DB Tsai] add more tests and they match R which is good. fix a bug
34705bc [DB Tsai] first commit

(cherry picked from commit 86ef4cfd43)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
2015-05-08 21:43:13 -07:00
Burak Yavuz 85cab34828 [SPARK-7488] [ML] Feature Parity in PySpark for ml.recommendation
Adds Python Api for `ALS` under `ml.recommendation` in PySpark. Also adds seed as a settable parameter in the Scala Implementation of ALS.

Author: Burak Yavuz <brkyvz@gmail.com>

Closes #6015 from brkyvz/ml-rec and squashes the following commits:

be6e931 [Burak Yavuz] addressed comments
eaed879 [Burak Yavuz] readd numFeatures
0bd66b1 [Burak Yavuz] fixed seed
7f6d964 [Burak Yavuz] merged master
52e2bda [Burak Yavuz] added ALS

(cherry picked from commit 84bf931f36)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
2015-05-08 17:24:39 -07:00
Yanbo Liang ab48df3918 [SPARK-5913] [MLLIB] Python API for ChiSqSelector
Add a Python API for mllib.feature.ChiSqSelector
https://issues.apache.org/jira/browse/SPARK-5913

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #5939 from yanboliang/spark-5913 and squashes the following commits:

cdaac99 [Yanbo Liang] Python API for ChiSqSelector

(cherry picked from commit 35c9599b94)
Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
2015-05-08 15:48:50 -07:00
Burak Yavuz 85e11544a7 [SPARK-7383] [ML] Feature Parity in PySpark for ml.features
Implemented python wrappers for Scala functions that don't exist in `ml.features`

Author: Burak Yavuz <brkyvz@gmail.com>

Closes #5991 from brkyvz/ml-feat-PR and squashes the following commits:

adcca55 [Burak Yavuz] add regex tokenizer to __all__
b91cb44 [Burak Yavuz] addressed comments
bd39fd2 [Burak Yavuz] remove addition
b82bd7c [Burak Yavuz] Parity in PySpark for ml.features

(cherry picked from commit f5ff4a84c4)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
2015-05-08 11:14:46 -07:00
Shuo Xiang 28d4238708 [SPARK-7452] [MLLIB] fix bug in topBykey and update test
the toArray function of the BoundedPriorityQueue does not necessarily preserve order. Add a counter-example as the test, which would fail the original impl.

Author: Shuo Xiang <shuoxiangpub@gmail.com>

Closes #5990 from coderxiang/topbykey-test and squashes the following commits:

98804c9 [Shuo Xiang] fix bug in topBykey and update test

(cherry picked from commit 92f8f803a6)
Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
2015-05-07 20:55:19 -07:00
Xiangrui Meng 475143a56b [SPARK-6948] [MLLIB] compress vectors in VectorAssembler
The compression is based on storage. brkyvz

Author: Xiangrui Meng <meng@databricks.com>

Closes #5985 from mengxr/SPARK-6948 and squashes the following commits:

df56a00 [Xiangrui Meng] update python tests
6d90d45 [Xiangrui Meng] compress vectors in VectorAssembler

(cherry picked from commit e43803b8f4)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
2015-05-07 15:45:47 -07:00
Octavian Geagla 76e58b5d88 [SPARK-5726] [MLLIB] Elementwise (Hadamard) Vector Product Transformer
See https://issues.apache.org/jira/browse/SPARK-5726

Author: Octavian Geagla <ogeagla@gmail.com>
Author: Joseph K. Bradley <joseph@databricks.com>

Closes #4580 from ogeagla/spark-mllib-weighting and squashes the following commits:

fac12ad [Octavian Geagla] [SPARK-5726] [MLLIB] Use new createTransformFunc.
90f7e39 [Joseph K. Bradley] small cleanups
4595165 [Octavian Geagla] [SPARK-5726] [MLLIB] Remove erroneous test case.
ded3ac6 [Octavian Geagla] [SPARK-5726] [MLLIB] Pass style checks.
37d4705 [Octavian Geagla] [SPARK-5726] [MLLIB] Incorporated feedback.
1dffeee [Octavian Geagla] [SPARK-5726] [MLLIB] Pass style checks.
e436896 [Octavian Geagla] [SPARK-5726] [MLLIB] Remove 'TF' from 'ElementwiseProductTF'
cb520e6 [Octavian Geagla] [SPARK-5726] [MLLIB] Rename HadamardProduct to ElementwiseProduct
4922722 [Octavian Geagla] [SPARK-5726] [MLLIB] Hadamard Vector Product Transformer

(cherry picked from commit 658a478d3f)
Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
2015-05-07 14:50:04 -07:00
Yanbo Liang ef835dc526 [SPARK-6093] [MLLIB] Add RegressionMetrics in PySpark/MLlib
https://issues.apache.org/jira/browse/SPARK-6093

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #5941 from yanboliang/spark-6093 and squashes the following commits:

6934af3 [Yanbo Liang] change to @property
aac3bc5 [Yanbo Liang] Add RegressionMetrics in PySpark/MLlib

(cherry picked from commit 1712a7c705)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
2015-05-07 11:18:38 -07:00
Burak Yavuz 6b9737a830 [SPARK-7388] [SPARK-7383] wrapper for VectorAssembler in Python
The wrapper required the implementation of the `ArrayParam`, because `Array[T]` is hard to obtain from Python. `ArrayParam` has an extra function called `wCast` which is an internal function to obtain `Array[T]` from `Seq[T]`

Author: Burak Yavuz <brkyvz@gmail.com>
Author: Xiangrui Meng <meng@databricks.com>

Closes #5930 from brkyvz/ml-feat and squashes the following commits:

73e745f [Burak Yavuz] Merge pull request #3 from mengxr/SPARK-7388
c221db9 [Xiangrui Meng] overload StringArrayParam.w
c81072d [Burak Yavuz] addressed comments
99c2ebf [Burak Yavuz] add to python_shared_params
39ecb07 [Burak Yavuz] fix scalastyle
7f7ea2a [Burak Yavuz] [SPARK-7388][SPARK-7383] wrapper for VectorAssembler in Python

(cherry picked from commit 9e2ffb1328)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
2015-05-07 10:25:49 -07:00
Joseph K. Bradley 91ce13109b [SPARK-7429] [ML] Params cleanups
Params.setDefault taking a set of ParamPairs should be annotated with varargs. I thought it would not work before, but it apparently does.

CrossValidator.transform should call transformSchema since the underlying Model might be a PipelineModel

CC: mengxr

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #5960 from jkbradley/params-cleanups and squashes the following commits:

118b158 [Joseph K. Bradley] Params.setDefault taking a set of ParamPairs should be annotated with varargs. I thought it would not work before, but it apparently does. CrossValidator.transform should call transformSchema since the underlying Model might be a PipelineModel

(cherry picked from commit 4f87e9562a)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
2015-05-07 01:28:59 -07:00
Joseph K. Bradley a038c5174e [SPARK-7421] [MLLIB] OnlineLDA cleanups
Small changes, primarily to allow us more flexibility in the future:
* Rename "tau_0" to "tau0"
* Mark LDAOptimizer trait sealed and DeveloperApi.
* Mark LDAOptimizer subclasses as final.
* Mark setOptimizer (the one taking an LDAOptimizer) and getOptimizer as DeveloperApi since we may need to change them in the future

CC: hhbyyh

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #5956 from jkbradley/onlinelda-cleanups and squashes the following commits:

f4be508 [Joseph K. Bradley] added newline
f4003e4 [Joseph K. Bradley] Changes: * Rename "tau_0" to "tau0" * Mark LDAOptimizer trait sealed and DeveloperApi. * Mark LDAOptimizer subclasses as final. * Mark setOptimizer (the one taking an LDAOptimizer) and getOptimizer as DeveloperApi since we may need to change them in the future

(cherry picked from commit 8b6b46e4ff)
Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
2015-05-07 01:12:23 -07:00