Commit graph

814 commits

Author SHA1 Message Date
Xiangrui Meng 89f21f66b5 [SPARK-8049] [MLLIB] drop tmp col from OneVsRest output
The temporary column should be dropped after we get the prediction column. harsha2010

Author: Xiangrui Meng <meng@databricks.com>

Closes #6592 from mengxr/SPARK-8049 and squashes the following commits:

1d89107 [Xiangrui Meng] use SparkFunSuite
6ee70de [Xiangrui Meng] drop tmp col from OneVsRest output
2015-06-02 16:51:17 -07:00
Mike Dusenberry ad06727fe9 [SPARK-7985] [ML] [MLlib] [Docs] Remove "fittingParamMap" references. Updating ML Doc "Estimator, Transformer, and Param" examples.
Updating ML Doc's *"Estimator, Transformer, and Param"* example to use `model.extractParamMap` instead of `model.fittingParamMap`, which no longer exists.

mengxr, I believe this addresses (part of) the *update documentation* TODO list item from [PR 5820](https://github.com/apache/spark/pull/5820).

Author: Mike Dusenberry <dusenberrymw@gmail.com>

Closes #6514 from dusenberrymw/Fix_ML_Doc_Estimator_Transformer_Param_Example and squashes the following commits:

6366e1f [Mike Dusenberry] Updating instances of model.extractParamMap to model.parent.extractParamMap, since the Params of the parent Estimator could possibly differ from thos of the Model.
d850e0e [Mike Dusenberry] Removing all references to "fittingParamMap" throughout Spark, since it has been removed.
0480304 [Mike Dusenberry] Updating the ML Doc "Estimator, Transformer, and Param" Java example to use model.extractParamMap() instead of model.fittingParamMap(), which no longer exists.
7d34939 [Mike Dusenberry] Updating ML Doc "Estimator, Transformer, and Param" example to use model.extractParamMap instead of model.fittingParamMap, which no longer exists.
2015-06-02 12:38:14 -07:00
Xiangrui Meng 0221c7f0ef [SPARK-7582] [MLLIB] user guide for StringIndexer
This PR adds a Java unit test and user guide for `StringIndexer`. I put it before `OneHotEncoder` because they are closely related. jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #6561 from mengxr/SPARK-7582 and squashes the following commits:

4bba4f1 [Xiangrui Meng] fix example
ba1cd1b [Xiangrui Meng] fix style
7fa18d1 [Xiangrui Meng] add user guide for StringIndexer
136cb93 [Xiangrui Meng] add a Java unit test for StringIndexer
2015-06-01 22:03:29 -07:00
Xiangrui Meng 90c606925e [SPARK-7584] [MLLIB] User guide for VectorAssembler
This PR adds a section in the user guide for `VectorAssembler` with code examples in Python/Java/Scala. It also adds a unit test in Java.

jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #6556 from mengxr/SPARK-7584 and squashes the following commits:

11313f6 [Xiangrui Meng] simplify Java example
0cd47f3 [Xiangrui Meng] update user guide
fd36292 [Xiangrui Meng] update Java unit test
ce61ca0 [Xiangrui Meng] add Java unit test for VectorAssembler
e399942 [Xiangrui Meng] scala/python example code
2015-06-01 15:05:14 -07:00
Reynold Xin e1067d0ad1 [SPARK-3850] Trim trailing spaces for MLlib.
Author: Reynold Xin <rxin@databricks.com>

Closes #6534 from rxin/whitespace-mllib and squashes the following commits:

38926e3 [Reynold Xin] [SPARK-3850] Trim trailing spaces for MLlib.
2015-05-31 11:35:30 -07:00
Reynold Xin 4b5f12bac9 [SPARK-7979] Enforce structural type checker.
Author: Reynold Xin <rxin@databricks.com>

Closes #6536 from rxin/structural-type-checker and squashes the following commits:

f833151 [Reynold Xin] Fixed compilation.
633f9a1 [Reynold Xin] Fixed typo.
d1fa804 [Reynold Xin] [SPARK-7979] Enforce structural type checker.
2015-05-31 01:37:56 -07:00
Mike Dusenberry 1281a35188 [SPARK-7920] [MLLIB] Make MLlib ChiSqSelector Serializable (& Fix Related Documentation Example).
The MLlib ChiSqSelector class is not serializable, and so the example in the ChiSqSelector documentation fails. Also, that example is missing the import of ChiSqSelector.

This PR makes ChiSqSelector extend Serializable in MLlib, and adds the ChiSqSelector import statement to the associated example in the documentation.

Author: Mike Dusenberry <dusenberrymw@gmail.com>

Closes #6462 from dusenberrymw/Make_ChiSqSelector_Serializable_and_Fix_Related_Docs_Example and squashes the following commits:

9cb2f94 [Mike Dusenberry] Make MLlib ChiSqSelector Serializable.
d9003bf [Mike Dusenberry] Add missing import in MLlib ChiSqSelector Docs Scala example.
2015-05-30 16:50:59 -07:00
Andrew Or a4f24123d8 [HOT FIX] [BUILD] Fix maven build failures
This patch fixes a build break in maven caused by #6441.

Note that this patch reverts the changes in flume-sink because
this module does not currently depend on Spark core, but the
tests require it. There is not an easy way to make this work
because mvn test dependencies are not transitive (MNG-1378).

For now, we will leave the one test suite in flume-sink out
until we figure out a better solution. This patch is mainly
intended to unbreak the maven build.

Author: Andrew Or <andrew@databricks.com>

Closes #6511 from andrewor14/fix-build-mvn and squashes the following commits:

3d53643 [Andrew Or] [HOT FIX #6441] Fix maven build failures
2015-05-29 17:19:46 -07:00
Andrew Or 9eb222c139 [SPARK-7558] Demarcate tests in unit-tests.log
Right now `unit-tests.log` are not of much value because we can't tell where the test boundaries are easily. This patch adds log statements before and after each test to outline the test boundaries, e.g.:

```
===== TEST OUTPUT FOR o.a.s.serializer.KryoSerializerSuite: 'kryo with parallelize for primitive arrays' =====

15/05/27 12:36:39.596 pool-1-thread-1-ScalaTest-running-KryoSerializerSuite INFO SparkContext: Starting job: count at KryoSerializerSuite.scala:230
15/05/27 12:36:39.596 dag-scheduler-event-loop INFO DAGScheduler: Got job 3 (count at KryoSerializerSuite.scala:230) with 4 output partitions (allowLocal=false)
15/05/27 12:36:39.596 dag-scheduler-event-loop INFO DAGScheduler: Final stage: ResultStage 3(count at KryoSerializerSuite.scala:230)
15/05/27 12:36:39.596 dag-scheduler-event-loop INFO DAGScheduler: Parents of final stage: List()
15/05/27 12:36:39.597 dag-scheduler-event-loop INFO DAGScheduler: Missing parents: List()
15/05/27 12:36:39.597 dag-scheduler-event-loop INFO DAGScheduler: Submitting ResultStage 3 (ParallelCollectionRDD[5] at parallelize at KryoSerializerSuite.scala:230), which has no missing parents

...

15/05/27 12:36:39.624 pool-1-thread-1-ScalaTest-running-KryoSerializerSuite INFO DAGScheduler: Job 3 finished: count at KryoSerializerSuite.scala:230, took 0.028563 s
15/05/27 12:36:39.625 pool-1-thread-1-ScalaTest-running-KryoSerializerSuite INFO KryoSerializerSuite:

***** FINISHED o.a.s.serializer.KryoSerializerSuite: 'kryo with parallelize for primitive arrays' *****

...
```

Author: Andrew Or <andrew@databricks.com>

Closes #6441 from andrewor14/demarcate-tests and squashes the following commits:

879b060 [Andrew Or] Fix compile after rebase
d622af7 [Andrew Or] Merge branch 'master' of github.com:apache/spark into demarcate-tests
017c8ba [Andrew Or] Merge branch 'master' of github.com:apache/spark into demarcate-tests
7790b6c [Andrew Or] Fix tests after logical merge conflict
c7460c0 [Andrew Or] Merge branch 'master' of github.com:apache/spark into demarcate-tests
c43ffc4 [Andrew Or] Fix tests?
8882581 [Andrew Or] Fix tests
ee22cda [Andrew Or] Fix log message
fa9450e [Andrew Or] Merge branch 'master' of github.com:apache/spark into demarcate-tests
12d1e1b [Andrew Or] Various whitespace changes (minor)
69cbb24 [Andrew Or] Make all test suites extend SparkFunSuite instead of FunSuite
bbce12e [Andrew Or] Fix manual things that cannot be covered through automation
da0b12f [Andrew Or] Add core tests as dependencies in all modules
f7d29ce [Andrew Or] Introduce base abstract class for all test suites
2015-05-29 14:03:12 -07:00
Reynold Xin 94f62a4979 [SPARK-7940] Enforce whitespace checking for DO, TRY, CATCH, FINALLY, MATCH, LARROW, RARROW in style checker.
…

Author: Reynold Xin <rxin@databricks.com>

Closes #6491 from rxin/more-whitespace and squashes the following commits:

f6e63dc [Reynold Xin] [SPARK-7940] Enforce whitespace checking for DO, TRY, CATCH, FINALLY, MATCH, LARROW, RARROW in style checker.
2015-05-29 13:38:37 -07:00
MechCoder 6181937f31 [SPARK-7946] [MLLIB] DecayFactor wrongly set in StreamingKMeans
Author: MechCoder <manojkumarsivaraj334@gmail.com>

Closes #6497 from MechCoder/spark-7946 and squashes the following commits:

2fdd0a3 [MechCoder] Add non-regression test
8c988c6 [MechCoder] [SPARK-7946] DecayFactor wrongly set in StreamingKMeans
2015-05-29 11:36:41 -07:00
Xiangrui Meng 23452be944 [SPARK-7912] [SPARK-7921] [MLLIB] Update OneHotEncoder to handle ML attributes and change includeFirst to dropLast
This PR contains two major changes to `OneHotEncoder`:

1. more robust handling of ML attributes. If the input attribute is unknown, we look at the values to get the max category index
2. change `includeFirst` to `dropLast` and leave the default to `true`. There are couple benefits:

    a. consistent with other tutorials of one-hot encoding (or dummy coding) (e.g., http://www.ats.ucla.edu/stat/mult_pkg/faq/general/dummy.htm)
    b. keep the indices unmodified in the output vector. If we drop the first, all indices will be shifted by 1.
    c. If users use `StringIndex`, the last element is the least frequent one.

Sorry for including two changes in one PR! I'll update the user guide in another PR.

jkbradley sryza

Author: Xiangrui Meng <meng@databricks.com>

Closes #6466 from mengxr/SPARK-7912 and squashes the following commits:

a280dca [Xiangrui Meng] fix tests
d8f234d [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7912
171b276 [Xiangrui Meng] mention the difference between our impl vs sklearn's
00dfd96 [Xiangrui Meng] update OneHotEncoder in Python
208ddad [Xiangrui Meng] update OneHotEncoder to handle ML attributes and change includeFirst to dropLast
2015-05-29 00:51:12 -07:00
Xiangrui Meng db95137897 [SPARK-7922] [MLLIB] use DataFrames for user/item factors in ALSModel
Expose user/item factors in DataFrames. This is to be more consistent with the pipeline API. It also helps maintain consistent APIs across languages. This PR also removed fitting params from `ALSModel`.

coderxiang

Author: Xiangrui Meng <meng@databricks.com>

Closes #6468 from mengxr/SPARK-7922 and squashes the following commits:

7bfb1d5 [Xiangrui Meng] update ALSModel in PySpark
1ba5607 [Xiangrui Meng] use DataFrames for user/item factors in ALS
2015-05-28 22:38:38 -07:00
Xiangrui Meng 04616b1a2f [SPARK-7927] [MLLIB] Enforce whitespace for more tokens in style checker
rxin

Author: Xiangrui Meng <meng@databricks.com>

Closes #6481 from mengxr/mllib-scalastyle and squashes the following commits:

3ca4d61 [Xiangrui Meng] revert scalastyle config
30961ba [Xiangrui Meng] adjust spaces in mllib/test
571b5c5 [Xiangrui Meng] fix spaces in mllib
2015-05-28 20:09:12 -07:00
Xusen Yin 1bd63e82fd [SPARK-7577] [ML] [DOC] add bucketizer doc
CC jkbradley

Author: Xusen Yin <yinxusen@gmail.com>

Closes #6451 from yinxusen/SPARK-7577 and squashes the following commits:

e2dc32e [Xusen Yin] rename colums
e350e49 [Xusen Yin] add all demos
006ddf1 [Xusen Yin] add java test
3238481 [Xusen Yin] add bucketizer
2015-05-28 17:30:12 -07:00
Xiangrui Meng 7859ab659e [SPARK-7198] [MLLIB] VectorAssembler should output ML attributes
`VectorAssembler` should carry over ML attributes. For unknown attributes, we assume numeric values. This PR handles the following cases:

1. DoubleType with ML attribute: carry over
2. DoubleType without ML attribute: numeric value
3. Scalar type: numeric value
4. VectorType with all ML attributes: carry over and update names
5. VectorType with number of ML attributes: assume all numeric
6. VectorType without ML attributes: check the first row and get the number of attributes

jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #6452 from mengxr/SPARK-7198 and squashes the following commits:

a9d2469 [Xiangrui Meng] add space
facdb1f [Xiangrui Meng] VectorAssembler should output ML attributes
2015-05-28 16:32:51 -07:00
Xiangrui Meng 530efe3e80 [SPARK-7911] [MLLIB] A workaround for VectorUDT serialize (or deserialize) being called multiple times
~~A PythonUDT shouldn't be serialized into external Scala types in PythonRDD. I'm not sure whether this should fix one of the bugs related to SQL UDT/UDF in PySpark.~~

The fix above didn't work. So I added a workaround for this. If a Python UDF is applied to a Python UDT. This will put the Python SQL types as inputs. Still incorrect, but at least it doesn't throw exceptions on the Scala side. davies harsha2010

Author: Xiangrui Meng <meng@databricks.com>

Closes #6442 from mengxr/SPARK-7903 and squashes the following commits:

c257d2a [Xiangrui Meng] add a workaround for VectorUDT
2015-05-28 12:03:46 -07:00
Xiangrui Meng a9f1c0c57b [SPARK-7535] [.1] [MLLIB] minor changes to the pipeline API
1. removed `Params.validateParams(extra)`
2. added `Evaluate.evaluate(dataset, paramPairs*)`
3. updated `RegressionEvaluator` doc

jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #6392 from mengxr/SPARK-7535.1 and squashes the following commits:

5ff5af8 [Xiangrui Meng] add unit test for CV.validateParams
f1f8369 [Xiangrui Meng] update CV.validateParams() to test estimatorParamMaps
607445d [Xiangrui Meng] merge master
8716f5f [Xiangrui Meng] specify default metric name in RegressionEvaluator
e4e5631 [Xiangrui Meng] update RegressionEvaluator doc
801e864 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7535.1
fcbd3e2 [Xiangrui Meng] Merge branch 'master' into SPARK-7535.1
2192316 [Xiangrui Meng] remove validateParams(extra); add evaluate(dataset, extra*)
2015-05-26 23:51:32 -07:00
Xiangrui Meng 836a75898f [SPARK-7748] [MLLIB] Graduate spark.ml from alpha
With descent coverage of feature transformers, algorithms, and model tuning support, it is time to graduate `spark.ml` from alpha. This PR changes all `AlphaComponent` annotations to either `DeveloperApi` or `Experimental`, depending on whether we expect a class/method to be used by end users (who use the pipeline API to assemble/tune their ML pipelines but not to create new pipeline components.) `UnaryTransformer` becomes a `DeveloperApi` in this PR.

jkbradley harsha2010

Author: Xiangrui Meng <meng@databricks.com>

Closes #6417 from mengxr/SPARK-7748 and squashes the following commits:

effbccd [Xiangrui Meng] organize imports
c15028e [Xiangrui Meng] added missing docs
1b2e5f8 [Xiangrui Meng] update package doc
73ca791 [Xiangrui Meng] alpha -> ex/dev for the rest
93819db [Xiangrui Meng] alpha -> ex/dev in ml.param
55ca073 [Xiangrui Meng] alpha -> ex/dev in ml.feature
83572f1 [Xiangrui Meng] add Experimental and DeveloperApi tags (wip)
2015-05-26 15:51:31 -07:00
MechCoder 61664732b2 [SPARK-7844] [MLLIB] Fix broken tests in KernelDensity
The densities in KernelDensity are scaled down by
(number of parallel processes X number of points). It should be just no.of samples. This results in broken tests in KernelDensitySuite which haven't been tested properly.

Author: MechCoder <manojkumarsivaraj334@gmail.com>

Closes #6383 from MechCoder/spark-7844 and squashes the following commits:

ab81302 [MechCoder] Math->math
9b8ed50 [MechCoder] Make one pass to update count
a92fe50 [MechCoder] [SPARK-7844] Fix broken tests in KernelDensity
2015-05-26 13:21:00 -07:00
Ram Sriharsha 65c696ecc0 [SPARK-7833] [ML] Add python wrapper for RegressionEvaluator
Author: Ram Sriharsha <rsriharsha@hw11853.local>

Closes #6365 from harsha2010/SPARK-7833 and squashes the following commits:

923f288 [Ram Sriharsha] cleanup
7623b7d [Ram Sriharsha] python style fix
9743f83 [Ram Sriharsha] [SPARK-7833][ml] Add python wrapper for RegressionEvaluator
2015-05-24 10:36:02 -07:00
Ram Sriharsha f490b3b4c7 [SPARK-7404] [ML] Add RegressionEvaluator to spark.ml
Author: Ram Sriharsha <rsriharsha@hw11853.local>

Closes #6344 from harsha2010/SPARK-7404 and squashes the following commits:

16b9d77 [Ram Sriharsha] consistent naming
7f100b6 [Ram Sriharsha] cleanup
c46044d [Ram Sriharsha] Merge with Master + Code Review Fixes
188fa0a [Ram Sriharsha] Merge branch 'master' into SPARK-7404
f5b6a4c [Ram Sriharsha] cleanup doc
97beca5 [Ram Sriharsha] update test to use R packages
32dd310 [Ram Sriharsha] fix indentation
f93b812 [Ram Sriharsha] fix test
1b6ebb3 [Ram Sriharsha] [SPARK-7404][ml] Add RegressionEvaluator to spark.ml
2015-05-22 09:59:44 -07:00
Joseph K. Bradley 2728c3df66 [SPARK-7578] [ML] [DOC] User guide for spark.ml Normalizer, IDF, StandardScaler
Added user guide sections with code examples.
Also added small Java unit tests to test Java example in guide.

CC: mengxr

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #6127 from jkbradley/feature-guide-2 and squashes the following commits:

cd47f4b [Joseph K. Bradley] Updated based on code review
f16bcec [Joseph K. Bradley] Fixed merge issues and update Python examples print calls for Python 3
0a862f9 [Joseph K. Bradley] Added Normalizer, StandardScaler to ml-features doc, plus small Java unit tests
a21c2d6 [Joseph K. Bradley] Updated ml-features.md with IDF
2015-05-21 22:59:45 -07:00
Xiangrui Meng 8f11c6116b [SPARK-7535] [.0] [MLLIB] Audit the pipeline APIs for 1.4
Some changes to the pipeilne APIs:

1. Estimator/Transformer/ doesn’t need to extend Params since PipelineStage already does.
1. Move Evaluator to ml.evaluation.
1. Mention larger metric values are better.
1. PipelineModel doc. “compiled” -> “fitted”
1. Hide object PolynomialExpansion.
1. Hide object VectorAssembler.
1. Word2Vec.minCount (and other) -> group param
1. ParamValidators -> DeveloperApi
1. Hide MetadataUtils/SchemaUtils.

jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #6322 from mengxr/SPARK-7535.0 and squashes the following commits:

9e9c7da [Xiangrui Meng] move JavaEvaluator to ml.evaluation as well
e179480 [Xiangrui Meng] move Evaluation to ml.evaluation in PySpark
08ef61f [Xiangrui Meng] update pipieline APIs
2015-05-21 22:57:33 -07:00
Xiangrui Meng 85b96372cf [SPARK-7219] [MLLIB] Output feature attributes in HashingTF
This PR updates `HashingTF` to output ML attributes that tell the number of features in the output column. We need to expand `UnaryTransformer` to support output metadata. A `df outputMetadata: Metadata` is not sufficient because the metadata may also depends on the input data. Though this is not true for `HashingTF`, I think it is reasonable to update `UnaryTransformer` in a separate PR. `checkParams` is added to verify common requirements for params. I will send a separate PR to use it in other test suites. jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #6308 from mengxr/SPARK-7219 and squashes the following commits:

9bd2922 [Xiangrui Meng] address comments
e82a68a [Xiangrui Meng] remove sqlContext from test suite
995535b [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7219
2194703 [Xiangrui Meng] add test for attributes
178ae23 [Xiangrui Meng] update HashingTF with tests
91a6106 [Xiangrui Meng] WIP
2015-05-21 18:04:45 -07:00
Xiangrui Meng f5db4b416c [SPARK-7794] [MLLIB] update RegexTokenizer default settings
The previous default is `{gaps: false, pattern: "\\p{L}+|[^\\p{L}\\s]+"}`. The default pattern is hard to understand. This PR changes the default to `{gaps: true, pattern: "\\s+"}`. jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #6330 from mengxr/SPARK-7794 and squashes the following commits:

5ee7cde [Xiangrui Meng] update RegexTokenizer default settings
2015-05-21 17:59:03 -07:00
Xiangrui Meng cdc7c055c9 [SPARK-7498] [MLLIB] add varargs back to setDefault
We removed `varargs` due to Java compilation issues. That was a false alarm because I didn't run `build/sbt clean`. So this PR reverts the changes. jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #6320 from mengxr/SPARK-7498 and squashes the following commits:

74a7259 [Xiangrui Meng] add varargs back to setDefault
2015-05-21 13:06:53 -07:00
Joseph K. Bradley 6d75ed7e5c [SPARK-7585] [ML] [DOC] VectorIndexer user guide section
Added VectorIndexer section to ML user guide.  Also added javaCategoryMaps() method and Java unit test for it.

CC: mengxr

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #6255 from jkbradley/vector-indexer-guide and squashes the following commits:

dbb8c4c [Joseph K. Bradley] simplified VectorIndexerModel.javaCategoryMaps
f692084 [Joseph K. Bradley] Added VectorIndexer section to ML user guide.  Also added javaCategoryMaps() method and Java unit test for it.
2015-05-21 13:05:48 -07:00
Shuo Xiang 4f572008f8 [SPARK-7793] [MLLIB] Use getOrElse for getting the threshold of SVM model
same issue and fix as in Spark-7694.

Author: Shuo Xiang <shuoxiangpub@gmail.com>

Closes #6321 from coderxiang/nb and squashes the following commits:

a5e6de4 [Shuo Xiang] use getOrElse for svmmodel.tostring
2cb0177 [Shuo Xiang] Merge remote-tracking branch 'upstream/master' into nb
5f109b4 [Shuo Xiang] Merge remote-tracking branch 'upstream/master'
c5c5bfe [Shuo Xiang] Merge remote-tracking branch 'upstream/master'
98804c9 [Shuo Xiang] fix bug in topBykey and update test
2015-05-21 12:09:44 -07:00
Xiangrui Meng 13348e21b6 [SPARK-7752] [MLLIB] Use lowercase letters for NaiveBayes.modelType
to be consistent with other string names in MLlib. This PR also updates the implementation to use vals instead of hardcoded strings. jkbradley leahmcguire

Author: Xiangrui Meng <meng@databricks.com>

Closes #6277 from mengxr/SPARK-7752 and squashes the following commits:

f38b662 [Xiangrui Meng] add another case _ back in test
ae5c66a [Xiangrui Meng] model type -> modelType
711d1c6 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7752
40ae53e [Xiangrui Meng] fix Java test suite
264a814 [Xiangrui Meng] add case _ back
3c456a8 [Xiangrui Meng] update NB user guide
17bba53 [Xiangrui Meng] update naive Bayes to use lowercase model type strings
2015-05-21 10:30:08 -07:00
Xiangrui Meng 947ea1cf5f [SPARK-7753] [MLLIB] Update KernelDensity API
Update `KernelDensity` API to make it extensible to different kernels in the future. `bandwidth` is used instead of `standardDeviation`. The static `kernelDensity` method is removed from `Statistics`. The implementation is updated using BLAS, while the algorithm remains the same. sryza srowen

Author: Xiangrui Meng <meng@databricks.com>

Closes #6279 from mengxr/SPARK-7753 and squashes the following commits:

4cdfadc [Xiangrui Meng] add example code in the doc
767fd5a [Xiangrui Meng] update KernelDensity API
2015-05-20 23:38:58 -07:00
Xiangrui Meng ddec173cba [SPARK-7774] [MLLIB] add sqlContext to MLlibTestSparkContext
to simplify test suites that require a SQLContext.

Author: Xiangrui Meng <meng@databricks.com>

Closes #6303 from mengxr/SPARK-7774 and squashes the following commits:

0622b5a [Xiangrui Meng] update some other test suites
e1f9b8d [Xiangrui Meng] add sqlContext to MLlibTestSparkContext
2015-05-20 20:30:39 -07:00
Xiangrui Meng c330e52dae [SPARK-7762] [MLLIB] set default value for outputCol
Set a default value for `outputCol` instead of forcing users to name it. This is useful for intermediate transformers in the pipeline. jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #6289 from mengxr/SPARK-7762 and squashes the following commits:

54edebc [Xiangrui Meng] merge master
bff8667 [Xiangrui Meng] update unit test
171246b [Xiangrui Meng] add unit test for outputCol
a4321bd [Xiangrui Meng] set default value for outputCol
2015-05-20 17:26:26 -07:00
Xiangrui Meng 2ad4837cfa [SPARK-7537] [MLLIB] spark.mllib API updates
Minor updates to the spark.mllib APIs:

1. Add `DeveloperApi` to `PMMLExportable` and add `Experimental` to `toPMML` methods.
2. Mention `RankingMetrics.of` in the `RankingMetrics` constructor.

Author: Xiangrui Meng <meng@databricks.com>

Closes #6280 from mengxr/SPARK-7537 and squashes the following commits:

1bd2583 [Xiangrui Meng] organize imports
94afa7a [Xiangrui Meng] mark all toPMML methods experimental
4c40da1 [Xiangrui Meng] mention the factory method for RankingMetrics for Java users
88c62d0 [Xiangrui Meng] add DeveloperApi to PMMLExportable
2015-05-20 12:50:06 -07:00
Yanbo Liang 98a46f9dff [SPARK-6094] [MLLIB] Add MultilabelMetrics in PySpark/MLlib
Add MultilabelMetrics in PySpark/MLlib

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #6276 from yanboliang/spark-6094 and squashes the following commits:

b8e3343 [Yanbo Liang] Add MultilabelMetrics in PySpark/MLlib
2015-05-20 07:55:51 -07:00
Xiangrui Meng 589b12f8e6 [SPARK-7654] [MLLIB] Migrate MLlib to the DataFrame reader/writer API
parquetFile -> read.parquet rxin

Author: Xiangrui Meng <meng@databricks.com>

Closes #6281 from mengxr/SPARK-7654 and squashes the following commits:

a79b612 [Xiangrui Meng] parquetFile -> read.parquet
2015-05-20 07:46:17 -07:00
Xusen Yin b3abf0b8d9 [SPARK-7663] [MLLIB] Add requirement for word2vec model
JIRA issue [link](https://issues.apache.org/jira/browse/SPARK-7663).

We should check the model size of word2vec, to prevent the unexpected empty.

CC srowen.

Author: Xusen Yin <yinxusen@gmail.com>

Closes #6228 from yinxusen/SPARK-7663 and squashes the following commits:

21770c5 [Xusen Yin] check the vocab size
54ae63e [Xusen Yin] add requirement for word2vec model
2015-05-20 10:44:06 +01:00
Liang-Chi Hsieh c12dff9b82 [SPARK-7652] [MLLIB] Update the implementation of naive Bayes prediction with BLAS
JIRA: https://issues.apache.org/jira/browse/SPARK-7652

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #6189 from viirya/naive_bayes_blas_prediction and squashes the following commits:

ab611fd [Liang-Chi Hsieh] Remove unnecessary space.
ddc48b9 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into naive_bayes_blas_prediction
b5772b4 [Liang-Chi Hsieh] Fix binary compatibility.
2f65186 [Liang-Chi Hsieh] Remove toDense.
1b6cdfe [Liang-Chi Hsieh] Update the implementation of naive Bayes prediction with BLAS.
2015-05-19 13:53:08 -07:00
Xusen Yin 68fb2a46ed [SPARK-7586] [ML] [DOC] Add docs of Word2Vec in ml package
CC jkbradley.

JIRA [issue](https://issues.apache.org/jira/browse/SPARK-7586).

Author: Xusen Yin <yinxusen@gmail.com>

Closes #6181 from yinxusen/SPARK-7586 and squashes the following commits:

77014c5 [Xusen Yin] comment fix
57a4c07 [Xusen Yin] small fix for docs
1178c8f [Xusen Yin] remove the correctness check in java suite
1c3f389 [Xusen Yin] delete sbt commit
1af152b [Xusen Yin] check python example code
1b5369e [Xusen Yin] add docs of word2vec
2015-05-19 13:43:48 -07:00
Joseph K. Bradley 7b16e9f211 [SPARK-7678] [ML] Fix default random seed in HasSeed
Changed shared param HasSeed to have default based on hashCode of class name, instead of random number.
Also, removed fixed random seeds from Word2Vec and ALS.

CC: mengxr

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #6251 from jkbradley/scala-fixed-seed and squashes the following commits:

0e37184 [Joseph K. Bradley] Fixed Word2VecSuite, ALSSuite in spark.ml to use original fixed random seeds
678ec3a [Joseph K. Bradley] Removed fixed random seeds from Word2Vec and ALS. Changed shared param HasSeed to have default based on hashCode of class name, instead of random number.
2015-05-19 10:57:47 -07:00
Joseph K. Bradley fb90273212 [SPARK-7047] [ML] ml.Model optional parent support
Made Model.parent transient.  Added Model.hasParent to test for null parent

CC: mengxr

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #5914 from jkbradley/parent-optional and squashes the following commits:

d501774 [Joseph K. Bradley] Made Model.parent transient.  Added Model.hasParent to test for null parent
2015-05-19 10:55:21 -07:00
Xusen Yin 6008ec14ed [SPARK-7581] [ML] [DOC] User guide for spark.ml PolynomialExpansion
JIRA [here](https://issues.apache.org/jira/browse/SPARK-7581).

CC jkbradley

Author: Xusen Yin <yinxusen@gmail.com>

Closes #6113 from yinxusen/SPARK-7581 and squashes the following commits:

1a7d80d [Xusen Yin] merge with master
892a8e9 [Xusen Yin] fix python 3 compatibility
ec935bf [Xusen Yin] small fix
3e9fa1d [Xusen Yin] delete note
69fcf85 [Xusen Yin] simplify and add python example
81d21dc [Xusen Yin] add programming guide for Polynomial Expansion
40babfb [Xusen Yin] add java test suite for PolynomialExpansion
2015-05-19 00:06:33 -07:00
Liang-Chi Hsieh d03638cc2d [SPARK-7681] [MLLIB] Add SparseVector support for gemv
JIRA: https://issues.apache.org/jira/browse/SPARK-7681

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #6209 from viirya/sparsevector_gemv and squashes the following commits:

ce0bb8b [Liang-Chi Hsieh] Still need to scal y when beta is 0.0 because it clears out y.
b890e63 [Liang-Chi Hsieh] Do not delete multiply for DenseVector.
57a8c1e [Liang-Chi Hsieh] Add MimaExcludes for v1.4.
458d1ae [Liang-Chi Hsieh] List DenseMatrix.multiply and SparseMatrix.multiply to MimaExcludes too.
054f05d [Liang-Chi Hsieh] Fix scala style.
410381a [Liang-Chi Hsieh] Address comments. Make Matrix.multiply more generalized.
4616696 [Liang-Chi Hsieh] Add support for SparseVector with SparseMatrix.
5d6d07a [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into sparsevector_gemv
c069507 [Liang-Chi Hsieh] Add SparseVector support for gemv with DenseMatrix.
2015-05-18 21:32:36 -07:00
Xiangrui Meng 9c7e802a5a [SPARK-7380] [MLLIB] pipeline stages should be copyable in Python
This PR makes pipeline stages in Python copyable and hence simplifies some implementations. It also includes the following changes:

1. Rename `paramMap` and `defaultParamMap` to `_paramMap` and `_defaultParamMap`, respectively.
2. Accept a list of param maps in `fit`.
3. Use parent uid and name to identify param.

jkbradley

Author: Xiangrui Meng <meng@databricks.com>
Author: Joseph K. Bradley <joseph@databricks.com>

Closes #6088 from mengxr/SPARK-7380 and squashes the following commits:

413c463 [Xiangrui Meng] remove unnecessary doc
4159f35 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7380
611c719 [Xiangrui Meng] fix python style
68862b8 [Xiangrui Meng] update _java_obj initialization
927ad19 [Xiangrui Meng] fix ml/tests.py
0138fc3 [Xiangrui Meng] update feature transformers and fix a bug in RegexTokenizer
9ca44fb [Xiangrui Meng] simplify Java wrappers and add tests
c7d84ef [Xiangrui Meng] update ml/tests.py to test copy params
7e0d27f [Xiangrui Meng] merge master
46840fb [Xiangrui Meng] update wrappers
b6db1ed [Xiangrui Meng] update all self.paramMap to self._paramMap
46cb6ed [Xiangrui Meng] merge master
a163413 [Xiangrui Meng] fix style
1042e80 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7380
9630eae [Xiangrui Meng] fix Identifiable._randomUID
13bd70a [Xiangrui Meng] update ml/tests.py
64a536c [Xiangrui Meng] use _fit/_transform/_evaluate to simplify the impl
02abf13 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into copyable-python
66ce18c [Joseph K. Bradley] some cleanups before sending to Xiangrui
7431272 [Joseph K. Bradley] Rebased with master
2015-05-18 12:02:18 -07:00
Shuo Xiang 775e6f9909 [SPARK-7694] [MLLIB] Use getOrElse for getting the threshold of LR model
The `toString` method of `LogisticRegressionModel` calls `get` method on an Option (threshold) without a safeguard. In spark-shell, the following code `val model = algorithm.run(data).clearThreshold()` in lbfgs code will fail as `toString `method will be called right after `clearThreshold()` to show the results in the REPL.

Author: Shuo Xiang <shuoxiangpub@gmail.com>

Closes #6224 from coderxiang/getorelse and squashes the following commits:

d5f53c9 [Shuo Xiang] use getOrElse for getting the threshold of LR model
5f109b4 [Shuo Xiang] Merge remote-tracking branch 'upstream/master'
c5c5bfe [Shuo Xiang] Merge remote-tracking branch 'upstream/master'
98804c9 [Shuo Xiang] fix bug in topBykey and update test
2015-05-17 21:16:52 -07:00
Reynold Xin 161d0b4a41 [SPARK-7654][MLlib] Migrate MLlib to the DataFrame reader/writer API.
Author: Reynold Xin <rxin@databricks.com>

Closes #6211 from rxin/mllib-reader and squashes the following commits:

79a2cb9 [Reynold Xin] [SPARK-7654][MLlib] Migrate MLlib to the DataFrame reader/writer API.
2015-05-16 15:03:57 -07:00
AiHe deb411335a [SPARK-7473] [MLLIB] Add reservoir sample in RandomForest
reservoir feature sample by using existing api

Author: AiHe <ai.he@ussuning.com>

Closes #5988 from AiHe/reservoir and squashes the following commits:

e7a41ac [AiHe] remove non-robust testing case
28ffb9a [AiHe] set seed as rng.nextLong
37459e1 [AiHe] set fixed seed
1e98a4c [AiHe] [MLLIB][tree] Add reservoir sample in RandomForest
2015-05-15 20:42:35 -07:00
Liang-Chi Hsieh f96b85ab44 [SPARK-7668] [MLLIB] Preserve isTransposed property for Matrix after calling map function
JIRA: https://issues.apache.org/jira/browse/SPARK-7668

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #6188 from viirya/fix_matrix_map and squashes the following commits:

2a7cc97 [Liang-Chi Hsieh] Preserve isTransposed property for Matrix after calling map function.
2015-05-15 10:03:29 -07:00
Yanbo Liang 94761485b2 [SPARK-6258] [MLLIB] GaussianMixture Python API parity check
Implement Python API for major disparities of GaussianMixture cluster algorithm between Scala & Python
```scala
GaussianMixture
    setInitialModel
GaussianMixtureModel
    k
```

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #6087 from yanboliang/spark-6258 and squashes the following commits:

b3af21c [Yanbo Liang] fix typo
2b645c1 [Yanbo Liang] fix doc
638b4b7 [Yanbo Liang] address comments
b5bcade [Yanbo Liang] GaussianMixture Python API parity check
2015-05-15 00:18:39 -07:00
Xiangrui Meng 1b8625f425 [SPARK-7407] [MLLIB] use uid + name to identify parameters
A param instance is strongly attached to an parent in the current implementation. So if we make a copy of an estimator or a transformer in pipelines and other meta-algorithms, it becomes error-prone to copy the params to the copied instances. In this PR, a param is identified by its parent's UID and the param name. So it becomes loosely attached to its parent and all its derivatives. The UID is preserved during copying or fitting. All components now have a default constructor and a constructor that takes a UID as input. I keep the constructors for Param in this PR to reduce the amount of diff and moved `parent` as a mutable field.

This PR still needs some clean-ups, and there are several spark.ml PRs pending. I'll try to get them merged first and then update this PR.

jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #6019 from mengxr/SPARK-7407 and squashes the following commits:

c4c8120 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7407
520f0a2 [Xiangrui Meng] address comments
2569168 [Xiangrui Meng] fix tests
873caca [Xiangrui Meng] fix tests in OneVsRest; fix a racing condition in shouldOwn
409ea08 [Xiangrui Meng] minor updates
83a163c [Xiangrui Meng] update JavaDeveloperApiExample
5db5325 [Xiangrui Meng] update OneVsRest
7bde7ae [Xiangrui Meng] merge master
697fdf9 [Xiangrui Meng] update Bucketizer
7b4f6c2 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7407
629d402 [Xiangrui Meng] fix LRSuite
154516f [Xiangrui Meng] merge master
aa4a611 [Xiangrui Meng] fix examples/compile
a4794dd [Xiangrui Meng] change Param to use  to reduce the size of diff
fdbc415 [Xiangrui Meng] all tests passed
c255f17 [Xiangrui Meng] fix tests in ParamsSuite
818e1db [Xiangrui Meng] merge master
e1160cf [Xiangrui Meng] fix tests
fbc39f0 [Xiangrui Meng] pass test:compile
108937e [Xiangrui Meng] pass compile
8726d39 [Xiangrui Meng] use parent uid in Param
eaeed35 [Xiangrui Meng] update Identifiable
2015-05-14 01:22:15 -07:00
DB Tsai d3db2fd667 [SPARK-7620] [ML] [MLLIB] Removed calling size, length in while condition to avoid extra JVM call
Author: DB Tsai <dbt@netflix.com>

Closes #6137 from dbtsai/clean and squashes the following commits:

185816d [DB Tsai] fix compilication issue
f418d08 [DB Tsai] first commit
2015-05-13 22:23:21 -07:00
Xiangrui Meng d5f18de165 [SPARK-7612] [MLLIB] update NB training to use mllib's BLAS
This is similar to the changes to k-means, which gives us better control on the performance. dbtsai

Author: Xiangrui Meng <meng@databricks.com>

Closes #6128 from mengxr/SPARK-7612 and squashes the following commits:

b5c24c5 [Xiangrui Meng] merge master
a90e3ec [Xiangrui Meng] update NB training to use mllib's BLAS
2015-05-13 21:27:17 -07:00
leahmcguire 61e05fc58e [SPARK-7545] [MLLIB] Added check in Bernoulli Naive Bayes to make sure that both training and predict features have values of 0 or 1
Author: leahmcguire <lmcguire@salesforce.com>

Closes #6073 from leahmcguire/binaryCheckNB and squashes the following commits:

b8442c2 [leahmcguire] changed to if else for value checks
911bf83 [leahmcguire] undid reformat
4eedf1e [leahmcguire] moved bernoulli check
9ee9e84 [leahmcguire] fixed style error
3f3b32c [leahmcguire] fixed zero one check so only called in combiner
831fd27 [leahmcguire] got test working
f44bb3c [leahmcguire] removed changes from CV branch
67253f0 [leahmcguire] added check to bernoulli to ensure feature values are zero or one
f191c71 [leahmcguire] fixed name
58d060b [leahmcguire] changed param name and test according to comments
04f0d3c [leahmcguire] Added stats from cross validation as a val in the cross validation model to save them for user access
2015-05-13 14:13:19 -07:00
Burak Yavuz 5db18ba6e1 [SPARK-7593] [ML] Python Api for ml.feature.Bucketizer
Added `ml.feature.Bucketizer` to PySpark.

cc mengxr

Author: Burak Yavuz <brkyvz@gmail.com>

Closes #6124 from brkyvz/ml-bucket and squashes the following commits:

05285be [Burak Yavuz] added sphinx doc
6abb6ed [Burak Yavuz] added support for Bucketizer
2015-05-13 13:21:36 -07:00
Xiangrui Meng 2713bc65af [SPARK-7528] [MLLIB] make RankingMetrics Java-friendly
`RankingMetrics` contains a ClassTag, which is hard to create in Java. This PR adds a factory method `of` for Java users. coderxiang

Author: Xiangrui Meng <meng@databricks.com>

Closes #6098 from mengxr/SPARK-7528 and squashes the following commits:

e5d57ae [Xiangrui Meng] make RankingMetrics Java-friendly
2015-05-12 16:53:47 -07:00
Joseph K. Bradley 96c4846db8 [SPARK-7573] [ML] OneVsRest cleanups
Minor cleanups discussed with [~mengxr]:
* move OneVsRest from reduction to classification sub-package
* make model constructor private

Some doc cleanups too

CC: harsha2010  Could you please verify this looks OK?  Thanks!

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #6097 from jkbradley/onevsrest-cleanup and squashes the following commits:

4ecd48d [Joseph K. Bradley] org imports
430b065 [Joseph K. Bradley] moved OneVsRest from reduction subpackage to classification.  small java doc style fixes
9f8b9b9 [Joseph K. Bradley] Small cleanups to OneVsRest.  Made model constructor private to ml package.
2015-05-12 16:42:30 -07:00
Joseph K. Bradley f0c1bc3472 [SPARK-7557] [ML] [DOC] User guide for spark.ml HashingTF, Tokenizer
Added feature transformer subsection to spark.ml guide, with HashingTF and Tokenizer.  Added JavaHashingTFSuite to test Java examples in new guide.

I've run Scala, Python examples in the Spark/PySpark shells.  I ran the Java examples via the test suite (with small modifications for printing).

CC: mengxr

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #6093 from jkbradley/hashingtf-guide and squashes the following commits:

d5d213f [Joseph K. Bradley] small fix
dd6e91a [Joseph K. Bradley] fixes from code review of user guide
33c3ff9 [Joseph K. Bradley] small fix
bc6058c [Joseph K. Bradley] fix link
361a174 [Joseph K. Bradley] Added subsection for feature transformers to spark.ml guide, with HashingTF and Tokenizer.  Added JavaHashingTFSuite to test Java examples in new guide
2015-05-12 16:39:56 -07:00
Xiangrui Meng a4874b0d18 [SPARK-7571] [MLLIB] rename Math to math
`scala.Math` is deprecated since 2.8. This PR only touchs `Math` usages in MLlib. dbtsai

Author: Xiangrui Meng <meng@databricks.com>

Closes #6092 from mengxr/SPARK-7571 and squashes the following commits:

fe8f8d3 [Xiangrui Meng] Math -> math
2015-05-12 14:39:03 -07:00
Xiangrui Meng 23b9863e2a [SPARK-7559] [MLLIB] Bucketizer should include the right most boundary in the last bucket.
We make special treatment for +inf in `Bucketizer`. This could be simplified by always including the largest split value in the last bucket. E.g., (x1, x2, x3) defines buckets [x1, x2) and [x2, x3]. This shouldn't affect user code much, and there are applications that need to include the right-most value. For example, we can bucketize ratings from 0 to 10 to bad, neutral, and good with splits 0, 4, 6, 10. It may reads weird if the users need to put 0, 4, 6, 10.1 (or 11).

This also update the impl to use `Arrays.binarySearch` and `withClue` in test.

yinxusen jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #6075 from mengxr/SPARK-7559 and squashes the following commits:

e28f910 [Xiangrui Meng] update bucketizer impl
2015-05-12 14:24:26 -07:00
Ram Sriharsha 595a67589a [SPARK-7015] [MLLIB] [WIP] Multiclass to Binary Reduction: One Against All
initial cut of one against all. test code is a scaffolding , not fully implemented.
This WIP is to gather early feedback.

Author: Ram Sriharsha <rsriharsha@hw11853.local>

Closes #5830 from harsha2010/reduction and squashes the following commits:

5f4b495 [Ram Sriharsha] Fix Test
386e98b [Ram Sriharsha] Style fix
49b4a17 [Ram Sriharsha] Simplify the test
02279cc [Ram Sriharsha] Output Label Metadata in Prediction Col
bc78032 [Ram Sriharsha] Code Review Updates
8ce4845 [Ram Sriharsha] Merge with Master
2a807be [Ram Sriharsha] Merge branch 'master' into reduction
e21bfcc [Ram Sriharsha] Style Fix
5614f23 [Ram Sriharsha] Style Fix
c75583a [Ram Sriharsha] Cleanup
7a5f136 [Ram Sriharsha] Fix TODOs
804826b [Ram Sriharsha] Merge with Master
1448a5f [Ram Sriharsha] Style Fix
6e47807 [Ram Sriharsha] Style Fix
d63e46b [Ram Sriharsha] Incorporate Code Review Feedback
ced68b5 [Ram Sriharsha] Refactor OneVsAll to implement Predictor
78fa82a [Ram Sriharsha] extra line
0dfa1fb [Ram Sriharsha] Fix inexhaustive match cases that may arise from UnresolvedAttribute
a59a4f4 [Ram Sriharsha] @Experimental
4167234 [Ram Sriharsha] Merge branch 'master' into reduction
868a4fd [Ram Sriharsha] @Experimental
041d905 [Ram Sriharsha] Code Review Fixes
df188d8 [Ram Sriharsha] Style fix
612ec48 [Ram Sriharsha] Style Fix
6ef43d3 [Ram Sriharsha] Prefer Unresolved Attribute to Option: Java APIs are cleaner
6bf6bff [Ram Sriharsha] Update OneHotEncoder to new API
e29cb89 [Ram Sriharsha] Merge branch 'master' into reduction
1c7fa44 [Ram Sriharsha] Fix Tests
ca83672 [Ram Sriharsha] Incorporate Code Review Feedback + Rename to OneVsRestClassifier
221beeed [Ram Sriharsha] Upgrade to use Copy method for cloning Base Classifiers
26f1ddb [Ram Sriharsha] Merge with SPARK-5956 API changes
9738744 [Ram Sriharsha] Merge branch 'master' into reduction
1a3e375 [Ram Sriharsha] More efficient Implementation: Use withColumn to generate label column dynamically
32e0189 [Ram Sriharsha] Restrict reduction to Margin Based Classifiers
ff272da [Ram Sriharsha] Style fix
28771f5 [Ram Sriharsha] Add Tests for Multiclass to Binary Reduction
b60f874 [Ram Sriharsha] Fix Style issues in Test
3191cdf [Ram Sriharsha] Remove this test, accidental commit
23f056c [Ram Sriharsha] Fix Headers for test
1b5e929 [Ram Sriharsha] Fix Style issues and add Header
8752863 [Ram Sriharsha] [SPARK-7015][MLLib][WIP] Multiclass to Binary Reduction: One Against All
2015-05-12 13:35:12 -07:00
Marcelo Vanzin 82e890fb19 [SPARK-7485] [BUILD] Remove pyspark files from assembly.
The sbt part of the build is hacky; it basically tricks sbt
into generating the zip by using a generator, but returns
an empty list for the generated files so that nothing is
actually added to the assembly.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #6022 from vanzin/SPARK-7485 and squashes the following commits:

22c1e04 [Marcelo Vanzin] Remove unneeded code.
4893622 [Marcelo Vanzin] [SPARK-7485] [build] Remove pyspark files from assembly.
2015-05-12 01:39:21 -07:00
Xusen Yin 35fb42a0b0 [SPARK-5893] [ML] Add bucketizer
JIRA issue [here](https://issues.apache.org/jira/browse/SPARK-5893).

One thing to make clear, the `buckets` parameter, which is an array of `Double`, performs as split points. Say,

```scala
buckets = Array(-0.5, 0.0, 0.5)
```

splits the real number into 4 ranges, (-inf, -0.5], (-0.5, 0.0], (0.0, 0.5], (0.5, +inf), which is encoded as 0, 1, 2, 3.

Author: Xusen Yin <yinxusen@gmail.com>
Author: Joseph K. Bradley <joseph@databricks.com>

Closes #5980 from yinxusen/SPARK-5893 and squashes the following commits:

dc8c843 [Xusen Yin] Merge pull request #4 from jkbradley/yinxusen-SPARK-5893
1ca973a [Joseph K. Bradley] one more bucketizer test
34f124a [Joseph K. Bradley] Removed lowerInclusive, upperInclusive params from Bucketizer, and used splits instead.
eacfcfa [Xusen Yin] change ML attribute from splits into buckets
c3cc770 [Xusen Yin] add more unit test for binary search
3a16cc2 [Xusen Yin] refine comments and names
ac77859 [Xusen Yin] fix style error
fb30d79 [Xusen Yin] fix and test binary search
2466322 [Xusen Yin] refactor Bucketizer
11fb00a [Xusen Yin] change it into an Estimator
998bc87 [Xusen Yin] check buckets
4024cf1 [Xusen Yin] add test suite
5fe190e [Xusen Yin] add bucketizer
2015-05-11 18:41:22 -07:00
Yanbo Liang 042dda3c5c [SPARK-6092] [MLLIB] Add RankingMetrics in PySpark/MLlib
Author: Yanbo Liang <ybliang8@gmail.com>

Closes #6044 from yanboliang/spark-6092 and squashes the following commits:

726a9b1 [Yanbo Liang] add newRankingMetrics
33f649c [Yanbo Liang] Add RankingMetrics in PySpark/MLlib
2015-05-11 09:14:20 -07:00
Kirill A. Korinskiy 8c07c75c98 [SPARK-5521] PCA wrapper for easy transform vectors
I implement a simple PCA wrapper for easy transform of vectors by PCA for example LabeledPoint or another complicated structure.

Example of usage:
```
  import org.apache.spark.mllib.regression.LinearRegressionWithSGD
  import org.apache.spark.mllib.regression.LabeledPoint
  import org.apache.spark.mllib.linalg.Vectors
  import org.apache.spark.mllib.feature.PCA

  val data = sc.textFile("data/mllib/ridge-data/lpsa.data").map { line =>
    val parts = line.split(',')
    LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)))
  }.cache()

  val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L)
  val training = splits(0).cache()
  val test = splits(1)

  val pca = PCA.create(training.first().features.size/2, data.map(_.features))
  val training_pca = training.map(p => p.copy(features = pca.transform(p.features)))
  val test_pca = test.map(p => p.copy(features = pca.transform(p.features)))

  val numIterations = 100
  val model = LinearRegressionWithSGD.train(training, numIterations)
  val model_pca = LinearRegressionWithSGD.train(training_pca, numIterations)

  val valuesAndPreds = test.map { point =>
    val score = model.predict(point.features)
    (score, point.label)
  }

  val valuesAndPreds_pca = test_pca.map { point =>
    val score = model_pca.predict(point.features)
    (score, point.label)
  }

  val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean()
  val MSE_pca = valuesAndPreds_pca.map{case(v, p) => math.pow((v - p), 2)}.mean()

  println("Mean Squared Error = " + MSE)
  println("PCA Mean Squared Error = " + MSE_pca)
```

Author: Kirill A. Korinskiy <catap@catap.ru>
Author: Joseph K. Bradley <joseph@databricks.com>

Closes #4304 from catap/pca and squashes the following commits:

501bcd9 [Joseph K. Bradley] Small updates: removed k from Java-friendly PCA fit().  In PCASuite, converted results to set for comparison. Added an error message for bad k in PCA.
9dcc02b [Kirill A. Korinskiy] [SPARK-5521] fix scala style
1892a06 [Kirill A. Korinskiy] [SPARK-5521] PCA wrapper for easy transform vectors
2015-05-10 13:34:00 -07:00
Yanbo Liang bf7e81a51c [SPARK-6091] [MLLIB] Add MulticlassMetrics in PySpark/MLlib
https://issues.apache.org/jira/browse/SPARK-6091

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #6011 from yanboliang/spark-6091 and squashes the following commits:

bb3e4ba [Yanbo Liang] trigger jenkins
53c045d [Yanbo Liang] keep compatibility for python 2.6
972d5ac [Yanbo Liang] Add MulticlassMetrics in PySpark/MLlib
2015-05-10 00:57:14 -07:00
Joseph K. Bradley 2992623841 [SPARK-7498] [ML] removed varargs annotation from Params.setDefaults
In SPARK-7429 and PR https://github.com/apache/spark/pull/5960, I added the varargs annotation to Params.setDefault which takes a variable number of ParamPairs. It worked locally and on Jenkins for me.
However, mengxr reported issues compiling on his machine. So I'm reverting the change introduced in https://github.com/apache/spark/pull/5960 by removing varargs.

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #6021 from jkbradley/revert-varargs and squashes the following commits:

098ed39 [Joseph K. Bradley] removed varargs annotation from Params.setDefaults taking multiple ParamPairs
2015-05-08 21:55:54 -07:00
DB Tsai 86ef4cfd43 [SPARK-7262] [ML] Binary LogisticRegression with L1/L2 (elastic net) using OWLQN in new ML package
1) Handle scaling and addBias internally.
2) L1/L2 elasticnet using OWLQN optimizer.

Author: DB Tsai <dbt@netflix.com>

Closes #5967 from dbtsai/lor and squashes the following commits:

fa029bb [DB Tsai] made the bound smaller
0806002 [DB Tsai] better initial intercept and more test
5c31824 [DB Tsai] fix import
c387e25 [DB Tsai] Merge branch 'master' into lor
c84e931 [DB Tsai] Made MultiClassSummarizer private
f98e711 [DB Tsai] address feedback
a784321 [DB Tsai] fix style
8ec65d2 [DB Tsai] remove new line
f3f8c88 [DB Tsai] add more tests and they match R which is good. fix a bug
34705bc [DB Tsai] first commit
2015-05-08 21:43:05 -07:00
Burak Yavuz 84bf931f36 [SPARK-7488] [ML] Feature Parity in PySpark for ml.recommendation
Adds Python Api for `ALS` under `ml.recommendation` in PySpark. Also adds seed as a settable parameter in the Scala Implementation of ALS.

Author: Burak Yavuz <brkyvz@gmail.com>

Closes #6015 from brkyvz/ml-rec and squashes the following commits:

be6e931 [Burak Yavuz] addressed comments
eaed879 [Burak Yavuz] readd numFeatures
0bd66b1 [Burak Yavuz] fixed seed
7f6d964 [Burak Yavuz] merged master
52e2bda [Burak Yavuz] added ALS
2015-05-08 17:24:32 -07:00
Yanbo Liang 35c9599b94 [SPARK-5913] [MLLIB] Python API for ChiSqSelector
Add a Python API for mllib.feature.ChiSqSelector
https://issues.apache.org/jira/browse/SPARK-5913

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #5939 from yanboliang/spark-5913 and squashes the following commits:

cdaac99 [Yanbo Liang] Python API for ChiSqSelector
2015-05-08 15:48:39 -07:00
Burak Yavuz f5ff4a84c4 [SPARK-7383] [ML] Feature Parity in PySpark for ml.features
Implemented python wrappers for Scala functions that don't exist in `ml.features`

Author: Burak Yavuz <brkyvz@gmail.com>

Closes #5991 from brkyvz/ml-feat-PR and squashes the following commits:

adcca55 [Burak Yavuz] add regex tokenizer to __all__
b91cb44 [Burak Yavuz] addressed comments
bd39fd2 [Burak Yavuz] remove addition
b82bd7c [Burak Yavuz] Parity in PySpark for ml.features
2015-05-08 11:14:39 -07:00
Shuo Xiang 92f8f803a6 [SPARK-7452] [MLLIB] fix bug in topBykey and update test
the toArray function of the BoundedPriorityQueue does not necessarily preserve order. Add a counter-example as the test, which would fail the original impl.

Author: Shuo Xiang <shuoxiangpub@gmail.com>

Closes #5990 from coderxiang/topbykey-test and squashes the following commits:

98804c9 [Shuo Xiang] fix bug in topBykey and update test
2015-05-07 20:55:08 -07:00
Xiangrui Meng e43803b8f4 [SPARK-6948] [MLLIB] compress vectors in VectorAssembler
The compression is based on storage. brkyvz

Author: Xiangrui Meng <meng@databricks.com>

Closes #5985 from mengxr/SPARK-6948 and squashes the following commits:

df56a00 [Xiangrui Meng] update python tests
6d90d45 [Xiangrui Meng] compress vectors in VectorAssembler
2015-05-07 15:45:37 -07:00
Octavian Geagla 658a478d3f [SPARK-5726] [MLLIB] Elementwise (Hadamard) Vector Product Transformer
See https://issues.apache.org/jira/browse/SPARK-5726

Author: Octavian Geagla <ogeagla@gmail.com>
Author: Joseph K. Bradley <joseph@databricks.com>

Closes #4580 from ogeagla/spark-mllib-weighting and squashes the following commits:

fac12ad [Octavian Geagla] [SPARK-5726] [MLLIB] Use new createTransformFunc.
90f7e39 [Joseph K. Bradley] small cleanups
4595165 [Octavian Geagla] [SPARK-5726] [MLLIB] Remove erroneous test case.
ded3ac6 [Octavian Geagla] [SPARK-5726] [MLLIB] Pass style checks.
37d4705 [Octavian Geagla] [SPARK-5726] [MLLIB] Incorporated feedback.
1dffeee [Octavian Geagla] [SPARK-5726] [MLLIB] Pass style checks.
e436896 [Octavian Geagla] [SPARK-5726] [MLLIB] Remove 'TF' from 'ElementwiseProductTF'
cb520e6 [Octavian Geagla] [SPARK-5726] [MLLIB] Rename HadamardProduct to ElementwiseProduct
4922722 [Octavian Geagla] [SPARK-5726] [MLLIB] Hadamard Vector Product Transformer
2015-05-07 14:49:55 -07:00
Yanbo Liang 1712a7c705 [SPARK-6093] [MLLIB] Add RegressionMetrics in PySpark/MLlib
https://issues.apache.org/jira/browse/SPARK-6093

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #5941 from yanboliang/spark-6093 and squashes the following commits:

6934af3 [Yanbo Liang] change to @property
aac3bc5 [Yanbo Liang] Add RegressionMetrics in PySpark/MLlib
2015-05-07 11:18:32 -07:00
Burak Yavuz 9e2ffb1328 [SPARK-7388] [SPARK-7383] wrapper for VectorAssembler in Python
The wrapper required the implementation of the `ArrayParam`, because `Array[T]` is hard to obtain from Python. `ArrayParam` has an extra function called `wCast` which is an internal function to obtain `Array[T]` from `Seq[T]`

Author: Burak Yavuz <brkyvz@gmail.com>
Author: Xiangrui Meng <meng@databricks.com>

Closes #5930 from brkyvz/ml-feat and squashes the following commits:

73e745f [Burak Yavuz] Merge pull request #3 from mengxr/SPARK-7388
c221db9 [Xiangrui Meng] overload StringArrayParam.w
c81072d [Burak Yavuz] addressed comments
99c2ebf [Burak Yavuz] add to python_shared_params
39ecb07 [Burak Yavuz] fix scalastyle
7f7ea2a [Burak Yavuz] [SPARK-7388][SPARK-7383] wrapper for VectorAssembler in Python
2015-05-07 10:25:41 -07:00
Joseph K. Bradley 4f87e9562a [SPARK-7429] [ML] Params cleanups
Params.setDefault taking a set of ParamPairs should be annotated with varargs. I thought it would not work before, but it apparently does.

CrossValidator.transform should call transformSchema since the underlying Model might be a PipelineModel

CC: mengxr

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #5960 from jkbradley/params-cleanups and squashes the following commits:

118b158 [Joseph K. Bradley] Params.setDefault taking a set of ParamPairs should be annotated with varargs. I thought it would not work before, but it apparently does. CrossValidator.transform should call transformSchema since the underlying Model might be a PipelineModel
2015-05-07 01:28:44 -07:00
Joseph K. Bradley 8b6b46e4ff [SPARK-7421] [MLLIB] OnlineLDA cleanups
Small changes, primarily to allow us more flexibility in the future:
* Rename "tau_0" to "tau0"
* Mark LDAOptimizer trait sealed and DeveloperApi.
* Mark LDAOptimizer subclasses as final.
* Mark setOptimizer (the one taking an LDAOptimizer) and getOptimizer as DeveloperApi since we may need to change them in the future

CC: hhbyyh

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #5956 from jkbradley/onlinelda-cleanups and squashes the following commits:

f4be508 [Joseph K. Bradley] added newline
f4003e4 [Joseph K. Bradley] Changes: * Rename "tau_0" to "tau0" * Mark LDAOptimizer trait sealed and DeveloperApi. * Mark LDAOptimizer subclasses as final. * Mark setOptimizer (the one taking an LDAOptimizer) and getOptimizer as DeveloperApi since we may need to change them in the future
2015-05-07 01:12:14 -07:00
Joseph K. Bradley 1ad04dae03 [SPARK-5995] [ML] Make Prediction dev API public
Changes:
* Update protected prediction methods, following design doc. **<--most interesting change**
* Changed abstract classes for Estimator and Model to be public.  Added DeveloperApi tag.  (I kept the traits for Estimator/Model Params private.)
* Changed ProbabilisticClassificationModel method names to use probability instead of probabilities.

CC: mengxr shivaram etrain

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #5913 from jkbradley/public-dev-api and squashes the following commits:

e9aa0ea [Joseph K. Bradley] moved findMax to DenseVector and renamed to argmax. fixed bug for vector of length 0
15b9957 [Joseph K. Bradley] renamed probabilities to probability in method names
5cda84d [Joseph K. Bradley] regenerated sharedParams
7d1877a [Joseph K. Bradley] Made spark.ml prediction abstractions public.  Organized their prediction methods for efficient computation of multiple output columns.
2015-05-06 16:15:51 -07:00
Xiangrui Meng 32cdc815c6 [SPARK-6940] [MLLIB] Add CrossValidator to Python ML pipeline API
Since CrossValidator is a meta algorithm, we copy the implementation in Python. jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #5926 from mengxr/SPARK-6940 and squashes the following commits:

6af181f [Xiangrui Meng] add TODOs
8285134 [Xiangrui Meng] update doc
060f7c3 [Xiangrui Meng] update doctest
acac727 [Xiangrui Meng] add keyword args
cdddecd [Xiangrui Meng] add CrossValidator in Python
2015-05-06 01:28:43 -07:00
Yanbo Liang 7b1457839b [SPARK-6267] [MLLIB] Python API for IsotonicRegression
https://issues.apache.org/jira/browse/SPARK-6267

Author: Yanbo Liang <ybliang8@gmail.com>
Author: Xiangrui Meng <meng@databricks.com>

Closes #5890 from yanboliang/spark-6267 and squashes the following commits:

f20541d [Yanbo Liang] Merge pull request #3 from mengxr/SPARK-6267
7f202f9 [Xiangrui Meng] use Vector to have the best Python 2&3 compatibility
4bccfee [Yanbo Liang] fix doctest
ec09412 [Yanbo Liang] fix typos
8214bbb [Yanbo Liang] fix code style
5c8ebe5 [Yanbo Liang] Python API for IsotonicRegression
2015-05-05 22:57:13 -07:00
Sandy Ryza 47728db7cf [SPARK-5888] [MLLIB] Add OneHotEncoder as a Transformer
This patch adds a one hot encoder for categorical features.  Planning to add documentation and another test after getting feedback on the approach.

A couple choices made here:
* There's an `includeFirst` option which, if false, creates numCategories - 1 columns and, if true, creates numCategories columns.  The default is true, which is the behavior in scikit-learn.
* The user is expected to pass a `Seq` of category names when instantiating a `OneHotEncoder`.  These can be easily gotten from a `StringIndexer`.  The names are used for the output column names, which take the form colName_categoryName.

Author: Sandy Ryza <sandy@cloudera.com>

Closes #5500 from sryza/sandy-spark-5888 and squashes the following commits:

f383250 [Sandy Ryza] Infer label names automatically
6e257b9 [Sandy Ryza] Review comments
7c539cf [Sandy Ryza] Vector transformers
1c182dd [Sandy Ryza] SPARK-5888. [MLLIB]. Add OneHotEncoder as a Transformer
2015-05-05 12:34:02 -07:00
Alain d4cb38aeb7 [MLLIB] [TREE] Verify size of input rdd > 0 when building meta data
Require non empty input rdd such that we can take the first labeledpoint and get the feature size

Author: Alain <aihe@usc.edu>
Author: aihe@usc.edu <aihe@usc.edu>

Closes #5810 from AiHe/decisiontree-issue and squashes the following commits:

3b1d08a [aihe@usc.edu] [MLLIB][tree] merge the assertion into the evaluation of numFeatures
cf2e567 [Alain] [MLLIB][tree] Use a rdd api to verify size of input rdd > 0 when building meta data
b448f47 [Alain] [MLLIB][tree] Verify size of input rdd > 0 when building meta data
2015-05-05 16:47:34 +01:00
Hrishikesh Subramonian 5995ada96b [SPARK-6612] [MLLIB] [PYSPARK] Python KMeans parity
The following items are added to Python kmeans:

kmeans - setEpsilon, setInitializationSteps
KMeansModel - computeCost, k

Author: Hrishikesh Subramonian <hrishikesh.subramonian@flytxt.com>

Closes #5647 from FlytxtRnD/newPyKmeansAPI and squashes the following commits:

b9e451b [Hrishikesh Subramonian] set seed to fixed value in doc test
5fd3ced [Hrishikesh Subramonian] doc test corrections
20b3c68 [Hrishikesh Subramonian] python 3 fixes
4d4e695 [Hrishikesh Subramonian] added arguments in python tests
21eb84c [Hrishikesh Subramonian] Python Kmeans - setEpsilon, setInitializationSteps, k and computeCost added.
2015-05-05 07:57:39 -07:00
MechCoder 5ab652cdb8 [SPARK-7202] [MLLIB] [PYSPARK] Add SparseMatrixPickler to SerDe
Utilities for pickling and unpickling SparseMatrices using SerDe

Author: MechCoder <manojkumarsivaraj334@gmail.com>

Closes #5775 from MechCoder/spark-7202 and squashes the following commits:

7e689dc [MechCoder] [SPARK-7202] Add SparseMatrixPickler to SerDe
2015-05-05 07:53:11 -07:00
Xiangrui Meng e0833c5958 [SPARK-5956] [MLLIB] Pipeline components should be copyable.
This PR added `copy(extra: ParamMap): Params` to `Params`, which makes a copy of the current instance with a randomly generated uid and some extra param values. With this change, we only need to implement `fit` and `transform` without extra param values given the default implementation of `fit(dataset, extra)`:

~~~scala
def fit(dataset: DataFrame, extra: ParamMap): Model = {
  copy(extra).fit(dataset)
}
~~~

Inside `fit` and `transform`, since only the embedded values are used, I added `$` as an alias for `getOrDefault` to make the code easier to read. For example, in `LinearRegression.fit` we have:

~~~scala
val effectiveRegParam = $(regParam) / yStd
val effectiveL1RegParam = $(elasticNetParam) * effectiveRegParam
val effectiveL2RegParam = (1.0 - $(elasticNetParam)) * effectiveRegParam
~~~

Meta-algorithm like `Pipeline` implements its own `copy(extra)`. So the fitted pipeline model stored all copied stages (no matter whether it is a transformer or a model).

Other changes:
* `Params$.inheritValues` is moved to `Params!.copyValues` and returns the target instance.
* `fittingParamMap` was removed because the `parent` carries this information.
* `validate` was renamed to `validateParams` to be more precise.

TODOs:
* [x] add tests for newly added methods
* [ ] update documentation

jkbradley dbtsai

Author: Xiangrui Meng <meng@databricks.com>

Closes #5820 from mengxr/SPARK-5956 and squashes the following commits:

7bef88d [Xiangrui Meng] address comments
05229c3 [Xiangrui Meng] assert -> assertEquals
b2927b1 [Xiangrui Meng] organize imports
f14456b [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-5956
93e7924 [Xiangrui Meng] add tests for hasParam & copy
463ecae [Xiangrui Meng] merge master
2b954c3 [Xiangrui Meng] update Binarizer
465dd12 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-5956
282a1a8 [Xiangrui Meng] fix test
819dd2d [Xiangrui Meng] merge master
b642872 [Xiangrui Meng] example code runs
5a67779 [Xiangrui Meng] examples compile
c76b4d1 [Xiangrui Meng] fix all unit tests
0f4fd64 [Xiangrui Meng] fix some tests
9286a22 [Xiangrui Meng] copyValues to trained models
53e0973 [Xiangrui Meng] move inheritValues to Params and rename it to copyValues
9ee004e [Xiangrui Meng] merge copy and copyWith; rename validate to validateParams
d882afc [Xiangrui Meng] test compile
f082a31 [Xiangrui Meng] make Params copyable and simply handling of extra params in all spark.ml components
2015-05-04 11:28:59 -07:00
Yuhao Yang 3539cb7d20 [SPARK-5563] [MLLIB] LDA with online variational inference
JIRA: https://issues.apache.org/jira/browse/SPARK-5563
The PR contains the implementation for [Online LDA] (https://www.cs.princeton.edu/~blei/papers/HoffmanBleiBach2010b.pdf) based on the research of  Matt Hoffman and David M. Blei, which provides an efficient option for LDA users. Major advantages for the algorithm are the stream compatibility and economic time/memory consumption due to the corpus split. For more details, please refer to the jira.

Online LDA can act as a fast option for LDA, and will be especially helpful for the users who needs a quick result or with large corpus.

 Correctness test.
I have tested current PR with https://github.com/Blei-Lab/onlineldavb and the results are identical. I've uploaded the result and code to https://github.com/hhbyyh/LDACrossValidation.

Author: Yuhao Yang <hhbyyh@gmail.com>
Author: Joseph K. Bradley <joseph@databricks.com>

Closes #4419 from hhbyyh/ldaonline and squashes the following commits:

1045eec [Yuhao Yang] Merge pull request #2 from jkbradley/hhbyyh-ldaonline2
cf376ff [Joseph K. Bradley] For private vars needed for testing, I made them private and added accessors.  Java doesn’t understand package-private tags, so this minimizes the issues Java users might encounter.
6149ca6 [Yuhao Yang] fix for setOptimizer
cf0007d [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline
54cf8da [Yuhao Yang] some style change
68c2318 [Yuhao Yang] add a java ut
4041723 [Yuhao Yang] add ut
138bfed [Yuhao Yang] Merge pull request #1 from jkbradley/hhbyyh-ldaonline-update
9e910d9 [Joseph K. Bradley] small fix
61d60df [Joseph K. Bradley] Minor cleanups: * Update *Concentration parameter documentation * EM Optimizer: createVertices() does not need to be a function * OnlineLDAOptimizer: typos in doc * Clean up the core code for online LDA (Scala style)
a996a82 [Yuhao Yang] respond to comments
b1178cf [Yuhao Yang] fit into the optimizer framework
dbe3cff [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline
15be071 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline
b29193b [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline
d19ef55 [Yuhao Yang] change OnlineLDA to class
97b9e1a [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline
e7bf3b0 [Yuhao Yang] move to seperate file
f367cc9 [Yuhao Yang] change to optimization
8cb16a6 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline
62405cc [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline
02d0373 [Yuhao Yang] fix style in comment
f6d47ca [Yuhao Yang] Merge branch 'ldaonline' of https://github.com/hhbyyh/spark into ldaonline
d86cdec [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline
a570c9a [Yuhao Yang] use sample to pick up batch
4a3f27e [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline
e271eb1 [Yuhao Yang] remove non ascii
581c623 [Yuhao Yang] seperate API and adjust batch split
37af91a [Yuhao Yang] iMerge remote-tracking branch 'upstream/master' into ldaonline
20328d1 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline i
aa365d1 [Yuhao Yang] merge upstream master
3a06526 [Yuhao Yang] merge with new example
0dd3947 [Yuhao Yang] kMerge remote-tracking branch 'upstream/master' into ldaonline
0d0f3ee [Yuhao Yang] replace random split with sliding
fa408a8 [Yuhao Yang] ssMerge remote-tracking branch 'upstream/master' into ldaonline
45884ab [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline s
f41c5ca [Yuhao Yang] style fix
26dca1b [Yuhao Yang] style fix and make class private
043e786 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline s Conflicts: 	mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala
d640d9c [Yuhao Yang] online lda initial checkin
2015-05-04 00:06:25 -07:00
Reynold Xin 37537760d1 [SPARK-7274] [SQL] Create Column expression for array/struct creation.
Author: Reynold Xin <rxin@databricks.com>

Closes #5802 from rxin/SPARK-7274 and squashes the following commits:

19aecaa [Reynold Xin] Fixed unicode tests.
bfc1538 [Reynold Xin] Export all Python functions.
2517b8c [Reynold Xin] Code review.
23da335 [Reynold Xin] Fixed Python bug.
132002e [Reynold Xin] Fixed tests.
56fce26 [Reynold Xin] Added Python support.
b0d591a [Reynold Xin] Fixed debug error.
86926a6 [Reynold Xin] Added test suite.
7dbb9ab [Reynold Xin] Ok one more.
470e2f5 [Reynold Xin] One more MLlib ...
e2d14f0 [Reynold Xin] [SPARK-7274][SQL] Create Column expression for array/struct creation.
2015-05-01 12:49:02 -07:00
Liang-Chi Hsieh 7630213cab [SPARK-5891] [ML] Add Binarizer ML Transformer
JIRA: https://issues.apache.org/jira/browse/SPARK-5891

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #5699 from viirya/add_binarizer and squashes the following commits:

1a0b9a4 [Liang-Chi Hsieh] For comments.
bc397f2 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into add_binarizer
cc4f03c [Liang-Chi Hsieh] Implement threshold param and use merged params map.
7564c63 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into add_binarizer
1682f8c [Liang-Chi Hsieh] Add Binarizer ML Transformer.
2015-05-01 08:31:01 -07:00
Debasish Das 3b514af8a0 [SPARK-3066] [MLLIB] Support recommendAll in matrix factorization model
This is based on #3098 from debasish83.

1. BLAS' GEMM is used to compute inner products.
2. Reverted changes to MovieLensALS. SPARK-4231 should be addressed in a separate PR.
3. ~~Fixed a bug in topByKey~~

Closes #3098

debasish83 coderxiang

Author: Debasish Das <debasish.das@one.verizon.com>
Author: Xiangrui Meng <meng@databricks.com>

Closes #5829 from mengxr/SPARK-3066 and squashes the following commits:

22e6a87 [Xiangrui Meng] topByKey was correct. update its usage
389b381 [Xiangrui Meng] fix indentation
49953de [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-3066
cb9799a [Xiangrui Meng] revert MovieLensALS
f864f5e [Xiangrui Meng] update test and fix a bug in topByKey
c5e0181 [Xiangrui Meng] use GEMM and topByKey
3a0c4eb [Debasish Das] updated with spark master
98fa424 [Debasish Das] updated with master
ee99571 [Debasish Das] addressed initial review comments;merged with master;added tests for batch predict APIs in matrix factorization
3f97c49 [Debasish Das] fixed spark coding style for imports
7163a5c [Debasish Das] Added API for batch user and product recommendation; MAP calculation for product recommendation per user using randomized split
d144f57 [Debasish Das] recommendAll API to MatrixFactorizationModel, uses topK finding using BoundedPriorityQueue similar to RDD.top
f38a1b5 [Debasish Das] use sampleByKey for per user sampling
10cbb37 [Debasish Das] provide ratio for topN product validation; generate MAP and prec@k metric for movielens dataset
9fa063e [Debasish Das] import scala.math.round
4bbae0f [Debasish Das] comments fixed as per scalastyle
cd3ab31 [Debasish Das] merged with AbstractParams serialization bug
9b3951f [Debasish Das] validate user/product on MovieLens dataset through user input and compute map measure along with rmse
2015-05-01 08:27:46 -07:00
DB Tsai 1c3e402e66 [SPARK-7279] Removed diffSum which is theoretical zero in LinearRegression and coding formating
Author: DB Tsai <dbt@netflix.com>

Closes #5809 from dbtsai/format and squashes the following commits:

6904eed [DB Tsai] triger jenkins
9146e19 [DB Tsai] initial commit
2015-04-30 16:26:51 -07:00
Vincenzo Selvaggio 254e050976 [SPARK-1406] Mllib pmml model export
See PDF attached to the JIRA issue 1406.

The contribution is my original work and I license the work to the project under the project's open source license.

Author: Vincenzo Selvaggio <vselvaggio@hotmail.it>
Author: Xiangrui Meng <meng@databricks.com>
Author: selvinsource <vselvaggio@hotmail.it>

Closes #3062 from selvinsource/mllib_pmml_model_export_SPARK-1406 and squashes the following commits:

852aac6 [Vincenzo Selvaggio] [SPARK-1406] Update JPMML version to 1.1.15 in LICENSE file
085cf42 [Vincenzo Selvaggio] [SPARK-1406] Added Double Min and Max Fixed scala style
30165c4 [Vincenzo Selvaggio] [SPARK-1406] Fixed extreme cases for logit
7a5e0ec [Vincenzo Selvaggio] [SPARK-1406] Binary classification for SVM and Logistic Regression
cfcb596 [Vincenzo Selvaggio] [SPARK-1406] Throw IllegalArgumentException when exporting a multinomial logistic regression
25dce33 [Vincenzo Selvaggio] [SPARK-1406] Update code to latest pmml model
dea98ca [Vincenzo Selvaggio] [SPARK-1406] Exclude transitive dependency for pmml model
66b7c12 [Vincenzo Selvaggio] [SPARK-1406] Updated pmml model lib to 1.1.15, latest Java 6 compatible
a0a55f7 [Vincenzo Selvaggio] Merge pull request #2 from mengxr/SPARK-1406
3c22f79 [Xiangrui Meng] more code style
e2313df [Vincenzo Selvaggio] Merge pull request #1 from mengxr/SPARK-1406
472d757 [Xiangrui Meng] fix code style
1676e15 [Vincenzo Selvaggio] fixed scala issue
e2ffae8 [Vincenzo Selvaggio] fixed scala style
b8823b0 [Vincenzo Selvaggio] Merge remote-tracking branch 'upstream/master' into mllib_pmml_model_export_SPARK-1406
b25bbf7 [Vincenzo Selvaggio] [SPARK-1406] Added export of pmml to distributed file system using the spark context
7a949d0 [Vincenzo Selvaggio] [SPARK-1406] Fixed scala style
f46c75c [Vincenzo Selvaggio] [SPARK-1406] Added PMMLExportable to supported models
7b33b4e [Vincenzo Selvaggio] [SPARK-1406] Added a PMMLExportable interface Restructured code in a new package mllib.pmml Supported models implements the new PMMLExportable interface: LogisticRegression, SVM, KMeansModel, LinearRegression, RidgeRegression, Lasso
d559ec5 [Vincenzo Selvaggio] Merge remote-tracking branch 'upstream/master' into mllib_pmml_model_export_SPARK-1406
8fe12bb [Vincenzo Selvaggio] [SPARK-1406] Adjusted logistic regression export description and target categories
03bc3a5 [Vincenzo Selvaggio] added logistic regression
da2ec11 [Vincenzo Selvaggio] [SPARK-1406] added linear SVM PMML export
82f2131 [Vincenzo Selvaggio] Merge remote-tracking branch 'upstream/master' into mllib_pmml_model_export_SPARK-1406
19adf29 [Vincenzo Selvaggio] [SPARK-1406] Fixed scala style
1faf985 [Vincenzo Selvaggio] [SPARK-1406] Added target field to the regression model for completeness Adjusted unit test to deal with this change
3ae8ae5 [Vincenzo Selvaggio] [SPARK-1406] Adjusted imported order according to the guidelines
c67ce81 [Vincenzo Selvaggio] Merge remote-tracking branch 'upstream/master' into mllib_pmml_model_export_SPARK-1406
78515ec [Vincenzo Selvaggio] [SPARK-1406] added pmml export for LinearRegressionModel, RidgeRegressionModel and LassoModel
e29dfb9 [Vincenzo Selvaggio] removed version, by default is set to 4.2 (latest from jpmml) removed copyright
ae8b993 [Vincenzo Selvaggio] updated some commented tests to use the new ModelExporter object reordered the imports
df8a89e [Vincenzo Selvaggio] added pmml version to pmml model changed the copyright to spark
a1b4dc3 [Vincenzo Selvaggio] updated imports
834ca44 [Vincenzo Selvaggio] reordered the import accordingly to the guidelines
349a76b [Vincenzo Selvaggio] new helper object to serialize the models to pmml format
c3ef9b8 [Vincenzo Selvaggio] set it to private
6357b98 [Vincenzo Selvaggio] set it to private
e1eb251 [Vincenzo Selvaggio] removed serialization part, this will be part of the ModelExporter helper object
aba5ee1 [Vincenzo Selvaggio] fixed cluster export
cd6c07c [Vincenzo Selvaggio] fixed scala style to run tests
f75b988 [Vincenzo Selvaggio] Merge remote-tracking branch 'origin/master' into mllib_pmml_model_export_SPARK-1406
07a29bf [selvinsource] Update LICENSE
8841439 [Vincenzo Selvaggio] adjust scala style in order to compile
1433b11 [Vincenzo Selvaggio] complete suite tests
8e71b8d [Vincenzo Selvaggio] kmeans pmml export implementation
9bc494f [Vincenzo Selvaggio] added scala suite tests added saveLocalFile to ModelExport trait
226e184 [Vincenzo Selvaggio] added javadoc and export model type in case there is a need to support other types of export (not just PMML)
a0e3679 [Vincenzo Selvaggio] export and pmml export traits kmeans test implementation
2015-04-29 23:21:21 -07:00
DB Tsai ba49eb1625 Some code clean up.
Author: DB Tsai <dbt@netflix.com>

Closes #5794 from dbtsai/clean and squashes the following commits:

ad639dd [DB Tsai] Indentation
834d527 [DB Tsai] Some code clean up.
2015-04-29 21:44:41 -07:00
Joseph K. Bradley 114bad606e [SPARK-7176] [ML] Add validation functionality to Param
Main change: Added isValid field to Param.  Modified all usages to use isValid when relevant.  Added helper methods in ParamValidate.

Also overrode Params.validate() in:
* CrossValidator + model
* Pipeline + model

I made a few updates for the elastic net patch:
* I changed "tol" to "convergenceTol"
* I added some documentation

This PR is Scala + Java only.  Python will be in a follow-up PR.

CC: mengxr

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #5740 from jkbradley/enforce-validate and squashes the following commits:

ad9c6c1 [Joseph K. Bradley] re-generated sharedParams after merging with current master
76415e8 [Joseph K. Bradley] reverted convergenceTol to tol
af62f4b [Joseph K. Bradley] Removed changes to SparkBuild, python linalg.  Fixed test failures.  Renamed ParamValidate to ParamValidators.  Removed explicit type from ParamValidators calls where possible.
bb2665a [Joseph K. Bradley] merged with elastic net pr
ecda302 [Joseph K. Bradley] fix rat tests, plus add a little doc
6895dfc [Joseph K. Bradley] small cleanups
069ac6d [Joseph K. Bradley] many cleanups
928fb84 [Joseph K. Bradley] Maybe done
a910ac7 [Joseph K. Bradley] still workin
6d60e2e [Joseph K. Bradley] Still workin
b987319 [Joseph K. Bradley] Partly done with adding checks, but blocking on adding checking functionality to Param
dbc9fb2 [Joseph K. Bradley] merged with master.  enforcing Params.validate
2015-04-29 17:26:46 -07:00
Joseph K. Bradley b1ef6a60ff [SPARK-7259] [ML] VectorIndexer: do not copy non-ML metadata to output column
Changed VectorIndexer so it does not carry non-ML metadata from the input to the output column.  Removed ml.util.TestingUtils since VectorIndexer was the only use.

CC: mengxr

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #5789 from jkbradley/vector-indexer-metadata and squashes the following commits:

b28e159 [Joseph K. Bradley] Changed VectorIndexer so it does not carry non-ML metadata from the input to the output column.  Removed ml.util.TestingUtils since VectorIndexer was the only use.
2015-04-29 16:35:17 -07:00
Xusen Yin c9d530e2e5 [SPARK-6529] [ML] Add Word2Vec transformer
See JIRA issue [here](https://issues.apache.org/jira/browse/SPARK-6529).

There are some notes:

1. I add `learningRate` in sharedParams since it is a common parameter for ML algorithms.
2. We will not support transform of finding synonyms from a `Vector`, which will support in further JIRA issues.
3. Word2Vec is different with other ML models that its training set and transformed set are different. Its training set is an `RDD[Iterable[String]]` which represents documents, but the transformed set we want is an `RDD[String]` that represents unique words. So you have to switch your `inputCol` in these two stages.

Author: Xusen Yin <yinxusen@gmail.com>

Closes #5596 from yinxusen/SPARK-6529 and squashes the following commits:

ee2b37a [Xusen Yin] merge with former HEAD
4945462 [Xusen Yin] merge with #5626
3bc2cbd [Xusen Yin] change foldLeft to for loop and use blas
5dd4ee7 [Xusen Yin] fix scala style
743e0d5 [Xusen Yin] fix comments and code style
04c48e9 [Xusen Yin] ensure the functionality
a190f2c [Xusen Yin] fix code style and refine the transform function of word2vec
02848fa [Xusen Yin] refine comments
34a55c0 [Xusen Yin] fix errors
109d124 [Xusen Yin] add test suite and pass it
04dde06 [Xusen Yin] add shared params
c594095 [Xusen Yin] add word2vec transformer
23d77fa [Xusen Yin] merge with #5626
e8cfaf7 [Xusen Yin] fix conflict with master
66e7bd3 [Xusen Yin] change foldLeft to for loop and use blas
566ec20 [Xusen Yin] fix scala style
b54399f [Xusen Yin] fix comments and code style
1211e86 [Xusen Yin] ensure the functionality
6b97ec8 [Xusen Yin] fix code style and refine the transform function of word2vec
7cde18f [Xusen Yin] rm sharedParams
618abd0 [Xusen Yin] refine comments
e29680a [Xusen Yin] fix errors
fe3afe9 [Xusen Yin] add test suite and pass it
02767fb [Xusen Yin] add shared params
6a514f1 [Xusen Yin] add word2vec transformer
2015-04-29 14:55:32 -07:00
DB Tsai 15995c883a [SPARK-7222] [ML] Added mathematical derivation in comment and compressed the model, removed the correction terms in LinearRegression with ElasticNet
Added detailed mathematical derivation of how scaling and LeastSquaresAggregator work. Refactored the code so the model is compressed based on the storage. We may try compression based on the prediction time.

Also, I found that diffSum will be always zero mathematically, so no corrections are required.

Author: DB Tsai <dbt@netflix.com>

Closes #5767 from dbtsai/lir-doc and squashes the following commits:

5e346c9 [DB Tsai] refactoring
fc9f582 [DB Tsai] doc
58456d8 [DB Tsai] address feedback
69757b8 [DB Tsai] actually diffSum is mathematically zero! No correction is needed.
5929e49 [DB Tsai] typo
63f7d1e [DB Tsai] Added compression to the model based on storage
203a295 [DB Tsai] Add more documentation to LinearRegression in new ML framework.
2015-04-29 14:53:37 -07:00
Xusen Yin c0c0ba6d2a Fix a typo of "threshold"
mengxr

Author: Xusen Yin <yinxusen@gmail.com>

Closes #5769 from yinxusen/patch-1 and squashes the following commits:

43235f4 [Xusen Yin] Update PearsonCorrelation.scala
f7287ee [Xusen Yin] Fix a typo of "threshold"
2015-04-29 10:13:48 -07:00
Xiangrui Meng 5ef006fc4d [SPARK-6756] [MLLIB] add toSparse, toDense, numActives, numNonzeros, and compressed to Vector
Add `compressed` to `Vector` with some other methods: `numActives`, `numNonzeros`, `toSparse`, and `toDense`. jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #5756 from mengxr/SPARK-6756 and squashes the following commits:

8d4ecbd [Xiangrui Meng] address comment and add mima excludes
da54179 [Xiangrui Meng] add toSparse, toDense, numActives, numNonzeros, and compressed to Vector
2015-04-28 21:49:53 -07:00
Xiangrui Meng d36e67350c [SPARK-6965] [MLLIB] StringIndexer handles numeric input.
Cast numeric types to String for indexing. Boolean type is not handled in this PR. jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #5753 from mengxr/SPARK-6965 and squashes the following commits:

2e34f3c [Xiangrui Meng] add actual type in the error message
ad938bf [Xiangrui Meng] StringIndexer handles numeric input.
2015-04-28 17:41:09 -07:00
Xiangrui Meng f0a1f90f53 [SPARK-7201] [MLLIB] move Identifiable to ml.util
It shouldn't live directly under `spark.ml`.

Author: Xiangrui Meng <meng@databricks.com>

Closes #5749 from mengxr/SPARK-7201 and squashes the following commits:

53847f9 [Xiangrui Meng] move Identifiable to ml.util
2015-04-28 14:07:26 -07:00