Commit graph

1010 commits

Author SHA1 Message Date
Xiangrui Meng f0f563a3c4 [SPARK-100354] [MLLIB] fix some apparent memory issues in k-means|| initializaiton
* do not cache first cost RDD
* change following cost RDD cache level to MEMORY_AND_DISK
* remove Vector wrapper to save a object per instance

Further improvements will be addressed in SPARK-10329

cc: yu-iskw HuJiayin

Author: Xiangrui Meng <meng@databricks.com>

Closes #8526 from mengxr/SPARK-10354.
2015-08-30 23:20:03 -07:00
Burak Yavuz 8d2ab75d3b [SPARK-10353] [MLLIB] BLAS gemm not scaling when beta = 0.0 for some subset of matrix multiplications
mengxr jkbradley rxin

It would be great if this fix made it into RC3!

Author: Burak Yavuz <brkyvz@gmail.com>

Closes #8525 from brkyvz/blas-scaling.
2015-08-30 12:21:15 -07:00
Yu ISHIKAWA 4eeda8d472 [SPARK-10260] [ML] Add @Since annotation to ml.clustering
### JIRA
[[SPARK-10260] Add Since annotation to ml.clustering - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-10260)

Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #8455 from yu-iskw/SPARK-10260.
2015-08-28 00:50:26 -07:00
Feynman Liang 5bfe9e1111 [SPARK-9680] [MLLIB] [DOC] StopWordsRemovers user guide and Java compatibility test
* Adds user guide for ml.feature.StopWordsRemovers, ran code examples on my machine
* Cleans up scaladocs for public methods
* Adds test for Java compatibility
* Follow up Python user guide code example is tracked by SPARK-10249

Author: Feynman Liang <fliang@databricks.com>

Closes #8436 from feynmanliang/SPARK-10230.
2015-08-27 16:10:37 -07:00
Vyacheslav Baranov fdd466bed7 [SPARK-10182] [MLLIB] GeneralizedLinearModel doesn't unpersist cached data
`GeneralizedLinearModel` creates a cached RDD when building a model. It's inconvenient, since these RDDs flood the memory when building several models in a row, so useful data might get evicted from the cache.

The proposed solution is to always cache the dataset & remove the warning. There's a caveat though: input dataset gets evaluated twice, in line 270 when fitting `StandardScaler` for the first time, and when running optimizer for the second time. So, it might worth to return removed warning.

Another possible solution is to disable caching entirely & return removed warning. I don't really know what approach is better.

Author: Vyacheslav Baranov <slavik.baranov@gmail.com>

Closes #8395 from SlavikBaranov/SPARK-10182.
2015-08-27 18:56:18 +01:00
Feynman Liang e1f4de4a7d [SPARK-10257] [MLLIB] Removes Guava from all spark.mllib Java tests
* Replaces instances of `Lists.newArrayList` with `Arrays.asList`
* Replaces `commons.lang.StringUtils` over `com.google.collections.Strings`
* Replaces `List` interface over `ArrayList` implementations

This PR along with #8445 #8446 #8447 completely removes all `com.google.collections.Lists` dependencies within mllib's Java tests.

Author: Feynman Liang <fliang@databricks.com>

Closes #8451 from feynmanliang/SPARK-10257.
2015-08-27 18:46:41 +01:00
Jacek Laskowski b02e818722 [SPARK-9613] [HOTFIX] Fix usage of JavaConverters removed in Scala 2.11
Fix for [JavaConverters.asJavaListConverter](http://www.scala-lang.org/api/2.10.5/index.html#scala.collection.JavaConverters$) being removed in 2.11.7 and hence the build fails with the 2.11 profile enabled. Tested with the default 2.10 and 2.11 profiles. BUILD SUCCESS in both cases.

Build for 2.10:

    ./build/mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.7.1 -DskipTests clean install

and 2.11:

    ./dev/change-scala-version.sh 2.11
    ./build/mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.7.1 -Dscala-2.11 -DskipTests clean install

Author: Jacek Laskowski <jacek@japila.pl>

Closes #8479 from jaceklaskowski/SPARK-9613-hotfix.
2015-08-27 11:07:37 +01:00
Feynman Liang 1a446f75b6 [SPARK-10256] [ML] Removes guava dependency from spark.ml.classification JavaTests
Author: Feynman Liang <fliang@databricks.com>

Closes #8447 from feynmanliang/SPARK-10256.
2015-08-27 10:46:18 +01:00
Feynman Liang 75d6230794 [SPARK-10255] [ML] Removes Guava dependencies from spark.ml.param JavaTests
Author: Feynman Liang <fliang@databricks.com>

Closes #8446 from feynmanliang/SPARK-10255.
2015-08-27 10:45:35 +01:00
Feynman Liang 1650f6f56e [SPARK-10254] [ML] Removes Guava dependencies in spark.ml.feature JavaTests
* Replaces `com.google.common` dependencies with `java.util.Arrays`
* Small clean up in `JavaNormalizerSuite`

Author: Feynman Liang <fliang@databricks.com>

Closes #8445 from feynmanliang/SPARK-10254.
2015-08-27 10:44:44 +01:00
Xiangrui Meng 086d4681df [SPARK-10241] [MLLIB] update since versions in mllib.recommendation
Same as #8421 but for `mllib.recommendation`.

cc srowen coderxiang

Author: Xiangrui Meng <meng@databricks.com>

Closes #8432 from mengxr/SPARK-10241.
2015-08-26 14:02:19 -07:00
Xiangrui Meng 6519fd06cc [SPARK-9665] [MLLIB] audit MLlib API annotations
I only found `ml.NaiveBayes` missing `Experimental` annotation. This PR doesn't cover Python APIs.

cc jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #8452 from mengxr/SPARK-9665.
2015-08-26 11:47:05 -07:00
Xiangrui Meng 321d775969 [SPARK-10236] [MLLIB] update since versions in mllib.feature
Same as #8421 but for `mllib.feature`.

cc dbtsai

Author: Xiangrui Meng <meng@databricks.com>

Closes #8449 from mengxr/SPARK-10236.feature and squashes the following commits:

0e8d658 [Xiangrui Meng] remove unnecessary comment
ad70b03 [Xiangrui Meng] update since versions in mllib.feature
2015-08-25 23:45:41 -07:00
Xiangrui Meng 4657fa1f37 [SPARK-10235] [MLLIB] update since versions in mllib.regression
Same as #8421 but for `mllib.regression`.

cc freeman-lab dbtsai

Author: Xiangrui Meng <meng@databricks.com>

Closes #8426 from mengxr/SPARK-10235 and squashes the following commits:

6cd28e4 [Xiangrui Meng] update since versions in mllib.regression
2015-08-25 22:49:33 -07:00
Xiangrui Meng fb7e12fe2e [SPARK-10243] [MLLIB] update since versions in mllib.tree
Same as #8421 but for `mllib.tree`.

cc jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #8442 from mengxr/SPARK-10236.
2015-08-25 22:35:49 -07:00
Xiangrui Meng d703372f86 [SPARK-10234] [MLLIB] update since version in mllib.clustering
Same as #8421 but for `mllib.clustering`.

cc feynmanliang yu-iskw

Author: Xiangrui Meng <meng@databricks.com>

Closes #8435 from mengxr/SPARK-10234.
2015-08-25 22:33:48 -07:00
Xiangrui Meng c3a54843c0 [SPARK-10240] [SPARK-10242] [MLLIB] update since versions in mlilb.random and mllib.stat
The same as #8241 but for `mllib.stat` and `mllib.random`.

cc feynmanliang

Author: Xiangrui Meng <meng@databricks.com>

Closes #8439 from mengxr/SPARK-10242.
2015-08-25 22:31:23 -07:00
Xiangrui Meng ab431f8a97 [SPARK-10238] [MLLIB] update since versions in mllib.linalg
Same as #8421 but for `mllib.linalg`.

cc dbtsai

Author: Xiangrui Meng <meng@databricks.com>

Closes #8440 from mengxr/SPARK-10238 and squashes the following commits:

b38437e [Xiangrui Meng] update since versions in mllib.linalg
2015-08-25 20:07:56 -07:00
Xiangrui Meng 8668ead2e7 [SPARK-10233] [MLLIB] update since version in mllib.evaluation
Same as #8421 but for `mllib.evaluation`.

cc avulanov

Author: Xiangrui Meng <meng@databricks.com>

Closes #8423 from mengxr/SPARK-10233.
2015-08-25 18:17:54 -07:00
Feynman Liang 125205cdb3 [SPARK-9888] [MLLIB] User guide for new LDA features
* Adds two new sections to LDA's user guide; one for each optimizer/model
 * Documents new features added to LDA (e.g. topXXXperXXX, asymmetric priors, hyperpam optimization)
 * Cleans up a TODO and sets a default parameter in LDA code

jkbradley hhbyyh

Author: Feynman Liang <fliang@databricks.com>

Closes #8254 from feynmanliang/SPARK-9888.
2015-08-25 17:39:20 -07:00
Xiangrui Meng 00ae4be97f [SPARK-10239] [SPARK-10244] [MLLIB] update since versions in mllib.pmml and mllib.util
Same as #8421 but for `mllib.pmml` and `mllib.util`.

cc dbtsai

Author: Xiangrui Meng <meng@databricks.com>

Closes #8430 from mengxr/SPARK-10239 and squashes the following commits:

a189acf [Xiangrui Meng] update since versions in mllib.pmml and mllib.util
2015-08-25 14:11:38 -07:00
Feynman Liang 9205907876 [SPARK-9797] [MLLIB] [DOC] StreamingLinearRegressionWithSGD.setConvergenceTol default value
Adds default convergence tolerance (0.001, set in `GradientDescent.convergenceTol`) to `setConvergenceTol`'s scaladoc

Author: Feynman Liang <fliang@databricks.com>

Closes #8424 from feynmanliang/SPARK-9797.
2015-08-25 13:23:15 -07:00
Xiangrui Meng c619c7552f [SPARK-10237] [MLLIB] update since versions in mllib.fpm
Same as #8421 but for `mllib.fpm`.

cc feynmanliang

Author: Xiangrui Meng <meng@databricks.com>

Closes #8429 from mengxr/SPARK-10237.
2015-08-25 13:22:38 -07:00
Feynman Liang c0e9ff1588 [SPARK-9800] Adds docs for GradientDescent$.runMiniBatchSGD alias
* Adds doc for alias of runMIniBatchSGD documenting default value for convergeTol
* Cleans up a note in code

Author: Feynman Liang <fliang@databricks.com>

Closes #8425 from feynmanliang/SPARK-9800.
2015-08-25 13:21:05 -07:00
Xiangrui Meng 16a2be1a84 [SPARK-10231] [MLLIB] update @Since annotation for mllib.classification
Update `Since` annotation in `mllib.classification`:

1. add version to classes, objects, constructors, and public variables declared in constructors
2. correct some versions
3. remove `Since` on `toString`

MechCoder dbtsai

Author: Xiangrui Meng <meng@databricks.com>

Closes #8421 from mengxr/SPARK-10231 and squashes the following commits:

b2dce80 [Xiangrui Meng] update @Since annotation for mllib.classification
2015-08-25 12:16:23 -07:00
Feynman Liang 881208a8e8 [SPARK-10230] [MLLIB] Rename optimizeAlpha to optimizeDocConcentration
See [discussion](https://github.com/apache/spark/pull/8254#discussion_r37837770)

CC jkbradley

Author: Feynman Liang <fliang@databricks.com>

Closes #8422 from feynmanliang/SPARK-10230.
2015-08-25 11:58:47 -07:00
Sean Owen 69c9c17716 [SPARK-9613] [CORE] Ban use of JavaConversions and migrate all existing uses to JavaConverters
Replace `JavaConversions` implicits with `JavaConverters`

Most occurrences I've seen so far are necessary conversions; a few have been avoidable. None are in critical code as far as I see, yet.

Author: Sean Owen <sowen@cloudera.com>

Closes #8033 from srowen/SPARK-9613.
2015-08-25 12:33:13 +01:00
Joseph K. Bradley b963c19a80 [SPARK-10164] [MLLIB] Fixed GMM distributed decomposition bug
GaussianMixture now distributes matrix decompositions for certain problem sizes. Distributed computation actually fails, but this was not tested in unit tests.

This PR adds a unit test which checks this.  It failed previously but works with this fix.

CC: mengxr

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #8370 from jkbradley/gmm-fix.
2015-08-23 18:34:07 -07:00
Xusen Yin 630a994e6a [SPARK-9893] User guide with Java test suite for VectorSlicer
Add user guide for `VectorSlicer`, with Java test suite and Python version VectorSlicer.

Note that Python version does not support selecting by names now.

Author: Xusen Yin <yinxusen@gmail.com>

Closes #8267 from yinxusen/SPARK-9893.
2015-08-21 16:30:12 -07:00
Joseph K. Bradley f01c4220d2 [SPARK-10163] [ML] Allow single-category features for GBT models
Removed categorical feature info validation since no longer needed

This is needed to make the ML user guide examples work (in another current PR).

CC: mengxr

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #8367 from jkbradley/gbt-single-cat.
2015-08-21 16:28:00 -07:00
MechCoder f5b028ed2f [SPARK-9864] [DOC] [MLlib] [SQL] Replace since in scaladoc to Since annotation
Author: MechCoder <manojkumarsivaraj334@gmail.com>

Closes #8352 from MechCoder/since.
2015-08-21 14:19:24 -07:00
Joseph K. Bradley eaafe139f8 [SPARK-9245] [MLLIB] LDA topic assignments
For each (document, term) pair, return top topic.  Note that instances of (doc, term) pairs within a document (a.k.a. "tokens") are exchangeable, so we should provide an estimate per document-term, rather than per token.

CC: rotationsymmetry mengxr

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #8329 from jkbradley/lda-topic-assignments.
2015-08-20 15:01:31 -07:00
MechCoder 7cfc0750e1 [SPARK-10108] Add since tags to mllib.feature
Author: MechCoder <manojkumarsivaraj334@gmail.com>

Closes #8309 from MechCoder/tags_feature.
2015-08-20 14:56:08 -07:00
Xiangrui Meng 2a3d98aae2 [SPARK-10138] [ML] move setters to MultilayerPerceptronClassifier and add Java test suite
Otherwise, setters do not return self type. jkbradley avulanov

Author: Xiangrui Meng <meng@databricks.com>

Closes #8342 from mengxr/SPARK-10138.
2015-08-20 14:47:04 -07:00
Eric Liang 8e0a072f78 [SPARK-9895] User Guide for RFormula Feature Transformer
mengxr

Author: Eric Liang <ekl@databricks.com>

Closes #8293 from ericl/docs-2.
2015-08-19 15:43:08 -07:00
Xiangrui Meng 5b62bef8cb [SPARK-8918] [MLLIB] [DOC] Add @since tags to mllib.clustering
This continues the work from #8256. I removed `since` tags from private/protected/local methods/variables (see 72fdeb6463). MechCoder

Closes #8256

Author: Xiangrui Meng <meng@databricks.com>
Author: Xiaoqing Wang <spark445@126.com>
Author: MechCoder <manojkumarsivaraj334@gmail.com>

Closes #8288 from mengxr/SPARK-8918.
2015-08-19 13:17:26 -07:00
Feynman Liang 28a98464ea [SPARK-10097] Adds shouldMaximize flag to ml.evaluation.Evaluator
Previously, users of evaluator (`CrossValidator` and `TrainValidationSplit`) would only maximize the metric in evaluator, leading to a hacky solution which negated metrics to be minimized and caused erroneous negative values to be reported to the user.

This PR adds a `isLargerBetter` attribute to the `Evaluator` base class, instructing users of `Evaluator` on whether the chosen metric should be maximized or minimized.

CC jkbradley

Author: Feynman Liang <fliang@databricks.com>
Author: Joseph K. Bradley <joseph@databricks.com>

Closes #8290 from feynmanliang/SPARK-10097.
2015-08-19 11:35:05 -07:00
lewuathe c635a16f64 [SPARK-10012] [ML] Missing test case for Params#arrayLengthGt
Currently there is no test case for `Params#arrayLengthGt`.

Author: lewuathe <lewuathe@me.com>

Closes #8223 from Lewuathe/SPARK-10012.
2015-08-18 15:30:23 -07:00
Bryan Cutler 1dbffba37a [SPARK-8924] [MLLIB, DOCUMENTATION] Added @since tags to mllib.tree
Added since tags to mllib.tree

Author: Bryan Cutler <bjcutler@us.ibm.com>

Closes #7380 from BryanCutler/sinceTag-mllibTree-8924.
2015-08-18 14:58:30 -07:00
Feynman Liang f5ea391290 [SPARK-9900] [MLLIB] User guide for Association Rules
Updates FPM user guide to include Association Rules.

Author: Feynman Liang <fliang@databricks.com>

Closes #8207 from feynmanliang/SPARK-9900-arules.
2015-08-18 12:53:57 -07:00
Yuhao Yang 354f4582b6 [SPARK-9028] [ML] Add CountVectorizer as an estimator to generate CountVectorizerModel
jira: https://issues.apache.org/jira/browse/SPARK-9028

Add an estimator for CountVectorizerModel. The estimator will extract a vocabulary from document collections according to the term frequency.

I changed the meaning of minCount as a filter across the corpus. This aligns with Word2Vec and the similar parameter in SKlearn.

Author: Yuhao Yang <hhbyyh@gmail.com>
Author: Joseph K. Bradley <joseph@databricks.com>

Closes #7388 from hhbyyh/cvEstimator.
2015-08-18 11:00:09 -07:00
Yanbo Liang dd0614fd61 [SPARK-10076] [ML] make MultilayerPerceptronClassifier layers and weights public
Fix the issue that ```layers``` and ```weights``` should be public variables of ```MultilayerPerceptronClassificationModel```. Users can not get ```layers``` and ```weights``` from a ```MultilayerPerceptronClassificationModel``` currently.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #8263 from yanboliang/mlp-public.
2015-08-17 23:57:02 -07:00
Xiangrui Meng e290029a35 [SPARK-7808] [ML] add package doc for ml.feature
This PR adds a short description of `ml.feature` package with code example. The Java package doc will come in a separate PR. jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #8260 from mengxr/SPARK-7808.
2015-08-17 19:40:51 -07:00
Prayag Chandran 18523c1305 SPARK-8916 [Documentation, MLlib] Add @since tags to mllib.regression
Added since tags to mllib.regression

Author: Prayag Chandran <prayagchandran@gmail.com>

Closes #7518 from prayagchandran/sinceTags and squashes the following commits:

fa4dda2 [Prayag Chandran] Re-formatting
6c6d584 [Prayag Chandran] Corrected a few tags. Removed few unnecessary tags
1a0365f [Prayag Chandran] Reformating and adding a few more tags
89fdb66 [Prayag Chandran] SPARK-8916 [Documentation, MLlib] Add @since tags to mllib.regression
2015-08-17 17:26:08 -07:00
Sameer Abhyankar 088b11ec59 [SPARK-8920] [MLLIB] Add @since tags to mllib.linalg
Author: Sameer Abhyankar <sabhyankar@sabhyankar-MBP.Samavihome>
Author: Sameer Abhyankar <sabhyankar@sabhyankar-MBP.local>

Closes #7729 from sabhyankar/branch_8920.
2015-08-17 16:00:23 -07:00
Feynman Liang f7efda3975 [SPARK-9959] [MLLIB] Association Rules Java Compatibility
mengxr

Author: Feynman Liang <fliang@databricks.com>

Closes #8206 from feynmanliang/SPARK-9959-arules-java.
2015-08-17 09:58:34 -07:00
Davies Liu 37586e5449 [HOTFIX] fix duplicated braces
Author: Davies Liu <davies@databricks.com>

Closes #8219 from davies/fix_typo.
2015-08-14 20:56:55 -07:00
Joseph K. Bradley 2a6590e510 [SPARK-9981] [ML] Made labels public for StringIndexerModel
Also added unit test for integration between StringIndexerModel and IndexToString

CC: holdenk We realized we should have left in your unit test (to catch the issue with removing the inverse() method), so this adds it back.  mengxr

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #8211 from jkbradley/stridx-labels.
2015-08-14 14:05:03 -07:00
Wenchen Fan 34d610be85 [SPARK-9929] [SQL] support metadata in withColumn
in MLlib sometimes we need to set metadata for the new column, thus we will alias the new column with metadata before call `withColumn` and in `withColumn` we alias this clolumn again. Here I overloaded `withColumn` to allow user set metadata, just like what we did  for `Column.as`.

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #8159 from cloud-fan/withColumn.
2015-08-14 12:00:01 -07:00
Holden Karau a7317ccdc2 [SPARK-8744] [ML] Add a public constructor to StringIndexer
It would be helpful to allow users to pass a pre-computed index to create an indexer, rather than always going through StringIndexer to create the model.

Author: Holden Karau <holden@pigscanfly.ca>

Closes #7267 from holdenk/SPARK-8744-StringIndexerModel-should-have-public-constructor.
2015-08-14 11:22:10 -07:00