* do not cache first cost RDD
* change following cost RDD cache level to MEMORY_AND_DISK
* remove Vector wrapper to save a object per instance
Further improvements will be addressed in SPARK-10329
cc: yu-iskw HuJiayin
Author: Xiangrui Meng <meng@databricks.com>
Closes#8526 from mengxr/SPARK-10354.
* Adds user guide for ml.feature.StopWordsRemovers, ran code examples on my machine
* Cleans up scaladocs for public methods
* Adds test for Java compatibility
* Follow up Python user guide code example is tracked by SPARK-10249
Author: Feynman Liang <fliang@databricks.com>
Closes#8436 from feynmanliang/SPARK-10230.
`GeneralizedLinearModel` creates a cached RDD when building a model. It's inconvenient, since these RDDs flood the memory when building several models in a row, so useful data might get evicted from the cache.
The proposed solution is to always cache the dataset & remove the warning. There's a caveat though: input dataset gets evaluated twice, in line 270 when fitting `StandardScaler` for the first time, and when running optimizer for the second time. So, it might worth to return removed warning.
Another possible solution is to disable caching entirely & return removed warning. I don't really know what approach is better.
Author: Vyacheslav Baranov <slavik.baranov@gmail.com>
Closes#8395 from SlavikBaranov/SPARK-10182.
* Replaces instances of `Lists.newArrayList` with `Arrays.asList`
* Replaces `commons.lang.StringUtils` over `com.google.collections.Strings`
* Replaces `List` interface over `ArrayList` implementations
This PR along with #8445#8446#8447 completely removes all `com.google.collections.Lists` dependencies within mllib's Java tests.
Author: Feynman Liang <fliang@databricks.com>
Closes#8451 from feynmanliang/SPARK-10257.
Fix for [JavaConverters.asJavaListConverter](http://www.scala-lang.org/api/2.10.5/index.html#scala.collection.JavaConverters$) being removed in 2.11.7 and hence the build fails with the 2.11 profile enabled. Tested with the default 2.10 and 2.11 profiles. BUILD SUCCESS in both cases.
Build for 2.10:
./build/mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.7.1 -DskipTests clean install
and 2.11:
./dev/change-scala-version.sh 2.11
./build/mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.7.1 -Dscala-2.11 -DskipTests clean install
Author: Jacek Laskowski <jacek@japila.pl>
Closes#8479 from jaceklaskowski/SPARK-9613-hotfix.
* Replaces `com.google.common` dependencies with `java.util.Arrays`
* Small clean up in `JavaNormalizerSuite`
Author: Feynman Liang <fliang@databricks.com>
Closes#8445 from feynmanliang/SPARK-10254.
I only found `ml.NaiveBayes` missing `Experimental` annotation. This PR doesn't cover Python APIs.
cc jkbradley
Author: Xiangrui Meng <meng@databricks.com>
Closes#8452 from mengxr/SPARK-9665.
Same as #8421 but for `mllib.feature`.
cc dbtsai
Author: Xiangrui Meng <meng@databricks.com>
Closes#8449 from mengxr/SPARK-10236.feature and squashes the following commits:
0e8d658 [Xiangrui Meng] remove unnecessary comment
ad70b03 [Xiangrui Meng] update since versions in mllib.feature
Same as #8421 but for `mllib.regression`.
cc freeman-lab dbtsai
Author: Xiangrui Meng <meng@databricks.com>
Closes#8426 from mengxr/SPARK-10235 and squashes the following commits:
6cd28e4 [Xiangrui Meng] update since versions in mllib.regression
The same as #8241 but for `mllib.stat` and `mllib.random`.
cc feynmanliang
Author: Xiangrui Meng <meng@databricks.com>
Closes#8439 from mengxr/SPARK-10242.
Same as #8421 but for `mllib.linalg`.
cc dbtsai
Author: Xiangrui Meng <meng@databricks.com>
Closes#8440 from mengxr/SPARK-10238 and squashes the following commits:
b38437e [Xiangrui Meng] update since versions in mllib.linalg
* Adds two new sections to LDA's user guide; one for each optimizer/model
* Documents new features added to LDA (e.g. topXXXperXXX, asymmetric priors, hyperpam optimization)
* Cleans up a TODO and sets a default parameter in LDA code
jkbradley hhbyyh
Author: Feynman Liang <fliang@databricks.com>
Closes#8254 from feynmanliang/SPARK-9888.
Same as #8421 but for `mllib.pmml` and `mllib.util`.
cc dbtsai
Author: Xiangrui Meng <meng@databricks.com>
Closes#8430 from mengxr/SPARK-10239 and squashes the following commits:
a189acf [Xiangrui Meng] update since versions in mllib.pmml and mllib.util
Adds default convergence tolerance (0.001, set in `GradientDescent.convergenceTol`) to `setConvergenceTol`'s scaladoc
Author: Feynman Liang <fliang@databricks.com>
Closes#8424 from feynmanliang/SPARK-9797.
* Adds doc for alias of runMIniBatchSGD documenting default value for convergeTol
* Cleans up a note in code
Author: Feynman Liang <fliang@databricks.com>
Closes#8425 from feynmanliang/SPARK-9800.
Update `Since` annotation in `mllib.classification`:
1. add version to classes, objects, constructors, and public variables declared in constructors
2. correct some versions
3. remove `Since` on `toString`
MechCoder dbtsai
Author: Xiangrui Meng <meng@databricks.com>
Closes#8421 from mengxr/SPARK-10231 and squashes the following commits:
b2dce80 [Xiangrui Meng] update @Since annotation for mllib.classification
Replace `JavaConversions` implicits with `JavaConverters`
Most occurrences I've seen so far are necessary conversions; a few have been avoidable. None are in critical code as far as I see, yet.
Author: Sean Owen <sowen@cloudera.com>
Closes#8033 from srowen/SPARK-9613.
GaussianMixture now distributes matrix decompositions for certain problem sizes. Distributed computation actually fails, but this was not tested in unit tests.
This PR adds a unit test which checks this. It failed previously but works with this fix.
CC: mengxr
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#8370 from jkbradley/gmm-fix.
Add user guide for `VectorSlicer`, with Java test suite and Python version VectorSlicer.
Note that Python version does not support selecting by names now.
Author: Xusen Yin <yinxusen@gmail.com>
Closes#8267 from yinxusen/SPARK-9893.
Removed categorical feature info validation since no longer needed
This is needed to make the ML user guide examples work (in another current PR).
CC: mengxr
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#8367 from jkbradley/gbt-single-cat.
For each (document, term) pair, return top topic. Note that instances of (doc, term) pairs within a document (a.k.a. "tokens") are exchangeable, so we should provide an estimate per document-term, rather than per token.
CC: rotationsymmetry mengxr
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#8329 from jkbradley/lda-topic-assignments.
This continues the work from #8256. I removed `since` tags from private/protected/local methods/variables (see 72fdeb6463). MechCoder
Closes#8256
Author: Xiangrui Meng <meng@databricks.com>
Author: Xiaoqing Wang <spark445@126.com>
Author: MechCoder <manojkumarsivaraj334@gmail.com>
Closes#8288 from mengxr/SPARK-8918.
Previously, users of evaluator (`CrossValidator` and `TrainValidationSplit`) would only maximize the metric in evaluator, leading to a hacky solution which negated metrics to be minimized and caused erroneous negative values to be reported to the user.
This PR adds a `isLargerBetter` attribute to the `Evaluator` base class, instructing users of `Evaluator` on whether the chosen metric should be maximized or minimized.
CC jkbradley
Author: Feynman Liang <fliang@databricks.com>
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#8290 from feynmanliang/SPARK-10097.
jira: https://issues.apache.org/jira/browse/SPARK-9028
Add an estimator for CountVectorizerModel. The estimator will extract a vocabulary from document collections according to the term frequency.
I changed the meaning of minCount as a filter across the corpus. This aligns with Word2Vec and the similar parameter in SKlearn.
Author: Yuhao Yang <hhbyyh@gmail.com>
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#7388 from hhbyyh/cvEstimator.
Fix the issue that ```layers``` and ```weights``` should be public variables of ```MultilayerPerceptronClassificationModel```. Users can not get ```layers``` and ```weights``` from a ```MultilayerPerceptronClassificationModel``` currently.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#8263 from yanboliang/mlp-public.
This PR adds a short description of `ml.feature` package with code example. The Java package doc will come in a separate PR. jkbradley
Author: Xiangrui Meng <meng@databricks.com>
Closes#8260 from mengxr/SPARK-7808.
Added since tags to mllib.regression
Author: Prayag Chandran <prayagchandran@gmail.com>
Closes#7518 from prayagchandran/sinceTags and squashes the following commits:
fa4dda2 [Prayag Chandran] Re-formatting
6c6d584 [Prayag Chandran] Corrected a few tags. Removed few unnecessary tags
1a0365f [Prayag Chandran] Reformating and adding a few more tags
89fdb66 [Prayag Chandran] SPARK-8916 [Documentation, MLlib] Add @since tags to mllib.regression
Also added unit test for integration between StringIndexerModel and IndexToString
CC: holdenk We realized we should have left in your unit test (to catch the issue with removing the inverse() method), so this adds it back. mengxr
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#8211 from jkbradley/stridx-labels.
in MLlib sometimes we need to set metadata for the new column, thus we will alias the new column with metadata before call `withColumn` and in `withColumn` we alias this clolumn again. Here I overloaded `withColumn` to allow user set metadata, just like what we did for `Column.as`.
Author: Wenchen Fan <cloud0fan@outlook.com>
Closes#8159 from cloud-fan/withColumn.
It would be helpful to allow users to pass a pre-computed index to create an indexer, rather than always going through StringIndexer to create the model.
Author: Holden Karau <holden@pigscanfly.ca>
Closes#7267 from holdenk/SPARK-8744-StringIndexerModel-should-have-public-constructor.