The MLlib ChiSqSelector class is not serializable, and so the example in the ChiSqSelector documentation fails. Also, that example is missing the import of ChiSqSelector.
This PR makes ChiSqSelector extend Serializable in MLlib, and adds the ChiSqSelector import statement to the associated example in the documentation.
Author: Mike Dusenberry <dusenberrymw@gmail.com>
Closes#6462 from dusenberrymw/Make_ChiSqSelector_Serializable_and_Fix_Related_Docs_Example and squashes the following commits:
9cb2f94 [Mike Dusenberry] Make MLlib ChiSqSelector Serializable.
d9003bf [Mike Dusenberry] Add missing import in MLlib ChiSqSelector Docs Scala example.
(cherry picked from commit 1281a35188)
Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
Author: Octavian Geagla <ogeagla@gmail.com>
Closes#6008 from ogeagla/elementwise-prod-doc and squashes the following commits:
72e6dc0 [Octavian Geagla] [SPARK-7459] [MLLIB] Java example import.
cf2afbd [Octavian Geagla] [SPARK-7459] [MLLIB] Update description of example.
b66431b [Octavian Geagla] [SPARK-7459] [MLLIB] Add override annotation to java example, make scala example use same data as java.
6b26b03 [Octavian Geagla] [SPARK-7459] [MLLIB] Fix line which is too long.
79af020 [Octavian Geagla] [SPARK-7459] [MLLIB] Actually don't use Java 8.
9d5b31a [Octavian Geagla] [SPARK-7459] [MLLIB] Don't use Java 8
4f0c92f [Octavian Geagla] [SPARK-7459] [MLLIB] ElementwiseProduct Java example.
(cherry picked from commit e3a4374833)
Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
I implement a simple PCA wrapper for easy transform of vectors by PCA for example LabeledPoint or another complicated structure.
Example of usage:
```
import org.apache.spark.mllib.regression.LinearRegressionWithSGD
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.feature.PCA
val data = sc.textFile("data/mllib/ridge-data/lpsa.data").map { line =>
val parts = line.split(',')
LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)))
}.cache()
val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L)
val training = splits(0).cache()
val test = splits(1)
val pca = PCA.create(training.first().features.size/2, data.map(_.features))
val training_pca = training.map(p => p.copy(features = pca.transform(p.features)))
val test_pca = test.map(p => p.copy(features = pca.transform(p.features)))
val numIterations = 100
val model = LinearRegressionWithSGD.train(training, numIterations)
val model_pca = LinearRegressionWithSGD.train(training_pca, numIterations)
val valuesAndPreds = test.map { point =>
val score = model.predict(point.features)
(score, point.label)
}
val valuesAndPreds_pca = test_pca.map { point =>
val score = model_pca.predict(point.features)
(score, point.label)
}
val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean()
val MSE_pca = valuesAndPreds_pca.map{case(v, p) => math.pow((v - p), 2)}.mean()
println("Mean Squared Error = " + MSE)
println("PCA Mean Squared Error = " + MSE_pca)
```
Author: Kirill A. Korinskiy <catap@catap.ru>
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#4304 from catap/pca and squashes the following commits:
501bcd9 [Joseph K. Bradley] Small updates: removed k from Java-friendly PCA fit(). In PCASuite, converted results to set for comparison. Added an error message for bad k in PCA.
9dcc02b [Kirill A. Korinskiy] [SPARK-5521] fix scala style
1892a06 [Kirill A. Korinskiy] [SPARK-5521] PCA wrapper for easy transform vectors
(cherry picked from commit 8c07c75c98)
Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
Fixes:
* typo in Scala example
* Removed comment "usually applied on sparse data" since that is debatable
* small edits to text for clarity
CC: avulanov I noticed a typo post-hoc and ended up making a few small edits. Do the changes look OK?
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#4732 from jkbradley/chisqselector-docs and squashes the following commits:
9656a3b [Joseph K. Bradley] added Java example for ChiSqSelector to guide
3f3f9f4 [Joseph K. Bradley] small fixes to ChiSqSelector docs
Added description of ChiSqSelector and few words about feature selection in general. I could add a code example, however it would not look reasonable in the absence of feature discretizer or a dataset in the `data` folder that has redundant features.
Author: Alexander Ulanov <nashb@yandex.ru>
Closes#4709 from avulanov/SPARK-5912 and squashes the following commits:
19a8a4e [Alexander Ulanov] Addressing reviewers comments @jkbradley
58d9e4d [Alexander Ulanov] Addressing reviewers comments @jkbradley
eb6b9fe [Alexander Ulanov] Typo
2921a1d [Alexander Ulanov] ChiSqSelector example of use
c845350 [Alexander Ulanov] ChiSqSelector docs
This seems complete, the duplication of tests for provided means/variances might be overkill, would appreciate some feedback.
Author: Octavian Geagla <ogeagla@gmail.com>
Closes#4140 from ogeagla/SPARK-5207 and squashes the following commits:
fa64dfa [Octavian Geagla] [SPARK-5207] [MLLIB] [WIP] change StandardScalerModel to take stddev instead of variance
9078fe0 [Octavian Geagla] [SPARK-5207] [MLLIB] [WIP] Incorporate code review feedback: change arg ordering, add dev api annotations, do better null checking, add another test and some doc for this.
997d2e0 [Octavian Geagla] [SPARK-5207] [MLLIB] [WIP] make withMean and withStd public, add constructor which uses defaults, un-refactor test class
64408a4 [Octavian Geagla] [SPARK-5207] [MLLIB] [WIP] change StandardScalerModel contructor to not be private to mllib, added tests for newly-exposed functionality
This pull request refers to issue: https://issues.apache.org/jira/browse/SPARK-3838
Python example for word2vec
mengxr
Author: Anant <anant.asty@gmail.com>
Closes#2952 from anantasty/SPARK-3838 and squashes the following commits:
87bd723 [Anant] remove stop line
4bd439e [Anant] Changes as per code review. Fized error in word2vec python example, simplified example in docs.
3d3c9ee [Anant] Added empty line after python imports
0c90c31 [Anant] Fixed erroneous code. I was still treating each line to be a single word instead of 16 words
ee4f5f6 [Anant] Fixes from code review comments
c637bcf [Anant] Added word2vec python example to docs
269f31f [Anant] added example in docs
c015b14 [Anant] Added python example for word2vec
Thare are some inconsistent spellings 'MLlib' and 'MLLib' in some documents and source codes.
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes#2903 from sarutak/SPARK-4055 and squashes the following commits:
b031640 [Kousuke Saruta] Fixed inconsistent spelling "MLlib and MLLib"
This PR for [SPARK-3614](https://issues.apache.org/jira/browse/SPARK-3614) adds functionality for filtering out terms which do not appear in at least a minimum number of documents.
This is implemented using a minimumOccurence parameter (default 0). When terms' document frequencies are less than minimumOccurence, their IDFs are set to 0, just like when the DF is 0. As a result, the TF-IDFs for the terms are found to be 0, as if the terms were not present in the documents.
This PR makes the following changes:
* Add a minimumOccurence parameter to the IDF and DocumentFrequencyAggregator classes.
* Create a parameter-less constructor for IDF with a default minimumOccurence value of 0 to remain backwards-compatibility with the original IDF API.
* Sets the IDFs to 0 for terms which DFs are less than minimumOccurence
* Add tests to the Spark IDFSuite and Java JavaTfIdfSuite test suites
* Updated the MLLib Feature Extraction programming guide to describe the new feature
Author: RJ Nowling <rnowling@gmail.com>
Closes#2494 from rnowling/spark-3614-idf-filter and squashes the following commits:
0aa3c63 [RJ Nowling] Fix identation
e6523a8 [RJ Nowling] Remove unnecessary toDouble's from IDFSuite
bfa82ec [RJ Nowling] Add space after if
30d20b3 [RJ Nowling] Add spaces around equals signs
9013447 [RJ Nowling] Add space before division operator
79978fc [RJ Nowling] Remove unnecessary semi-colon
40fd70c [RJ Nowling] Change minimumOccurence to minDocFreq in code and docs
47850ab [RJ Nowling] Changed minimumOccurence to Int from Long
9fb4093 [RJ Nowling] Remove unnecessary lines from IDF class docs
1fc09d8 [RJ Nowling] Add backwards-compatible constructor to DocumentFrequencyAggregator
1801fd2 [RJ Nowling] Fix style errors in IDF.scala
6897252 [RJ Nowling] Preface minimumOccurence members with val to make them final and immutable
a200bab [RJ Nowling] Remove unnecessary else statement
4b974f5 [RJ Nowling] Remove accidentally-added import from testing
c0cc643 [RJ Nowling] Add minimumOccurence filtering to IDF
Author: RJ Nowling <rnowling@gmail.com>
Closes#2459 from rnowling/tfidf-fix and squashes the following commits:
b370a91 [RJ Nowling] Fix variable name misspelling in MLLib Feature Extraction guide
Documentation for newly added feature transformations:
1. TF-IDF
2. StandardScaler
3. Normalizer
Author: DB Tsai <dbtsai@alpinenow.com>
Closes#2068 from dbtsai/transformer-documentation and squashes the following commits:
109f324 [DB Tsai] address feedback
Moved TF-IDF before Word2Vec because the former is more basic. I also added a link for Word2Vec. atalwalkar
Author: Xiangrui Meng <meng@databricks.com>
Closes#2061 from mengxr/tfidf-doc and squashes the following commits:
ca04c70 [Xiangrui Meng] address comments
a5ea4b4 [Xiangrui Meng] add tf-idf user guide
As per discussions with Xiangrui, I've reorganized and edited the mllib documentation.
Author: Ameet Talwalkar <atalwalkar@gmail.com>
Closes#1908 from atalwalkar/master and squashes the following commits:
fe6938a [Ameet Talwalkar] made xiangruis suggested changes
840028b [Ameet Talwalkar] made xiangruis suggested changes
7ec366a [Ameet Talwalkar] reorganize and edit mllib documentation