cc marmbrus
Author: Volodymyr Lyubinets <vlyubin@gmail.com>
Closes#5279 from vlyubin/speedup and squashes the following commits:
e75a387 [Volodymyr Lyubinets] Changes to ScalaUDF
11a20ec [Volodymyr Lyubinets] Avoid creating a tuple
c327bc9 [Volodymyr Lyubinets] Moved the only remaining function from DataTypeConversions to DateUtils
dec6802 [Volodymyr Lyubinets] Addresed review feedback
74301fa [Volodymyr Lyubinets] Addressed review comments
afa3aa5 [Volodymyr Lyubinets] Minor refactoring, added license, removed debug output
881dc60 [Volodymyr Lyubinets] Moved to a separate module; addressed review comments; one extra place of usage; changed behaviour for Java
8cad6e2 [Volodymyr Lyubinets] Addressed review commments
41b2aa9 [Volodymyr Lyubinets] Creating converters for ScalaReflection stuff, and more
jira: https://issues.apache.org/jira/browse/SPARK-6693
It's kind of annoying when debugging and found you cannot print out the matrix as you want.
original toString of Matrix only print like following,
0.17810102596909183 0.5616906241468385 ... (10 total)
0.9692861997823815 0.015558159784155756 ...
0.8513015122819192 0.031523763918528847 ...
0.5396875653953941 0.3267864552779176 ...
The def toString(maxLines : Int, maxWidth : Int) is useful when debuging, logging and saving matrix to files.
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes#5344 from hhbyyh/addToString and squashes the following commits:
19a6836 [Yuhao Yang] remove extra line
6314b21 [Yuhao Yang] add exclude
736c324 [Yuhao Yang] add ut and exclude
420da39 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into addToString
c22f352 [Yuhao Yang] style change
64a9e0f [Yuhao Yang] add specific to string to matrix
Support FPGrowth algorithm in Python API.
Should we remove "Experimental" which were marked for FPGrowth and FPGrowthModel in Scala? jkbradley
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#5213 from yanboliang/spark-6264 and squashes the following commits:
ed62ead [Yanbo Liang] trigger jenkins
8ce0359 [Yanbo Liang] fix docstring style
544c725 [Yanbo Liang] address comments
a2d7cf7 [Yanbo Liang] add doc for FPGrowth.train()
dcf7d73 [Yanbo Liang] add python doc
b18fd07 [Yanbo Liang] trigger jenkins
2c951b8 [Yanbo Liang] fix typos
7f62c8f [Yanbo Liang] add fpm to __init__.py
b96206a [Yanbo Liang] Support FPGrowth algorithm in Python API
https://issues.apache.org/jira/browse/SPARK-6758
I am not sure if it is ok to block them in test resources too (as we shade jetty in assembly?).
Author: WangTaoTheTonic <wangtao111@huawei.com>
Closes#5406 from WangTaoTheTonic/SPARK-6758 and squashes the following commits:
e09605b [WangTaoTheTonic] block the right jetty package
So we can turn style checker on for test code.
Author: Reynold Xin <rxin@databricks.com>
Closes#5411 from rxin/test-style-mllib and squashes the following commits:
d8a2569 [Reynold Xin] [SPARK-6765] Fix test code style for mllib.
I have the fit intercept enabled by default for logistic regression, I
wonder what others think here. I understand that it enables allocation
by default which is undesirable, but one needs to have a very strong
reason for not having an intercept term enabled so it is the safer
default from a statistical sense.
Explicitly modeling the intercept by adding a column of all 1s does not
work. I believe the reason is that since the API for
LogisticRegressionWithLBFGS forces column normalization, and a column of all
1s has 0 variance so dividing by 0 kills it.
Author: Omede Firouz <ofirouz@palantir.com>
Closes#5301 from oefirouz/addIntercept and squashes the following commits:
9f1286b [Omede Firouz] [SPARK-6705][MLLIB] Add fitInterceptTerm to LogisticRegression
1d6bd6f [Omede Firouz] [SPARK-6705][MLLIB] Add a fit intercept term to ML LogisticRegression
9963509 [Omede Firouz] [MLLIB] Add fitIntercept to LogisticRegression
2257fca [Omede Firouz] [MLLIB] Add fitIntercept param to logistic regression
329c1e2 [Omede Firouz] [MLLIB] Add fit intercept term
bd9663c [Omede Firouz] [MLLIB] Add fit intercept api to ml logisticregression
Author: Vinod K C <vinod.kc@huawei.com>
Closes#5384 from vinodkc/Suppression_Scala_existential_code and squashes the following commits:
82a3a1f [Vinod K C] Added scala.language.existentials
Use Iterators in columnSimilarities to allow mapPartitionsWithIndex to spill to disk. This could happen in a dense and large column - this way Spark can spill the pairs onto disk instead of building all the pairs before handing them to Spark.
Another PR coming to update documentation.
Author: Reza Zadeh <reza@databricks.com>
Closes#5364 from rezazadeh/optmemsim and squashes the following commits:
47c90ba [Reza Zadeh] Iterators in columnSimilarities for flatMap
This is the sub-task of SPARK-6254.
Wrap missing method for `Word2Vec` and `Word2VecModel`.
Author: lewuathe <lewuathe@me.com>
Closes#5296 from Lewuathe/SPARK-6615 and squashes the following commits:
f14c304 [lewuathe] Reorder tests
1d326b9 [lewuathe] Merge master
e2bedfb [lewuathe] Modify test cases
afb866d [lewuathe] [SPARK-6615] Python API for Word2Vec
There's no corresponding printing in linear regression. Here was my previous PR (something weird happened and I can't reopen it) https://github.com/apache/spark/pull/5272
Author: Omede Firouz <ofirouz@palantir.com>
Closes#5338 from oefirouz/println and squashes the following commits:
3f3dbf4 [Omede Firouz] [MLLIB] Remove println
This builds on my earlier pull requests and turns on the explicit type checking in scalastyle.
Author: Reynold Xin <rxin@databricks.com>
Closes#5342 from rxin/SPARK-6428 and squashes the following commits:
7b531ab [Reynold Xin] import ordering
2d9a8a5 [Reynold Xin] jl
e668b1c [Reynold Xin] override
9b9e119 [Reynold Xin] Parenthesis.
82e0cf5 [Reynold Xin] [SPARK-6428] Turn on explicit type checking for public methods.
This patch fixes a reported bug causing model updates to not properly propagate to model predictions during streaming regression. These minor changes in model declaration fix the problem, and I expanded the tests to include the scenario in which the bug was arising. The two new tests failed prior to the patch and now pass.
cc mengxr
Author: freeman <the.freeman.lab@gmail.com>
Closes#5037 from freeman-lab/train-predict-fix and squashes the following commits:
3af953e [freeman] Expand test coverage to include combined training and prediction
8f84fc8 [freeman] Move model declaration
davies
Author: Xiangrui Meng <meng@databricks.com>
Closes#5318 from mengxr/SPARK-6660 and squashes the following commits:
0f66ec2 [Xiangrui Meng] recognize object arrays
ad8c42f [Xiangrui Meng] add a test for SPARK-6660
This PR changes lambda scaling from number of users/items to number of explicit ratings. The latter is the behavior in 1.2. Slight refactor of NormalEquation to make it independent of ALS models. srowen codexiang
Author: Xiangrui Meng <meng@databricks.com>
Closes#5314 from mengxr/SPARK-6642 and squashes the following commits:
dc655a1 [Xiangrui Meng] relax python tests
f410df2 [Xiangrui Meng] use 1.2 scaling and remove addImplicit from NormalEquation
Word2Vec model now supports saving and loading.
a] The Metadata stored in JSON format consists of "version", "classname", "vectorSize" and "numWords"
b] The data stored in Parquet file format consists of an Array of rows with each row consisting of 2 columns, first being the word: String and the second, an Array of Floats.
Author: MechCoder <manojkumarsivaraj334@gmail.com>
Closes#5291 from MechCoder/spark-5692 and squashes the following commits:
1142f3a [MechCoder] Add numWords to metaData
bfe4c39 [MechCoder] [SPARK-5692] Word2Vec save/load
Python API parity check for classification and multiclass classification support, major disparities need to be added for Python:
```scala
LogisticRegressionWithLBFGS
setNumClasses
setValidateData
LogisticRegressionModel
getThreshold
numClasses
numFeatures
SVMWithSGD
setValidateData
SVMModel
getThreshold
```
For users the greatest benefit in this PR is multiclass classification was supported by Python API.
Users can train multiclass classification model and use it to predict in pyspark.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#5137 from yanboliang/spark-6255 and squashes the following commits:
0bd531e [Yanbo Liang] address comments
444d5e2 [Yanbo Liang] LogisticRegressionModel.predict() optimization
fc7990b [Yanbo Liang] address comments
b0d9c63 [Yanbo Liang] Support Mulinomial LR model predict in Python API
ded847c [Yanbo Liang] Python API parity check for classification (support multiclass classification)
Added optional model type parameter for NaiveBayes training. Can be either Multinomial or Bernoulli.
When Bernoulli is given the Bernoulli smoothing is used for fitting and for prediction as per: http://nlp.stanford.edu/IR-book/html/htmledition/the-bernoulli-model-1.html.
Default for model is original Multinomial fit and predict.
Added additional testing for Bernoulli and Multinomial models.
Author: leahmcguire <lmcguire@salesforce.com>
Author: Joseph K. Bradley <joseph@databricks.com>
Author: Leah McGuire <lmcguire@salesforce.com>
Closes#4087 from leahmcguire/master and squashes the following commits:
f3c8994 [leahmcguire] changed checks on model type to requires
acb69af [leahmcguire] removed enum type and replaces all modelType parameters with strings
2224b15 [Leah McGuire] Merge pull request #2 from jkbradley/leahmcguire-master
9ad89ca [Joseph K. Bradley] removed old code
6a8f383 [Joseph K. Bradley] Added new model save/load format 2.0 for NaiveBayesModel after modelType parameter was added. Updated tests. Also updated ModelType enum-like type.
852a727 [leahmcguire] merged with upstream master
a22d670 [leahmcguire] changed NaiveBayesModel modelType parameter back to NaiveBayes.ModelType, made NaiveBayes.ModelType serializable, fixed getter method in NavieBayes
18f3219 [leahmcguire] removed private from naive bayes constructor for lambda only
bea62af [leahmcguire] put back in constructor for NaiveBayes
01baad7 [leahmcguire] made fixes from code review
fb0a5c7 [leahmcguire] removed typo
e2d925e [leahmcguire] fixed nonserializable error that was causing naivebayes test failures
2d0c1ba [leahmcguire] fixed typo in NaiveBayes
c298e78 [leahmcguire] fixed scala style errors
b85b0c9 [leahmcguire] Merge remote-tracking branch 'upstream/master'
900b586 [leahmcguire] fixed model call so that uses type argument
ea09b28 [leahmcguire] Merge remote-tracking branch 'upstream/master'
e016569 [leahmcguire] updated test suite with model type fix
85f298f [leahmcguire] Merge remote-tracking branch 'upstream/master'
dc65374 [leahmcguire] integrated model type fix
7622b0c [leahmcguire] added comments and fixed style as per rb
b93aaf6 [Leah McGuire] Merge pull request #1 from jkbradley/nb-model-type
3730572 [Joseph K. Bradley] modified NB model type to be more Java-friendly
b61b5e2 [leahmcguire] added back compatable constructor to NaiveBayesModel to fix MIMA test failure
5a4a534 [leahmcguire] fixed scala style error in NaiveBayes
3891bf2 [leahmcguire] synced with apache spark and resolved merge conflict
d9477ed [leahmcguire] removed old inaccurate comment from test suite for mllib naive bayes
76e5b0f [leahmcguire] removed unnecessary sort from test
0313c0c [leahmcguire] fixed style error in NaiveBayes.scala
4a3676d [leahmcguire] Updated changes re-comments. Got rid of verbose populateMatrix method. Public api now has string instead of enumeration. Docs are updated."
ce73c63 [leahmcguire] added Bernoulli option to niave bayes model in mllib, added optional model type parameter for training. When Bernoulli is given the Bernoulli smoothing is used for fitting and for prediction http://nlp.stanford.edu/IR-book/html/htmledition/the-bernoulli-model-1.html
This fixes `predictAll` after load. jkbradley
Author: Xiangrui Meng <meng@databricks.com>
Closes#5243 from mengxr/SPARK-6571 and squashes the following commits:
82dcaa7 [Xiangrui Meng] use wrapper in MatrixFactorizationModel.load
See [SPARK-6526](https://issues.apache.org/jira/browse/SPARK-6526).
mengxr Should we add test suite for this transformer? There is no test suite for all feature transformers in ML package now.
Author: Xusen Yin <yinxusen@gmail.com>
Closes#5181 from yinxusen/SPARK-6526 and squashes the following commits:
6faa7bf [Xusen Yin] fix style
8a462da [Xusen Yin] remove duplications
ab35ab0 [Xusen Yin] add test suite
bc8cd0f [Xusen Yin] fix comment
79774c9 [Xusen Yin] add Normalizer transformer in ML package
There are any bugs of breeze's SparseVector at 0.11.1. You know, Spark 1.3 depends on breeze 0.11.1. So I think we should upgrade it to 0.11.2.
https://issues.apache.org/jira/browse/SPARK-6341
And thanks you for your great cooperation, David Hall(dlwh)
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
Closes#5222 from yu-iskw/upgrade-breeze and squashes the following commits:
ad8a688 [Yu ISHIKAWA] Upgrade breeze from 0.11.1 to 0.11.2 because of a bug of SparseVector. Thanks you for your great cooperation, David Hall(@dlwh)
minor thing. Let me know if jira is required.
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes#5207 from hhbyyh/adjustImport and squashes the following commits:
2240121 [Yuhao Yang] remove unused import
Should be self explanatory.
Author: MechCoder <manojkumarsivaraj334@gmail.com>
Closes#4986 from MechCoder/spark-5987 and squashes the following commits:
7d2cd56 [MechCoder] Iterate over dataframe in a better way
e7a14cb [MechCoder] Minor
33c84f9 [MechCoder] Store as Array[Data] instead of Data[Array]
505bd57 [MechCoder] Rebased over master and used MatrixUDT
7422bb4 [MechCoder] Store sigmas as Array[Double] instead of Array[Array[Double]]
b9794e4 [MechCoder] Minor
cb77095 [MechCoder] [SPARK-5987] Save/load for GaussianMixtureModels
MLlib Python API parity check for Regression, major disparities need to be added for Python list following:
```scala
LinearRegressionWithSGD
setValidateData
LassoWithSGD
setIntercept
setValidateData
RidgeRegressionWithSGD
setIntercept
setValidateData
```
setFeatureScaling is mllib private function which is not needed to expose in pyspark.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#4997 from yanboliang/spark-6256 and squashes the following commits:
102f498 [Yanbo Liang] fix intercept issue & add doc test
1fb7b4f [Yanbo Liang] change 'intercept' to 'addIntercept'
de5ecbc [Yanbo Liang] MLlib Python API parity check for regression
Added a Regex based tokenizer for ml.
Currently the regex is fixed but if I could add a regex type paramater to the paramMap,
changing the tokenizer regex could be a parameter used in the crossValidation.
Also I wonder what would be the best way to add a stop word list.
Author: Augustin Borsu <augustin@sagacify.com>
Author: Augustin Borsu <a.borsu@gmail.com>
Author: Augustin Borsu <aborsu@gmail.com>
Author: Xiangrui Meng <meng@databricks.com>
Closes#4504 from aborsu985/master and squashes the following commits:
716d257 [Augustin Borsu] Merge branch 'mengxr-SPARK-5566'
cb07021 [Augustin Borsu] Merge branch 'SPARK-5566' of git://github.com/mengxr/spark into mengxr-SPARK-5566
5f09434 [Augustin Borsu] Merge remote-tracking branch 'upstream/master'
a164800 [Xiangrui Meng] remove tabs
556aa27 [Xiangrui Meng] Merge branch 'aborsu985-master' into SPARK-5566
9651aec [Xiangrui Meng] update test
f96526d [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-5566
2338da5 [Augustin Borsu] Merge remote-tracking branch 'upstream/master'
e88d7b8 [Xiangrui Meng] change pattern to a StringParameter; update tests
148126f [Augustin Borsu] Added return type to public functions
12dddb4 [Augustin Borsu] Merge remote-tracking branch 'upstream/master'
daf685e [Augustin Borsu] Merge remote-tracking branch 'upstream/master'
6a85982 [Augustin Borsu] Style corrections
38b95a1 [Augustin Borsu] Added Java unit test for RegexTokenizer
b66313f [Augustin Borsu] Modified the pattern Param so it is compiled when given to the Tokenizer
e262bac [Augustin Borsu] Added unit tests in scala
cd6642e [Augustin Borsu] Changed regex to pattern
132b00b [Augustin Borsu] Changed matching to gaps and removed case folding
201a107 [Augustin Borsu] Merge remote-tracking branch 'upstream/master'
cb9c9a7 [Augustin Borsu] Merge remote-tracking branch 'upstream/master'
d3ef6d3 [Augustin Borsu] Added doc to RegexTokenizer
9082fc3 [Augustin Borsu] Removed stopwords parameters and updated doc
19f9e53 [Augustin Borsu] Merge remote-tracking branch 'upstream/master'
f6a5002 [Augustin Borsu] Merge remote-tracking branch 'upstream/master'
7f930bb [Augustin Borsu] Merge remote-tracking branch 'upstream/master'
77ff9ca [Augustin Borsu] Merge remote-tracking branch 'upstream/master'
2e89719 [Augustin Borsu] Merge remote-tracking branch 'upstream/master'
196cd7a [Augustin Borsu] Merge remote-tracking branch 'upstream/master'
11ca50f [Augustin Borsu] Merge remote-tracking branch 'upstream/master'
9f8685a [Augustin Borsu] RegexTokenizer
9e07a78 [Augustin Borsu] Merge remote-tracking branch 'upstream/master'
9547e9d [Augustin Borsu] RegEx Tokenizer
01cd26f [Augustin Borsu] RegExTokenizer
In GeneralizedLinearAlgorithm ```numFeatures``` is default to -1, we need to update it to correct value when we call run() to train a model.
```LogisticRegressionWithLBFGS.run(input)``` works well, but when we call ```LogisticRegressionWithLBFGS.run(input, initialWeights)``` to train multiclass classification model, it will throw exception due to the numFeatures is not updated.
In this PR, we just update numFeatures at the beginning of GeneralizedLinearAlgorithm.run(input, initialWeights) and add test case.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#5167 from yanboliang/spark-6496 and squashes the following commits:
8131c48 [Yanbo Liang] LogisticRegressionWithLBFGS.run(input, initialWeights) should initialize numFeatures
Author: MechCoder <manojkumarsivaraj334@gmail.com>
Closes#5118 from MechCoder/spark-6308 and squashes the following commits:
6c8ffab [MechCoder] Add test for simpleString
b966242 [MechCoder] [SPARK-6308] [MLlib][Sql] VectorUDT is displayed as vecto in dtypes
Author: vinodkc <vinod.kc.in@gmail.com>
Closes#5112 from vinodkc/spark_1.3_doc_fixes and squashes the following commits:
2c6aee6 [vinodkc] Spark 1.3 doc fixes
Added evaluateEachIteration to allow the user to manually extract the error for each iteration of GradientBoosting. The internal optimisation can be dealt with later.
Author: MechCoder <manojkumarsivaraj334@gmail.com>
Closes#4906 from MechCoder/spark-6025 and squashes the following commits:
67146ab [MechCoder] Minor
352001f [MechCoder] Minor
6e8aa10 [MechCoder] Made the following changes Used mapPartition instead of map Refactored computeError and unpersisted broadcast variables
bc99ac6 [MechCoder] Refactor the method and stuff
dbda033 [MechCoder] [SPARK-6025] Add helper method evaluateEachIteration to extract learning curve
Utilities to serialize and deserialize Matrices in MLlib
Author: MechCoder <manojkumarsivaraj334@gmail.com>
Closes#5048 from MechCoder/spark-6309 and squashes the following commits:
05dc6f2 [MechCoder] Hashcode and organize imports
16d5d47 [MechCoder] Test some more
6e67020 [MechCoder] TST: Test using Array conversion instead of equals
7fa7a2c [MechCoder] [SPARK-6309] [SQL] [MLlib] Implement MatrixUDT
Add checkpiontInterval to ALS to prevent:
1. StackOverflow exceptions caused by long lineage,
2. large shuffle files generated during iterations,
3. slow recovery when some node fail.
srowen coderxiang
Author: Xiangrui Meng <meng@databricks.com>
Closes#5076 from mengxr/SPARK-5955 and squashes the following commits:
df56791 [Xiangrui Meng] update impl to reuse code
29affcb [Xiangrui Meng] do not materialize factors in implicit
20d3f7f [Xiangrui Meng] add checkpointInterval to ALS
This PR implements two functions
- `topByKey(num: Int): RDD[(K, Array[V])]` finds the top-k values for each key in a pair RDD. This can be used, e.g., in computing top recommendations.
- `takeOrderedByKey(num: Int): RDD[(K, Array[V])] ` does the opposite of `topByKey`
The `sorted` is used here as the `toArray` method of the PriorityQueue does not return a necessarily sorted array.
Author: Shuo Xiang <shuoxiangpub@gmail.com>
Closes#5075 from coderxiang/topByKey and squashes the following commits:
1611c37 [Shuo Xiang] code clean up
6f565c0 [Shuo Xiang] naming
a80e0ec [Shuo Xiang] typo and warning
82dded9 [Shuo Xiang] Merge remote-tracking branch 'upstream/master' into topByKey
d202745 [Shuo Xiang] move to MLPairRDDFunctions
901b0af [Shuo Xiang] style check
70c6e35 [Shuo Xiang] remove takeOrderedByKey, update doc and test
0895c17 [Shuo Xiang] Merge remote-tracking branch 'upstream/master' into topByKey
b10e325 [Shuo Xiang] Merge remote-tracking branch 'upstream/master' into topByKey
debccad [Shuo Xiang] topByKey
I want to add a checker to turn public type checking on, since future pull requests can accidentally expose a non-public type. This is the first cleanup task.
Author: Reynold Xin <rxin@databricks.com>
Closes#5102 from rxin/mllib-hashcode-publicmethodtypes and squashes the following commits:
617f19e [Reynold Xin] Fixed Scala compilation error.
52bc2d5 [Reynold Xin] [MLlib] Added explicit type for public methods and implemented hashCode when equals is defined.
GLM toString prints out intercept, numFeatures.
For LogisticRegression and SVM model, toString also prints out numClasses, threshold.
GLM toDebugString prints out the whole weights, intercept.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#5038 from yanboliang/spark-6291 and squashes the following commits:
2f578b0 [Yanbo Liang] code format
78b33f2 [Yanbo Liang] fix typos
1e8a023 [Yanbo Liang] GLM toString & toDebugString
I find it's better to have getter for NumFeatures and addIntercept within GeneralizedLinearAlgorithm during actual usage, otherwise I 'll have to get the value through debug.
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes#5058 from hhbyyh/addGetLinear and squashes the following commits:
9dc90e8 [Yuhao Yang] add get for GeneralizedLinearAlgo
Use `_py2java` and `_java2py` to convert Python model to/from Java model. yinxusen
Author: Xiangrui Meng <meng@databricks.com>
Closes#5049 from mengxr/SPARK-6226-mengxr and squashes the following commits:
570ba81 [Xiangrui Meng] fix python style
b10b911 [Xiangrui Meng] add save/load in PySpark's KMeansModel
LBFGS uses convergence tolerance. This value should be written in document as an argument.
Author: lewuathe <lewuathe@me.com>
Closes#5033 from Lewuathe/SPARK-6336 and squashes the following commits:
e738b33 [lewuathe] Modify text to be more natural
ac03c3a [lewuathe] Modify documentations
6ccb304 [lewuathe] [SPARK-6336] LBFGS should document what convergenceTol means
Note: not relevant for Python API since it only has a static train method
Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#4969 from jkbradley/SPARK-6252 and squashes the following commits:
a471d90 [Joseph K. Bradley] small edits from review
63eff48 [Joseph K. Bradley] Added getLambda to Scala NaiveBayes
This continues the work in #4460 from srowen . The design doc is published on the JIRA page with some minor changes.
Short description of ML attributes: https://github.com/apache/spark/pull/4925/files?diff=unified#diff-95e7f5060429f189460b44a3f8731a35R24
More details can be found in the design doc.
srowen Could you help review this PR? There are many lines but most of them are boilerplate code.
Author: Xiangrui Meng <meng@databricks.com>
Author: Sean Owen <sowen@cloudera.com>
Closes#4925 from mengxr/SPARK-4588-new and squashes the following commits:
71d1bd0 [Xiangrui Meng] add JavaDoc for package ml.attribute
617be40 [Xiangrui Meng] remove final; rename cardinality to numValues
393ffdc [Xiangrui Meng] forgot to include Java attribute group tests
b1aceef [Xiangrui Meng] more tests
e7ab467 [Xiangrui Meng] update ML attribute impl
7c944da [Sean Owen] Add FeatureType hierarchy and categorical cardinality
2a21d6d [Sean Owen] Initial draft of FeatureAttributes class
jira: https://issues.apache.org/jira/browse/SPARK-6268
KMeans has many setters for parameters. It should have matching getters.
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes#4974 from hhbyyh/get4Kmeans and squashes the following commits:
f44d4dc [Yuhao Yang] add experimental to getRuns
f94a3d7 [Yuhao Yang] add get for KMeans
The issue is discussed in https://issues.apache.org/jira/browse/SPARK-5669. Replacing all JBLAS usage by netlib-java gives us a simpler dependency tree and less license issues to worry about. I didn't touch the test scope in this PR. The user guide is not modified to avoid merge conflicts with branch-1.3. srowen ankurdave pwendell
Author: Xiangrui Meng <meng@databricks.com>
Closes#4699 from mengxr/SPARK-5814 and squashes the following commits:
48635c6 [Xiangrui Meng] move netlib-java version to parent pom
ca21c74 [Xiangrui Meng] remove jblas from ml-guide
5f7767a [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-5814
c5c4183 [Xiangrui Meng] merge master
0f20cad [Xiangrui Meng] add mima excludes
e53e9f4 [Xiangrui Meng] remove jblas from mllib runtime
ceaa14d [Xiangrui Meng] replace jblas by netlib-java in graphx
fa7c2ca [Xiangrui Meng] move jblas to test scope
Resolve javac, scalac warnings of various types -- deprecations, Scala lang, unchecked cast, etc.
Author: Sean Owen <sowen@cloudera.com>
Closes#4950 from srowen/SPARK-6225 and squashes the following commits:
3080972 [Sean Owen] Ordered imports: Java, Scala, 3rd party, Spark
c67985b [Sean Owen] Resolve javac, scalac warnings of various types -- deprecations, Scala lang, unchecked cast, etc.
This PR adds save/load for K-means as described in SPARK-5986. Python version will be added in another PR.
Author: Xusen Yin <yinxusen@gmail.com>
Closes#4951 from yinxusen/SPARK-5986 and squashes the following commits:
6dd74a0 [Xusen Yin] rewrite some functions and classes
cd390fd [Xusen Yin] add indexed point
b144216 [Xusen Yin] remove invalid comments
dce7055 [Xusen Yin] add save/load for k-means for SPARK-5986
A simple wrapper around the Scala implementation. `DataFrame` is used for serialization/deserialization. Methods that return `RDD`s are not supported in this PR.
davies If we recognize Scala's `Product`s in Py4J, we can easily add wrappers for Scala methods that returns `RDD[(Double, Double)]`. Is it easy to register serializer for `Product` in PySpark?
Author: Xiangrui Meng <meng@databricks.com>
Closes#4863 from mengxr/SPARK-6090 and squashes the following commits:
009a3a3 [Xiangrui Meng] provide schema
dcddab5 [Xiangrui Meng] add a basic BinaryClassificationMetrics to PySpark/MLlib
Option 1 of 2: Convert spark-parent module name to spark-parent_2.10 / spark-parent_2.11
Author: Sean Owen <sowen@cloudera.com>
Closes#4912 from srowen/SPARK-6182.1 and squashes the following commits:
eff60de [Sean Owen] Convert spark-parent module name to spark-parent_2.10 / spark-parent_2.11
LBFGS and OWLQN in Breeze 0.10 has convergence check bug.
This is fixed in 0.11, see the description in Breeze project for detail:
https://github.com/scalanlp/breeze/pull/373#issuecomment-76879760
Author: Xiangrui Meng <meng@databricks.com>
Author: DB Tsai <dbtsai@alpinenow.com>
Author: DB Tsai <dbtsai@dbtsai.com>
Closes#4879 from dbtsai/breeze and squashes the following commits:
d848f65 [DB Tsai] Merge pull request #1 from mengxr/AlpineNow-breeze
c2ca6ac [Xiangrui Meng] upgrade to breeze-0.11.1
35c2f26 [Xiangrui Meng] fix LRSuite
397a208 [DB Tsai] upgrade breeze
Issue: When the Python DecisionTree example in the programming guide is run, it runs out of Java Heap Space when using the default memory settings for the spark shell.
This prints a warning.
CC: mengxr
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#4864 from jkbradley/dt-save-heap and squashes the following commits:
02e8daf [Joseph K. Bradley] fixed based on code review
7ecb1ed [Joseph K. Bradley] Added warnings about memory when calling tree and ensemble model save with too small a Java heap size
This PR contains the following changes:
1. Add a new method, `DataType.equalsIgnoreCompatibleNullability`, which is the middle ground between DataType's equality check and `DataType.equalsIgnoreNullability`. For two data types `from` and `to`, it does `equalsIgnoreNullability` as well as if the nullability of `from` is compatible with that of `to`. For example, the nullability of `ArrayType(IntegerType, containsNull = false)` is compatible with that of `ArrayType(IntegerType, containsNull = true)` (for an array without null values, we can always say it may contain null values). However, the nullability of `ArrayType(IntegerType, containsNull = true)` is incompatible with that of `ArrayType(IntegerType, containsNull = false)` (for an array that may have null values, we cannot say it does not have null values).
2. For the `resolved` field of `InsertIntoTable`, use `equalsIgnoreCompatibleNullability` to replace the equality check of the data types.
3. For our data source write path, when appending data, we always use the schema of existing table to write the data. This is important for parquet, since nullability direct impacts the way to encode/decode values. If we do not do this, we may see corrupted values when reading values from a set of parquet files generated with different nullability settings.
4. When generating a new parquet table, we always set nullable/containsNull/valueContainsNull to true. So, we will not face situations that we cannot append data because containsNull/valueContainsNull in an Array/Map column of the existing table has already been set to `false`. This change makes the whole data pipeline more robust.
5. Update the equality check of JSON relation. Since JSON does not really cares nullability, `equalsIgnoreNullability` seems a better choice to compare schemata from to JSON tables.
JIRA: https://issues.apache.org/jira/browse/SPARK-5950
Thanks viirya for the initial work in #4729.
cc marmbrus liancheng
Author: Yin Huai <yhuai@databricks.com>
Closes#4826 from yhuai/insertNullabilityCheck and squashes the following commits:
3b61a04 [Yin Huai] Revert change on equals.
80e487e [Yin Huai] asNullable in UDT.
587d88b [Yin Huai] Make methods private.
0cb7ea2 [Yin Huai] marmbrus's comments.
3cec464 [Yin Huai] Cheng's comments.
486ed08 [Yin Huai] Merge remote-tracking branch 'upstream/master' into insertNullabilityCheck
d3747d1 [Yin Huai] Remove unnecessary change.
8360817 [Yin Huai] Merge remote-tracking branch 'upstream/master' into insertNullabilityCheck
8a3f237 [Yin Huai] Use equalsIgnoreNullability instead of equality check.
0eb5578 [Yin Huai] Fix tests.
f6ed813 [Yin Huai] Update old parquet path.
e4f397c [Yin Huai] Unit tests.
b2c06f8 [Yin Huai] Ignore nullability in JSON relation's equality check.
8bd008b [Yin Huai] nullable, containsNull, and valueContainsNull will be always true for parquet data.
bf50d73 [Yin Huai] When appending data, we use the schema of the existing table instead of the schema of the new data.
0a703e7 [Yin Huai] Test failed again since we cannot read correct content.
9a26611 [Yin Huai] Make InsertIntoTable happy.
8f19fe5 [Yin Huai] equalsIgnoreCompatibleNullability
4ec17fd [Yin Huai] Failed test.
A simple wrapper to save/load `MatrixFactorizationModel` in Python. jkbradley
Author: Xiangrui Meng <meng@databricks.com>
Closes#4811 from mengxr/SPARK-5991 and squashes the following commits:
f135dac [Xiangrui Meng] update save doc
57e5200 [Xiangrui Meng] address comments
06140a4 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-5991
282ec8d [Xiangrui Meng] support save/load in PySpark's ALS
Remove unicode characters from MLlib file.
Author: Michael Griffiths <msjgriffiths@gmail.com>
Author: Griffiths, Michael (NYC-RPM) <michael.griffiths@reprisemedia.com>
Closes#4815 from msjgriffiths/SPARK-6063 and squashes the following commits:
bcd7de1 [Griffiths, Michael (NYC-RPM)] Change \u201D quote marks around 'theta' to standard single apostrophe (\x27)
38eb535 [Michael Griffiths] Merge pull request #2 from apache/master
b08e865 [Michael Griffiths] Merge pull request #1 from apache/master
Since the validation error does not change monotonically, in practice, it should be proper to pick the best model when training GradientBoostedTrees with validation instead of stopping it early.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#4763 from viirya/gbt_record_model and squashes the following commits:
452e049 [Liang-Chi Hsieh] Address comment.
ea2fae2 [Liang-Chi Hsieh] Pick the best model when training GradientBoostedTrees with validation.
The model trained by ALS requires partitioning information to do quick lookup of a user/item factor for making recommendation on individual requests. In the new implementation, we didn't set partitioners in the factors returned by ALS, which would cause performance regression.
srowen coderxiang
Author: Xiangrui Meng <meng@databricks.com>
Closes#4748 from mengxr/SPARK-5976 and squashes the following commits:
9373a09 [Xiangrui Meng] add partitioner to factors returned by ALS
260f183 [Xiangrui Meng] add a test for partitioner
One can early stop if the decrease in error rate is lesser than a certain tol or if the error increases if the training data is overfit.
This introduces a new method runWithValidation which takes in a pair of RDD's , one for the training data and the other for the validation.
Author: MechCoder <manojkumarsivaraj334@gmail.com>
Closes#4677 from MechCoder/spark-5436 and squashes the following commits:
1bb21d4 [MechCoder] Combine regression and classification tests into a single one
e4d799b [MechCoder] Addresses indentation and doc comments
b48a70f [MechCoder] COSMIT
b928a19 [MechCoder] Move validation while training section under usage tips
fad9b6e [MechCoder] Made the following changes 1. Add section to documentation 2. Return corresponding to bestValidationError 3. Allow negative tolerance.
55e5c3b [MechCoder] One liner for prevValidateError
3e74372 [MechCoder] TST: Add test for classification
77549a9 [MechCoder] [SPARK-5436] Validate GradientBoostedTrees using runWithValidation
For SPARK-5867:
* The spark.ml programming guide needs to be updated to use the new SQL DataFrame API instead of the old SchemaRDD API.
* It should also include Python examples now.
For SPARK-5892:
* Fix Python docs
* Various other cleanups
BTW, I accidentally merged this with master. If you want to compile it on your own, use this branch which is based on spark/branch-1.3 and cherry-picks the commits from this PR: [https://github.com/jkbradley/spark/tree/doc-review-1.3-check]
CC: mengxr (ML), davies (Python docs)
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#4675 from jkbradley/doc-review-1.3 and squashes the following commits:
f191bb0 [Joseph K. Bradley] small cleanups
e786efa [Joseph K. Bradley] small doc corrections
6b1ab4a [Joseph K. Bradley] fixed python lint test
946affa [Joseph K. Bradley] Added sample data for ml.MovieLensALS example. Changed spark.ml Java examples to use DataFrames API instead of sql()
da81558 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into doc-review-1.3
629dbf5 [Joseph K. Bradley] Updated based on code review: * made new page for old migration guides * small fixes * moved inherit_doc in python
b9df7c4 [Joseph K. Bradley] Small cleanups: toDF to toDF(), adding s for string interpolation
34b067f [Joseph K. Bradley] small doc correction
da16aef [Joseph K. Bradley] Fixed python mllib docs
8cce91c [Joseph K. Bradley] GMM: removed old imports, added some doc
695f3f6 [Joseph K. Bradley] partly done trying to fix inherit_doc for class hierarchies in python docs
a72c018 [Joseph K. Bradley] made ChiSqTestResult appear in python docs
b05a80d [Joseph K. Bradley] organize imports. doc cleanups
e572827 [Joseph K. Bradley] updated programming guide for ml and mllib
In the previous version, PIC stores clustering assignments as an `RDD[(Long, Int)]`. This is mapped to `RDD<Tuple2<Object, Object>>` in Java and hence Java users have to cast types manually. We should either create a new method called `javaAssignments` that returns `JavaRDD[(java.lang.Long, java.lang.Int)]` or wrap the result pair in a class. I chose the latter approach in this PR. Now assignments are stored as an `RDD[Assignment]`, where `Assignment` is a class with `id` and `cluster`.
Similarly, in FPGrowth, the frequent itemsets are stored as an `RDD[(Array[Item], Long)]`, which is mapped to `RDD<Tuple2<Object, Object>>`. Though we provide a "Java-friendly" method `javaFreqItemsets` that returns `JavaRDD[(Array[Item], java.lang.Long)]`. It doesn't really work because `Array[Item]` is mapped to `Object` in Java. So in this PR I created a class `FreqItemset` to wrap the results. It has `items` and `freq`, as well as a `javaItems` method that returns `List<Item>` in Java.
I'm not certain that the names I chose are proper: `Assignment`/`id`/`cluster` and `FreqItemset`/`items`/`freq`. Please let me know if there are better suggestions.
CC: jkbradley
Author: Xiangrui Meng <meng@databricks.com>
Closes#4695 from mengxr/SPARK-5900 and squashes the following commits:
865b5ca [Xiangrui Meng] make Assignment serializable
cffa96e [Xiangrui Meng] fix test
9c0e590 [Xiangrui Meng] remove unused Tuple2
1b9db3d [Xiangrui Meng] make PIC and FPGrowth Java-friendly
Another one from JoshRosen 's wish list. The first commit is much smaller and removes 2 of the 4 Clock classes. The second is much larger, necessary for consolidating the streaming one. I put together implementations in the way that seemed simplest. Almost all the change is standardizing class and method names.
Author: Sean Owen <sowen@cloudera.com>
Closes#4514 from srowen/SPARK-4682 and squashes the following commits:
5ed3a03 [Sean Owen] Javadoc Clock classes; make ManualClock private[spark]
169dd13 [Sean Owen] Add support for legacy org.apache.spark.streaming clock class names
277785a [Sean Owen] Reduce the net change in this patch by reversing some unnecessary syntax changes along the way
b5e53df [Sean Owen] FakeClock -> ManualClock; getTime() -> getTimeMillis()
160863a [Sean Owen] Consolidate Streaming Clock class into common util Clock
7c956b2 [Sean Owen] Consolidate Clocks except for Streaming Clock
For users to implement their own PipelineStages, we need to make PipelineStage.transformSchema be public instead of private to ml. This would be nice to include in Spark 1.3
CC: mengxr
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#4682 from jkbradley/SPARK-5902 and squashes the following commits:
6f02357 [Joseph K. Bradley] Made transformSchema public
0e6d0a0 [Joseph K. Bradley] made implementations of transformSchema protected as well
fdaf26a [Joseph K. Bradley] Made PipelineStage.transformSchema protected instead of private[ml]
Updated PIC user guide to reflect API changes and added a simple Java example. The API is still not very Java-friendly. I created SPARK-5990 for this issue.
Author: Xiangrui Meng <meng@databricks.com>
Closes#4680 from mengxr/SPARK-5897 and squashes the following commits:
847d216 [Xiangrui Meng] apache header
87719a2 [Xiangrui Meng] remove PIC image
2dd921f [Xiangrui Meng] update PIC user guide and add a Java example
Although we've migrated to the DataFrame API, lots of code still uses `rdd` or `srdd` as local variable names. This PR tries to address these naming inconsistencies and some other minor DataFrame related style issues.
<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4670)
<!-- Reviewable:end -->
Author: Cheng Lian <lian@databricks.com>
Closes#4670 from liancheng/df-cleanup and squashes the following commits:
3e14448 [Cheng Lian] Cleans up DataFrame variable names and toDF() calls
For unordered features, it is sufficient to use splits since the threshold of the split corresponds the threshold of the HighSplit of the bin and there is no use of the LowSplit.
Author: MechCoder <manojkumarsivaraj334@gmail.com>
Closes#4231 from MechCoder/spark-3381 and squashes the following commits:
58c19a5 [MechCoder] COSMIT
c274b74 [MechCoder] Remove unordered feature calculation in labeledPointToTreePoint
b2b9b89 [MechCoder] COSMIT
d3ee042 [MechCoder] [SPARK-3381] [MLlib] Eliminate bins for unordered features
`numFeatures` is only used by multinomial logistic regression. Calling `.first()` for every GLM causes performance regression, especially in Python.
Author: Xiangrui Meng <meng@databricks.com>
Closes#4647 from mengxr/SPARK-5858 and squashes the following commits:
036dc7f [Xiangrui Meng] remove unnecessary first() call
12c5548 [Xiangrui Meng] check numFeatures only once
If we need to transform the input data, we should cache the output to avoid re-computing feature vectors every iteration. dbtsai
Author: Xiangrui Meng <meng@databricks.com>
Closes#4593 from mengxr/SPARK-5802 and squashes the following commits:
ae3be84 [Xiangrui Meng] cache transformed data in glm
On a big dataset explicitly unpersist train and validation folds allows to load more data into memory in the next loop iteration. On my environment (single node 8Gb worker RAM, 2 GB dataset file, 3 folds for cross validation), saved more than 5 minutes.
Author: Peter Rudenko <petro.rudenko@gmail.com>
Closes#4595 from petro-rudenko/patch-2 and squashes the following commits:
66a7cfb [Peter Rudenko] Move validationDataset cache to declaration
c5f3265 [Peter Rudenko] [Ml] SPARK-5804 Explicitly manage cache in Crossvalidator k-fold loop
If it's a last estimator in Pipeline there's no need to transform data, since there's no next stage that would consume this data.
Author: Peter Rudenko <petro.rudenko@gmail.com>
Closes#4590 from petro-rudenko/patch-1 and squashes the following commits:
d13ec33 [Peter Rudenko] [Ml] SPARK-5796 Don't transform data on a last estimator in Pipeline
This PR adds three groups to the ScalaDoc: `param`, `setParam`, and `getParam`. Params will show up in the generated Scala API doc as the top group. Setters/getters will be at the bottom.
Preview:
![screen shot 2015-02-13 at 2 47 49 pm](https://cloud.githubusercontent.com/assets/829644/6196657/5740c240-b38f-11e4-94bb-bd8ef5a796c5.png)
Author: Xiangrui Meng <meng@databricks.com>
Closes#4600 from mengxr/SPARK-5730 and squashes the following commits:
febed9a [Xiangrui Meng] add doc groups to spark.ml components
because ArrayBuffer is not specialized.
Author: Xiangrui Meng <meng@databricks.com>
Closes#4594 from mengxr/SPARK-5803 and squashes the following commits:
1261bd5 [Xiangrui Meng] merge master
a4ea872 [Xiangrui Meng] use ArrayBuilder to build primitive arrays
This PR detaches MLlib model import/export code from SQL's JSON support, and hence unblocks #4544 . yhuai
Author: Xiangrui Meng <meng@databricks.com>
Closes#4555 from mengxr/SPARK-5757 and squashes the following commits:
b0415e8 [Xiangrui Meng] replace SQL JSON usage by json4s
The `initialState` of LDA performs several RDD operations that looks redundant. This pr tries to simplify these operations.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#4501 from viirya/sim_lda and squashes the following commits:
4870fe4 [Liang-Chi Hsieh] For comments.
9af1487 [Liang-Chi Hsieh] Refactor initial step of LDA to remove redundant operations.
Do not recursively strip out projects. Only strip the first level project.
```scala
df("colA") + df("colB").as("colC")
```
Previously, the above would construct an invalid plan.
Author: Reynold Xin <rxin@databricks.com>
Closes#4519 from rxin/computability and squashes the following commits:
87ff763 [Reynold Xin] Code review feedback.
015c4fc [Reynold Xin] [SQL][DataFrame] Fix column computability.
Deprecate inferSchema() and applySchema(), use createDataFrame() instead, which could take an optional `schema` to create an DataFrame from an RDD. The `schema` could be StructType or list of names of columns.
Author: Davies Liu <davies@databricks.com>
Closes#4498 from davies/create and squashes the following commits:
08469c1 [Davies Liu] remove Scala/Java API for now
c80a7a9 [Davies Liu] fix hive test
d1bd8f2 [Davies Liu] cleanup applySchema
9526e97 [Davies Liu] createDataFrame from RDD with columns
Following discussion in the Jira.
Author: MechCoder <manojkumarsivaraj334@gmail.com>
Closes#4459 from MechCoder/sparse_gmm and squashes the following commits:
1b18dab [MechCoder] Rewrite syr for sparse matrices
e579041 [MechCoder] Add test for covariance matrix
5cb370b [MechCoder] Separate tests for sparse data
5e096bd [MechCoder] Alphabetize and correct error message
e180f4c [MechCoder] [SPARK-5021] Gaussian Mixture now supports Sparse Input
This is based on #4444 from jkbradley with the following changes:
1. Node schema updated to
~~~
treeId: int
nodeId: Int
predict/
|- predict: Double
|- prob: Double
impurity: Double
isLeaf: Boolean
split/
|- feature: Int
|- threshold: Double
|- featureType: Int
|- categories: Array[Double]
leftNodeId: Integer
rightNodeId: Integer
infoGain: Double
~~~
2. Some refactor of the implementation.
Closes#4444.
Author: Joseph K. Bradley <joseph@databricks.com>
Author: Xiangrui Meng <meng@databricks.com>
Closes#4493 from mengxr/SPARK-5597 and squashes the following commits:
75e3bb6 [Xiangrui Meng] fix style
2b0033d [Xiangrui Meng] update tree export schema and refactor the implementation
45873a2 [Joseph K. Bradley] org imports
1d4c264 [Joseph K. Bradley] Added save/load for tree ensembles
dcdbf85 [Joseph K. Bradley] added save/load for decision tree but need to generalize it to ensembles
Fix ARPACK error code mapping, at least. It's not yet clear whether the error is what we expect from ARPACK. If it isn't, not sure if that's to be treated as an MLlib or Breeze issue.
Author: Sean Owen <sowen@cloudera.com>
Closes#4485 from srowen/SPARK-4900 and squashes the following commits:
7355aa1 [Sean Owen] Fix ARPACK error code mapping
Author: Sandy Ryza <sandy@cloudera.com>
Closes#1093 from sryza/sandy-spark-2149 and squashes the following commits:
5f06b33 [Sandy Ryza] More review comments
0f73060 [Sandy Ryza] Respond to Sean's review comments
0dfa005 [Sandy Ryza] SPARK-2149. Univariate kernel density estimation
Check that size of dense matrix array is not beyond Int.MaxValue in Matrices.* methods. jkbradley this should be an easy one. Review and/or merge as you see fit.
Author: Sean Owen <sowen@cloudera.com>
Closes#4461 from srowen/SPARK-4405 and squashes the following commits:
c67574e [Sean Owen] Check that size of dense matrix array is not beyond Int.MaxValue in Matrices.* methods
This is #4447 with `override`.
Closes#4447
Author: Joseph K. Bradley <joseph@databricks.com>
Author: Xiangrui Meng <meng@databricks.com>
Closes#4462 from mengxr/SPARK-5660 and squashes the following commits:
f82c8d6 [Xiangrui Meng] add override to matrix.apply
91cedde [Joseph K. Bradley] made matrix apply public
following #4233. jkbradley
Author: Xiangrui Meng <meng@databricks.com>
Closes#4422 from mengxr/SPARK-5598 and squashes the following commits:
a059394 [Xiangrui Meng] SaveLoad not extending Loader
14b7ea6 [Xiangrui Meng] address comments
f487cb2 [Xiangrui Meng] add unit tests
62fc43c [Xiangrui Meng] implement save/load for MFM
...ceed max int.
Large values of k and/or n in EigenValueDecomposition.symmetricEigs will result in array initialization to a value larger than Integer.MAX_VALUE in the following: var v = new Array[Double](n * ncv)
Author: mbittmann <mbittmann@gmail.com>
Author: bittmannm <mark.bittmann@agilex.com>
Closes#4433 from mbittmann/master and squashes the following commits:
ee56e05 [mbittmann] [SPARK-5656] Combine checks into simple message
e49cbbb [mbittmann] [SPARK-5656] Simply error message
860836b [mbittmann] Array size check updates based on code review
a604816 [bittmannm] [SPARK-5656] Fail gracefully for large values of k and/or n that will exceed max int.
Overload `trainOn`, `predictOn`, and `predictOnValues`.
CC freeman-lab
Author: Xiangrui Meng <meng@databricks.com>
Closes#4432 from mengxr/streaming-java and squashes the following commits:
6a79b85 [Xiangrui Meng] add java test for streaming logistic regression
2d7b357 [Xiangrui Meng] organize imports
1f662b3 [Xiangrui Meng] make streaming linear algorithms Java-friendly
`LogisticRegressionModel`'s `predictPoint` should directly use broadcasted weights. This pr also fixes the compilation errors of two unit test suite: `JavaLogisticRegressionSuite ` and `JavaLinearRegressionSuite`.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#4429 from viirya/use_bcvalue and squashes the following commits:
5a797e5 [Liang-Chi Hsieh] Use broadcasted weights. Fix compilation error.
This is part (1a) of the updates from the design doc in [https://docs.google.com/document/d/1BH9el33kBX8JiDdgUJXdLW14CA2qhTCWIG46eXZVoJs]
**UPDATE**: Most of the APIs are being kept private[spark] to allow further discussion. Here is a list of changes which are public:
* new output columns: rawPrediction, probabilities
* The “score” column is now called “rawPrediction”
* Classifiers now provide numClasses
* Params.get and .set are now protected instead of private[ml].
* ParamMap now has a size method.
* new classes: LinearRegression, LinearRegressionModel
* LogisticRegression now has an intercept.
### Sketch of APIs (most of which are private[spark] for now)
Abstract classes for learning algorithms (+ corresponding Model abstractions):
* Classifier (+ ClassificationModel)
* ProbabilisticClassifier (+ ProbabilisticClassificationModel)
* Regressor (+ RegressionModel)
* Predictor (+ PredictionModel)
* *For all of these*:
* There is no strongly typed training-time API.
* There is a strongly typed test-time (prediction) API which helps developers implement new algorithms.
Concrete classes: learning algorithms
* LinearRegression
* LogisticRegression (updated to use new abstract classes)
* Also, removed "score" in favor of "probability" output column. Changed BinaryClassificationEvaluator to match. (SPARK-5031)
Other updates:
* params.scala: Changed Params.set/get to be protected instead of private[ml]
* This was needed for the example of defining a class from outside of the MLlib namespace.
* VectorUDT: Will later change from private[spark] to public.
* This is needed for outside users to write their own validateAndTransformSchema() methods using vectors.
* Also, added equals() method.f
* SPARK-4942 : ML Transformers should allow output cols to be turned on,off
* Update validateAndTransformSchema
* Update transform
* (Updated examples, test suites according to other changes)
New examples:
* DeveloperApiExample.scala (example of defining algorithm from outside of the MLlib namespace)
* Added Java version too
Test Suites:
* LinearRegressionSuite
* LogisticRegressionSuite
* + Java versions of above suites
CC: mengxr etrain shivaram
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#3637 from jkbradley/ml-api-part1 and squashes the following commits:
405bfb8 [Joseph K. Bradley] Last edits based on code review. Small cleanups
fec348a [Joseph K. Bradley] Added JavaDeveloperApiExample.java and fixed other issues: Made developer API private[spark] for now. Added constructors Java can understand to specialized Param types.
8316d5e [Joseph K. Bradley] fixes after rebasing on master
fc62406 [Joseph K. Bradley] fixed test suites after last commit
bcb9549 [Joseph K. Bradley] Fixed issues after rebasing from master (after move from SchemaRDD to DataFrame)
9872424 [Joseph K. Bradley] fixed JavaLinearRegressionSuite.java Java sql api
f542997 [Joseph K. Bradley] Added MIMA excludes for VectorUDT (now public), and added DeveloperApi annotation to it
216d199 [Joseph K. Bradley] fixed after sql datatypes PR got merged
f549e34 [Joseph K. Bradley] Updates based on code review. Major ones are: * Created weakly typed Predictor.train() method which is called by fit() so that developers do not have to call schema validation or copy parameters. * Made Predictor.featuresDataType have a default value of VectorUDT. * NOTE: This could be dangerous since the FeaturesType type parameter cannot have a default value.
343e7bd [Joseph K. Bradley] added blanket mima exclude for ml package
82f340b [Joseph K. Bradley] Fixed bug in LogisticRegression (introduced in this PR). Fixed Java suites
0a16da9 [Joseph K. Bradley] Fixed Linear/Logistic RegressionSuites
c3c8da5 [Joseph K. Bradley] small cleanup
934f97b [Joseph K. Bradley] Fixed bugs from previous commit.
1c61723 [Joseph K. Bradley] * Made ProbabilisticClassificationModel into a subclass of ClassificationModel. Also introduced ProbabilisticClassifier. * This was to support output column “probabilityCol” in transform().
4e2f711 [Joseph K. Bradley] rat fix
bc654e1 [Joseph K. Bradley] Added spark.ml LinearRegressionSuite
8d13233 [Joseph K. Bradley] Added methods: * Classifier: batch predictRaw() * Predictor: train() without paramMap ProbabilisticClassificationModel.predictProbabilities() * Java versions of all above batch methods + others
1680905 [Joseph K. Bradley] Added JavaLabeledPointSuite.java for spark.ml, and added constructor to LabeledPoint which defaults weight to 1.0
adbe50a [Joseph K. Bradley] * fixed LinearRegression train() to use embedded paramMap * added Predictor.predict(RDD[Vector]) method * updated Linear/LogisticRegressionSuites
58802e3 [Joseph K. Bradley] added train() to Predictor subclasses which does not take a ParamMap.
57d54ab [Joseph K. Bradley] * Changed semantics of Predictor.train() to merge the given paramMap with the embedded paramMap. * remove threshold_internal from logreg * Added Predictor.copy() * Extended LogisticRegressionSuite
e433872 [Joseph K. Bradley] Updated docs. Added LabeledPointSuite to spark.ml
54b7b31 [Joseph K. Bradley] Fixed issue with logreg threshold being set correctly
0617d61 [Joseph K. Bradley] Fixed bug from last commit (sorting paramMap by parameter names in toString). Fixed bug in persisting logreg data. Added threshold_internal to logreg for faster test-time prediction (avoiding map lookup).
601e792 [Joseph K. Bradley] Modified ParamMap to sort parameters in toString. Cleaned up classes in class hierarchy, before implementing tests and examples.
d705e87 [Joseph K. Bradley] Added LinearRegression and Regressor back from ml-api branch
52f4fde [Joseph K. Bradley] removing everything except for simple class hierarchy for classification
d35bb5d [Joseph K. Bradley] fixed compilation issues, but have not added tests yet
bfade12 [Joseph K. Bradley] Added lots of classes for new ML API:
This is the second part of SPARK-5604, which removes checkpointDir from tree strategies. Note that this is a break change. I will mention it in the migration guide.
Author: Xiangrui Meng <meng@databricks.com>
Closes#4407 from mengxr/SPARK-5604-1 and squashes the following commits:
13a276d [Xiangrui Meng] remove checkpointDir from trees
`checkpointDir` is a Spark global configuration. Users should set it outside LDA. This PR also hides some methods under `private[clustering] object LDA`, so they don't show up in the generated Java doc (SPARK-5610).
jkbradley
Author: Xiangrui Meng <meng@databricks.com>
Closes#4390 from mengxr/SPARK-5604 and squashes the following commits:
a34bb39 [Xiangrui Meng] remove checkpointDir from LDA
Because `deleteAllCheckpoints` has IOException potential.
fix issue.
Author: x1- <viva008@gmail.com>
Closes#4347 from x1-/SPARK-5460 and squashes the following commits:
7a3d8de [x1-] change `Try()` to `try catch { case ... }` ar RandomForest.
3a52745 [x1-] modified typo. 'faild' -> 'failed' and remove disused '-'.
1572576 [x1-] Wrapped `Try` around `deleteAllCheckpoints` - RandomForest.
There are no break changes (against 1.2) in this PR. I hide the PythonMLLibAPI, which is only called by Py4J, and renamed `SparseMatrix.diag` to `SparseMatrix.spdiag`. All other changes are documentation and annotations. The `Experimental` tag is removed from `ALS.setAlpha` and `Rating`. One issue not addressed in this PR is the `setCheckpointDir` in `LDA` (https://issues.apache.org/jira/browse/SPARK-5604).
CC: srowen jkbradley
Author: Xiangrui Meng <meng@databricks.com>
Closes#4377 from mengxr/SPARK-5599 and squashes the following commits:
17975dc [Xiangrui Meng] fix tests
4487f20 [Xiangrui Meng] remove experimental tag from each stat method because Statistics is experimental already
3cd969a [Xiangrui Meng] remove freeman (sorry~) from StreamLA public doc
55900f5 [Xiangrui Meng] make IR experimental and update its doc
9b8eed3 [Xiangrui Meng] graduate Rating and setAlpha in ALS
b854d28 [Xiangrui Meng] correct iid doc in RandomRDDs
27f5bdd [Xiangrui Meng] update linalg docs and some new method signatures
371721b [Xiangrui Meng] mark fpg as experimental and update its doc
8aca7ee [Xiangrui Meng] change SLR to experimental and update the doc
ebbb2e9 [Xiangrui Meng] mark PIC experimental and update the doc
7830d3b [Xiangrui Meng] mark GMM experimental
a378496 [Xiangrui Meng] use the correct subscript syntax in PIC
c65c424 [Xiangrui Meng] update LDAModel doc
a213b0c [Xiangrui Meng] update GMM constructor
3993054 [Xiangrui Meng] hide algorithm in SLR
ad6b9ce [Xiangrui Meng] Revert "make ClassificatinModel.predict(JavaRDD) return JavaDoubleRDD"
0054684 [Xiangrui Meng] add doc to LRModel's constructor
a89763b [Xiangrui Meng] make ClassificatinModel.predict(JavaRDD) return JavaDoubleRDD
7c0946c [Xiangrui Meng] hide PythonMLLibAPI
This is a PR for Parquet-based model import/export. Please see the design doc on [the JIRA](https://issues.apache.org/jira/browse/SPARK-4587).
Note: This includes only a subset of regression and classification models:
* NaiveBayes, SVM, LogisticRegression
* LinearRegression, RidgeRegression, Lasso
Follow-up PRs will cover other models.
Sketch of current contents:
* New traits: Saveable, Loader
* Implementations for some algorithms
* Also: Added LogisticRegressionModel.getThreshold method (so that unit test could check the threshold)
CC: mengxr selvinsource
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#4233 from jkbradley/ml-import-export and squashes the following commits:
87c4eb8 [Joseph K. Bradley] small cleanups
12d9059 [Joseph K. Bradley] Many cleanups after code review. Major changes: Storing numFeatures, numClasses in model metadata. Improvements to unit tests
b4ee064 [Joseph K. Bradley] Reorganized save/load for regression and classification. Renamed concepts to Saveable, Loader
a34aef5 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into ml-import-export
ee99228 [Joseph K. Bradley] scala style fix
79675d5 [Joseph K. Bradley] cleanups in LogisticRegression after rebasing after multinomial PR
d1e5882 [Joseph K. Bradley] organized imports
2935963 [Joseph K. Bradley] Added save/load and tests for most classification and regression models
c495dba [Joseph K. Bradley] made version for model import/export local to each model
1496852 [Joseph K. Bradley] Added save/load for NaiveBayes
8d46386 [Joseph K. Bradley] Added save/load to NaiveBayes
1577d70 [Joseph K. Bradley] fixed issues after rebasing on master (DataFrame patch)
64914a3 [Joseph K. Bradley] added getThreshold to SVMModel
b1fc5ec [Joseph K. Bradley] small cleanups
418ba1b [Joseph K. Bradley] Added save, load to mllib.classification.LogisticRegressionModel, plus test suite
If `seed` is `None` on the python side, it will pass in as a `null`. So we should use `java.lang.Long` instead of `Long` to take it.
Author: Xiangrui Meng <meng@databricks.com>
Closes#4349 from mengxr/gmm-fix and squashes the following commits:
3be5926 [Xiangrui Meng] fix seed handling in Python GMM
Make FPGrowth.run API take generic item types:
`def run[Item: ClassTag, Basket <: Iterable[Item]](data: RDD[Basket]): FPGrowthModel[Item]`
so that user can invoke it by run[String, Seq[String]], run[Int, Seq[Int]], run[Int, List[Int]], etc.
Scala part is done, while java part is still in progress
Author: Jacky Li <jacky.likun@huawei.com>
Author: Jacky Li <jackylk@users.noreply.github.com>
Author: Xiangrui Meng <meng@databricks.com>
Closes#4340 from jackylk/SPARK-5520-WIP and squashes the following commits:
f5acf84 [Jacky Li] Merge pull request #2 from mengxr/SPARK-5520
63073d0 [Xiangrui Meng] update to make generic FPGrowth Java-friendly
737d8bb [Jacky Li] fix scalastyle
793f85c [Jacky Li] add Java test case
7783351 [Jacky Li] add generic support in FPGrowth
Author: Xiangrui Meng <meng@databricks.com>
Closes#4329 from mengxr/streaming-lr and squashes the following commits:
78731e1 [Xiangrui Meng] update streaming linear algorithms
**This PR introduces an API + simple implementation for Latent Dirichlet Allocation (LDA).**
The [design doc for this PR](https://docs.google.com/document/d/1kSsDqTeZMEB94Bs4GTd0mvdAmduvZSSkpoSfn-seAzo) has been updated since I initially posted it. In particular, see the API and Planning for the Future sections.
* Settle on a public API which may eventually include:
* more inference algorithms
* more options / functionality
* Have an initial easy-to-understand implementation which others may improve.
* This is NOT intended to support every topic model out there. However, if there are suggestions for making this extensible or pluggable in the future, that could be nice, as long as it does not complicate the API or implementation too much.
* This may not be very scalable currently. It will be important to check and improve accuracy. For correctness of the implementation, please check against the Asuncion et al. (2009) paper in the design doc.
**Dependency: This makes MLlib depend on GraphX.**
Files and classes:
* LDA.scala (441 lines):
* class LDA (main estimator class)
* LDA.Document (text + document ID)
* LDAModel.scala (266 lines)
* abstract class LDAModel
* class LocalLDAModel
* class DistributedLDAModel
* LDAExample.scala (245 lines): script to run LDA + a simple (private) Tokenizer
* LDASuite.scala (144 lines)
Data/model representation and algorithm:
* Data/model: Uses GraphX, with term vertices + document vertices
* Algorithm: EM, following [Asuncion, Welling, Smyth, and Teh. "On Smoothing and Inference for Topic Models." UAI, 2009.](http://arxiv-web3.library.cornell.edu/abs/1205.2662v1)
* For more details, please see the description in the “DEVELOPERS NOTE” in LDA.scala
Please refer to the JIRA for more discussion + the [design doc for this PR](https://docs.google.com/document/d/1kSsDqTeZMEB94Bs4GTd0mvdAmduvZSSkpoSfn-seAzo)
Here, I list the main changes AFTER the design doc was posted.
Design decisions:
* logLikelihood() computes the log likelihood of the data and the current point estimate of parameters. This is different from the likelihood of the data given the hyperparameters, which would be harder to compute. I’d describe the current approach as more frequentist, whereas the harder approach would be more Bayesian.
* The current API takes Documents as token count vectors. I believe there should be an extended API taking RDD[String] or RDD[Array[String]] in a future PR. I have sketched this out in the design doc (as well as handier versions of getTopics returning Strings).
* Hyperparameters should be set differently for different inference/learning algorithms. See Asuncion et al. (2009) in the design doc for a good demonstration. I encourage good behavior via defaults and warning messages.
Items planned for future PRs:
* perplexity
* API taking Strings
* Should LDA be called LatentDirichletAllocation (and LDAModel be LatentDirichletAllocationModel)?
* Pro: We may someday want LinearDiscriminantAnalysis.
* Con: Very long names
* Should LDA reside in clustering? Or do we want a sub-package?
* mllib.topicmodel
* mllib.clustering.topicmodel
* Does the API seem reasonable and extensible?
* Unit tests:
* Should there be a test which checks a clustering results? E.g., train on a small, fake dataset with 2 very distinct topics/clusters, and ensure LDA finds those 2 topics/clusters. Does that sound useful or too flaky?
This has not been tested much for scaling. I have run it on a laptop for 200 iterations on a 5MB dataset with 1000 terms and 5 topics. Running it for 500 iterations made it fail because of GC problems. I'm running larger scale tests & will put results here, but future PRs may need to improve the scaling.
* dlwh for the initial implementation
* + jegonzal for some code in the initial implementation
* The many contributors towards topic model implementations in Spark which were referenced as a basis for this PR: akopich witgo yinxusen dlwh EntilZha jegonzal IlyaKozlov
* Note: The plan is to include this full list in the authors if this PR gets merged. Please notify me if you prefer otherwise.
CC: mengxr
Authors:
Joseph K. Bradley <joseph@databricks.com>
Joseph Gonzalez <joseph.e.gonzalez@gmail.com>
David Hall <david.lw.hall@gmail.com>
Guoqiang Li <witgo@qq.com>
Xiangrui Meng <meng@databricks.com>
Pedro Rodriguez <pedro@snowgeek.org>
Avanesov Valeriy <acopich@gmail.com>
Xusen Yin <yinxusen@gmail.com>
Closes#2388Closes#4047 from jkbradley/davidhall-lda and squashes the following commits:
77e8814 [Joseph K. Bradley] small doc fix
5c74345 [Joseph K. Bradley] cleaned up doc based on code review
589728b [Joseph K. Bradley] Updates per code review. Main change was in LDAExample for faster vocab computation. Also updated PeriodicGraphCheckpointerSuite.scala to clean up checkpoint files at end
e3980d2 [Joseph K. Bradley] cleaned up PeriodicGraphCheckpointerSuite.scala
74487e5 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into davidhall-lda
4ae2a7d [Joseph K. Bradley] removed duplicate graphx dependency in mllib/pom.xml
e391474 [Joseph K. Bradley] Removed LDATiming. Added PeriodicGraphCheckpointerSuite.scala. Small LDA cleanups.
e8d8acf [Joseph K. Bradley] Added catch for BreakIterator exception. Improved preprocessing to reduce passes over data
1a231b4 [Joseph K. Bradley] fixed scalastyle
91aadfe [Joseph K. Bradley] Added Java-friendly run method to LDA. Added Java test suite for LDA. Changed LDAModel.describeTopics to return Java-friendly type
b75472d [Joseph K. Bradley] merged improvements from LDATiming into LDAExample. Will remove LDATiming after done testing
993ca56 [Joseph K. Bradley] * Removed Document type in favor of (Long, Vector) * Changed doc ID restriction to be: id must be nonnegative and unique in the doc (instead of 0,1,2,...) * Add checks for valid ranges of eta, alpha * Rename “LearningState” to “EMOptimizer” * Renamed params: termSmoothing -> topicConcentration, topicSmoothing -> docConcentration * Also added aliases alpha, beta
cb5a319 [Joseph K. Bradley] Added checkpointing to LDA * new class PeriodicGraphCheckpointer * params checkpointDir, checkpointInterval to LDA
43c1c40 [Joseph K. Bradley] small cleanup
0b90393 [Joseph K. Bradley] renamed LDA LearningState.collectTopicTotals to globalTopicTotals
77a2c85 [Joseph K. Bradley] Moved auto term,topic smoothing computation to get*Smoothing methods. Changed word to term in some places. Updated LDAExample to use default smoothing amounts.
fb1e7b5 [Xiangrui Meng] minor
08d59a3 [Xiangrui Meng] reset spacing
9fe0b95 [Xiangrui Meng] optimize aggregateMessages
cec0a9c [Xiangrui Meng] * -> *=
6cb11b0 [Xiangrui Meng] optimize computePTopic
9eb3d02 [Xiangrui Meng] + -> +=
892530c [Xiangrui Meng] use axpy
45cc7f2 [Xiangrui Meng] mapPart -> flatMap
ce53be9 [Joseph K. Bradley] fixed example name
75749e7 [Joseph K. Bradley] scala style fix
9f2a492 [Joseph K. Bradley] Unit tests and fixes for LDA, now ready for PR
377ebd9 [Joseph K. Bradley] separated LDA models into own file. more cleanups before PR
2d40006 [Joseph K. Bradley] cleanups before PR
2891e89 [Joseph K. Bradley] Prepped LDA main class for PR, but some cleanups remain
0cb7187 [Joseph K. Bradley] Added 3 files from dlwh LDA implementation
The only issue is that `analyzeBlock` is removed, which was marked as a developer API. I didn't change other tests in the ALSSuite under `spark.mllib` to ensure that the implementation is correct.
CC: srowen coderxiang
Author: Xiangrui Meng <meng@databricks.com>
Closes#4321 from mengxr/SPARK-5536 and squashes the following commits:
5a3cee8 [Xiangrui Meng] update python tests that are too strict
e840acf [Xiangrui Meng] ignore scala style check for ALS.train
e9a721c [Xiangrui Meng] update mima excludes
9ee6a36 [Xiangrui Meng] merge master
9a8aeac [Xiangrui Meng] update tests
d8c3271 [Xiangrui Meng] remove analyzeBlocks
d68eee7 [Xiangrui Meng] add checkpoint to new ALS
22a56f8 [Xiangrui Meng] wrap old ALS
c387dff [Xiangrui Meng] support random seed
3bdf24b [Xiangrui Meng] make storage level configurable in the new ALS
This adds support for streaming logistic regression with stochastic gradient descent, in the same manner as the existing implementation of streaming linear regression. It is a relatively simple addition because most of the work is already done by the abstract class `StreamingLinearAlgorithm` and existing algorithms and models from MLlib.
The PR includes
- Streaming Logistic Regression algorithm
- Unit tests for accuracy, streaming convergence, and streaming prediction
- An example use
cc mengxr tdas
Author: freeman <the.freeman.lab@gmail.com>
Closes#4306 from freeman-lab/streaming-logisitic-regression and squashes the following commits:
5c2c70b [freeman] Use Option on model
5cca2bc [freeman] Merge remote-tracking branch 'upstream/master' into streaming-logisitic-regression
275f8bd [freeman] Make private to mllib
3926e4e [freeman] Line formatting
5ee8694 [freeman] Experimental tag for docs
2fc68ac [freeman] Fix example formatting
85320b1 [freeman] Fixed line length
d88f717 [freeman] Remove stray comment
59d7ecb [freeman] Add streaming logistic regression
e78fe28 [freeman] Add streaming logistic regression example
321cc66 [freeman] Set private and protected within mllib