Author: Kashif Rasul <kashif.rasul@gmail.com>
Closes#7269 from kashif/SPARK-8872 and squashes the following commits:
2d5457f [Kashif Rasul] added R code for FP Int type
3de6808 [Kashif Rasul] added verification results from R for FPGrowthSuite
In LinearRegression and LogisticRegression, we use Breeze's optimizers (LBFGS and OWLQN). We check the State.value to see the current objective. However, Breeze's documentation makes it sound like value and adjustedValue differ for some optimizers, possibly including OWLQN: 26faf62286/math/src/main/scala/breeze/optimize/FirstOrderMinimizer.scala (L36)
If that is the case, then we should use adjustedValue instead of value. This is relevant to SPARK-8538 and SPARK-8539, where we will provide the objective trace to the user.
Author: DB Tsai <dbt@netflix.com>
Closes#7245 from dbtsai/SPARK-8845 and squashes the following commits:
fa4c91e [DB Tsai] address feedback
e6caac1 [DB Tsai] java style multiline comment
b10c574 [DB Tsai] address feedback
c9ff81e [DB Tsai] first commit
Add std, mean to StandardScalerModel
getVectors, findSynonyms to Word2Vec Model
setFeatures and getFeatures to hashingTF
Author: MechCoder <manojkumarsivaraj334@gmail.com>
Closes#7086 from MechCoder/missing_model_methods and squashes the following commits:
9fbae90 [MechCoder] Add type
6e3d6b2 [MechCoder] [SPARK-8704] Add missing methods in StandardScaler (ML and PySpark)
Distributed generation of single-consequent association rules from a RDD of frequent itemsets. Tests referenced against `R`'s implementation of A Priori in [arules](http://cran.r-project.org/web/packages/arules/index.html).
Author: Feynman Liang <fliang@databricks.com>
Closes#7005 from feynmanliang/fp-association-rules-distributed and squashes the following commits:
466ced0 [Feynman Liang] Refactor AR generation impl
73c1cff [Feynman Liang] Make rule attributes public, remove numTransactions from FreqItemset
80f63ff [Feynman Liang] Change default confidence and optimize imports
04cf5b5 [Feynman Liang] Code review with @mengxr, add R to tests
0cc1a6a [Feynman Liang] Java compatibility test
f3c14b5 [Feynman Liang] Fix MiMa test
764375e [Feynman Liang] Fix tests
1187307 [Feynman Liang] Almost working tests
b20779b [Feynman Liang] Working implementation
5395c4e [Feynman Liang] Fix imports
2d34405 [Feynman Liang] Partial implementation of distributed ar
83ace4b [Feynman Liang] Local rule generation without pruning complete
69c2c87 [Feynman Liang] Working local implementation, now to parallelize../..
4e1ec9a [Feynman Liang] Pull FreqItemsets out, refactor type param, tests
69ccedc [Feynman Liang] First implementation of association rule generation
Add numNodes and depth to treeModels, add treeWeights to ensemble Models.
Add __repr__ to all models.
Author: MechCoder <manojkumarsivaraj334@gmail.com>
Closes#7095 from MechCoder/missing_methods_tree and squashes the following commits:
23b08be [MechCoder] private [spark]
38a0860 [MechCoder] rename pyTreeWeights to javaTreeWeights
6d16ad8 [MechCoder] Fix Python 3 Error
47d7023 [MechCoder] Use np.allclose and treeEnsembleModel -> TreeEnsembleMethods
819098c [MechCoder] [SPARK-8711] [ML] Add additional methods ot PySpark ML tree models
Add Java unit test for PCA transformer
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#7184 from yanboliang/spark-8788 and squashes the following commits:
9d1a2af [Yanbo Liang] address comments
b34451f [Yanbo Liang] Add Java unit test for PCA transformer
See the jira https://issues.apache.org/jira/browse/SPARK-5562
Author: Alok Singh <singhal@Aloks-MacBook-Pro.local>
Author: Alok Singh <singhal@aloks-mbp.usca.ibm.com>
Author: Alok Singh <“singhal@us.ibm.com”>
Closes#7064 from aloknsingh/aloknsingh_SPARK-5562 and squashes the following commits:
259a0a7 [Alok Singh] change as per the comments by @jkbradley
be48491 [Alok Singh] [SPARK-5562][MLlib] re-order import in alphabhetical order
c01311b [Alok Singh] [SPARK-5562][MLlib] fix the newline typo
b271c8a [Alok Singh] [SPARK-5562][Mllib] As per github discussion with jkbradley. We would like to simply things.
7c06251 [Alok Singh] [SPARK-5562][MLlib] modified the JavaLDASuite for test passing
c710cb6 [Alok Singh] fix the scala code style to have space after :
2572a08 [Alok Singh] [SPARK-5562][MLlib] change the import xyz._ to the import xyz.{c1, c2} ..
ab55fbf [Alok Singh] [SPARK-5562][MLlib] Change as per Sean Owen's comments https://github.com/apache/spark/pull/7064/files#diff-9236d23975e6f5a5608ffc81dfd79146
9f4f9ea [Alok Singh] [SPARK-5562][MLlib] LDA should handle empty document.
This reverts commit 25f574eb9a. After speaking to some users and developers, we realized that FP-growth doesn't meet the requirement for frequent sequence mining. PrefixSpan (SPARK-6487) would be the correct algorithm for it. feynmanliang
Author: Xiangrui Meng <meng@databricks.com>
Closes#7240 from mengxr/SPARK-7212.revert and squashes the following commits:
2b3d66b [Xiangrui Meng] Revert "[SPARK-7212] [MLLIB] Add sequence learning flag"
Author: Joshi <rekhajoshm@gmail.com>
Author: Rekha Joshi <rekhajoshm@gmail.com>
Closes#5992 from rekhajoshm/fix/SPARK-7137 and squashes the following commits:
8c42b57 [Joshi] update checkInputColumn to print more info if needed
33ddd2e [Joshi] update checkInputColumn to print more info if needed
acf3e17 [Joshi] update checkInputColumn to print more info if needed
8993c0e [Joshi] SPARK-7137: Add checkInputColumn back to Params and print more info
e3677c9 [Rekha Joshi] Merge pull request #1 from apache/master
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
Closes#6821 from yu-iskw/SPARK-7104 and squashes the following commits:
975136b [Yu ISHIKAWA] Organize import
0ef58b6 [Yu ISHIKAWA] Use rmtree, instead of removedirs
cb21653 [Yu ISHIKAWA] Add an explicit type for `Word2VecModelWrapper.save`
1d468ef [Yu ISHIKAWA] [SPARK-7104][MLlib] Support model save/load in Python's Word2Vec
GrandientDescent can receive convergence tolerance value. Default value is 0.0.
When loss value becomes less than the tolerance which is set by user, iteration is terminated.
Author: lewuathe <lewuathe@me.com>
Closes#3636 from Lewuathe/gd-convergence-tolerance and squashes the following commits:
0b8a9a8 [lewuathe] Update doc
ce91b15 [lewuathe] Merge branch 'master' into gd-convergence-tolerance
4f22c2b [lewuathe] Modify based on SPARK-1503
5e47b82 [lewuathe] Merge branch 'master' into gd-convergence-tolerance
abadb7e [lewuathe] Fix LassoSuite
8fadebd [lewuathe] Fix failed unit tests
ee5de46 [lewuathe] Merge branch 'master' into gd-convergence-tolerance
8313ba2 [lewuathe] Fix styles
0ead94c [lewuathe] Merge branch 'master' into gd-convergence-tolerance
a94cfd5 [lewuathe] Modify some styles
3aef0a2 [lewuathe] Modify converged logic to do relative comparison
f7b19d5 [lewuathe] [SPARK-3382] Clarify comparison logic
e6c9cd2 [lewuathe] [SPARK-3382] Compare with the diff of solution vector
4b125d2 [lewuathe] [SPARK3382] Fix scala style
e7c10dd [lewuathe] [SPARK-3382] format improvements
f867eea [lewuathe] [SPARK-3382] Modify warning message statements
b9d5e61 [lewuathe] [SPARK-3382] should compare diff inside loss history and convergence tolerance
5433f71 [lewuathe] [SPARK-3382] GradientDescent convergence tolerance
Matrices allow zeros to be stored in values. Sometimes a method is handy to check if the numNonZeros are same as number of Active values.
Author: MechCoder <manojkumarsivaraj334@gmail.com>
Closes#6904 from MechCoder/nnz_matrix and squashes the following commits:
252c6b7 [MechCoder] Add to MiMa excludes
e2390f5 [MechCoder] Use count instead of foreach
2f62b2f [MechCoder] Add to MiMa excludes
d6e96ef [MechCoder] [SPARK-8479] Add numNonzeros and numActives to linalg.Matrices
JIRA: https://issues.apache.org/jira/browse/SPARK-8708
Previously the partitions of ratings are only based on the given products. So if the `usersProducts` given for prediction contains only few products or even one product, the generated ratings will be pushed into few or single partition and can't use high parallelism.
The following codes are the example reported in the JIRA. Because it asks the predictions for users on product 2. There is only one partition in the result.
>>> r1 = (1, 1, 1.0)
>>> r2 = (1, 2, 2.0)
>>> r3 = (2, 1, 2.0)
>>> r4 = (2, 2, 2.0)
>>> r5 = (3, 1, 1.0)
>>> ratings = sc.parallelize([r1, r2, r3, r4, r5], 5)
>>> users = ratings.map(itemgetter(0)).distinct()
>>> model = ALS.trainImplicit(ratings, 1, seed=10)
>>> predictions_for_2 = model.predictAll(users.map(lambda u: (u, 2)))
>>> predictions_for_2.glom().map(len).collect()
[0, 0, 3, 0, 0]
This PR uses user and product instead of only product to partition the ratings.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Author: Liang-Chi Hsieh <viirya@appier.com>
Closes#7121 from viirya/mfm_fix_partition and squashes the following commits:
779946d [Liang-Chi Hsieh] Calculate approximate numbers of users and products in one pass.
4336dc2 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into mfm_fix_partition
83e56c1 [Liang-Chi Hsieh] Instead of additional join, use the numbers of users and products to decide how to perform join.
b534dc8 [Liang-Chi Hsieh] Paritition ratings based on both users and products.
I added the code,
// see [SPARK-8647], this achieves the needed constant hash code without constant no.
override def hashCode(): Int = this.getClass.getName.hashCode()
does getting the constant hash code as per jira
Author: Alok Singh <singhal@Aloks-MacBook-Pro.local>
Closes#7146 from aloknsingh/aloknsingh_SPARK-8647 and squashes the following commits:
e58bccf [Alok Singh] [SPARK-8647][MLlib] to avoid the class derivation issues, change the constant hashCode to override def hashCode(): Int = classOf[MatrixUDT].getName.hashCode()
43cdb89 [Alok Singh] [SPARK-8647][MLlib] Potential issue with constant hashCode
I've updated default values in comments, documentation, and in the command line builder to be 1g based on comments in the JIRA. I've also updated most usages to point at a single variable defined in the Utils.scala and JavaUtils.java files. This wasn't possible in all cases (R, shell scripts etc.) but usage in most code is now pointing at the same place.
Please let me know if I've missed anything.
Will the spark-shell use the value within the command line builder during instantiation?
Author: Ilya Ganelin <ilya.ganelin@capitalone.com>
Closes#7132 from ilganeli/SPARK-3071 and squashes the following commits:
4074164 [Ilya Ganelin] String fix
271610b [Ilya Ganelin] Merge branch 'SPARK-3071' of github.com:ilganeli/spark into SPARK-3071
273b6e9 [Ilya Ganelin] Test fix
fd67721 [Ilya Ganelin] Update JavaUtils.java
26cc177 [Ilya Ganelin] test fix
e5db35d [Ilya Ganelin] Fixed test failure
39732a1 [Ilya Ganelin] merge fix
a6f7deb [Ilya Ganelin] Created default value for DRIVER MEM in Utils that's now used in almost all locations instead of setting manually in each
09ad698 [Ilya Ganelin] Update SubmitRestProtocolSuite.scala
19b6f25 [Ilya Ganelin] Missed one doc update
2698a3d [Ilya Ganelin] Updated default value for driver memory
'>' symbols removed from comments in LogisticRegressionSuite.scala, for ease of copypaste
also single-lined the multiline commands (is this desirable, or does it violate style?)
Author: Rosstin <asterazul@gmail.com>
Closes#7167 from Rosstin/SPARK-8660-2 and squashes the following commits:
f4b9bc8 [Rosstin] SPARK-8660 restored character limit on multiline comments in LogisticRegressionSuite.scala
fe6b112 [Rosstin] SPARK-8660 > symbols removed from LogisticRegressionSuite.scala for easy of copypaste
39ddd50 [Rosstin] Merge branch 'master' of github.com:apache/spark into SPARK-8661
5a05dee [Rosstin] SPARK-8661 for LinearRegressionSuite.scala, changed javadoc-style comments to regular multiline comments to make it easier to copy-paste the R code.
bb9a4b1 [Rosstin] Merge branch 'master' of github.com:apache/spark into SPARK-8660
242aedd [Rosstin] SPARK-8660, changed comment style from JavaDoc style to normal multiline comment in order to make copypaste into R easier, in file classification/LogisticRegressionSuite.scala
2cd2985 [Rosstin] Merge branch 'master' of github.com:apache/spark into SPARK-8639
21ac1e5 [Rosstin] Merge branch 'master' of github.com:apache/spark into SPARK-8639
6c18058 [Rosstin] fixed minor typos in docs/README.md and docs/api.md
Rename DiscreteCosineTransformer and related classes to DCT.
Author: Feynman Liang <fliang@databricks.com>
Closes#7138 from feynmanliang/dct-features and squashes the following commits:
e547b3e [Feynman Liang] Fix renaming bug
9d5c9e4 [Feynman Liang] Lowercase JavaDCTSuite variable
f9a8958 [Feynman Liang] Remove old files
f8fe794 [Feynman Liang] Merge branch 'master' into dct-features
894d0b2 [Feynman Liang] Rename DiscreteCosineTransformer to DCT
433dbc7 [Feynman Liang] Test refactoring
91e9636 [Feynman Liang] Style guide and test helper refactor
b5ac19c [Feynman Liang] Use Vector types, add Java test
530983a [Feynman Liang] Tests for other numeric datatypes
195d7aa [Feynman Liang] Implement support for arbitrary numeric types
95d4939 [Feynman Liang] Working DCT for 1D Doubles
I'm sorry that I made https://github.com/apache/spark/pull/6949 closed by mistake.
I pushed codes again.
And, I added a test code.
>
There is a bug that `U.numCols() = self.nCols` in `IndexedRowMatrix.computeSVD()`
It should have been `U.numCols() = k = svd.U.numCols()`
>
```
self = U * sigma * V.transpose
(m x n) = (m x n) * (k x k) * (k x n) //ASIS
-->
(m x n) = (m x k) * (k x k) * (k x n) //TOBE
```
Author: lee19 <lee19@live.co.kr>
Closes#6953 from lee19/MLlibBugfix and squashes the following commits:
c1812a0 [lee19] [SPARK-8563] [MLlib] Used nRows instead of numRows() to reduce a burden.
4b9803b [lee19] [SPARK-8563] [MLlib] Fixed a build error.
c2ccd89 [lee19] Added a unit test that validates matrix sizes of svd for [SPARK-8563][MLlib]
8373424 [lee19] [SPARK-8563][MLlib] Fixed a bug so that IndexedRowMatrix.computeSVD().U.numCols = k
Changed GBTRegressor so it does NOT threshold the prediction. Added test which fails with bug but works after fix.
CC: feynmanliang mengxr
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#7134 from jkbradley/gbrt-fix and squashes the following commits:
613b90e [Joseph K. Bradley] Changed GBTRegressor so it does NOT threshold the prediction
jira: https://issues.apache.org/jira/browse/SPARK-7514
Add a popular scaling method to feature component, which is commonly known as min-max normalization or Rescaling.
Core function is,
Normalized(x) = (x - min) / (max - min) * scale + newBase
where `newBase` and `scale` are parameters (type Double) of the `VectorTransformer`. `newBase` is the new minimum number for the features, and `scale` controls the ranges after transformation. This is a little complicated than the basic MinMax normalization, yet it provides flexibility so that users can control the range more specifically. like [0.1, 0.9] in some NN application.
For case that `max == min`, 0.5 is used as the raw value. (0.5 * scale + newBase)
I'll add UT once the design got settled ( and this is not considered as too naive)
reference:
http://en.wikipedia.org/wiki/Feature_scalinghttp://stn.spotfire.com/spotfire_client_help/index.htm#norm/norm_scale_between_0_and_1.htm
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes#6039 from hhbyyh/minMaxNorm and squashes the following commits:
f942e9f [Yuhao Yang] add todo for metadata
8b37bbc [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into minMaxNorm
4894dbc [Yuhao Yang] add copy
fa2989f [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into minMaxNorm
29db415 [Yuhao Yang] add clue and minor adjustment
5b8f7cc [Yuhao Yang] style fix
9b133d0 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into minMaxNorm
22f20f2 [Yuhao Yang] style change and bug fix
747c9bb [Yuhao Yang] add ut and remove mllib version
a5ba0aa [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into minMaxNorm
585cc07 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into minMaxNorm
1c6dcb1 [Yuhao Yang] minor change
0f1bc80 [Yuhao Yang] add MinMaxScaler to ml
8e7436e [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into minMaxNorm
3663165 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into minMaxNorm
1247c27 [Yuhao Yang] some comments improvement
d285a19 [Yuhao Yang] initial checkin for minMaxNorm
Implementation and tests for Discrete Cosine Transformer.
Author: Feynman Liang <fliang@databricks.com>
Closes#6894 from feynmanliang/dct-features and squashes the following commits:
433dbc7 [Feynman Liang] Test refactoring
91e9636 [Feynman Liang] Style guide and test helper refactor
b5ac19c [Feynman Liang] Use Vector types, add Java test
530983a [Feynman Liang] Tests for other numeric datatypes
195d7aa [Feynman Liang] Implement support for arbitrary numeric types
95d4939 [Feynman Liang] Working DCT for 1D Doubles
Add PCA transformer for ML pipeline
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#7065 from yanboliang/spark-8664 and squashes the following commits:
4afae45 [Yanbo Liang] address comments
e9effd7 [Yanbo Liang] Add PCA transformer
for mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala, changed javadoc-style comments to regular multiline comments, to make copy-pasting R code more simple
Author: Rosstin <asterazul@gmail.com>
Closes#7098 from Rosstin/SPARK-8661 and squashes the following commits:
5a05dee [Rosstin] SPARK-8661 for LinearRegressionSuite.scala, changed javadoc-style comments to regular multiline comments to make it easier to copy-paste the R code.
bb9a4b1 [Rosstin] Merge branch 'master' of github.com:apache/spark into SPARK-8660
242aedd [Rosstin] SPARK-8660, changed comment style from JavaDoc style to normal multiline comment in order to make copypaste into R easier, in file classification/LogisticRegressionSuite.scala
2cd2985 [Rosstin] Merge branch 'master' of github.com:apache/spark into SPARK-8639
21ac1e5 [Rosstin] Merge branch 'master' of github.com:apache/spark into SPARK-8639
6c18058 [Rosstin] fixed minor typos in docs/README.md and docs/api.md
Converted JavaDoc style comments in mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala to regular multiline comments, to make copy-pasting R commands easier.
Author: Rosstin <asterazul@gmail.com>
Closes#7096 from Rosstin/SPARK-8660 and squashes the following commits:
242aedd [Rosstin] SPARK-8660, changed comment style from JavaDoc style to normal multiline comment in order to make copypaste into R easier, in file classification/LogisticRegressionSuite.scala
2cd2985 [Rosstin] Merge branch 'master' of github.com:apache/spark into SPARK-8639
21ac1e5 [Rosstin] Merge branch 'master' of github.com:apache/spark into SPARK-8639
6c18058 [Rosstin] fixed minor typos in docs/README.md and docs/api.md
Follow up of [SPARK-8356](https://issues.apache.org/jira/browse/SPARK-8356) and #6902.
Removes the unit test for the now deprecated ```callUdf```
Unit test in SQLQuerySuite now uses ```udf``` instead of ```callUDF```
Replaced ```callUDF``` by ```udf``` where possible in mllib
Author: BenFradet <benjamin.fradet@gmail.com>
Closes#6993 from BenFradet/SPARK-8575 and squashes the following commits:
26f5a7a [BenFradet] 2 spaces instead of 1
1ddb452 [BenFradet] renamed initUDF in order to be consistent in OneVsRest
48ca15e [BenFradet] used vector type tag for udf call in VectorIndexer
0ebd0da [BenFradet] replace the now deprecated callUDF by udf in VectorIndexer
8013409 [BenFradet] replaced the now deprecated callUDF by udf in Predictor
94345b5 [BenFradet] unifomized udf calls in ProbabilisticClassifier
1305492 [BenFradet] uniformized udf calls in Classifier
a672228 [BenFradet] uniformized udf calls in OneVsRest
49e4904 [BenFradet] Revert "removal of the unit test for the now deprecated callUdf"
bbdeaf3 [BenFradet] fixed syntax for init udf in OneVsRest
fe2a10b [BenFradet] callUDF => udf in ProbabilisticClassifier
0ea30b3 [BenFradet] callUDF => udf in Classifier where possible
197ec82 [BenFradet] callUDF => udf in OneVsRest
84d6780 [BenFradet] modified unit test in SQLQuerySuite to use udf instead of callUDF
477709f [BenFradet] removal of the unit test for the now deprecated callUdf
Python support for Power Iteration Clustering
https://issues.apache.org/jira/browse/SPARK-5962
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#6992 from yanboliang/pyspark-pic and squashes the following commits:
6b03d82 [Yanbo Liang] address comments
4be4423 [Yanbo Liang] Python support for Power Iteration Clustering
Support mining of ordered frequent item sequences.
Author: Feynman Liang <fliang@databricks.com>
Closes#6997 from feynmanliang/fp-sequence and squashes the following commits:
7c14e15 [Feynman Liang] Improve scalatests with R code and Seq
0d3e4b6 [Feynman Liang] Fix python test
ce987cb [Feynman Liang] Backwards compatibility aux constructor
34ef8f2 [Feynman Liang] Fix failing test due to reverse orderering
f04bd50 [Feynman Liang] Naming, add ordered to FreqItemsets, test ordering using Seq
648d4d4 [Feynman Liang] Test case for frequent item sequences
252a36a [Feynman Liang] Add sequence learning flag
Add a param to disable linear feature scaling (to be implemented later in linear & logistic regression). Done as a seperate PR so we can use same param & not conflict while working on the sub-tasks.
Author: Holden Karau <holden@pigscanfly.ca>
Closes#7024 from holdenk/SPARK-8522-Disable-Linear_featureScaling-Spark-8613-Add-param and squashes the following commits:
ce8931a [Holden Karau] Regenerate the sharedParams code
fa6427e [Holden Karau] update text for standardization param.
7b24a2b [Holden Karau] generate the new standardization param
3c190af [Holden Karau] Add the standardization param to sharedparamscodegen
Keep the same naming conventions for PythonMLLibAPI.
Only the following three functions is different from others
```scala
trainNaiveBayes
trainGaussianMixture
trainWord2Vec
```
So change them to
```scala
trainNaiveBayesModel
trainGaussianMixtureModel
trainWord2VecModel
```
It does not affect any users and public APIs, only to make better understand for developer and code hacker.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#7011 from yanboliang/py-mllib-api-rename and squashes the following commits:
771ffec [Yanbo Liang] rename some functions of PythonMLLibAPI
fix LabeledPoint parser when there is a whitespace between label and features vector, e.g.
(y, [x1, x2, x3])
Author: Oleksiy Dyagilev <oleksiy_dyagilev@epam.com>
Closes#6954 from fe2s/SPARK-8525 and squashes the following commits:
0755b9d [Oleksiy Dyagilev] [SPARK-8525][MLLIB] addressing comment, removing dep on commons-lang
c1abc2b [Oleksiy Dyagilev] [SPARK-8525][MLLIB] fix LabeledPoint parser when there is a whitespace on specific position
It is useful to generate linear data for easy testing of linear models and in general. Scala already has it. This is just a wrapper around the Scala code.
Author: MechCoder <manojkumarsivaraj334@gmail.com>
Closes#6715 from MechCoder/generate_linear_input and squashes the following commits:
6182884 [MechCoder] Minor changes
8bda047 [MechCoder] Minor style fixes
0f1053c [MechCoder] [SPARK-8265] Add LinearDataGenerator to pyspark.mllib.utils
Author: Holden Karau <holden@pigscanfly.ca>
Closes#6927 from holdenk/SPARK-7888-Be-able-to-disable-intercept-in-Linear-Regression-in-ML-package and squashes the following commits:
0ad384c [Holden Karau] Add MiMa excludes
4016fac [Holden Karau] Switch to wild card import, remove extra blank lines
ae5baa8 [Holden Karau] CR feedback, move the fitIntercept down rather than changing ymean and etc above
f34971c [Holden Karau] Fix some more long lines
319bd3f [Holden Karau] Fix long lines
3bb9ee1 [Holden Karau] Update the regression suite tests
7015b9f [Holden Karau] Our code performs the same with R, except we need more than one data point but that seems reasonable
0b0c8c0 [Holden Karau] fix the issue with the sample R code
e2140ba [Holden Karau] Add a test, it fails!
5e84a0b [Holden Karau] Write out thoughts and use the correct trait
91ffc0a [Holden Karau] more murh
006246c [Holden Karau] murp?
Author: Holden Karau <holden@pigscanfly.ca>
Closes#6331 from holdenk/SPARK-7781-GradientBoostedTrees.trainRegressor-missing-max-bins and squashes the following commits:
2894695 [Holden Karau] remove extra blank line
2573e8d [Holden Karau] Update the scala side of the pythonmllibapi and make the test a bit nicer too
3a09170 [Holden Karau] add maxBins to to the train method as well
af7f274 [Holden Karau] Add maxBins to GradientBoostedTrees.trainRegressor and correctly mention the default of 32 in other places where it mentioned 100
Implementation of n-gram feature transformer for ML.
Author: Feynman Liang <fliang@databricks.com>
Closes#6887 from feynmanliang/ngram-featurizer and squashes the following commits:
d2c839f [Feynman Liang] Make n > input length yield empty output
9fadd36 [Feynman Liang] Add empty and corner test cases, fix names and spaces
fe93873 [Feynman Liang] Implement n-gram feature transformer
Updated `Attribute.fromStructField` to allow any `NumericType`, rather than just `DoubleType`, and added unit tests for a few of the other NumericTypes.
Author: Mike Dusenberry <dusenberrymw@gmail.com>
Closes#6540 from dusenberrymw/SPARK-7426_AttributeFactory.fromStructField_Should_Allow_NumericTypes and squashes the following commits:
87fecb3 [Mike Dusenberry] Updated Attribute.fromStructField to allow any NumericType, rather than just DoubleType, and added unit tests for a few of the other NumericTypes.
Python API for PCA and PCAModel
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#6315 from yanboliang/spark-7604 and squashes the following commits:
1d58734 [Yanbo Liang] remove transform() in PCAModel, use default behavior
4d9d121 [Yanbo Liang] Python API for PCA and PCAModel
JIRA: https://issues.apache.org/jira/browse/SPARK-8468
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#6905 from viirya/cv_min and squashes the following commits:
930d3db [Liang-Chi Hsieh] Fix python unit test and add document.
d632135 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into cv_min
16e3b2c [Liang-Chi Hsieh] Take the negative instead of reciprocal.
c3dd8d9 [Liang-Chi Hsieh] For comments.
b5f52c1 [Liang-Chi Hsieh] Add param to CrossValidator for choosing whether to maximize evaulation value.
Python bindings for StreamingKMeans
Will change status to MRG once docs, tests and examples are updated.
Author: MechCoder <manojkumarsivaraj334@gmail.com>
Closes#6499 from MechCoder/spark-4118 and squashes the following commits:
7722d16 [MechCoder] minor style fixes
51052d3 [MechCoder] Doc fixes
2061a76 [MechCoder] Add tests for simultaneous training and prediction Minor style fixes
81482fd [MechCoder] minor
5d9fe61 [MechCoder] predictOn should take into account the latest model
8ab9e89 [MechCoder] Fix Python3 error
a9817df [MechCoder] Better tests and minor fixes
c80e451 [MechCoder] Add ignore_unicode_prefix
ee8ce16 [MechCoder] Update tests, doc and examples
4b1481f [MechCoder] Some changes and tests
d8b066a [MechCoder] [SPARK-4118] [MLlib] [PySpark] Python bindings for StreamingKMeans
Python API for org.apache.spark.mllib.feature.ElementwiseProduct
Author: MechCoder <manojkumarsivaraj334@gmail.com>
Closes#6346 from MechCoder/spark-7605 and squashes the following commits:
79d1ef5 [MechCoder] Consistent and support list / array types
5f81d81 [MechCoder] [SPARK-7605] [MLlib] Python API for ElementwiseProduct
MatrixUDT was recently coded in scala. This has been ported to PySpark
Author: MechCoder <manojkumarsivaraj334@gmail.com>
Closes#6354 from MechCoder/spark-6390 and squashes the following commits:
fc4dc1e [MechCoder] Better error message
c940a44 [MechCoder] Added test
aa9c391 [MechCoder] Add pyUDT to MatrixUDT
62a2a7d [MechCoder] [SPARK-6390] Port MatrixUDT to PySpark
Check then make the MLlib Python classification and regression doc to be as complete as the Scala doc.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#6460 from yanboliang/spark-7916 and squashes the following commits:
f8deda4 [Yanbo Liang] trigger jenkins
6dc4d99 [Yanbo Liang] address comments
ce2a43e [Yanbo Liang] truncate too long line and remove extra sparse
3eaf6ad [Yanbo Liang] MLlib Python doc parity check for classification and regression
MLUtils.appendBias method is heavily used in creating intercepts for linear models.
This method uses Breeze's vector concatenation which is very slow compared to the plain
System.arrayCopy. This improvement is to change the implementation to use System.arrayCopy.
I saw the following performance improvements after the change:
Benchmark with mnist dataset for 50 times:
MLUtils.appendBias (SparseVector Before): 47320 ms
MLUtils.appendBias (SparseVector After): 1935 ms
MLUtils.appendBias (DenseVector Before): 5340 ms
MLUtils.appendBias (DenseVector After): 4080 ms
This is almost a 24 times performance boost for SparseVectors.
Author: Roger Menezes <rmenezes@netflix.com>
Closes#6768 from rogermenezes/improve-append-bias and squashes the following commits:
4e42f75 [Roger Menezes] address feedback
e999d79 [Roger Menezes] first commit
Test cases for both StreamingLinearRegression and StreamingLogisticRegression, and code fix.
Edit:
This contribution is my original work and I license the work to the project under the project's open source license.
Author: Paavo <pparkkin@gmail.com>
Closes#6713 from pparkkin/streamingmodel-empty-rdd and squashes the following commits:
ff5cd78 [Paavo] Update strings to use interpolation.
db234cf [Paavo] Use !rdd.isEmpty.
54ad89e [Paavo] Test case for empty stream.
393e36f [Paavo] Ignore empty RDDs.
0bfc365 [Paavo] Test case for empty stream.
Author: MechCoder <manojkumarsivaraj334@gmail.com>
Closes#6720 from MechCoder/empty_model_check and squashes the following commits:
3a07de5 [MechCoder] Remove construct to get weights in StreamingLinearAlgorithm
This makes the constructor callable in Python. dbtsai
Author: Xiangrui Meng <meng@databricks.com>
Closes#6709 from mengxr/SPARK-8168 and squashes the following commits:
f871de4 [Xiangrui Meng] Add Python friendly constructor to PipelineModel
1. Prevent creating a map of data to find numFeatures
2. If model is empty, then initialize with a zero vector of numFeature
Author: MechCoder <manojkumarsivaraj334@gmail.com>
Closes#6684 from MechCoder/spark-8140 and squashes the following commits:
7fbf5f9 [MechCoder] [SPARK-8140] Remove empty model check in StreamingLinearAlgorithm And other minor cosmits
Python API for KernelDensity
Author: MechCoder <manojkumarsivaraj334@gmail.com>
Closes#6387 from MechCoder/spark-7639 and squashes the following commits:
17abc62 [MechCoder] add tests
2de6540 [MechCoder] style tests
bf4acc0 [MechCoder] Added doctests
84359d5 [MechCoder] [SPARK-7639] Python API for KernelDensity
Added stats from cross validation as a val in the cross validation model to save them for user access.
Author: leahmcguire <lmcguire@salesforce.com>
Closes#5915 from leahmcguire/saveCVmetrics and squashes the following commits:
49b507b [leahmcguire] fixed tyle error
67537b1 [leahmcguire] rebased
85907f0 [leahmcguire] fixed name
59987cc [leahmcguire] changed param name and test according to comments
36e71e3 [leahmcguire] rebasing
4b8223e [leahmcguire] fixed name
4ddffc6 [leahmcguire] changed param name and test according to comments
3a995da [leahmcguire] Added stats from cross validation as a val in the cross validation model to save them for user access