In SPARK-7429 and PR https://github.com/apache/spark/pull/5960, I added the varargs annotation to Params.setDefault which takes a variable number of ParamPairs. It worked locally and on Jenkins for me.
However, mengxr reported issues compiling on his machine. So I'm reverting the change introduced in https://github.com/apache/spark/pull/5960 by removing varargs.
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#6021 from jkbradley/revert-varargs and squashes the following commits:
098ed39 [Joseph K. Bradley] removed varargs annotation from Params.setDefaults taking multiple ParamPairs
(cherry picked from commit 2992623841)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
1) Handle scaling and addBias internally.
2) L1/L2 elasticnet using OWLQN optimizer.
Author: DB Tsai <dbt@netflix.com>
Closes#5967 from dbtsai/lor and squashes the following commits:
fa029bb [DB Tsai] made the bound smaller
0806002 [DB Tsai] better initial intercept and more test
5c31824 [DB Tsai] fix import
c387e25 [DB Tsai] Merge branch 'master' into lor
c84e931 [DB Tsai] Made MultiClassSummarizer private
f98e711 [DB Tsai] address feedback
a784321 [DB Tsai] fix style
8ec65d2 [DB Tsai] remove new line
f3f8c88 [DB Tsai] add more tests and they match R which is good. fix a bug
34705bc [DB Tsai] first commit
(cherry picked from commit 86ef4cfd43)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
Adds Python Api for `ALS` under `ml.recommendation` in PySpark. Also adds seed as a settable parameter in the Scala Implementation of ALS.
Author: Burak Yavuz <brkyvz@gmail.com>
Closes#6015 from brkyvz/ml-rec and squashes the following commits:
be6e931 [Burak Yavuz] addressed comments
eaed879 [Burak Yavuz] readd numFeatures
0bd66b1 [Burak Yavuz] fixed seed
7f6d964 [Burak Yavuz] merged master
52e2bda [Burak Yavuz] added ALS
(cherry picked from commit 84bf931f36)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
Add a Python API for mllib.feature.ChiSqSelector
https://issues.apache.org/jira/browse/SPARK-5913
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#5939 from yanboliang/spark-5913 and squashes the following commits:
cdaac99 [Yanbo Liang] Python API for ChiSqSelector
(cherry picked from commit 35c9599b94)
Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
Implemented python wrappers for Scala functions that don't exist in `ml.features`
Author: Burak Yavuz <brkyvz@gmail.com>
Closes#5991 from brkyvz/ml-feat-PR and squashes the following commits:
adcca55 [Burak Yavuz] add regex tokenizer to __all__
b91cb44 [Burak Yavuz] addressed comments
bd39fd2 [Burak Yavuz] remove addition
b82bd7c [Burak Yavuz] Parity in PySpark for ml.features
(cherry picked from commit f5ff4a84c4)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
the toArray function of the BoundedPriorityQueue does not necessarily preserve order. Add a counter-example as the test, which would fail the original impl.
Author: Shuo Xiang <shuoxiangpub@gmail.com>
Closes#5990 from coderxiang/topbykey-test and squashes the following commits:
98804c9 [Shuo Xiang] fix bug in topBykey and update test
(cherry picked from commit 92f8f803a6)
Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
The compression is based on storage. brkyvz
Author: Xiangrui Meng <meng@databricks.com>
Closes#5985 from mengxr/SPARK-6948 and squashes the following commits:
df56a00 [Xiangrui Meng] update python tests
6d90d45 [Xiangrui Meng] compress vectors in VectorAssembler
(cherry picked from commit e43803b8f4)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
https://issues.apache.org/jira/browse/SPARK-6093
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#5941 from yanboliang/spark-6093 and squashes the following commits:
6934af3 [Yanbo Liang] change to @property
aac3bc5 [Yanbo Liang] Add RegressionMetrics in PySpark/MLlib
(cherry picked from commit 1712a7c705)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
The wrapper required the implementation of the `ArrayParam`, because `Array[T]` is hard to obtain from Python. `ArrayParam` has an extra function called `wCast` which is an internal function to obtain `Array[T]` from `Seq[T]`
Author: Burak Yavuz <brkyvz@gmail.com>
Author: Xiangrui Meng <meng@databricks.com>
Closes#5930 from brkyvz/ml-feat and squashes the following commits:
73e745f [Burak Yavuz] Merge pull request #3 from mengxr/SPARK-7388
c221db9 [Xiangrui Meng] overload StringArrayParam.w
c81072d [Burak Yavuz] addressed comments
99c2ebf [Burak Yavuz] add to python_shared_params
39ecb07 [Burak Yavuz] fix scalastyle
7f7ea2a [Burak Yavuz] [SPARK-7388][SPARK-7383] wrapper for VectorAssembler in Python
(cherry picked from commit 9e2ffb1328)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
Params.setDefault taking a set of ParamPairs should be annotated with varargs. I thought it would not work before, but it apparently does.
CrossValidator.transform should call transformSchema since the underlying Model might be a PipelineModel
CC: mengxr
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#5960 from jkbradley/params-cleanups and squashes the following commits:
118b158 [Joseph K. Bradley] Params.setDefault taking a set of ParamPairs should be annotated with varargs. I thought it would not work before, but it apparently does. CrossValidator.transform should call transformSchema since the underlying Model might be a PipelineModel
(cherry picked from commit 4f87e9562a)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
Small changes, primarily to allow us more flexibility in the future:
* Rename "tau_0" to "tau0"
* Mark LDAOptimizer trait sealed and DeveloperApi.
* Mark LDAOptimizer subclasses as final.
* Mark setOptimizer (the one taking an LDAOptimizer) and getOptimizer as DeveloperApi since we may need to change them in the future
CC: hhbyyh
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#5956 from jkbradley/onlinelda-cleanups and squashes the following commits:
f4be508 [Joseph K. Bradley] added newline
f4003e4 [Joseph K. Bradley] Changes: * Rename "tau_0" to "tau0" * Mark LDAOptimizer trait sealed and DeveloperApi. * Mark LDAOptimizer subclasses as final. * Mark setOptimizer (the one taking an LDAOptimizer) and getOptimizer as DeveloperApi since we may need to change them in the future
(cherry picked from commit 8b6b46e4ff)
Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
Changes:
* Update protected prediction methods, following design doc. **<--most interesting change**
* Changed abstract classes for Estimator and Model to be public. Added DeveloperApi tag. (I kept the traits for Estimator/Model Params private.)
* Changed ProbabilisticClassificationModel method names to use probability instead of probabilities.
CC: mengxr shivaram etrain
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#5913 from jkbradley/public-dev-api and squashes the following commits:
e9aa0ea [Joseph K. Bradley] moved findMax to DenseVector and renamed to argmax. fixed bug for vector of length 0
15b9957 [Joseph K. Bradley] renamed probabilities to probability in method names
5cda84d [Joseph K. Bradley] regenerated sharedParams
7d1877a [Joseph K. Bradley] Made spark.ml prediction abstractions public. Organized their prediction methods for efficient computation of multiple output columns.
(cherry picked from commit 1ad04dae03)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
Since CrossValidator is a meta algorithm, we copy the implementation in Python. jkbradley
Author: Xiangrui Meng <meng@databricks.com>
Closes#5926 from mengxr/SPARK-6940 and squashes the following commits:
6af181f [Xiangrui Meng] add TODOs
8285134 [Xiangrui Meng] update doc
060f7c3 [Xiangrui Meng] update doctest
acac727 [Xiangrui Meng] add keyword args
cdddecd [Xiangrui Meng] add CrossValidator in Python
(cherry picked from commit 32cdc815c6)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
This patch adds a one hot encoder for categorical features. Planning to add documentation and another test after getting feedback on the approach.
A couple choices made here:
* There's an `includeFirst` option which, if false, creates numCategories - 1 columns and, if true, creates numCategories columns. The default is true, which is the behavior in scikit-learn.
* The user is expected to pass a `Seq` of category names when instantiating a `OneHotEncoder`. These can be easily gotten from a `StringIndexer`. The names are used for the output column names, which take the form colName_categoryName.
Author: Sandy Ryza <sandy@cloudera.com>
Closes#5500 from sryza/sandy-spark-5888 and squashes the following commits:
f383250 [Sandy Ryza] Infer label names automatically
6e257b9 [Sandy Ryza] Review comments
7c539cf [Sandy Ryza] Vector transformers
1c182dd [Sandy Ryza] SPARK-5888. [MLLIB]. Add OneHotEncoder as a Transformer
(cherry picked from commit 47728db7cf)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
The following items are added to Python kmeans:
kmeans - setEpsilon, setInitializationSteps
KMeansModel - computeCost, k
Author: Hrishikesh Subramonian <hrishikesh.subramonian@flytxt.com>
Closes#5647 from FlytxtRnD/newPyKmeansAPI and squashes the following commits:
b9e451b [Hrishikesh Subramonian] set seed to fixed value in doc test
5fd3ced [Hrishikesh Subramonian] doc test corrections
20b3c68 [Hrishikesh Subramonian] python 3 fixes
4d4e695 [Hrishikesh Subramonian] added arguments in python tests
21eb84c [Hrishikesh Subramonian] Python Kmeans - setEpsilon, setInitializationSteps, k and computeCost added.
(cherry picked from commit 5995ada96b)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
Utilities for pickling and unpickling SparseMatrices using SerDe
Author: MechCoder <manojkumarsivaraj334@gmail.com>
Closes#5775 from MechCoder/spark-7202 and squashes the following commits:
7e689dc [MechCoder] [SPARK-7202] Add SparseMatrixPickler to SerDe
(cherry picked from commit 5ab652cdb8)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
This PR added `copy(extra: ParamMap): Params` to `Params`, which makes a copy of the current instance with a randomly generated uid and some extra param values. With this change, we only need to implement `fit` and `transform` without extra param values given the default implementation of `fit(dataset, extra)`:
~~~scala
def fit(dataset: DataFrame, extra: ParamMap): Model = {
copy(extra).fit(dataset)
}
~~~
Inside `fit` and `transform`, since only the embedded values are used, I added `$` as an alias for `getOrDefault` to make the code easier to read. For example, in `LinearRegression.fit` we have:
~~~scala
val effectiveRegParam = $(regParam) / yStd
val effectiveL1RegParam = $(elasticNetParam) * effectiveRegParam
val effectiveL2RegParam = (1.0 - $(elasticNetParam)) * effectiveRegParam
~~~
Meta-algorithm like `Pipeline` implements its own `copy(extra)`. So the fitted pipeline model stored all copied stages (no matter whether it is a transformer or a model).
Other changes:
* `Params$.inheritValues` is moved to `Params!.copyValues` and returns the target instance.
* `fittingParamMap` was removed because the `parent` carries this information.
* `validate` was renamed to `validateParams` to be more precise.
TODOs:
* [x] add tests for newly added methods
* [ ] update documentation
jkbradley dbtsai
Author: Xiangrui Meng <meng@databricks.com>
Closes#5820 from mengxr/SPARK-5956 and squashes the following commits:
7bef88d [Xiangrui Meng] address comments
05229c3 [Xiangrui Meng] assert -> assertEquals
b2927b1 [Xiangrui Meng] organize imports
f14456b [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-5956
93e7924 [Xiangrui Meng] add tests for hasParam & copy
463ecae [Xiangrui Meng] merge master
2b954c3 [Xiangrui Meng] update Binarizer
465dd12 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-5956
282a1a8 [Xiangrui Meng] fix test
819dd2d [Xiangrui Meng] merge master
b642872 [Xiangrui Meng] example code runs
5a67779 [Xiangrui Meng] examples compile
c76b4d1 [Xiangrui Meng] fix all unit tests
0f4fd64 [Xiangrui Meng] fix some tests
9286a22 [Xiangrui Meng] copyValues to trained models
53e0973 [Xiangrui Meng] move inheritValues to Params and rename it to copyValues
9ee004e [Xiangrui Meng] merge copy and copyWith; rename validate to validateParams
d882afc [Xiangrui Meng] test compile
f082a31 [Xiangrui Meng] make Params copyable and simply handling of extra params in all spark.ml components
(cherry picked from commit e0833c5958)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
JIRA: https://issues.apache.org/jira/browse/SPARK-5563
The PR contains the implementation for [Online LDA] (https://www.cs.princeton.edu/~blei/papers/HoffmanBleiBach2010b.pdf) based on the research of Matt Hoffman and David M. Blei, which provides an efficient option for LDA users. Major advantages for the algorithm are the stream compatibility and economic time/memory consumption due to the corpus split. For more details, please refer to the jira.
Online LDA can act as a fast option for LDA, and will be especially helpful for the users who needs a quick result or with large corpus.
Correctness test.
I have tested current PR with https://github.com/Blei-Lab/onlineldavb and the results are identical. I've uploaded the result and code to https://github.com/hhbyyh/LDACrossValidation.
Author: Yuhao Yang <hhbyyh@gmail.com>
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#4419 from hhbyyh/ldaonline and squashes the following commits:
1045eec [Yuhao Yang] Merge pull request #2 from jkbradley/hhbyyh-ldaonline2
cf376ff [Joseph K. Bradley] For private vars needed for testing, I made them private and added accessors. Java doesn’t understand package-private tags, so this minimizes the issues Java users might encounter.
6149ca6 [Yuhao Yang] fix for setOptimizer
cf0007d [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline
54cf8da [Yuhao Yang] some style change
68c2318 [Yuhao Yang] add a java ut
4041723 [Yuhao Yang] add ut
138bfed [Yuhao Yang] Merge pull request #1 from jkbradley/hhbyyh-ldaonline-update
9e910d9 [Joseph K. Bradley] small fix
61d60df [Joseph K. Bradley] Minor cleanups: * Update *Concentration parameter documentation * EM Optimizer: createVertices() does not need to be a function * OnlineLDAOptimizer: typos in doc * Clean up the core code for online LDA (Scala style)
a996a82 [Yuhao Yang] respond to comments
b1178cf [Yuhao Yang] fit into the optimizer framework
dbe3cff [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline
15be071 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline
b29193b [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline
d19ef55 [Yuhao Yang] change OnlineLDA to class
97b9e1a [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline
e7bf3b0 [Yuhao Yang] move to seperate file
f367cc9 [Yuhao Yang] change to optimization
8cb16a6 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline
62405cc [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline
02d0373 [Yuhao Yang] fix style in comment
f6d47ca [Yuhao Yang] Merge branch 'ldaonline' of https://github.com/hhbyyh/spark into ldaonline
d86cdec [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline
a570c9a [Yuhao Yang] use sample to pick up batch
4a3f27e [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline
e271eb1 [Yuhao Yang] remove non ascii
581c623 [Yuhao Yang] seperate API and adjust batch split
37af91a [Yuhao Yang] iMerge remote-tracking branch 'upstream/master' into ldaonline
20328d1 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline i
aa365d1 [Yuhao Yang] merge upstream master
3a06526 [Yuhao Yang] merge with new example
0dd3947 [Yuhao Yang] kMerge remote-tracking branch 'upstream/master' into ldaonline
0d0f3ee [Yuhao Yang] replace random split with sliding
fa408a8 [Yuhao Yang] ssMerge remote-tracking branch 'upstream/master' into ldaonline
45884ab [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline s
f41c5ca [Yuhao Yang] style fix
26dca1b [Yuhao Yang] style fix and make class private
043e786 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline s Conflicts: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala
d640d9c [Yuhao Yang] online lda initial checkin
This is based on #3098 from debasish83.
1. BLAS' GEMM is used to compute inner products.
2. Reverted changes to MovieLensALS. SPARK-4231 should be addressed in a separate PR.
3. ~~Fixed a bug in topByKey~~
Closes#3098
debasish83 coderxiang
Author: Debasish Das <debasish.das@one.verizon.com>
Author: Xiangrui Meng <meng@databricks.com>
Closes#5829 from mengxr/SPARK-3066 and squashes the following commits:
22e6a87 [Xiangrui Meng] topByKey was correct. update its usage
389b381 [Xiangrui Meng] fix indentation
49953de [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-3066
cb9799a [Xiangrui Meng] revert MovieLensALS
f864f5e [Xiangrui Meng] update test and fix a bug in topByKey
c5e0181 [Xiangrui Meng] use GEMM and topByKey
3a0c4eb [Debasish Das] updated with spark master
98fa424 [Debasish Das] updated with master
ee99571 [Debasish Das] addressed initial review comments;merged with master;added tests for batch predict APIs in matrix factorization
3f97c49 [Debasish Das] fixed spark coding style for imports
7163a5c [Debasish Das] Added API for batch user and product recommendation; MAP calculation for product recommendation per user using randomized split
d144f57 [Debasish Das] recommendAll API to MatrixFactorizationModel, uses topK finding using BoundedPriorityQueue similar to RDD.top
f38a1b5 [Debasish Das] use sampleByKey for per user sampling
10cbb37 [Debasish Das] provide ratio for topN product validation; generate MAP and prec@k metric for movielens dataset
9fa063e [Debasish Das] import scala.math.round
4bbae0f [Debasish Das] comments fixed as per scalastyle
cd3ab31 [Debasish Das] merged with AbstractParams serialization bug
9b3951f [Debasish Das] validate user/product on MovieLens dataset through user input and compute map measure along with rmse
Author: DB Tsai <dbt@netflix.com>
Closes#5809 from dbtsai/format and squashes the following commits:
6904eed [DB Tsai] triger jenkins
9146e19 [DB Tsai] initial commit
See PDF attached to the JIRA issue 1406.
The contribution is my original work and I license the work to the project under the project's open source license.
Author: Vincenzo Selvaggio <vselvaggio@hotmail.it>
Author: Xiangrui Meng <meng@databricks.com>
Author: selvinsource <vselvaggio@hotmail.it>
Closes#3062 from selvinsource/mllib_pmml_model_export_SPARK-1406 and squashes the following commits:
852aac6 [Vincenzo Selvaggio] [SPARK-1406] Update JPMML version to 1.1.15 in LICENSE file
085cf42 [Vincenzo Selvaggio] [SPARK-1406] Added Double Min and Max Fixed scala style
30165c4 [Vincenzo Selvaggio] [SPARK-1406] Fixed extreme cases for logit
7a5e0ec [Vincenzo Selvaggio] [SPARK-1406] Binary classification for SVM and Logistic Regression
cfcb596 [Vincenzo Selvaggio] [SPARK-1406] Throw IllegalArgumentException when exporting a multinomial logistic regression
25dce33 [Vincenzo Selvaggio] [SPARK-1406] Update code to latest pmml model
dea98ca [Vincenzo Selvaggio] [SPARK-1406] Exclude transitive dependency for pmml model
66b7c12 [Vincenzo Selvaggio] [SPARK-1406] Updated pmml model lib to 1.1.15, latest Java 6 compatible
a0a55f7 [Vincenzo Selvaggio] Merge pull request #2 from mengxr/SPARK-1406
3c22f79 [Xiangrui Meng] more code style
e2313df [Vincenzo Selvaggio] Merge pull request #1 from mengxr/SPARK-1406
472d757 [Xiangrui Meng] fix code style
1676e15 [Vincenzo Selvaggio] fixed scala issue
e2ffae8 [Vincenzo Selvaggio] fixed scala style
b8823b0 [Vincenzo Selvaggio] Merge remote-tracking branch 'upstream/master' into mllib_pmml_model_export_SPARK-1406
b25bbf7 [Vincenzo Selvaggio] [SPARK-1406] Added export of pmml to distributed file system using the spark context
7a949d0 [Vincenzo Selvaggio] [SPARK-1406] Fixed scala style
f46c75c [Vincenzo Selvaggio] [SPARK-1406] Added PMMLExportable to supported models
7b33b4e [Vincenzo Selvaggio] [SPARK-1406] Added a PMMLExportable interface Restructured code in a new package mllib.pmml Supported models implements the new PMMLExportable interface: LogisticRegression, SVM, KMeansModel, LinearRegression, RidgeRegression, Lasso
d559ec5 [Vincenzo Selvaggio] Merge remote-tracking branch 'upstream/master' into mllib_pmml_model_export_SPARK-1406
8fe12bb [Vincenzo Selvaggio] [SPARK-1406] Adjusted logistic regression export description and target categories
03bc3a5 [Vincenzo Selvaggio] added logistic regression
da2ec11 [Vincenzo Selvaggio] [SPARK-1406] added linear SVM PMML export
82f2131 [Vincenzo Selvaggio] Merge remote-tracking branch 'upstream/master' into mllib_pmml_model_export_SPARK-1406
19adf29 [Vincenzo Selvaggio] [SPARK-1406] Fixed scala style
1faf985 [Vincenzo Selvaggio] [SPARK-1406] Added target field to the regression model for completeness Adjusted unit test to deal with this change
3ae8ae5 [Vincenzo Selvaggio] [SPARK-1406] Adjusted imported order according to the guidelines
c67ce81 [Vincenzo Selvaggio] Merge remote-tracking branch 'upstream/master' into mllib_pmml_model_export_SPARK-1406
78515ec [Vincenzo Selvaggio] [SPARK-1406] added pmml export for LinearRegressionModel, RidgeRegressionModel and LassoModel
e29dfb9 [Vincenzo Selvaggio] removed version, by default is set to 4.2 (latest from jpmml) removed copyright
ae8b993 [Vincenzo Selvaggio] updated some commented tests to use the new ModelExporter object reordered the imports
df8a89e [Vincenzo Selvaggio] added pmml version to pmml model changed the copyright to spark
a1b4dc3 [Vincenzo Selvaggio] updated imports
834ca44 [Vincenzo Selvaggio] reordered the import accordingly to the guidelines
349a76b [Vincenzo Selvaggio] new helper object to serialize the models to pmml format
c3ef9b8 [Vincenzo Selvaggio] set it to private
6357b98 [Vincenzo Selvaggio] set it to private
e1eb251 [Vincenzo Selvaggio] removed serialization part, this will be part of the ModelExporter helper object
aba5ee1 [Vincenzo Selvaggio] fixed cluster export
cd6c07c [Vincenzo Selvaggio] fixed scala style to run tests
f75b988 [Vincenzo Selvaggio] Merge remote-tracking branch 'origin/master' into mllib_pmml_model_export_SPARK-1406
07a29bf [selvinsource] Update LICENSE
8841439 [Vincenzo Selvaggio] adjust scala style in order to compile
1433b11 [Vincenzo Selvaggio] complete suite tests
8e71b8d [Vincenzo Selvaggio] kmeans pmml export implementation
9bc494f [Vincenzo Selvaggio] added scala suite tests added saveLocalFile to ModelExport trait
226e184 [Vincenzo Selvaggio] added javadoc and export model type in case there is a need to support other types of export (not just PMML)
a0e3679 [Vincenzo Selvaggio] export and pmml export traits kmeans test implementation
Author: DB Tsai <dbt@netflix.com>
Closes#5794 from dbtsai/clean and squashes the following commits:
ad639dd [DB Tsai] Indentation
834d527 [DB Tsai] Some code clean up.
Main change: Added isValid field to Param. Modified all usages to use isValid when relevant. Added helper methods in ParamValidate.
Also overrode Params.validate() in:
* CrossValidator + model
* Pipeline + model
I made a few updates for the elastic net patch:
* I changed "tol" to "convergenceTol"
* I added some documentation
This PR is Scala + Java only. Python will be in a follow-up PR.
CC: mengxr
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#5740 from jkbradley/enforce-validate and squashes the following commits:
ad9c6c1 [Joseph K. Bradley] re-generated sharedParams after merging with current master
76415e8 [Joseph K. Bradley] reverted convergenceTol to tol
af62f4b [Joseph K. Bradley] Removed changes to SparkBuild, python linalg. Fixed test failures. Renamed ParamValidate to ParamValidators. Removed explicit type from ParamValidators calls where possible.
bb2665a [Joseph K. Bradley] merged with elastic net pr
ecda302 [Joseph K. Bradley] fix rat tests, plus add a little doc
6895dfc [Joseph K. Bradley] small cleanups
069ac6d [Joseph K. Bradley] many cleanups
928fb84 [Joseph K. Bradley] Maybe done
a910ac7 [Joseph K. Bradley] still workin
6d60e2e [Joseph K. Bradley] Still workin
b987319 [Joseph K. Bradley] Partly done with adding checks, but blocking on adding checking functionality to Param
dbc9fb2 [Joseph K. Bradley] merged with master. enforcing Params.validate
Changed VectorIndexer so it does not carry non-ML metadata from the input to the output column. Removed ml.util.TestingUtils since VectorIndexer was the only use.
CC: mengxr
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#5789 from jkbradley/vector-indexer-metadata and squashes the following commits:
b28e159 [Joseph K. Bradley] Changed VectorIndexer so it does not carry non-ML metadata from the input to the output column. Removed ml.util.TestingUtils since VectorIndexer was the only use.
See JIRA issue [here](https://issues.apache.org/jira/browse/SPARK-6529).
There are some notes:
1. I add `learningRate` in sharedParams since it is a common parameter for ML algorithms.
2. We will not support transform of finding synonyms from a `Vector`, which will support in further JIRA issues.
3. Word2Vec is different with other ML models that its training set and transformed set are different. Its training set is an `RDD[Iterable[String]]` which represents documents, but the transformed set we want is an `RDD[String]` that represents unique words. So you have to switch your `inputCol` in these two stages.
Author: Xusen Yin <yinxusen@gmail.com>
Closes#5596 from yinxusen/SPARK-6529 and squashes the following commits:
ee2b37a [Xusen Yin] merge with former HEAD
4945462 [Xusen Yin] merge with #5626
3bc2cbd [Xusen Yin] change foldLeft to for loop and use blas
5dd4ee7 [Xusen Yin] fix scala style
743e0d5 [Xusen Yin] fix comments and code style
04c48e9 [Xusen Yin] ensure the functionality
a190f2c [Xusen Yin] fix code style and refine the transform function of word2vec
02848fa [Xusen Yin] refine comments
34a55c0 [Xusen Yin] fix errors
109d124 [Xusen Yin] add test suite and pass it
04dde06 [Xusen Yin] add shared params
c594095 [Xusen Yin] add word2vec transformer
23d77fa [Xusen Yin] merge with #5626
e8cfaf7 [Xusen Yin] fix conflict with master
66e7bd3 [Xusen Yin] change foldLeft to for loop and use blas
566ec20 [Xusen Yin] fix scala style
b54399f [Xusen Yin] fix comments and code style
1211e86 [Xusen Yin] ensure the functionality
6b97ec8 [Xusen Yin] fix code style and refine the transform function of word2vec
7cde18f [Xusen Yin] rm sharedParams
618abd0 [Xusen Yin] refine comments
e29680a [Xusen Yin] fix errors
fe3afe9 [Xusen Yin] add test suite and pass it
02767fb [Xusen Yin] add shared params
6a514f1 [Xusen Yin] add word2vec transformer
Added detailed mathematical derivation of how scaling and LeastSquaresAggregator work. Refactored the code so the model is compressed based on the storage. We may try compression based on the prediction time.
Also, I found that diffSum will be always zero mathematically, so no corrections are required.
Author: DB Tsai <dbt@netflix.com>
Closes#5767 from dbtsai/lir-doc and squashes the following commits:
5e346c9 [DB Tsai] refactoring
fc9f582 [DB Tsai] doc
58456d8 [DB Tsai] address feedback
69757b8 [DB Tsai] actually diffSum is mathematically zero! No correction is needed.
5929e49 [DB Tsai] typo
63f7d1e [DB Tsai] Added compression to the model based on storage
203a295 [DB Tsai] Add more documentation to LinearRegression in new ML framework.
mengxr
Author: Xusen Yin <yinxusen@gmail.com>
Closes#5769 from yinxusen/patch-1 and squashes the following commits:
43235f4 [Xusen Yin] Update PearsonCorrelation.scala
f7287ee [Xusen Yin] Fix a typo of "threshold"
Add `compressed` to `Vector` with some other methods: `numActives`, `numNonzeros`, `toSparse`, and `toDense`. jkbradley
Author: Xiangrui Meng <meng@databricks.com>
Closes#5756 from mengxr/SPARK-6756 and squashes the following commits:
8d4ecbd [Xiangrui Meng] address comment and add mima excludes
da54179 [Xiangrui Meng] add toSparse, toDense, numActives, numNonzeros, and compressed to Vector
Cast numeric types to String for indexing. Boolean type is not handled in this PR. jkbradley
Author: Xiangrui Meng <meng@databricks.com>
Closes#5753 from mengxr/SPARK-6965 and squashes the following commits:
2e34f3c [Xiangrui Meng] add actual type in the error message
ad938bf [Xiangrui Meng] StringIndexer handles numeric input.
It shouldn't live directly under `spark.ml`.
Author: Xiangrui Meng <meng@databricks.com>
Closes#5749 from mengxr/SPARK-7201 and squashes the following commits:
53847f9 [Xiangrui Meng] move Identifiable to ml.util
The Python SerDe calls `Object.hashCode`, which is very expensive for Vectors. It is not necessary to scan the whole vector, especially for large ones. In this PR, we only scan the first 16 nonzeros. srowen
Author: Xiangrui Meng <meng@databricks.com>
Closes#5697 from mengxr/SPARK-7140 and squashes the following commits:
2abc86d [Xiangrui Meng] typo
8fb7d74 [Xiangrui Meng] update impl
1ebad60 [Xiangrui Meng] only scan the first 16 nonzeros in Vector.hashCode
Author: DB Tsai <dbt@netflix.com>
Author: DB Tsai <dbtsai@alpinenow.com>
Closes#4259 from dbtsai/lir and squashes the following commits:
a81c201 [DB Tsai] add import org.apache.spark.util.Utils back
9fc48ed [DB Tsai] rebase
2178b63 [DB Tsai] add comments
9988ca8 [DB Tsai] addressed feedback and fixed a bug. TODO: documentation and build another synthetic dataset which can catch the bug fixed in this commit.
fcbaefe [DB Tsai] Refactoring
4eb078d [DB Tsai] first commit
jira: https://issues.apache.org/jira/browse/SPARK-7090
LDA was implemented with extensibility in mind. And with the development of OnlineLDA and Gibbs Sampling, we are collecting more detailed requirements from different algorithms.
As Joseph Bradley jkbradley proposed in https://github.com/apache/spark/pull/4807 and with some further discussion, we'd like to adjust the code structure a little to present the common interface and extension point clearly.
Basically class LDA would be a common entrance for LDA computing. And each LDA object will refer to a LDAOptimizer for the concrete algorithm implementation. Users can customize LDAOptimizer with specific parameters and assign it to LDA.
Concrete changes:
1. Add a trait `LDAOptimizer`, which defines the common iterface for concrete implementations. Each subClass is a wrapper for a specific LDA algorithm.
2. Move EMOptimizer to file LDAOptimizer and inherits from LDAOptimizer, rename to EMLDAOptimizer. (in case a more generic EMOptimizer comes in the future)
-adjust the constructor of EMOptimizer, since all the parameters should be passed in through initialState method. This can avoid unwanted confusion or overwrite.
-move the code from LDA.initalState to initalState of EMLDAOptimizer
3. Add property ldaOptimizer to LDA and its getter/setter, and EMLDAOptimizer is the default Optimizer.
4. Change the return type of LDA.run from DistributedLDAModel to LDAModel.
Further work:
add OnlineLDAOptimizer and other possible Optimizers once ready.
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes#5661 from hhbyyh/ldaRefactor and squashes the following commits:
0e2e006 [Yuhao Yang] respond to review comments
08a45da [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaRefactor
e756ce4 [Yuhao Yang] solve mima exception
d74fd8f [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaRefactor
0bb8400 [Yuhao Yang] refactor LDA with Optimizer
ec2f857 [Yuhao Yang] protoptype for discussion
1. predict(predict.toString) has already output prefix “predict” thus it’s duplicated to print ", predict = " again
2. there are some extra spaces
Author: Alain <aihe@usc.edu>
Closes#5687 from AiHe/tree-node-issue-2 and squashes the following commits:
9862b9a [Alain] Pass scala coding style checking
44ba947 [Alain] Minor][MLLIB] Format toString method in MLLIB
bdc402f [Alain] [Minor][MLLIB] Fix a formatting bug in toString method in Node
426eee7 [Alain] [Minor][MLLIB] Fix a formatting bug in toString method in Node.scala
This is a continuation of [https://github.com/apache/spark/pull/5530] (which was for Decision Trees), but for ensembles: Random Forests and Gradient-Boosted Trees. Please refer to the JIRA [https://issues.apache.org/jira/browse/SPARK-6113], the design doc linked from the JIRA, and the previous PR linked above for design discussions.
This PR follows the example set by the previous PR for Decision Trees. It includes a few cleanups to Decision Trees.
Note: There is one issue which will be addressed in a separate PR: Ensembles' component Models have no parent or fittingParamMap. I plan to submit a separate PR which makes those values in Model be Options. It does not matter much which PR gets merged first.
CC: mengxr manishamde codedeft chouqin
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#5626 from jkbradley/dt-api-ensembles and squashes the following commits:
729167a [Joseph K. Bradley] small cleanups based on code review
bbae2a2 [Joseph K. Bradley] Updated per all comments in code review
855aa9a [Joseph K. Bradley] scala style fix
ea3d901 [Joseph K. Bradley] Added GBT to spark.ml, with tests and examples
c0f30c1 [Joseph K. Bradley] Added random forests and test suites to spark.ml. Not tested yet. Need to add example as well
d045ebd [Joseph K. Bradley] some more updates, but far from done
ee1a10b [Joseph K. Bradley] Added files from old PR and did some initial updates.
See [SPARK-6528](https://issues.apache.org/jira/browse/SPARK-6528). Add IDF transformer in ML package.
Author: Xusen Yin <yinxusen@gmail.com>
Closes#5266 from yinxusen/SPARK-6528 and squashes the following commits:
741db31 [Xusen Yin] get param from new paramMap
d169967 [Xusen Yin] add final to param and IDF class
c9c3759 [Xusen Yin] simplify test suite
5867c09 [Xusen Yin] refine IDF transformer with new interfaces
7727cae [Xusen Yin] Merge branch 'master' into SPARK-6528
4338a37 [Xusen Yin] Merge branch 'master' into SPARK-6528
aef2cdf [Xusen Yin] add doc and group for param
5760b49 [Xusen Yin] fix code style
2add691 [Xusen Yin] fix code style and test
03fbecb [Xusen Yin] remove duplicated code
2aa4be0 [Xusen Yin] clean test suite
4802c67 [Xusen Yin] add IDF transformer and test suite
yinxusen
Author: Xiangrui Meng <meng@databricks.com>
Closes#5681 from mengxr/SPARK-7115 and squashes the following commits:
9ac27cd [Xiangrui Meng] skip the very first 1 in poly expansion
jkbradley
Author: Xiangrui Meng <meng@databricks.com>
Closes#5649 from mengxr/SPARK-7070 and squashes the following commits:
c66023c [Xiangrui Meng] setBeta should call setTopicConcentration
Author: wizz <wizz@wizz-dev01.kawasaki.flab.fujitsu.com>
Closes#5658 from kuromatsu-nobuyuki/SPARK-7085 and squashes the following commits:
6ec2d21 [wizz] Fix miniBatchFraction parameter in train method called with 4 arguments
Author: Reynold Xin <rxin@databricks.com>
Closes#5648 from rxin/vectorAssembler-boolean and squashes the following commits:
1bf3d40 [Reynold Xin] [MLlib] Add support for BooleanType to VectorAssembler.
Author: Reynold Xin <rxin@databricks.com>
Closes#5642 from rxin/mllib-native-type and squashes the following commits:
e23af5b [Reynold Xin] Remove StringType
7cbb205 [Reynold Xin] [SPARK-7066][MLlib] VectorAssembler should use NumericType and StringType, not NativeType.
Author: Reynold Xin <rxin@databricks.com>
Closes#5644 from rxin/mllib-nullable and squashes the following commits:
a727e5b [Reynold Xin] [MLlib] UnaryTransformer nullability should not depend on primitive types.
This does a few clean-ups. With this PR, all spark.ml tree components have ```private[ml]``` constructors.
CC: mengxr
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#5567 from jkbradley/dt-api-dt2 and squashes the following commits:
2263b5b [Joseph K. Bradley] Added note about tree example issue.
bb9f610 [Joseph K. Bradley] Small cleanups after original tree API PR
add missing comma and space
Author: Alain <aihe@usc.edu>
Closes#5621 from AiHe/tree-node-issue and squashes the following commits:
159a7bb [Alain] [Minor][MLLIB] Fix a minor formatting bug in toString methods in Node.scala
(cherry picked from commit 4508f01890a723f80d631424ff8eda166a13a727)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
1. Use blas calls to find the dot product between two vectors.
2. Prevent re-computing the L2 norm of the given vector for each word in model.
Author: MechCoder <manojkumarsivaraj334@gmail.com>
Closes#5467 from MechCoder/spark-6065 and squashes the following commits:
dd0b0b2 [MechCoder] Preallocate wordVectors
ffc9240 [MechCoder] Minor
6b74c81 [MechCoder] Switch back to native blas calls
da1642d [MechCoder] Explicit types and indexing
64575b0 [MechCoder] Save indexedmap and a wordvecmat instead of matrix
fbe0108 [MechCoder] Made the following changes 1. Calculate norms during initialization. 2. Use Blas calls from linalg.blas
1350cf3 [MechCoder] [SPARK-6065] Optimize word2vec.findSynonynms using blas calls
Since sparse matrices now support a isTransposed flag for row major data, DenseMatrices should do the same.
Author: MechCoder <manojkumarsivaraj334@gmail.com>
Closes#5455 from MechCoder/spark-6845 and squashes the following commits:
525c370 [MechCoder] minor
004a37f [MechCoder] Cast boolean to int
151f3b6 [MechCoder] [WIP] Add isTransposed to pickle DenseMatrix
cc0b90a [MechCoder] [SPARK-6845] Add isTranposed flag to DenseMatrix
Model import/export for IsotonicRegression
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#5270 from yanboliang/spark-5990 and squashes the following commits:
872028d [Yanbo Liang] fix code style
f80ec1b [Yanbo Liang] address comments
49600cc [Yanbo Liang] address comments
429ff7d [Yanbo Liang] store each interval as a record
2b2f5a1 [Yanbo Liang] Model import/export for IsotonicRegression
The current implementation call the default constructor of mllib.feature.StandarScaler without the possibility to specify withMean or withStd options.
Author: jrabary <Jaonary@gmail.com>
Closes#4704 from jrabary/master and squashes the following commits:
fae8568 [jrabary] style fix
8896b0e [jrabary] Comments fix
ef96d73 [jrabary] style fix
8e52607 [jrabary] style fix
edd9d48 [jrabary] Fix default param initialization
17e1a76 [jrabary] Fix default param initialization
298f405 [jrabary] Typo fix
45ed914 [jrabary] Add withMean and withStd params to StandarScaler
Title says it all.
cc rxin tdas
Author: zsxwing <zsxwing@gmail.com>
Closes#5554 from zsxwing/SPARK-6979 and squashes the following commits:
5304350 [zsxwing] Fix NotSerializableException
e9d3479 [zsxwing] Add blank lines
633e279 [zsxwing] Fix NotSerializableException
e496ace [zsxwing] Replace JobGenerator.eventActor with EventLoop
ec6ec58 [zsxwing] Fix the import order
ce0fa73 [zsxwing] Replace JobScheduler.eventActor with EventLoop
If `StreamingKMeans` is not `Serializable`, we cannot do checkpoint for applications that using `StreamingKMeans`. So we should make it `Serializable`.
Author: zsxwing <zsxwing@gmail.com>
Closes#5582 from zsxwing/SPARK-6998 and squashes the following commits:
67c2a14 [zsxwing] Make StreamingKMeans 'Serializable'
This is a PR for cleaning up and finalizing the DecisionTree API. PRs for ensembles will follow once this is merged.
### Goal
Here is the description copied from the JIRA (for both trees and ensembles):
> **Issue**: The APIs for DecisionTree and ensembles (RandomForests and GradientBoostedTrees) have been experimental for a long time. The API has become very convoluted because trees and ensembles have many, many variants, some of which we have added incrementally without a long-term design.
> **Proposal**: This JIRA is for discussing changes required to finalize the APIs. After we discuss, I will make a PR to update the APIs and make them non-Experimental. This will require making many breaking changes; see the design doc for details.
> **[Design doc](https://docs.google.com/document/d/1rJ_DZinyDG3PkYkAKSsQlY0QgCeefn4hUv7GsPkzBP4)** : This outlines current issues and the proposed API.
Overall code layout:
* The old API in mllib.tree.* will remain the same.
* The new API will reside in ml.classification.* and ml.regression.*
### Summary of changes
Old API
* Exactly the same, except I made 1 method in Loss private (but that is not a breaking change since that method was introduced after the Spark 1.3 release).
New APIs
* Under Pipeline API
* The new API preserves functionality, except:
* New API does NOT store prob (probability of label in classification). I want to have it store the full vector of probabilities but feel that should be in a later PR.
* Use abstractions for parameters, estimators, and models to avoid code duplication
* Limit parameters to relevant algorithms
* For enum-like types, only expose Strings
* We can make these pluggable later on by adding new parameters. That is a far-future item.
Test suites
* I organized DecisionTreeSuite, but I made absolutely no changes to the tests themselves.
* The test suites for the new API only test (a) similarity with the results of the old API and (b) elements of the new API.
* After code is moved to this new API, we should move the tests from the old suites which test the internals.
### Details
#### Changed names
Parameters
* useNodeIdCache -> cacheNodeIds
#### Other changes
* Split: Changed categories to set instead of list
#### Non-decision tree changes
* AttributeGroup
* Added parentheses to toMetadata, toStructField methods (These were removed in a previous PR, but I ran into 1 issue with the Scala compiler not being able to disambiguate between a toMetadata method with no parentheses and a toMetadata method which takes 1 argument.)
* Attributes
* Renamed: toMetadata -> toMetadataImpl
* Added toMetadata methods which return ML metadata (keyed with “ML_ATTR”)
* NominalAttribute: Added getNumValues method which examines both numValues and values.
* Params.inheritValues: Checks whether the parent param really belongs to the child (to allow Estimator-Model pairs with different sets of parameters)
### Questions for reviewers
* Is "DecisionTreeClassificationModel" too long a name?
* Is this OK in the docs?
```
class DecisionTreeRegressor extends TreeRegressor[DecisionTreeRegressionModel] with DecisionTreeParams[DecisionTreeRegressor] with TreeRegressorParams[DecisionTreeRegressor]
```
### Future
We should open up the abstractions at some point. E.g., it would be useful to be able to set tree-related parameters in 1 place and then pass those to multiple tree-based algorithms.
Follow-up JIRAs will be (in this order):
* Tree ensembles
* Deprecate old tree code
* Move DecisionTree implementation code to new API.
* Move tests from the old suites which test the internals.
* Update programming guide
* Python API
* Change RandomForest* to always use bootstrapping, even when numTrees = 1
* Provide the probability of the predicted label for classification. After we move code to the new API and update it to maintain probabilities for all labels, then we can add the probabilities to the new API.
CC: mengxr manishamde codedeft chouqin MechCoder
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#5530 from jkbradley/dt-api-dt and squashes the following commits:
6aae255 [Joseph K. Bradley] Changed tree abstractions not to take type parameters, and for setters to return this.type instead
ec17947 [Joseph K. Bradley] Updates based on code review. Main changes were: moving public types from ml.impl.tree to ml.tree, modifying CategoricalSplit to take an Array of categories but store a Set internally, making more types sealed or final
5626c81 [Joseph K. Bradley] style fixes
f8fbd24 [Joseph K. Bradley] imported reorg of DecisionTreeSuite from old PR. small cleanups
7ef63ed [Joseph K. Bradley] Added DecisionTreeRegressor, test suites, and example (for real this time)
e11673f [Joseph K. Bradley] Added DecisionTreeRegressor, test suites, and example
119f407 [Joseph K. Bradley] added DecisionTreeClassifier example
0bdc486 [Joseph K. Bradley] fixed issues after param PR was merged
f9fbb60 [Joseph K. Bradley] Done with DecisionTreeClassifier, but no save/load yet. Need to add example as well
2532c9a [Joseph K. Bradley] partial move to spark.ml API, not done yet
c72c1a0 [Joseph K. Bradley] Copied changes for common items, plus DecisionTreeClassifier from original PR
This PR update PySpark to support Python 3 (tested with 3.4).
Known issue: unpickle array from Pyrolite is broken in Python 3, those tests are skipped.
TODO: ec2/spark-ec2.py is not fully tested with python3.
Author: Davies Liu <davies@databricks.com>
Author: twneale <twneale@gmail.com>
Author: Josh Rosen <joshrosen@databricks.com>
Closes#5173 from davies/python3 and squashes the following commits:
d7d6323 [Davies Liu] fix tests
6c52a98 [Davies Liu] fix mllib test
99e334f [Davies Liu] update timeout
b716610 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
cafd5ec [Davies Liu] adddress comments from @mengxr
bf225d7 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
179fc8d [Davies Liu] tuning flaky tests
8c8b957 [Davies Liu] fix ResourceWarning in Python 3
5c57c95 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
4006829 [Davies Liu] fix test
2fc0066 [Davies Liu] add python3 path
71535e9 [Davies Liu] fix xrange and divide
5a55ab4 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
125f12c [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
ed498c8 [Davies Liu] fix compatibility with python 3
820e649 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
e8ce8c9 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
ad7c374 [Davies Liu] fix mllib test and warning
ef1fc2f [Davies Liu] fix tests
4eee14a [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
20112ff [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
59bb492 [Davies Liu] fix tests
1da268c [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
ca0fdd3 [Davies Liu] fix code style
9563a15 [Davies Liu] add imap back for python 2
0b1ec04 [Davies Liu] make python examples work with Python 3
d2fd566 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
a716d34 [Davies Liu] test with python 3.4
f1700e8 [Davies Liu] fix test in python3
671b1db [Davies Liu] fix test in python3
692ff47 [Davies Liu] fix flaky test
7b9699f [Davies Liu] invalidate import cache for Python 3.3+
9c58497 [Davies Liu] fix kill worker
309bfbf [Davies Liu] keep compatibility
5707476 [Davies Liu] cleanup, fix hash of string in 3.3+
8662d5b [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
f53e1f0 [Davies Liu] fix tests
70b6b73 [Davies Liu] compile ec2/spark_ec2.py in python 3
a39167e [Davies Liu] support customize class in __main__
814c77b [Davies Liu] run unittests with python 3
7f4476e [Davies Liu] mllib tests passed
d737924 [Davies Liu] pass ml tests
375ea17 [Davies Liu] SQL tests pass
6cc42a9 [Davies Liu] rename
431a8de [Davies Liu] streaming tests pass
78901a7 [Davies Liu] fix hash of serializer in Python 3
24b2f2e [Davies Liu] pass all RDD tests
35f48fe [Davies Liu] run future again
1eebac2 [Davies Liu] fix conflict in ec2/spark_ec2.py
6e3c21d [Davies Liu] make cloudpickle work with Python3
2fb2db3 [Josh Rosen] Guard more changes behind sys.version; still doesn't run
1aa5e8f [twneale] Turned out `pickle.DictionaryType is dict` == True, so swapped it out
7354371 [twneale] buffer --> memoryview I'm not super sure if this a valid change, but the 2.7 docs recommend using memoryview over buffer where possible, so hoping it'll work.
b69ccdf [twneale] Uses the pure python pickle._Pickler instead of c-extension _pickle.Pickler. It appears pyspark 2.7 uses the pure python pickler as well, so this shouldn't degrade pickling performance (?).
f40d925 [twneale] xrange --> range
e104215 [twneale] Replaces 2.7 types.InstsanceType with 3.4 `object`....could be horribly wrong depending on how types.InstanceType is used elsewhere in the package--see http://bugs.python.org/issue8206
79de9d0 [twneale] Replaces python2.7 `file` with 3.4 _io.TextIOWrapper
2adb42d [Josh Rosen] Fix up some import differences between Python 2 and 3
854be27 [Josh Rosen] Run `futurize` on Python code:
7c5b4ce [Josh Rosen] Remove Python 3 check in shell.py.
Same as #5431 but for Python. jkbradley
Author: Xiangrui Meng <meng@databricks.com>
Closes#5534 from mengxr/SPARK-6893 and squashes the following commits:
d3b519b [Xiangrui Meng] address comments
ebaccc6 [Xiangrui Meng] style update
fce244e [Xiangrui Meng] update explainParams with test
4d6b07a [Xiangrui Meng] add tests
5294500 [Xiangrui Meng] update default param handling in python
This pr adds informative error messages to all require statements in the Vectors class that did not previously have them. This references [SPARK-6938](https://issues.apache.org/jira/browse/SPARK-6938).
Author: Juliet Hougland <juliet@cloudera.com>
Closes#5532 from jhlch/SPARK-6938 and squashes the following commits:
ab321bb [Juliet Hougland] Remove braces from string interpolation when not required.
1221f94 [Juliet Hougland] All require statements now have an informative error message.
The previous PR https://github.com/apache/spark/pull/4906 helped to extract the learning curve giving the error for each iteration. This continues the work refactoring some code and extending the same logic during training and validation.
Author: MechCoder <manojkumarsivaraj334@gmail.com>
Closes#5330 from MechCoder/spark-5972 and squashes the following commits:
0b5d659 [MechCoder] minor
32d409d [MechCoder] EvaluateeachIteration and training cache should follow different paths
d542bb0 [MechCoder] Remove unused imports and docs
58f4932 [MechCoder] Remove unpersist
70d3b4c [MechCoder] Broadcast for each tree
5869533 [MechCoder] Access broadcasted values locally and other minor changes
923dbf6 [MechCoder] [SPARK-5972] Cache residuals and gradient in GBT during training and validation
See JIRA issue [SPARK-5988](https://issues.apache.org/jira/browse/SPARK-5988).
Author: Xusen Yin <yinxusen@gmail.com>
Closes#5450 from yinxusen/SPARK-5988 and squashes the following commits:
cb1ecfa [Xusen Yin] change Assignment into case class
b1dd24c [Xusen Yin] add test suite
63c3923 [Xusen Yin] add save load for power iteration clustering
This PR adds string indexer, which takes a column of string labels and outputs a double column with labels indexed by their frequency.
TODOs:
- [x] store feature to index map in output metadata
Author: Xiangrui Meng <meng@databricks.com>
Closes#4735 from mengxr/SPARK-5886 and squashes the following commits:
d82575f [Xiangrui Meng] fix test
700e70f [Xiangrui Meng] rename LabelIndexer to StringIndexer
16a6f8c [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-5886
457166e [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-5886
f8b30f4 [Xiangrui Meng] update label indexer to output metadata
e81ec28 [Xiangrui Meng] Merge branch 'openhashmap-contains' into SPARK-5886-2
d6e6f1f [Xiangrui Meng] add contains to primitivekeyopenhashmap
748a69b [Xiangrui Meng] add contains to OpenHashMap
def3c5c [Xiangrui Meng] add LabelIndexer
**Ready for review!**
Since the original PR, I moved the code to the spark.ml API and renamed this to VectorIndexer.
This introduces a VectorIndexer class which does the following:
* VectorIndexer.fit(): collect statistics about how many values each feature in a dataset (RDD[Vector]) can take (limited by maxCategories)
* Feature which exceed maxCategories are declared continuous, and the Model will treat them as such.
* VectorIndexerModel.transform(): Convert categorical feature values to corresponding 0-based indices
Design notes:
* This maintains sparsity in vectors by ensuring that categorical feature value 0.0 gets index 0.
* This does not yet support transforming data with new (unknown) categorical feature values. That can be added later.
* This is necessary for DecisionTree and tree ensembles.
Reviewers: Please check my use of metadata and my unit tests for it; I'm not sure if I covered everything in the tests.
Other notes:
* This also adds a public toMetadata method to AttributeGroup (for simpler construction of metadata).
CC: mengxr
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#3000 from jkbradley/indexer and squashes the following commits:
5956d91 [Joseph K. Bradley] minor cleanups
f5c57a8 [Joseph K. Bradley] added Java test suite
643b444 [Joseph K. Bradley] removed FeatureTests
02236c3 [Joseph K. Bradley] Updated VectorIndexer, ready for PR
286d221 [Joseph K. Bradley] Reworked DatasetIndexer for spark.ml API, and renamed it to VectorIndexer
12e6cf2 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into indexer
6d8f3f1 [Joseph K. Bradley] Added partly done DatasetIndexer to spark.ml
6a2f553 [Joseph K. Bradley] Updated TODO for allowUnknownCategories
3f041f8 [Joseph K. Bradley] Final cleanups for DatasetIndexer
038b9e3 [Joseph K. Bradley] DatasetIndexer now maintains sparsity in SparseVector
3a4a0bd [Joseph K. Bradley] Added another test for DatasetIndexer
2006923 [Joseph K. Bradley] DatasetIndexer now passes tests
f409987 [Joseph K. Bradley] partly done with DatasetIndexerSuite
5e7c874 [Joseph K. Bradley] working on DatasetIndexer
This is the sub-task of SPARK-6254.
Wrap missing method for `StandardScalerModel`.
Author: lewuathe <lewuathe@me.com>
Closes#5310 from Lewuathe/SPARK-6643 and squashes the following commits:
fafd690 [lewuathe] Fix for lint-python
bd31a64 [lewuathe] Merge branch 'master' into SPARK-6643
578f5ee [lewuathe] Remove unnecessary class
a38f155 [lewuathe] Merge master
66bb2ab [lewuathe] Fix typos
82683a0 [lewuathe] [SPARK-6643] Implement StandardScalerModel missing methods
cc marmbrus
Author: Volodymyr Lyubinets <vlyubin@gmail.com>
Closes#5279 from vlyubin/speedup and squashes the following commits:
e75a387 [Volodymyr Lyubinets] Changes to ScalaUDF
11a20ec [Volodymyr Lyubinets] Avoid creating a tuple
c327bc9 [Volodymyr Lyubinets] Moved the only remaining function from DataTypeConversions to DateUtils
dec6802 [Volodymyr Lyubinets] Addresed review feedback
74301fa [Volodymyr Lyubinets] Addressed review comments
afa3aa5 [Volodymyr Lyubinets] Minor refactoring, added license, removed debug output
881dc60 [Volodymyr Lyubinets] Moved to a separate module; addressed review comments; one extra place of usage; changed behaviour for Java
8cad6e2 [Volodymyr Lyubinets] Addressed review commments
41b2aa9 [Volodymyr Lyubinets] Creating converters for ScalaReflection stuff, and more
jira: https://issues.apache.org/jira/browse/SPARK-6693
It's kind of annoying when debugging and found you cannot print out the matrix as you want.
original toString of Matrix only print like following,
0.17810102596909183 0.5616906241468385 ... (10 total)
0.9692861997823815 0.015558159784155756 ...
0.8513015122819192 0.031523763918528847 ...
0.5396875653953941 0.3267864552779176 ...
The def toString(maxLines : Int, maxWidth : Int) is useful when debuging, logging and saving matrix to files.
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes#5344 from hhbyyh/addToString and squashes the following commits:
19a6836 [Yuhao Yang] remove extra line
6314b21 [Yuhao Yang] add exclude
736c324 [Yuhao Yang] add ut and exclude
420da39 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into addToString
c22f352 [Yuhao Yang] style change
64a9e0f [Yuhao Yang] add specific to string to matrix
Support FPGrowth algorithm in Python API.
Should we remove "Experimental" which were marked for FPGrowth and FPGrowthModel in Scala? jkbradley
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#5213 from yanboliang/spark-6264 and squashes the following commits:
ed62ead [Yanbo Liang] trigger jenkins
8ce0359 [Yanbo Liang] fix docstring style
544c725 [Yanbo Liang] address comments
a2d7cf7 [Yanbo Liang] add doc for FPGrowth.train()
dcf7d73 [Yanbo Liang] add python doc
b18fd07 [Yanbo Liang] trigger jenkins
2c951b8 [Yanbo Liang] fix typos
7f62c8f [Yanbo Liang] add fpm to __init__.py
b96206a [Yanbo Liang] Support FPGrowth algorithm in Python API
https://issues.apache.org/jira/browse/SPARK-6758
I am not sure if it is ok to block them in test resources too (as we shade jetty in assembly?).
Author: WangTaoTheTonic <wangtao111@huawei.com>
Closes#5406 from WangTaoTheTonic/SPARK-6758 and squashes the following commits:
e09605b [WangTaoTheTonic] block the right jetty package
So we can turn style checker on for test code.
Author: Reynold Xin <rxin@databricks.com>
Closes#5411 from rxin/test-style-mllib and squashes the following commits:
d8a2569 [Reynold Xin] [SPARK-6765] Fix test code style for mllib.
I have the fit intercept enabled by default for logistic regression, I
wonder what others think here. I understand that it enables allocation
by default which is undesirable, but one needs to have a very strong
reason for not having an intercept term enabled so it is the safer
default from a statistical sense.
Explicitly modeling the intercept by adding a column of all 1s does not
work. I believe the reason is that since the API for
LogisticRegressionWithLBFGS forces column normalization, and a column of all
1s has 0 variance so dividing by 0 kills it.
Author: Omede Firouz <ofirouz@palantir.com>
Closes#5301 from oefirouz/addIntercept and squashes the following commits:
9f1286b [Omede Firouz] [SPARK-6705][MLLIB] Add fitInterceptTerm to LogisticRegression
1d6bd6f [Omede Firouz] [SPARK-6705][MLLIB] Add a fit intercept term to ML LogisticRegression
9963509 [Omede Firouz] [MLLIB] Add fitIntercept to LogisticRegression
2257fca [Omede Firouz] [MLLIB] Add fitIntercept param to logistic regression
329c1e2 [Omede Firouz] [MLLIB] Add fit intercept term
bd9663c [Omede Firouz] [MLLIB] Add fit intercept api to ml logisticregression
Author: Vinod K C <vinod.kc@huawei.com>
Closes#5384 from vinodkc/Suppression_Scala_existential_code and squashes the following commits:
82a3a1f [Vinod K C] Added scala.language.existentials
Use Iterators in columnSimilarities to allow mapPartitionsWithIndex to spill to disk. This could happen in a dense and large column - this way Spark can spill the pairs onto disk instead of building all the pairs before handing them to Spark.
Another PR coming to update documentation.
Author: Reza Zadeh <reza@databricks.com>
Closes#5364 from rezazadeh/optmemsim and squashes the following commits:
47c90ba [Reza Zadeh] Iterators in columnSimilarities for flatMap
This is the sub-task of SPARK-6254.
Wrap missing method for `Word2Vec` and `Word2VecModel`.
Author: lewuathe <lewuathe@me.com>
Closes#5296 from Lewuathe/SPARK-6615 and squashes the following commits:
f14c304 [lewuathe] Reorder tests
1d326b9 [lewuathe] Merge master
e2bedfb [lewuathe] Modify test cases
afb866d [lewuathe] [SPARK-6615] Python API for Word2Vec
There's no corresponding printing in linear regression. Here was my previous PR (something weird happened and I can't reopen it) https://github.com/apache/spark/pull/5272
Author: Omede Firouz <ofirouz@palantir.com>
Closes#5338 from oefirouz/println and squashes the following commits:
3f3dbf4 [Omede Firouz] [MLLIB] Remove println
This builds on my earlier pull requests and turns on the explicit type checking in scalastyle.
Author: Reynold Xin <rxin@databricks.com>
Closes#5342 from rxin/SPARK-6428 and squashes the following commits:
7b531ab [Reynold Xin] import ordering
2d9a8a5 [Reynold Xin] jl
e668b1c [Reynold Xin] override
9b9e119 [Reynold Xin] Parenthesis.
82e0cf5 [Reynold Xin] [SPARK-6428] Turn on explicit type checking for public methods.
This patch fixes a reported bug causing model updates to not properly propagate to model predictions during streaming regression. These minor changes in model declaration fix the problem, and I expanded the tests to include the scenario in which the bug was arising. The two new tests failed prior to the patch and now pass.
cc mengxr
Author: freeman <the.freeman.lab@gmail.com>
Closes#5037 from freeman-lab/train-predict-fix and squashes the following commits:
3af953e [freeman] Expand test coverage to include combined training and prediction
8f84fc8 [freeman] Move model declaration
davies
Author: Xiangrui Meng <meng@databricks.com>
Closes#5318 from mengxr/SPARK-6660 and squashes the following commits:
0f66ec2 [Xiangrui Meng] recognize object arrays
ad8c42f [Xiangrui Meng] add a test for SPARK-6660
This PR changes lambda scaling from number of users/items to number of explicit ratings. The latter is the behavior in 1.2. Slight refactor of NormalEquation to make it independent of ALS models. srowen codexiang
Author: Xiangrui Meng <meng@databricks.com>
Closes#5314 from mengxr/SPARK-6642 and squashes the following commits:
dc655a1 [Xiangrui Meng] relax python tests
f410df2 [Xiangrui Meng] use 1.2 scaling and remove addImplicit from NormalEquation
Word2Vec model now supports saving and loading.
a] The Metadata stored in JSON format consists of "version", "classname", "vectorSize" and "numWords"
b] The data stored in Parquet file format consists of an Array of rows with each row consisting of 2 columns, first being the word: String and the second, an Array of Floats.
Author: MechCoder <manojkumarsivaraj334@gmail.com>
Closes#5291 from MechCoder/spark-5692 and squashes the following commits:
1142f3a [MechCoder] Add numWords to metaData
bfe4c39 [MechCoder] [SPARK-5692] Word2Vec save/load
Python API parity check for classification and multiclass classification support, major disparities need to be added for Python:
```scala
LogisticRegressionWithLBFGS
setNumClasses
setValidateData
LogisticRegressionModel
getThreshold
numClasses
numFeatures
SVMWithSGD
setValidateData
SVMModel
getThreshold
```
For users the greatest benefit in this PR is multiclass classification was supported by Python API.
Users can train multiclass classification model and use it to predict in pyspark.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#5137 from yanboliang/spark-6255 and squashes the following commits:
0bd531e [Yanbo Liang] address comments
444d5e2 [Yanbo Liang] LogisticRegressionModel.predict() optimization
fc7990b [Yanbo Liang] address comments
b0d9c63 [Yanbo Liang] Support Mulinomial LR model predict in Python API
ded847c [Yanbo Liang] Python API parity check for classification (support multiclass classification)
Added optional model type parameter for NaiveBayes training. Can be either Multinomial or Bernoulli.
When Bernoulli is given the Bernoulli smoothing is used for fitting and for prediction as per: http://nlp.stanford.edu/IR-book/html/htmledition/the-bernoulli-model-1.html.
Default for model is original Multinomial fit and predict.
Added additional testing for Bernoulli and Multinomial models.
Author: leahmcguire <lmcguire@salesforce.com>
Author: Joseph K. Bradley <joseph@databricks.com>
Author: Leah McGuire <lmcguire@salesforce.com>
Closes#4087 from leahmcguire/master and squashes the following commits:
f3c8994 [leahmcguire] changed checks on model type to requires
acb69af [leahmcguire] removed enum type and replaces all modelType parameters with strings
2224b15 [Leah McGuire] Merge pull request #2 from jkbradley/leahmcguire-master
9ad89ca [Joseph K. Bradley] removed old code
6a8f383 [Joseph K. Bradley] Added new model save/load format 2.0 for NaiveBayesModel after modelType parameter was added. Updated tests. Also updated ModelType enum-like type.
852a727 [leahmcguire] merged with upstream master
a22d670 [leahmcguire] changed NaiveBayesModel modelType parameter back to NaiveBayes.ModelType, made NaiveBayes.ModelType serializable, fixed getter method in NavieBayes
18f3219 [leahmcguire] removed private from naive bayes constructor for lambda only
bea62af [leahmcguire] put back in constructor for NaiveBayes
01baad7 [leahmcguire] made fixes from code review
fb0a5c7 [leahmcguire] removed typo
e2d925e [leahmcguire] fixed nonserializable error that was causing naivebayes test failures
2d0c1ba [leahmcguire] fixed typo in NaiveBayes
c298e78 [leahmcguire] fixed scala style errors
b85b0c9 [leahmcguire] Merge remote-tracking branch 'upstream/master'
900b586 [leahmcguire] fixed model call so that uses type argument
ea09b28 [leahmcguire] Merge remote-tracking branch 'upstream/master'
e016569 [leahmcguire] updated test suite with model type fix
85f298f [leahmcguire] Merge remote-tracking branch 'upstream/master'
dc65374 [leahmcguire] integrated model type fix
7622b0c [leahmcguire] added comments and fixed style as per rb
b93aaf6 [Leah McGuire] Merge pull request #1 from jkbradley/nb-model-type
3730572 [Joseph K. Bradley] modified NB model type to be more Java-friendly
b61b5e2 [leahmcguire] added back compatable constructor to NaiveBayesModel to fix MIMA test failure
5a4a534 [leahmcguire] fixed scala style error in NaiveBayes
3891bf2 [leahmcguire] synced with apache spark and resolved merge conflict
d9477ed [leahmcguire] removed old inaccurate comment from test suite for mllib naive bayes
76e5b0f [leahmcguire] removed unnecessary sort from test
0313c0c [leahmcguire] fixed style error in NaiveBayes.scala
4a3676d [leahmcguire] Updated changes re-comments. Got rid of verbose populateMatrix method. Public api now has string instead of enumeration. Docs are updated."
ce73c63 [leahmcguire] added Bernoulli option to niave bayes model in mllib, added optional model type parameter for training. When Bernoulli is given the Bernoulli smoothing is used for fitting and for prediction http://nlp.stanford.edu/IR-book/html/htmledition/the-bernoulli-model-1.html
This fixes `predictAll` after load. jkbradley
Author: Xiangrui Meng <meng@databricks.com>
Closes#5243 from mengxr/SPARK-6571 and squashes the following commits:
82dcaa7 [Xiangrui Meng] use wrapper in MatrixFactorizationModel.load
See [SPARK-6526](https://issues.apache.org/jira/browse/SPARK-6526).
mengxr Should we add test suite for this transformer? There is no test suite for all feature transformers in ML package now.
Author: Xusen Yin <yinxusen@gmail.com>
Closes#5181 from yinxusen/SPARK-6526 and squashes the following commits:
6faa7bf [Xusen Yin] fix style
8a462da [Xusen Yin] remove duplications
ab35ab0 [Xusen Yin] add test suite
bc8cd0f [Xusen Yin] fix comment
79774c9 [Xusen Yin] add Normalizer transformer in ML package
There are any bugs of breeze's SparseVector at 0.11.1. You know, Spark 1.3 depends on breeze 0.11.1. So I think we should upgrade it to 0.11.2.
https://issues.apache.org/jira/browse/SPARK-6341
And thanks you for your great cooperation, David Hall(dlwh)
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
Closes#5222 from yu-iskw/upgrade-breeze and squashes the following commits:
ad8a688 [Yu ISHIKAWA] Upgrade breeze from 0.11.1 to 0.11.2 because of a bug of SparseVector. Thanks you for your great cooperation, David Hall(@dlwh)
minor thing. Let me know if jira is required.
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes#5207 from hhbyyh/adjustImport and squashes the following commits:
2240121 [Yuhao Yang] remove unused import
Should be self explanatory.
Author: MechCoder <manojkumarsivaraj334@gmail.com>
Closes#4986 from MechCoder/spark-5987 and squashes the following commits:
7d2cd56 [MechCoder] Iterate over dataframe in a better way
e7a14cb [MechCoder] Minor
33c84f9 [MechCoder] Store as Array[Data] instead of Data[Array]
505bd57 [MechCoder] Rebased over master and used MatrixUDT
7422bb4 [MechCoder] Store sigmas as Array[Double] instead of Array[Array[Double]]
b9794e4 [MechCoder] Minor
cb77095 [MechCoder] [SPARK-5987] Save/load for GaussianMixtureModels
MLlib Python API parity check for Regression, major disparities need to be added for Python list following:
```scala
LinearRegressionWithSGD
setValidateData
LassoWithSGD
setIntercept
setValidateData
RidgeRegressionWithSGD
setIntercept
setValidateData
```
setFeatureScaling is mllib private function which is not needed to expose in pyspark.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#4997 from yanboliang/spark-6256 and squashes the following commits:
102f498 [Yanbo Liang] fix intercept issue & add doc test
1fb7b4f [Yanbo Liang] change 'intercept' to 'addIntercept'
de5ecbc [Yanbo Liang] MLlib Python API parity check for regression
Added a Regex based tokenizer for ml.
Currently the regex is fixed but if I could add a regex type paramater to the paramMap,
changing the tokenizer regex could be a parameter used in the crossValidation.
Also I wonder what would be the best way to add a stop word list.
Author: Augustin Borsu <augustin@sagacify.com>
Author: Augustin Borsu <a.borsu@gmail.com>
Author: Augustin Borsu <aborsu@gmail.com>
Author: Xiangrui Meng <meng@databricks.com>
Closes#4504 from aborsu985/master and squashes the following commits:
716d257 [Augustin Borsu] Merge branch 'mengxr-SPARK-5566'
cb07021 [Augustin Borsu] Merge branch 'SPARK-5566' of git://github.com/mengxr/spark into mengxr-SPARK-5566
5f09434 [Augustin Borsu] Merge remote-tracking branch 'upstream/master'
a164800 [Xiangrui Meng] remove tabs
556aa27 [Xiangrui Meng] Merge branch 'aborsu985-master' into SPARK-5566
9651aec [Xiangrui Meng] update test
f96526d [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-5566
2338da5 [Augustin Borsu] Merge remote-tracking branch 'upstream/master'
e88d7b8 [Xiangrui Meng] change pattern to a StringParameter; update tests
148126f [Augustin Borsu] Added return type to public functions
12dddb4 [Augustin Borsu] Merge remote-tracking branch 'upstream/master'
daf685e [Augustin Borsu] Merge remote-tracking branch 'upstream/master'
6a85982 [Augustin Borsu] Style corrections
38b95a1 [Augustin Borsu] Added Java unit test for RegexTokenizer
b66313f [Augustin Borsu] Modified the pattern Param so it is compiled when given to the Tokenizer
e262bac [Augustin Borsu] Added unit tests in scala
cd6642e [Augustin Borsu] Changed regex to pattern
132b00b [Augustin Borsu] Changed matching to gaps and removed case folding
201a107 [Augustin Borsu] Merge remote-tracking branch 'upstream/master'
cb9c9a7 [Augustin Borsu] Merge remote-tracking branch 'upstream/master'
d3ef6d3 [Augustin Borsu] Added doc to RegexTokenizer
9082fc3 [Augustin Borsu] Removed stopwords parameters and updated doc
19f9e53 [Augustin Borsu] Merge remote-tracking branch 'upstream/master'
f6a5002 [Augustin Borsu] Merge remote-tracking branch 'upstream/master'
7f930bb [Augustin Borsu] Merge remote-tracking branch 'upstream/master'
77ff9ca [Augustin Borsu] Merge remote-tracking branch 'upstream/master'
2e89719 [Augustin Borsu] Merge remote-tracking branch 'upstream/master'
196cd7a [Augustin Borsu] Merge remote-tracking branch 'upstream/master'
11ca50f [Augustin Borsu] Merge remote-tracking branch 'upstream/master'
9f8685a [Augustin Borsu] RegexTokenizer
9e07a78 [Augustin Borsu] Merge remote-tracking branch 'upstream/master'
9547e9d [Augustin Borsu] RegEx Tokenizer
01cd26f [Augustin Borsu] RegExTokenizer
In GeneralizedLinearAlgorithm ```numFeatures``` is default to -1, we need to update it to correct value when we call run() to train a model.
```LogisticRegressionWithLBFGS.run(input)``` works well, but when we call ```LogisticRegressionWithLBFGS.run(input, initialWeights)``` to train multiclass classification model, it will throw exception due to the numFeatures is not updated.
In this PR, we just update numFeatures at the beginning of GeneralizedLinearAlgorithm.run(input, initialWeights) and add test case.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#5167 from yanboliang/spark-6496 and squashes the following commits:
8131c48 [Yanbo Liang] LogisticRegressionWithLBFGS.run(input, initialWeights) should initialize numFeatures
Author: MechCoder <manojkumarsivaraj334@gmail.com>
Closes#5118 from MechCoder/spark-6308 and squashes the following commits:
6c8ffab [MechCoder] Add test for simpleString
b966242 [MechCoder] [SPARK-6308] [MLlib][Sql] VectorUDT is displayed as vecto in dtypes
Author: vinodkc <vinod.kc.in@gmail.com>
Closes#5112 from vinodkc/spark_1.3_doc_fixes and squashes the following commits:
2c6aee6 [vinodkc] Spark 1.3 doc fixes
Added evaluateEachIteration to allow the user to manually extract the error for each iteration of GradientBoosting. The internal optimisation can be dealt with later.
Author: MechCoder <manojkumarsivaraj334@gmail.com>
Closes#4906 from MechCoder/spark-6025 and squashes the following commits:
67146ab [MechCoder] Minor
352001f [MechCoder] Minor
6e8aa10 [MechCoder] Made the following changes Used mapPartition instead of map Refactored computeError and unpersisted broadcast variables
bc99ac6 [MechCoder] Refactor the method and stuff
dbda033 [MechCoder] [SPARK-6025] Add helper method evaluateEachIteration to extract learning curve
Utilities to serialize and deserialize Matrices in MLlib
Author: MechCoder <manojkumarsivaraj334@gmail.com>
Closes#5048 from MechCoder/spark-6309 and squashes the following commits:
05dc6f2 [MechCoder] Hashcode and organize imports
16d5d47 [MechCoder] Test some more
6e67020 [MechCoder] TST: Test using Array conversion instead of equals
7fa7a2c [MechCoder] [SPARK-6309] [SQL] [MLlib] Implement MatrixUDT
Add checkpiontInterval to ALS to prevent:
1. StackOverflow exceptions caused by long lineage,
2. large shuffle files generated during iterations,
3. slow recovery when some node fail.
srowen coderxiang
Author: Xiangrui Meng <meng@databricks.com>
Closes#5076 from mengxr/SPARK-5955 and squashes the following commits:
df56791 [Xiangrui Meng] update impl to reuse code
29affcb [Xiangrui Meng] do not materialize factors in implicit
20d3f7f [Xiangrui Meng] add checkpointInterval to ALS
This PR implements two functions
- `topByKey(num: Int): RDD[(K, Array[V])]` finds the top-k values for each key in a pair RDD. This can be used, e.g., in computing top recommendations.
- `takeOrderedByKey(num: Int): RDD[(K, Array[V])] ` does the opposite of `topByKey`
The `sorted` is used here as the `toArray` method of the PriorityQueue does not return a necessarily sorted array.
Author: Shuo Xiang <shuoxiangpub@gmail.com>
Closes#5075 from coderxiang/topByKey and squashes the following commits:
1611c37 [Shuo Xiang] code clean up
6f565c0 [Shuo Xiang] naming
a80e0ec [Shuo Xiang] typo and warning
82dded9 [Shuo Xiang] Merge remote-tracking branch 'upstream/master' into topByKey
d202745 [Shuo Xiang] move to MLPairRDDFunctions
901b0af [Shuo Xiang] style check
70c6e35 [Shuo Xiang] remove takeOrderedByKey, update doc and test
0895c17 [Shuo Xiang] Merge remote-tracking branch 'upstream/master' into topByKey
b10e325 [Shuo Xiang] Merge remote-tracking branch 'upstream/master' into topByKey
debccad [Shuo Xiang] topByKey
I want to add a checker to turn public type checking on, since future pull requests can accidentally expose a non-public type. This is the first cleanup task.
Author: Reynold Xin <rxin@databricks.com>
Closes#5102 from rxin/mllib-hashcode-publicmethodtypes and squashes the following commits:
617f19e [Reynold Xin] Fixed Scala compilation error.
52bc2d5 [Reynold Xin] [MLlib] Added explicit type for public methods and implemented hashCode when equals is defined.
GLM toString prints out intercept, numFeatures.
For LogisticRegression and SVM model, toString also prints out numClasses, threshold.
GLM toDebugString prints out the whole weights, intercept.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#5038 from yanboliang/spark-6291 and squashes the following commits:
2f578b0 [Yanbo Liang] code format
78b33f2 [Yanbo Liang] fix typos
1e8a023 [Yanbo Liang] GLM toString & toDebugString
I find it's better to have getter for NumFeatures and addIntercept within GeneralizedLinearAlgorithm during actual usage, otherwise I 'll have to get the value through debug.
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes#5058 from hhbyyh/addGetLinear and squashes the following commits:
9dc90e8 [Yuhao Yang] add get for GeneralizedLinearAlgo
Use `_py2java` and `_java2py` to convert Python model to/from Java model. yinxusen
Author: Xiangrui Meng <meng@databricks.com>
Closes#5049 from mengxr/SPARK-6226-mengxr and squashes the following commits:
570ba81 [Xiangrui Meng] fix python style
b10b911 [Xiangrui Meng] add save/load in PySpark's KMeansModel
LBFGS uses convergence tolerance. This value should be written in document as an argument.
Author: lewuathe <lewuathe@me.com>
Closes#5033 from Lewuathe/SPARK-6336 and squashes the following commits:
e738b33 [lewuathe] Modify text to be more natural
ac03c3a [lewuathe] Modify documentations
6ccb304 [lewuathe] [SPARK-6336] LBFGS should document what convergenceTol means
Note: not relevant for Python API since it only has a static train method
Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#4969 from jkbradley/SPARK-6252 and squashes the following commits:
a471d90 [Joseph K. Bradley] small edits from review
63eff48 [Joseph K. Bradley] Added getLambda to Scala NaiveBayes
This continues the work in #4460 from srowen . The design doc is published on the JIRA page with some minor changes.
Short description of ML attributes: https://github.com/apache/spark/pull/4925/files?diff=unified#diff-95e7f5060429f189460b44a3f8731a35R24
More details can be found in the design doc.
srowen Could you help review this PR? There are many lines but most of them are boilerplate code.
Author: Xiangrui Meng <meng@databricks.com>
Author: Sean Owen <sowen@cloudera.com>
Closes#4925 from mengxr/SPARK-4588-new and squashes the following commits:
71d1bd0 [Xiangrui Meng] add JavaDoc for package ml.attribute
617be40 [Xiangrui Meng] remove final; rename cardinality to numValues
393ffdc [Xiangrui Meng] forgot to include Java attribute group tests
b1aceef [Xiangrui Meng] more tests
e7ab467 [Xiangrui Meng] update ML attribute impl
7c944da [Sean Owen] Add FeatureType hierarchy and categorical cardinality
2a21d6d [Sean Owen] Initial draft of FeatureAttributes class
jira: https://issues.apache.org/jira/browse/SPARK-6268
KMeans has many setters for parameters. It should have matching getters.
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes#4974 from hhbyyh/get4Kmeans and squashes the following commits:
f44d4dc [Yuhao Yang] add experimental to getRuns
f94a3d7 [Yuhao Yang] add get for KMeans
The issue is discussed in https://issues.apache.org/jira/browse/SPARK-5669. Replacing all JBLAS usage by netlib-java gives us a simpler dependency tree and less license issues to worry about. I didn't touch the test scope in this PR. The user guide is not modified to avoid merge conflicts with branch-1.3. srowen ankurdave pwendell
Author: Xiangrui Meng <meng@databricks.com>
Closes#4699 from mengxr/SPARK-5814 and squashes the following commits:
48635c6 [Xiangrui Meng] move netlib-java version to parent pom
ca21c74 [Xiangrui Meng] remove jblas from ml-guide
5f7767a [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-5814
c5c4183 [Xiangrui Meng] merge master
0f20cad [Xiangrui Meng] add mima excludes
e53e9f4 [Xiangrui Meng] remove jblas from mllib runtime
ceaa14d [Xiangrui Meng] replace jblas by netlib-java in graphx
fa7c2ca [Xiangrui Meng] move jblas to test scope
Resolve javac, scalac warnings of various types -- deprecations, Scala lang, unchecked cast, etc.
Author: Sean Owen <sowen@cloudera.com>
Closes#4950 from srowen/SPARK-6225 and squashes the following commits:
3080972 [Sean Owen] Ordered imports: Java, Scala, 3rd party, Spark
c67985b [Sean Owen] Resolve javac, scalac warnings of various types -- deprecations, Scala lang, unchecked cast, etc.
This PR adds save/load for K-means as described in SPARK-5986. Python version will be added in another PR.
Author: Xusen Yin <yinxusen@gmail.com>
Closes#4951 from yinxusen/SPARK-5986 and squashes the following commits:
6dd74a0 [Xusen Yin] rewrite some functions and classes
cd390fd [Xusen Yin] add indexed point
b144216 [Xusen Yin] remove invalid comments
dce7055 [Xusen Yin] add save/load for k-means for SPARK-5986
A simple wrapper around the Scala implementation. `DataFrame` is used for serialization/deserialization. Methods that return `RDD`s are not supported in this PR.
davies If we recognize Scala's `Product`s in Py4J, we can easily add wrappers for Scala methods that returns `RDD[(Double, Double)]`. Is it easy to register serializer for `Product` in PySpark?
Author: Xiangrui Meng <meng@databricks.com>
Closes#4863 from mengxr/SPARK-6090 and squashes the following commits:
009a3a3 [Xiangrui Meng] provide schema
dcddab5 [Xiangrui Meng] add a basic BinaryClassificationMetrics to PySpark/MLlib
Option 1 of 2: Convert spark-parent module name to spark-parent_2.10 / spark-parent_2.11
Author: Sean Owen <sowen@cloudera.com>
Closes#4912 from srowen/SPARK-6182.1 and squashes the following commits:
eff60de [Sean Owen] Convert spark-parent module name to spark-parent_2.10 / spark-parent_2.11
LBFGS and OWLQN in Breeze 0.10 has convergence check bug.
This is fixed in 0.11, see the description in Breeze project for detail:
https://github.com/scalanlp/breeze/pull/373#issuecomment-76879760
Author: Xiangrui Meng <meng@databricks.com>
Author: DB Tsai <dbtsai@alpinenow.com>
Author: DB Tsai <dbtsai@dbtsai.com>
Closes#4879 from dbtsai/breeze and squashes the following commits:
d848f65 [DB Tsai] Merge pull request #1 from mengxr/AlpineNow-breeze
c2ca6ac [Xiangrui Meng] upgrade to breeze-0.11.1
35c2f26 [Xiangrui Meng] fix LRSuite
397a208 [DB Tsai] upgrade breeze
Issue: When the Python DecisionTree example in the programming guide is run, it runs out of Java Heap Space when using the default memory settings for the spark shell.
This prints a warning.
CC: mengxr
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#4864 from jkbradley/dt-save-heap and squashes the following commits:
02e8daf [Joseph K. Bradley] fixed based on code review
7ecb1ed [Joseph K. Bradley] Added warnings about memory when calling tree and ensemble model save with too small a Java heap size
This PR contains the following changes:
1. Add a new method, `DataType.equalsIgnoreCompatibleNullability`, which is the middle ground between DataType's equality check and `DataType.equalsIgnoreNullability`. For two data types `from` and `to`, it does `equalsIgnoreNullability` as well as if the nullability of `from` is compatible with that of `to`. For example, the nullability of `ArrayType(IntegerType, containsNull = false)` is compatible with that of `ArrayType(IntegerType, containsNull = true)` (for an array without null values, we can always say it may contain null values). However, the nullability of `ArrayType(IntegerType, containsNull = true)` is incompatible with that of `ArrayType(IntegerType, containsNull = false)` (for an array that may have null values, we cannot say it does not have null values).
2. For the `resolved` field of `InsertIntoTable`, use `equalsIgnoreCompatibleNullability` to replace the equality check of the data types.
3. For our data source write path, when appending data, we always use the schema of existing table to write the data. This is important for parquet, since nullability direct impacts the way to encode/decode values. If we do not do this, we may see corrupted values when reading values from a set of parquet files generated with different nullability settings.
4. When generating a new parquet table, we always set nullable/containsNull/valueContainsNull to true. So, we will not face situations that we cannot append data because containsNull/valueContainsNull in an Array/Map column of the existing table has already been set to `false`. This change makes the whole data pipeline more robust.
5. Update the equality check of JSON relation. Since JSON does not really cares nullability, `equalsIgnoreNullability` seems a better choice to compare schemata from to JSON tables.
JIRA: https://issues.apache.org/jira/browse/SPARK-5950
Thanks viirya for the initial work in #4729.
cc marmbrus liancheng
Author: Yin Huai <yhuai@databricks.com>
Closes#4826 from yhuai/insertNullabilityCheck and squashes the following commits:
3b61a04 [Yin Huai] Revert change on equals.
80e487e [Yin Huai] asNullable in UDT.
587d88b [Yin Huai] Make methods private.
0cb7ea2 [Yin Huai] marmbrus's comments.
3cec464 [Yin Huai] Cheng's comments.
486ed08 [Yin Huai] Merge remote-tracking branch 'upstream/master' into insertNullabilityCheck
d3747d1 [Yin Huai] Remove unnecessary change.
8360817 [Yin Huai] Merge remote-tracking branch 'upstream/master' into insertNullabilityCheck
8a3f237 [Yin Huai] Use equalsIgnoreNullability instead of equality check.
0eb5578 [Yin Huai] Fix tests.
f6ed813 [Yin Huai] Update old parquet path.
e4f397c [Yin Huai] Unit tests.
b2c06f8 [Yin Huai] Ignore nullability in JSON relation's equality check.
8bd008b [Yin Huai] nullable, containsNull, and valueContainsNull will be always true for parquet data.
bf50d73 [Yin Huai] When appending data, we use the schema of the existing table instead of the schema of the new data.
0a703e7 [Yin Huai] Test failed again since we cannot read correct content.
9a26611 [Yin Huai] Make InsertIntoTable happy.
8f19fe5 [Yin Huai] equalsIgnoreCompatibleNullability
4ec17fd [Yin Huai] Failed test.
A simple wrapper to save/load `MatrixFactorizationModel` in Python. jkbradley
Author: Xiangrui Meng <meng@databricks.com>
Closes#4811 from mengxr/SPARK-5991 and squashes the following commits:
f135dac [Xiangrui Meng] update save doc
57e5200 [Xiangrui Meng] address comments
06140a4 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-5991
282ec8d [Xiangrui Meng] support save/load in PySpark's ALS
Remove unicode characters from MLlib file.
Author: Michael Griffiths <msjgriffiths@gmail.com>
Author: Griffiths, Michael (NYC-RPM) <michael.griffiths@reprisemedia.com>
Closes#4815 from msjgriffiths/SPARK-6063 and squashes the following commits:
bcd7de1 [Griffiths, Michael (NYC-RPM)] Change \u201D quote marks around 'theta' to standard single apostrophe (\x27)
38eb535 [Michael Griffiths] Merge pull request #2 from apache/master
b08e865 [Michael Griffiths] Merge pull request #1 from apache/master
Since the validation error does not change monotonically, in practice, it should be proper to pick the best model when training GradientBoostedTrees with validation instead of stopping it early.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#4763 from viirya/gbt_record_model and squashes the following commits:
452e049 [Liang-Chi Hsieh] Address comment.
ea2fae2 [Liang-Chi Hsieh] Pick the best model when training GradientBoostedTrees with validation.
The model trained by ALS requires partitioning information to do quick lookup of a user/item factor for making recommendation on individual requests. In the new implementation, we didn't set partitioners in the factors returned by ALS, which would cause performance regression.
srowen coderxiang
Author: Xiangrui Meng <meng@databricks.com>
Closes#4748 from mengxr/SPARK-5976 and squashes the following commits:
9373a09 [Xiangrui Meng] add partitioner to factors returned by ALS
260f183 [Xiangrui Meng] add a test for partitioner
One can early stop if the decrease in error rate is lesser than a certain tol or if the error increases if the training data is overfit.
This introduces a new method runWithValidation which takes in a pair of RDD's , one for the training data and the other for the validation.
Author: MechCoder <manojkumarsivaraj334@gmail.com>
Closes#4677 from MechCoder/spark-5436 and squashes the following commits:
1bb21d4 [MechCoder] Combine regression and classification tests into a single one
e4d799b [MechCoder] Addresses indentation and doc comments
b48a70f [MechCoder] COSMIT
b928a19 [MechCoder] Move validation while training section under usage tips
fad9b6e [MechCoder] Made the following changes 1. Add section to documentation 2. Return corresponding to bestValidationError 3. Allow negative tolerance.
55e5c3b [MechCoder] One liner for prevValidateError
3e74372 [MechCoder] TST: Add test for classification
77549a9 [MechCoder] [SPARK-5436] Validate GradientBoostedTrees using runWithValidation
For SPARK-5867:
* The spark.ml programming guide needs to be updated to use the new SQL DataFrame API instead of the old SchemaRDD API.
* It should also include Python examples now.
For SPARK-5892:
* Fix Python docs
* Various other cleanups
BTW, I accidentally merged this with master. If you want to compile it on your own, use this branch which is based on spark/branch-1.3 and cherry-picks the commits from this PR: [https://github.com/jkbradley/spark/tree/doc-review-1.3-check]
CC: mengxr (ML), davies (Python docs)
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#4675 from jkbradley/doc-review-1.3 and squashes the following commits:
f191bb0 [Joseph K. Bradley] small cleanups
e786efa [Joseph K. Bradley] small doc corrections
6b1ab4a [Joseph K. Bradley] fixed python lint test
946affa [Joseph K. Bradley] Added sample data for ml.MovieLensALS example. Changed spark.ml Java examples to use DataFrames API instead of sql()
da81558 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into doc-review-1.3
629dbf5 [Joseph K. Bradley] Updated based on code review: * made new page for old migration guides * small fixes * moved inherit_doc in python
b9df7c4 [Joseph K. Bradley] Small cleanups: toDF to toDF(), adding s for string interpolation
34b067f [Joseph K. Bradley] small doc correction
da16aef [Joseph K. Bradley] Fixed python mllib docs
8cce91c [Joseph K. Bradley] GMM: removed old imports, added some doc
695f3f6 [Joseph K. Bradley] partly done trying to fix inherit_doc for class hierarchies in python docs
a72c018 [Joseph K. Bradley] made ChiSqTestResult appear in python docs
b05a80d [Joseph K. Bradley] organize imports. doc cleanups
e572827 [Joseph K. Bradley] updated programming guide for ml and mllib
In the previous version, PIC stores clustering assignments as an `RDD[(Long, Int)]`. This is mapped to `RDD<Tuple2<Object, Object>>` in Java and hence Java users have to cast types manually. We should either create a new method called `javaAssignments` that returns `JavaRDD[(java.lang.Long, java.lang.Int)]` or wrap the result pair in a class. I chose the latter approach in this PR. Now assignments are stored as an `RDD[Assignment]`, where `Assignment` is a class with `id` and `cluster`.
Similarly, in FPGrowth, the frequent itemsets are stored as an `RDD[(Array[Item], Long)]`, which is mapped to `RDD<Tuple2<Object, Object>>`. Though we provide a "Java-friendly" method `javaFreqItemsets` that returns `JavaRDD[(Array[Item], java.lang.Long)]`. It doesn't really work because `Array[Item]` is mapped to `Object` in Java. So in this PR I created a class `FreqItemset` to wrap the results. It has `items` and `freq`, as well as a `javaItems` method that returns `List<Item>` in Java.
I'm not certain that the names I chose are proper: `Assignment`/`id`/`cluster` and `FreqItemset`/`items`/`freq`. Please let me know if there are better suggestions.
CC: jkbradley
Author: Xiangrui Meng <meng@databricks.com>
Closes#4695 from mengxr/SPARK-5900 and squashes the following commits:
865b5ca [Xiangrui Meng] make Assignment serializable
cffa96e [Xiangrui Meng] fix test
9c0e590 [Xiangrui Meng] remove unused Tuple2
1b9db3d [Xiangrui Meng] make PIC and FPGrowth Java-friendly
Another one from JoshRosen 's wish list. The first commit is much smaller and removes 2 of the 4 Clock classes. The second is much larger, necessary for consolidating the streaming one. I put together implementations in the way that seemed simplest. Almost all the change is standardizing class and method names.
Author: Sean Owen <sowen@cloudera.com>
Closes#4514 from srowen/SPARK-4682 and squashes the following commits:
5ed3a03 [Sean Owen] Javadoc Clock classes; make ManualClock private[spark]
169dd13 [Sean Owen] Add support for legacy org.apache.spark.streaming clock class names
277785a [Sean Owen] Reduce the net change in this patch by reversing some unnecessary syntax changes along the way
b5e53df [Sean Owen] FakeClock -> ManualClock; getTime() -> getTimeMillis()
160863a [Sean Owen] Consolidate Streaming Clock class into common util Clock
7c956b2 [Sean Owen] Consolidate Clocks except for Streaming Clock
For users to implement their own PipelineStages, we need to make PipelineStage.transformSchema be public instead of private to ml. This would be nice to include in Spark 1.3
CC: mengxr
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#4682 from jkbradley/SPARK-5902 and squashes the following commits:
6f02357 [Joseph K. Bradley] Made transformSchema public
0e6d0a0 [Joseph K. Bradley] made implementations of transformSchema protected as well
fdaf26a [Joseph K. Bradley] Made PipelineStage.transformSchema protected instead of private[ml]
Updated PIC user guide to reflect API changes and added a simple Java example. The API is still not very Java-friendly. I created SPARK-5990 for this issue.
Author: Xiangrui Meng <meng@databricks.com>
Closes#4680 from mengxr/SPARK-5897 and squashes the following commits:
847d216 [Xiangrui Meng] apache header
87719a2 [Xiangrui Meng] remove PIC image
2dd921f [Xiangrui Meng] update PIC user guide and add a Java example
Although we've migrated to the DataFrame API, lots of code still uses `rdd` or `srdd` as local variable names. This PR tries to address these naming inconsistencies and some other minor DataFrame related style issues.
<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4670)
<!-- Reviewable:end -->
Author: Cheng Lian <lian@databricks.com>
Closes#4670 from liancheng/df-cleanup and squashes the following commits:
3e14448 [Cheng Lian] Cleans up DataFrame variable names and toDF() calls
For unordered features, it is sufficient to use splits since the threshold of the split corresponds the threshold of the HighSplit of the bin and there is no use of the LowSplit.
Author: MechCoder <manojkumarsivaraj334@gmail.com>
Closes#4231 from MechCoder/spark-3381 and squashes the following commits:
58c19a5 [MechCoder] COSMIT
c274b74 [MechCoder] Remove unordered feature calculation in labeledPointToTreePoint
b2b9b89 [MechCoder] COSMIT
d3ee042 [MechCoder] [SPARK-3381] [MLlib] Eliminate bins for unordered features
`numFeatures` is only used by multinomial logistic regression. Calling `.first()` for every GLM causes performance regression, especially in Python.
Author: Xiangrui Meng <meng@databricks.com>
Closes#4647 from mengxr/SPARK-5858 and squashes the following commits:
036dc7f [Xiangrui Meng] remove unnecessary first() call
12c5548 [Xiangrui Meng] check numFeatures only once
If we need to transform the input data, we should cache the output to avoid re-computing feature vectors every iteration. dbtsai
Author: Xiangrui Meng <meng@databricks.com>
Closes#4593 from mengxr/SPARK-5802 and squashes the following commits:
ae3be84 [Xiangrui Meng] cache transformed data in glm
On a big dataset explicitly unpersist train and validation folds allows to load more data into memory in the next loop iteration. On my environment (single node 8Gb worker RAM, 2 GB dataset file, 3 folds for cross validation), saved more than 5 minutes.
Author: Peter Rudenko <petro.rudenko@gmail.com>
Closes#4595 from petro-rudenko/patch-2 and squashes the following commits:
66a7cfb [Peter Rudenko] Move validationDataset cache to declaration
c5f3265 [Peter Rudenko] [Ml] SPARK-5804 Explicitly manage cache in Crossvalidator k-fold loop
If it's a last estimator in Pipeline there's no need to transform data, since there's no next stage that would consume this data.
Author: Peter Rudenko <petro.rudenko@gmail.com>
Closes#4590 from petro-rudenko/patch-1 and squashes the following commits:
d13ec33 [Peter Rudenko] [Ml] SPARK-5796 Don't transform data on a last estimator in Pipeline
This PR adds three groups to the ScalaDoc: `param`, `setParam`, and `getParam`. Params will show up in the generated Scala API doc as the top group. Setters/getters will be at the bottom.
Preview:
![screen shot 2015-02-13 at 2 47 49 pm](https://cloud.githubusercontent.com/assets/829644/6196657/5740c240-b38f-11e4-94bb-bd8ef5a796c5.png)
Author: Xiangrui Meng <meng@databricks.com>
Closes#4600 from mengxr/SPARK-5730 and squashes the following commits:
febed9a [Xiangrui Meng] add doc groups to spark.ml components
because ArrayBuffer is not specialized.
Author: Xiangrui Meng <meng@databricks.com>
Closes#4594 from mengxr/SPARK-5803 and squashes the following commits:
1261bd5 [Xiangrui Meng] merge master
a4ea872 [Xiangrui Meng] use ArrayBuilder to build primitive arrays
This PR detaches MLlib model import/export code from SQL's JSON support, and hence unblocks #4544 . yhuai
Author: Xiangrui Meng <meng@databricks.com>
Closes#4555 from mengxr/SPARK-5757 and squashes the following commits:
b0415e8 [Xiangrui Meng] replace SQL JSON usage by json4s
The `initialState` of LDA performs several RDD operations that looks redundant. This pr tries to simplify these operations.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#4501 from viirya/sim_lda and squashes the following commits:
4870fe4 [Liang-Chi Hsieh] For comments.
9af1487 [Liang-Chi Hsieh] Refactor initial step of LDA to remove redundant operations.
Do not recursively strip out projects. Only strip the first level project.
```scala
df("colA") + df("colB").as("colC")
```
Previously, the above would construct an invalid plan.
Author: Reynold Xin <rxin@databricks.com>
Closes#4519 from rxin/computability and squashes the following commits:
87ff763 [Reynold Xin] Code review feedback.
015c4fc [Reynold Xin] [SQL][DataFrame] Fix column computability.
Deprecate inferSchema() and applySchema(), use createDataFrame() instead, which could take an optional `schema` to create an DataFrame from an RDD. The `schema` could be StructType or list of names of columns.
Author: Davies Liu <davies@databricks.com>
Closes#4498 from davies/create and squashes the following commits:
08469c1 [Davies Liu] remove Scala/Java API for now
c80a7a9 [Davies Liu] fix hive test
d1bd8f2 [Davies Liu] cleanup applySchema
9526e97 [Davies Liu] createDataFrame from RDD with columns
Following discussion in the Jira.
Author: MechCoder <manojkumarsivaraj334@gmail.com>
Closes#4459 from MechCoder/sparse_gmm and squashes the following commits:
1b18dab [MechCoder] Rewrite syr for sparse matrices
e579041 [MechCoder] Add test for covariance matrix
5cb370b [MechCoder] Separate tests for sparse data
5e096bd [MechCoder] Alphabetize and correct error message
e180f4c [MechCoder] [SPARK-5021] Gaussian Mixture now supports Sparse Input
This is based on #4444 from jkbradley with the following changes:
1. Node schema updated to
~~~
treeId: int
nodeId: Int
predict/
|- predict: Double
|- prob: Double
impurity: Double
isLeaf: Boolean
split/
|- feature: Int
|- threshold: Double
|- featureType: Int
|- categories: Array[Double]
leftNodeId: Integer
rightNodeId: Integer
infoGain: Double
~~~
2. Some refactor of the implementation.
Closes#4444.
Author: Joseph K. Bradley <joseph@databricks.com>
Author: Xiangrui Meng <meng@databricks.com>
Closes#4493 from mengxr/SPARK-5597 and squashes the following commits:
75e3bb6 [Xiangrui Meng] fix style
2b0033d [Xiangrui Meng] update tree export schema and refactor the implementation
45873a2 [Joseph K. Bradley] org imports
1d4c264 [Joseph K. Bradley] Added save/load for tree ensembles
dcdbf85 [Joseph K. Bradley] added save/load for decision tree but need to generalize it to ensembles
Fix ARPACK error code mapping, at least. It's not yet clear whether the error is what we expect from ARPACK. If it isn't, not sure if that's to be treated as an MLlib or Breeze issue.
Author: Sean Owen <sowen@cloudera.com>
Closes#4485 from srowen/SPARK-4900 and squashes the following commits:
7355aa1 [Sean Owen] Fix ARPACK error code mapping
Author: Sandy Ryza <sandy@cloudera.com>
Closes#1093 from sryza/sandy-spark-2149 and squashes the following commits:
5f06b33 [Sandy Ryza] More review comments
0f73060 [Sandy Ryza] Respond to Sean's review comments
0dfa005 [Sandy Ryza] SPARK-2149. Univariate kernel density estimation
Check that size of dense matrix array is not beyond Int.MaxValue in Matrices.* methods. jkbradley this should be an easy one. Review and/or merge as you see fit.
Author: Sean Owen <sowen@cloudera.com>
Closes#4461 from srowen/SPARK-4405 and squashes the following commits:
c67574e [Sean Owen] Check that size of dense matrix array is not beyond Int.MaxValue in Matrices.* methods
This is #4447 with `override`.
Closes#4447
Author: Joseph K. Bradley <joseph@databricks.com>
Author: Xiangrui Meng <meng@databricks.com>
Closes#4462 from mengxr/SPARK-5660 and squashes the following commits:
f82c8d6 [Xiangrui Meng] add override to matrix.apply
91cedde [Joseph K. Bradley] made matrix apply public
following #4233. jkbradley
Author: Xiangrui Meng <meng@databricks.com>
Closes#4422 from mengxr/SPARK-5598 and squashes the following commits:
a059394 [Xiangrui Meng] SaveLoad not extending Loader
14b7ea6 [Xiangrui Meng] address comments
f487cb2 [Xiangrui Meng] add unit tests
62fc43c [Xiangrui Meng] implement save/load for MFM
...ceed max int.
Large values of k and/or n in EigenValueDecomposition.symmetricEigs will result in array initialization to a value larger than Integer.MAX_VALUE in the following: var v = new Array[Double](n * ncv)
Author: mbittmann <mbittmann@gmail.com>
Author: bittmannm <mark.bittmann@agilex.com>
Closes#4433 from mbittmann/master and squashes the following commits:
ee56e05 [mbittmann] [SPARK-5656] Combine checks into simple message
e49cbbb [mbittmann] [SPARK-5656] Simply error message
860836b [mbittmann] Array size check updates based on code review
a604816 [bittmannm] [SPARK-5656] Fail gracefully for large values of k and/or n that will exceed max int.
Overload `trainOn`, `predictOn`, and `predictOnValues`.
CC freeman-lab
Author: Xiangrui Meng <meng@databricks.com>
Closes#4432 from mengxr/streaming-java and squashes the following commits:
6a79b85 [Xiangrui Meng] add java test for streaming logistic regression
2d7b357 [Xiangrui Meng] organize imports
1f662b3 [Xiangrui Meng] make streaming linear algorithms Java-friendly