Commit graph

1049 commits

Author SHA1 Message Date
Joseph K. Bradley d2d5e7fe2d [SPARK-9704] [ML] Made ProbabilisticClassifier, Identifiable, VectorUDT public APIs
Made ProbabilisticClassifier, Identifiable, VectorUDT public.  All are annotated as DeveloperApi.

CC: mengxr EronWright

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #8004 from jkbradley/ml-api-public-items and squashes the following commits:

7ebefda [Joseph K. Bradley] update per code review
7ff0768 [Joseph K. Bradley] attepting to add mima fix
756d84c [Joseph K. Bradley] VectorUDT annotated as AlphaComponent
ae7767d [Joseph K. Bradley] added another warning
94fd553 [Joseph K. Bradley] Made ProbabilisticClassifier, Identifiable, VectorUDT public APIs
2015-08-12 20:43:36 -07:00
Xiangrui Meng fc1c7fd66e [SPARK-9915] [ML] stopWords should use StringArrayParam
hhbyyh

Author: Xiangrui Meng <meng@databricks.com>

Closes #8141 from mengxr/SPARK-9915.
2015-08-12 17:06:12 -07:00
Xiangrui Meng e6aef55766 [SPARK-9912] [MLLIB] QRDecomposition should use QType and RType for type names instead of UType and VType
hhbyyh

Author: Xiangrui Meng <meng@databricks.com>

Closes #8140 from mengxr/SPARK-9912.
2015-08-12 17:04:31 -07:00
Holden Karau 6e409bc135 [SPARK-9909] [ML] [TRIVIAL] move weightCol to shared params
As per the TODO move weightCol to Shared Params.

Author: Holden Karau <holden@pigscanfly.ca>

Closes #8144 from holdenk/SPARK-9909-move-weightCol-toSharedParams.
2015-08-12 16:54:45 -07:00
Xiangrui Meng caa14d9dc9 [SPARK-9913] [MLLIB] LDAUtils should be private
feynmanliang

Author: Xiangrui Meng <meng@databricks.com>

Closes #8142 from mengxr/SPARK-9913.
2015-08-12 16:53:47 -07:00
Joseph K. Bradley 551def5d69 [SPARK-9789] [ML] Added logreg threshold param back
Reinstated LogisticRegression.threshold Param for binary compatibility.  Param thresholds overrides threshold, if set.

CC: mengxr dbtsai feynmanliang

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #8079 from jkbradley/logreg-reinstate-threshold.
2015-08-12 14:27:13 -07:00
Joseph K. Bradley 70fe558867 [SPARK-9847] [ML] Modified copyValues to distinguish between default, explicit param values
From JIRA: Currently, Params.copyValues copies default parameter values to the paramMap of the target instance, rather than the defaultParamMap. It should copy to the defaultParamMap because explicitly setting a parameter can change the semantics.
This issue arose in SPARK-9789, where 2 params "threshold" and "thresholds" for LogisticRegression can have mutually exclusive values. If thresholds is set, then fit() will copy the default value of threshold as well, easily resulting in inconsistent settings for the 2 params.

CC: mengxr

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #8115 from jkbradley/copyvalues-fix.
2015-08-12 10:48:52 -07:00
Andrew Or 736af95bd0 [HOTFIX] Fix style error caused by 017b5de 2015-08-11 14:52:52 -07:00
Sudhakar Thota 017b5de07e [SPARK-8925] [MLLIB] Add @since tags to mllib.util
Went thru the history of changes the file MLUtils.scala and picked up the version that the change went in.

Author: Sudhakar Thota <sudhakarthota@yahoo.com>
Author: Sudhakar Thota <sudhakarthota@sudhakars-mbp-2.usca.ibm.com>

Closes #7436 from sthota2014/SPARK-8925_thotas.
2015-08-11 14:31:51 -07:00
Feynman Liang be3e271641 [SPARK-9788] [MLLIB] Fix LDA Binary Compatibility
1. Add “asymmetricDocConcentration” and revert docConcentration changes. If the (internal) doc concentration vector is a single value, “getDocConcentration" returns it. If it is a constant vector, getDocConcentration returns the first item, and fails otherwise.
2. Give `LDAModel.gammaShape` a default value in `LDAModel` concrete class constructors.

jkbradley

Author: Feynman Liang <fliang@databricks.com>

Closes #8077 from feynmanliang/SPARK-9788 and squashes the following commits:

6b07bc8 [Feynman Liang] Code review changes
9d6a71e [Feynman Liang] Add asymmetricAlpha alias
bf4e685 [Feynman Liang] Asymmetric docConcentration
4cab972 [Feynman Liang] Default gammaShape
2015-08-11 14:21:53 -07:00
Feynman Liang 520ad44b17 [SPARK-9750] [MLLIB] Improve equals on SparseMatrix and DenseMatrix
Adds unit test for `equals` on `mllib.linalg.Matrix` class and `equals` to both `SparseMatrix` and `DenseMatrix`. Supports equality testing between `SparseMatrix` and `DenseMatrix`.

mengxr

Author: Feynman Liang <fliang@databricks.com>

Closes #8042 from feynmanliang/SPARK-9750 and squashes the following commits:

bb70d5e [Feynman Liang] Breeze compare for dense matrices as well, in case other is sparse
ab6f3c8 [Feynman Liang] Sparse matrix compare for equals
22782df [Feynman Liang] Add equality based on matrix semantics, not representation
78f9426 [Feynman Liang] Add casts
43d28fa [Feynman Liang] Fix failing test
6416fa0 [Feynman Liang] Add failing sparse matrix equals tests
2015-08-11 12:49:47 -07:00
Holden Karau dbd778d84d [SPARK-8764] [ML] string indexer should take option to handle unseen values
As a precursor to adding a public constructor add an option to handle unseen values by skipping rather than throwing an exception (default remains throwing an exception),

Author: Holden Karau <holden@pigscanfly.ca>

Closes #7266 from holdenk/SPARK-8764-string-indexer-should-take-option-to-handle-unseen-values and squashes the following commits:

38a4de9 [Holden Karau] fix long line
045bf22 [Holden Karau] Add a second b entry so b gets 0 for sure
81dd312 [Holden Karau] Update the docs for handleInvalid param to be more descriptive
7f37f6e [Holden Karau] remove extra space (scala style)
414e249 [Holden Karau] And switch to using handleInvalid instead of skipInvalid
1e53f9b [Holden Karau] update the param (codegen side)
7a22215 [Holden Karau] fix typo
100a39b [Holden Karau] Merge in master
aa5b093 [Holden Karau] Since we filter we should never go down this code path if getSkipInvalid is true
75ffa69 [Holden Karau] Remove extra newline
d69ef5e [Holden Karau] Add a test
b5734be [Holden Karau] Add support for unseen labels
afecd4e [Holden Karau] Add a param to skip invalid entries.
2015-08-11 11:33:36 -07:00
Yanbo Liang 8cad854ef6 [SPARK-8345] [ML] Add an SQL node as a feature transformer
Implements the transforms which are defined by SQL statement.
Currently we only support SQL syntax like 'SELECT ... FROM __THIS__'
where '__THIS__' represents the underlying table of the input dataset.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #7465 from yanboliang/spark-8345 and squashes the following commits:

b403fcb [Yanbo Liang] address comments
0d4bb15 [Yanbo Liang] a better transformSchema() implementation
51eb9e7 [Yanbo Liang] Add an SQL node as a feature transformer
2015-08-11 11:01:59 -07:00
Feynman Liang 00b655cced [SPARK-9755] [MLLIB] Add docs to MultivariateOnlineSummarizer methods
Adds method documentations back to `MultivariateOnlineSummarizer`, which were present in 1.4 but disappeared somewhere along the way to 1.5.

jkbradley

Author: Feynman Liang <fliang@databricks.com>

Closes #8045 from feynmanliang/SPARK-9755 and squashes the following commits:

af67fde [Feynman Liang] Add MultivariateOnlineSummarizer docs
2015-08-10 11:01:45 -07:00
Feynman Liang 85be65b39c [SPARK-9719] [ML] Clean up Naive Bayes doc
Small documentation cleanups, including:
 * Adds documentation for `pi` and `theta`
 * setParam to `setModelType`

Author: Feynman Liang <fliang@databricks.com>

Closes #8047 from feynmanliang/SPARK-9719 and squashes the following commits:

b372438 [Feynman Liang] Clean up naive bayes doc
2015-08-07 17:21:12 -07:00
Feynman Liang cd540c1e59 [SPARK-9756] [ML] Make constructors in ML decision trees private
These should be made private until there is a public constructor for providing `rootNode: Node` to use these constructors.

jkbradley

Author: Feynman Liang <fliang@databricks.com>

Closes #8046 from feynmanliang/SPARK-9756 and squashes the following commits:

2cbdf08 [Feynman Liang] Make RFRegressionModel aux constructor private
a06f596 [Feynman Liang] Make constructors in ML decision trees private
2015-08-07 17:19:48 -07:00
Bertrand Dechoux 902334fd55 [SPARK-9748] [MLLIB] Centriod typo in KMeansModel
A minor typo (centriod -> centroid). Readable variable names help every users.

Author: Bertrand Dechoux <BertrandDechoux@users.noreply.github.com>

Closes #8037 from BertrandDechoux/kmeans-typo and squashes the following commits:

47632fe [Bertrand Dechoux] centriod typo
2015-08-07 16:07:24 -07:00
Dariusz Kobylarz e2fbbe7311 [SPARK-8481] [MLLIB] GaussianMixtureModel predict accepting single vector
Resubmit of [https://github.com/apache/spark/pull/6906] for adding single-vec predict to GMMs

CC: dkobylarz  mengxr

To be merged with master and branch-1.5
Primary author: dkobylarz

Author: Dariusz Kobylarz <darek.kobylarz@gmail.com>

Closes #8039 from jkbradley/gmm-predict-vec and squashes the following commits:

bfbedc4 [Dariusz Kobylarz] [SPARK-8481] [MLlib] GaussianMixtureModel predict accepting single vector
2015-08-07 14:51:03 -07:00
Xiangrui Meng 54c0789a05 [SPARK-9493] [ML] add featureIndex to handle vector features in IsotonicRegression
This PR contains the following changes:
* add `featureIndex` to handle vector features (in order to chain isotonic regression easily with output from logistic regression
* make getter/setter names consistent with params
* remove inheritance from Regressor because it is tricky to handle both `DoubleType` and `VectorType`
* simplify test data generation

jkbradley zapletal-martin

Author: Xiangrui Meng <meng@databricks.com>

Closes #7952 from mengxr/SPARK-9493 and squashes the following commits:

8818ac3 [Xiangrui Meng] address comments
05e2216 [Xiangrui Meng] address comments
8d08090 [Xiangrui Meng] add featureIndex to handle vector features make getter/setter names consistent with params remove inheritance from Regressor
2015-08-06 13:29:31 -07:00
MechCoder 076ec05681 [SPARK-9533] [PYSPARK] [ML] Add missing methods in Word2Vec ML
After https://github.com/apache/spark/pull/7263 it is pretty straightforward to Python wrappers.

Author: MechCoder <manojkumarsivaraj334@gmail.com>

Closes #7930 from MechCoder/spark-9533 and squashes the following commits:

1bea394 [MechCoder] make getVectors a lazy val
5522756 [MechCoder] [SPARK-9533] [PySpark] [ML] Add missing methods in Word2Vec ML
2015-08-06 10:09:58 -07:00
MechCoder c5c6aded64 [SPARK-9112] [ML] Implement Stats for LogisticRegression
I have added support for stats in LogisticRegression. The API is similar to that of LinearRegression with LogisticRegressionTrainingSummary and LogisticRegressionSummary

I have some queries and asked them inline.

Author: MechCoder <manojkumarsivaraj334@gmail.com>

Closes #7538 from MechCoder/log_reg_stats and squashes the following commits:

2e9f7c7 [MechCoder] Change defs into lazy vals
d775371 [MechCoder] Clean up class inheritance
9586125 [MechCoder] Add abstraction to handle Multiclass Metrics
40ad8ef [MechCoder] minor
640376a [MechCoder] remove unnecessary dataframe stuff and add docs
80d9954 [MechCoder] Added tests
fbed861 [MechCoder] DataFrame support for metrics
70a0fc4 [MechCoder] [SPARK-9112] [ML] Implement Stats for LogisticRegression
2015-08-06 10:08:33 -07:00
Xusen Yin a018b85716 [SPARK-5895] [ML] Add VectorSlicer - updated
Add VectorSlicer transformer to spark.ml, with features specified as either indices or names.  Transfers feature attributes for selected features.

Updated version of [https://github.com/apache/spark/pull/5731]

CC: yinxusen This updates your PR.  You'll still be the primary author of this PR.

CC: mengxr

Author: Xusen Yin <yinxusen@gmail.com>
Author: Joseph K. Bradley <joseph@databricks.com>

Closes #7972 from jkbradley/yinxusen-SPARK-5895 and squashes the following commits:

b16e86e [Joseph K. Bradley] fixed scala style
71c65d2 [Joseph K. Bradley] fix import order
86e9739 [Joseph K. Bradley] cleanups per code review
9d8d6f1 [Joseph K. Bradley] style fix
83bc2e9 [Joseph K. Bradley] Updated VectorSlicer
98c6939 [Xusen Yin] fix style error
ecbf2d3 [Xusen Yin] change interfaces and params
f6be302 [Xusen Yin] Merge branch 'master' into SPARK-5895
e4781f2 [Xusen Yin] fix commit error
fd154d7 [Xusen Yin] add test suite of vector slicer
17171f8 [Xusen Yin] fix slicer
9ab9747 [Xusen Yin] add vector slicer
aa5a0bf [Xusen Yin] add vector slicer
2015-08-05 17:07:55 -07:00
Feynman Liang dac090d1e9 [SPARK-9657] Fix return type of getMaxPatternLength
mengxr

Author: Feynman Liang <fliang@databricks.com>

Closes #7974 from feynmanliang/SPARK-9657 and squashes the following commits:

7ca533f [Feynman Liang] Fix return type of getMaxPatternLength
2015-08-05 15:42:18 -07:00
Mike Dusenberry 34dcf10104 [SPARK-6486] [MLLIB] [PYTHON] Add BlockMatrix to PySpark.
mengxr This adds the `BlockMatrix` to PySpark.  I have the conversions to `IndexedRowMatrix` and `CoordinateMatrix` ready as well, so once PR #7554 is completed (which relies on PR #7746), this PR can be finished.

Author: Mike Dusenberry <mwdusenb@us.ibm.com>

Closes #7761 from dusenberrymw/SPARK-6486_Add_BlockMatrix_to_PySpark and squashes the following commits:

27195c2 [Mike Dusenberry] Adding one more check to _convert_to_matrix_block_tuple, and a few minor documentation changes.
ae50883 [Mike Dusenberry] Minor update: BlockMatrix should inherit from DistributedMatrix.
b8acc1c [Mike Dusenberry] Moving BlockMatrix to pyspark.mllib.linalg.distributed, updating the logic to match that of the other distributed matrices, adding conversions, and adding documentation.
c014002 [Mike Dusenberry] Using properties for better documentation.
3bda6ab [Mike Dusenberry] Adding documentation.
8fb3095 [Mike Dusenberry] Small cleanup.
e17af2e [Mike Dusenberry] Adding BlockMatrix to PySpark.
2015-08-05 07:40:50 -07:00
Xiangrui Meng a02bcf20c4 [SPARK-9540] [MLLIB] optimize PrefixSpan implementation
This is a major refactoring of the PrefixSpan implementation. It contains the following changes:

1. Expand prefix with one item at a time. The existing implementation generates all subsets for each itemset, which might have scalability issue when the itemset is large.
2. Use a new internal format. `<(12)(31)>` is represented by `[0, 1, 2, 0, 1, 3, 0]` internally. We use `0` because negative numbers are used to indicates partial prefix items, e.g., `_2` is represented by `-2`.
3. Remember the start indices of all partial projections in the projected postfix to help next projection.
4. Reuse the original sequence array for projected postfixes.
5. Use `Prefix` IDs in aggregation rather than its content.
6. Use `ArrayBuilder` for building primitive arrays.
7. Expose `maxLocalProjDBSize`.
8. Tests are not changed except using `0` instead of `-1` as the delimiter.

`Postfix`'s API doc should be a good place to start.

Closes #7594

feynmanliang zhangjiajin

Author: Xiangrui Meng <meng@databricks.com>

Closes #7937 from mengxr/SPARK-9540 and squashes the following commits:

2d0ec31 [Xiangrui Meng] address more comments
48f450c [Xiangrui Meng] address comments from Feynman; fixed a bug in project and added a test
65f90e8 [Xiangrui Meng] naming and documentation
8afc86a [Xiangrui Meng] refactor impl
2015-08-04 22:28:49 -07:00
Holden Karau d92fa14179 [SPARK-8601] [ML] Add an option to disable standardization for linear regression
All compressed sensing applications, and some of the regression use-cases will have better result by turning the feature scaling off. However, if we implement this naively by training the dataset without doing any standardization, the rate of convergency will not be good. This can be implemented by still standardizing the training dataset but we penalize each component differently to get effectively the same objective function but a better numerical problem. As a result, for those columns with high variances, they will be penalized less, and vice versa. Without this, since all the features are standardized, so they will be penalized the same.

In R, there is an option for this.
standardize

Logical flag for x variable standardization, prior to fitting the model sequence. The coefficients are always returned on the original scale. Default is standardize=TRUE. If variables are in the same units already, you might not wish to standardize. See details below for y standardization with family="gaussian".

Note that the primary author for this PR is holdenk

Author: Holden Karau <holden@pigscanfly.ca>
Author: DB Tsai <dbt@netflix.com>

Closes #7875 from dbtsai/SPARK-8522 and squashes the following commits:

e856036 [DB Tsai] scala doc
596e96c [DB Tsai] minor
bbff347 [DB Tsai] naming
baa0805 [DB Tsai] touch up
d6234ba [DB Tsai] Merge branch 'master' into SPARK-8522-Disable-Linear_featureScaling-Spark-8601-in-Linear_regression
6b1dc09 [Holden Karau] Merge branch 'master' into SPARK-8522-Disable-Linear_featureScaling-Spark-8601-in-Linear_regression
332f140 [Holden Karau] Merge in master
eebe10a [Holden Karau] Use same comparision operator throughout the test
3f92935 [Holden Karau] merge
b83a41e [Holden Karau] Expand the tests and make them similar to the other PR also providing an option to disable standardization (but for LoR).
0c334a2 [Holden Karau] Remove extra line
99ce053 [Holden Karau] merge in master
e54a8a9 [Holden Karau] Fix long line
e47c574 [Holden Karau] Add support for L2 without standardization.
55d3a66 [Holden Karau] Add standardization param for linear regression
00a1dc5 [Holden Karau] Add the param to the linearregression impl
2015-08-04 18:15:26 -07:00
Feynman Liang 629e26f7ee [SPARK-9609] [MLLIB] Fix spelling of Strategy.defaultStrategy
jkbradley

Author: Feynman Liang <fliang@databricks.com>

Closes #7941 from feynmanliang/SPARK-9609-stategy-spelling and squashes the following commits:

d2aafb1 [Feynman Liang] Add deprecated backwards compatibility
aa090a8 [Feynman Liang] Fix spelling
2015-08-04 18:13:18 -07:00
Joseph K. Bradley b77d3b9688 [SPARK-9586] [ML] Update BinaryClassificationEvaluator to use setRawPredictionCol
Update BinaryClassificationEvaluator to use setRawPredictionCol, rather than setScoreCol. Deprecated setScoreCol.

I don't think setScoreCol was actually used anywhere (based on search).

CC: mengxr

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #7921 from jkbradley/binary-eval-rawpred and squashes the following commits:

e5d7dfa [Joseph K. Bradley] Update BinaryClassificationEvaluator to use setRawPredictionCol
2015-08-04 16:52:43 -07:00
Mike Dusenberry 571d5b5363 [SPARK-6485] [MLLIB] [PYTHON] Add CoordinateMatrix/RowMatrix/IndexedRowMatrix to PySpark.
This PR adds the RowMatrix, IndexedRowMatrix, and CoordinateMatrix distributed matrices to PySpark.  Each distributed matrix class acts as a wrapper around the Scala/Java counterpart by maintaining a reference to the Java object.  New distributed matrices can be created using factory methods added to DistributedMatrices, which creates the Java distributed matrix and then wraps it with the corresponding PySpark class.  This design allows for simple conversion between the various distributed matrices, and lets us re-use the Scala code.  Serialization between Python and Java is implemented using DataFrames as needed for IndexedRowMatrix and CoordinateMatrix for simplicity.  Associated documentation and unit-tests have also been added.  To facilitate code review, this PR implements access to the rows/entries as RDDs, the number of rows & columns, and conversions between the various distributed matrices (not including BlockMatrix), and does not implement the other linear algebra functions of the matrices, although this will be very simple to add now.

Author: Mike Dusenberry <mwdusenb@us.ibm.com>

Closes #7554 from dusenberrymw/SPARK-6485_Add_CoordinateMatrix_RowMatrix_IndexedMatrix_to_PySpark and squashes the following commits:

bb039cb [Mike Dusenberry] Minor documentation update.
b887c18 [Mike Dusenberry] Updating the matrix conversion logic again to make it even cleaner.  Now, we allow the 'rows' parameter in the constructors to be either an RDD or the Java matrix object. If 'rows' is an RDD, we create a Java matrix object, wrap it, and then store that.  If 'rows' is a Java matrix object of the correct type, we just wrap and store that directly.  This is only for internal usage, and publicly, we still require 'rows' to be an RDD.  We no longer store the 'rows' RDD, and instead just compute it from the Java object when needed.  The point of this is that when we do matrix conversions, we do the conversion on the Scala/Java side, which returns a Java object, so we should use that directly, but exposing 'java_matrix' parameter in the public API is not ideal. This non-public feature of allowing 'rows' to be a Java matrix object is documented in the '__init__' constructor docstrings, which are not part of the generated public API, and doctests are also included.
7f0dcb6 [Mike Dusenberry] Updating module docstring.
cfc1be5 [Mike Dusenberry] Use 'new SQLContext(matrix.rows.sparkContext)' rather than 'SQLContext.getOrCreate', as the later doesn't guarantee that the SparkContext will be the same as for the matrix.rows data.
687e345 [Mike Dusenberry] Improving conversion performance.  This adds an optional 'java_matrix' parameter to the constructors, and pulls the conversion logic out into a '_create_from_java' function. Now, if the constructors are given a valid Java distributed matrix object as 'java_matrix', they will store those internally, rather than create a new one on the Scala/Java side.
3e50b6e [Mike Dusenberry] Moving the distributed matrices to pyspark.mllib.linalg.distributed.
308f197 [Mike Dusenberry] Using properties for better documentation.
1633f86 [Mike Dusenberry] Minor documentation cleanup.
f0c13a7 [Mike Dusenberry] CoordinateMatrix should inherit from DistributedMatrix.
ffdd724 [Mike Dusenberry] Updating doctests to make documentation cleaner.
3fd4016 [Mike Dusenberry] Updating docstrings.
27cd5f6 [Mike Dusenberry] Simplifying input conversions in the constructors for each distributed matrix.
a409cf5 [Mike Dusenberry] Updating doctests to be less verbose by using lists instead of DenseVectors explicitly.
d19b0ba [Mike Dusenberry] Updating code and documentation to note that a vector-like object (numpy array, list, etc.) can be used in place of explicit Vector object, and adding conversions when necessary to RowMatrix construction.
4bd756d [Mike Dusenberry] Adding param documentation to IndexedRow and MatrixEntry.
c6bded5 [Mike Dusenberry] Move conversion logic from tuples to IndexedRow or MatrixEntry types from within the IndexedRowMatrix and CoordinateMatrix constructors to separate _convert_to_indexed_row and _convert_to_matrix_entry functions.
329638b [Mike Dusenberry] Moving the Experimental tag to the top of each docstring.
0be6826 [Mike Dusenberry] Simplifying doctests by removing duplicated rows/entries RDDs within the various tests.
c0900df [Mike Dusenberry] Adding the colons that were accidentally not inserted.
4ad6819 [Mike Dusenberry] Documenting the  and  parameters.
3b854b9 [Mike Dusenberry] Minor updates to documentation.
10046e8 [Mike Dusenberry] Updating documentation to use class constructors instead of the removed DistributedMatrices factory methods.
119018d [Mike Dusenberry] Adding static  methods to each of the distributed matrix classes to consolidate conversion logic.
4d7af86 [Mike Dusenberry] Adding type checks to the constructors.  Although it is slightly verbose, it is better for the user to have a good error message than a cryptic stacktrace.
93b6a3d [Mike Dusenberry] Pulling the DistributedMatrices Python class out of this pull request.
f6f3c68 [Mike Dusenberry] Pulling the DistributedMatrices Scala class out of this pull request.
6a3ecb7 [Mike Dusenberry] Updating pattern matching.
08f287b [Mike Dusenberry] Slight reformatting of the documentation.
a245dc0 [Mike Dusenberry] Updating Python doctests for compatability between Python 2 & 3. Since Python 3 removed the idea of a separate 'long' type, all values that would have been outputted as a 'long' (ex: '4L') will now be treated as an 'int' and outputed as one (ex: '4').  The doctests now explicitly convert to ints so that both Python 2 and 3 will have the same output.  This is fine since the values are all small, and thus can be easily represented as ints.
4d3a37e [Mike Dusenberry] Reformatting a few long Python doctest lines.
7e3ca16 [Mike Dusenberry] Fixing long lines.
f721ead [Mike Dusenberry] Updating documentation for each of the distributed matrices.
ab0e8b6 [Mike Dusenberry] Updating unit test to be more useful.
dda2f89 [Mike Dusenberry] Added wrappers for the conversions between the various distributed matrices.  Added logic to be able to access the rows/entries of the distributed matrices, which requires serialization through DataFrames for IndexedRowMatrix and CoordinateMatrix types. Added unit tests.
0cd7166 [Mike Dusenberry] Implemented the CoordinateMatrix API in PySpark, following the idea of the IndexedRowMatrix API, including using DataFrames for serialization.
3c369cb [Mike Dusenberry] Updating the architecture a bit to make conversions between the various distributed matrix types easier.  The different distributed matrix classes are now only wrappers around the Java objects, and take the Java object as an argument during construction.  This way, we can call  for example on an , which returns a reference to a Java RowMatrix object, and then construct a PySpark RowMatrix object wrapped around the Java object.  This is analogous to the behavior of PySpark RDDs and DataFrames.  We now delegate creation of the various distributed matrices from scratch in PySpark to the factory methods on .
4bdd09b [Mike Dusenberry] Implemented the IndexedRowMatrix API in PySpark, following the idea of the RowMatrix API.  Note that for the IndexedRowMatrix, we use DataFrames to serialize the data between Python and Scala/Java, so we accept PySpark RDDs, then convert to a DataFrame, then convert back to RDDs on the Scala/Java side before constructing the IndexedRowMatrix.
23bf1ec [Mike Dusenberry] Updating documentation to add PySpark RowMatrix. Inserting newline above doctest so that it renders properly in API docs.
b194623 [Mike Dusenberry] Updating design to have a PySpark RowMatrix simply create and keep a reference to a wrapper over a Java RowMatrix.  Updating DistributedMatrices factory methods to accept numRows and numCols with default values.  Updating PySpark DistributedMatrices factory method to simply create a PySpark RowMatrix. Adding additional doctests for numRows and numCols parameters.
bc2d220 [Mike Dusenberry] Adding unit tests for RowMatrix methods.
d7e316f [Mike Dusenberry] Implemented the RowMatrix API in PySpark by doing the following: Added a DistributedMatrices class to contain factory methods for creating the various distributed matrices.  Added a factory method for creating a RowMatrix from an RDD of Vectors.  Added a createRowMatrix function to the PythonMLlibAPI to interface with the factory method.  Added DistributedMatrix, DistributedMatrices, and RowMatrix classes to the pyspark.mllib.linalg api.
2015-08-04 16:30:03 -07:00
Joseph K. Bradley 1833d9c08f [SPARK-9582] [ML] LDA cleanups
Small cleanups to recent LDA additions and docs.

CC: feynmanliang

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #7916 from jkbradley/lda-cleanups and squashes the following commits:

f7021d9 [Joseph K. Bradley] broadcasting large matrices for LDA in local model and online learning
97947aa [Joseph K. Bradley] a few more cleanups
5b03f88 [Joseph K. Bradley] reverted split of lda log likelihood
c566915 [Joseph K. Bradley] small edit to make review easier
63f6c7d [Joseph K. Bradley] clarified log likelihood for lda models
2015-08-04 15:43:13 -07:00
Holden Karau 5a23213c14 [SPARK-8069] [ML] Add multiclass thresholds for ProbabilisticClassifier
This PR replaces the old "threshold" with a generalized "thresholds" Param.  We keep getThreshold,setThreshold for backwards compatibility for binary classification.

Note that the primary author of this PR is holdenk

Author: Holden Karau <holden@pigscanfly.ca>
Author: Joseph K. Bradley <joseph@databricks.com>

Closes #7909 from jkbradley/holdenk-SPARK-8069-add-cutoff-aka-threshold-to-random-forest and squashes the following commits:

3952977 [Joseph K. Bradley] fixed pyspark doc test
85febc8 [Joseph K. Bradley] made python unit tests a little more robust
7eb1d86 [Joseph K. Bradley] small cleanups
6cc2ed8 [Joseph K. Bradley] Fixed remaining merge issues.
0255e44 [Joseph K. Bradley] Many cleanups for thresholds, some more tests
7565a60 [Holden Karau] fix pep8 style checks, add a getThreshold method similar to our LogisticRegression.scala one for API compat
be87f26 [Holden Karau] Convert threshold to thresholds in the python code, add specialized support for Array[Double] to shared parems codegen, etc.
6747dad [Holden Karau] Override raw2prediction for ProbabilisticClassifier, fix some tests
25df168 [Holden Karau] Fix handling of thresholds in LogisticRegression
c02d6c0 [Holden Karau] No default for thresholds
5e43628 [Holden Karau] CR feedback and fixed the renamed test
f3fbbd1 [Holden Karau] revert the changes to random forest :(
51f581c [Holden Karau] Add explicit types to public methods, fix long line
f7032eb [Holden Karau] Fix a java test bug, remove some unecessary changes
adf15b4 [Holden Karau] rename the classifier suite test to ProbabilisticClassifierSuite now that we only have it in Probabilistic
398078a [Holden Karau] move the thresholding around a bunch based on the design doc
4893bdc [Holden Karau] Use numtrees of 3 since previous result was tied (one tree for each) and the switch from different max methods picked a different element (since they were equal I think this is ok)
638854c [Holden Karau] Add a scala RandomForestClassifierSuite test based on corresponding python test
e09919c [Holden Karau] Fix return type, I need more coffee....
8d92cac [Holden Karau] Use ClassifierParams as the head
3456ed3 [Holden Karau] Add explicit return types even though just test
a0f3b0c [Holden Karau] scala style fixes
6f14314 [Holden Karau] Since hasthreshold/hasthresholds is in root classifier now
ffc8dab [Holden Karau] Update the sharedParams
0420290 [Holden Karau] Allow us to override the get methods selectively
978e77a [Holden Karau] Move HasThreshold into classifier params and start defining the overloaded getThreshold/getThresholds functions
1433e52 [Holden Karau] Revert "try and hide threshold but chainges the API so no dice there"
1f09a2e [Holden Karau] try and hide threshold but chainges the API so no dice there
efb9084 [Holden Karau] move setThresholds only to where its used
6b34809 [Holden Karau] Add a test with thresholding for the RFCS
74f54c3 [Holden Karau] Fix creation of vote array
1986fa8 [Holden Karau] Setting the thresholds only makes sense if the underlying class hasn't overridden predict, so lets push it down.
2f44b18 [Holden Karau] Add a global default of null for thresholds param
f338cfc [Holden Karau] Wait that wasn't a good idea, Revert "Some progress towards unifying threshold and thresholds"
634b06f [Holden Karau] Some progress towards unifying threshold and thresholds
85c9e01 [Holden Karau] Test passes again... little fnur
099c0f3 [Holden Karau] Move thresholds around some more (set on model not trainer)
0f46836 [Holden Karau] Start adding a classifiersuite
f70eb5e [Holden Karau] Fix test compile issues
a7d59c8 [Holden Karau] Move thresholding into Classifier trait
5d999d2 [Holden Karau] Some more progress, start adding a test (maybe try and see if we can find a better thing to use for the base of the test)
1fed644 [Holden Karau] Use thresholds to scale scores in random forest classifcation
31d6bf2 [Holden Karau] Start threading the threshold info through
0ef228c [Holden Karau] Add hasthresholds
2015-08-04 10:12:22 -07:00
Sean Owen 76d74090d6 [SPARK-9534] [BUILD] Enable javac lint for scalac parity; fix a lot of build warnings, 1.5.0 edition
Enable most javac lint warnings; fix a lot of build warnings. In a few cases, touch up surrounding code in the process.

I'll explain several of the changes inline in comments.

Author: Sean Owen <sowen@cloudera.com>

Closes #7862 from srowen/SPARK-9534 and squashes the following commits:

ea51618 [Sean Owen] Enable most javac lint warnings; fix a lot of build warnings. In a few cases, touch up surrounding code in the process.
2015-08-04 12:02:26 +01:00
MechCoder 13675c742a [SPARK-8874] [ML] Add missing methods in Word2Vec
Add missing methods

1. getVectors
2. findSynonyms

to W2Vec scala and python API

mengxr

Author: MechCoder <manojkumarsivaraj334@gmail.com>

Closes #7263 from MechCoder/missing_methods_w2vec and squashes the following commits:

149d5ca [MechCoder] minor doc
69d91b7 [MechCoder] [SPARK-8874] [ML] Add missing methods in Word2Vec
2015-08-03 16:44:25 -07:00
Xiangrui Meng e4765a4683 [SPARK-9544] [MLLIB] add Python API for RFormula
Add Python API for RFormula. Similar to other feature transformers in Python. This is just a thin wrapper over the Scala implementation. ericl MechCoder

Author: Xiangrui Meng <meng@databricks.com>

Closes #7879 from mengxr/SPARK-9544 and squashes the following commits:

3d5ff03 [Xiangrui Meng] add an doctest for . and -
5e969a5 [Xiangrui Meng] fix pydoc
1cd41f8 [Xiangrui Meng] organize imports
3c18b10 [Xiangrui Meng] add Python API for RFormula
2015-08-03 13:59:35 -07:00
Joseph K. Bradley ff9169a002 [SPARK-5133] [ML] Added featureImportance to RandomForestClassifier and Regressor
Added featureImportance to RandomForestClassifier and Regressor.

This follows the scikit-learn implementation here: [a95203b249/sklearn/tree/_tree.pyx (L3341)]

CC: yanboliang  Would you mind taking a look?  Thanks!

Author: Joseph K. Bradley <joseph@databricks.com>
Author: Feynman Liang <fliang@databricks.com>

Closes #7838 from jkbradley/dt-feature-importance and squashes the following commits:

72a167a [Joseph K. Bradley] fixed unit test
86cea5f [Joseph K. Bradley] Modified RF featuresImportances to return Vector instead of Map
5aa74f0 [Joseph K. Bradley] finally fixed unit test for real
33df5db [Joseph K. Bradley] fix unit test
42a2d3b [Joseph K. Bradley] fix unit test
fe94e72 [Joseph K. Bradley] modified feature importance unit tests
cc693ee [Feynman Liang] Add classifier tests
79a6f87 [Feynman Liang] Compare dense vectors in test
21d01fc [Feynman Liang] Added failing SKLearn test
ac0b254 [Joseph K. Bradley] Added featureImportance to RandomForestClassifier/Regressor.  Need to add unit tests
2015-08-03 12:17:46 -07:00
Joseph K. Bradley 69f5a7c934 [SPARK-9528] [ML] Changed RandomForestClassifier to extend ProbabilisticClassifier
RandomForestClassifier now outputs rawPrediction based on tree probabilities, plus probability column computed from normalized rawPrediction.

CC: holdenk

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #7859 from jkbradley/rf-prob and squashes the following commits:

6c28f51 [Joseph K. Bradley] Changed RandomForestClassifier to extend ProbabilisticClassifier
2015-08-03 10:46:34 -07:00
Xiangrui Meng 66924ffa6b [SPARK-9527] [MLLIB] add PrefixSpanModel and make PrefixSpan Java friendly
1. Use `PrefixSpanModel` to wrap the frequent sequences.
2. Define `FreqSequence` to wrap each frequent sequence, which contains a Java-friendly method `javaSequence`
3. Overload `run` for Java users.
4. Added a unit test in Java to check Java compatibility.

zhangjiajin feynmanliang

Author: Xiangrui Meng <meng@databricks.com>

Closes #7869 from mengxr/SPARK-9527 and squashes the following commits:

4345594 [Xiangrui Meng] add PrefixSpanModel and make PrefixSpan Java friendly
2015-08-02 11:50:17 -07:00
Feynman Liang 28d944e86d [SPARK-9000] [MLLIB] Support generic item types in PrefixSpan
mengxr Please review after #7818 merges and master is rebased.

Continues work by rikima

Closes #7400

Author: Feynman Liang <fliang@databricks.com>
Author: masaki rikitoku <rikima3132@gmail.com>

Closes #7837 from feynmanliang/SPARK-7400-genericItems and squashes the following commits:

8b2c756 [Feynman Liang] Remove orig
92443c8 [Feynman Liang] Style fixes
42c6349 [Feynman Liang] Style fix
14e67fc [Feynman Liang] Generic prefixSpan itemtypes
b3b21e0 [Feynman Liang] Initial support for generic itemtype in public api
b86e0d5 [masaki rikitoku] modify to support generic item type
2015-08-01 23:11:25 -07:00
Meihua Wu 84a6982b35 [SPARK-9530] [MLLIB] ScalaDoc should not indicate LDAModel.describeTopics and DistributedLDAModel.topDocumentsPerTopic as approximate
Remove ScalaDoc that suggests describeTopics and topDocumentsPerTopic are approximate.

cc jkbradley

Author: Meihua Wu <meihuawu@umich.edu>

Closes #7858 from rotationsymmetry/SPARK-9530 and squashes the following commits:

b574923 [Meihua Wu] Remove ScalaDoc that suggests describeTopics and topDocumentsPerTopic are approximate.
2015-08-01 17:13:28 -07:00
Yuhao Yang 8765665015 [SPARK-8169] [ML] Add StopWordsRemover as a transformer
jira: https://issues.apache.org/jira/browse/SPARK-8169

stop words: http://en.wikipedia.org/wiki/Stop_words

StopWordsRemover takes a string array column and outputs a string array column with all defined stop words removed. The transformer should also come with a standard set of stop words as default.

Currently I used a minimum stop words set since on some [case](http://nlp.stanford.edu/IR-book/html/htmledition/dropping-common-terms-stop-words-1.html), small set of stop words is preferred.
ASCII char has been tested, Yet I cannot check it in due to style check.

Further thought,
1. Maybe I should use OpenHashSet. Is it recommended?
2. Currently I leave the null in input array untouched, i.e. Array(null, null) => Array(null, null).
3. If the current stop words set looks too limited, any suggestion for replacement? We can have something similar to the one in [SKlearn](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/feature_extraction/stop_words.py).

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #6742 from hhbyyh/stopwords and squashes the following commits:

fa959d8 [Yuhao Yang] separating udf
f190217 [Yuhao Yang] replace default list and other small fix
04403ab [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into stopwords
b3aa957 [Yuhao Yang] add stopWordsRemover
2015-08-01 02:31:28 -07:00
zhangjiajin d2a9b66f6c [SPARK-8999] [MLLIB] PrefixSpan non-temporal sequences
mengxr Extends PrefixSpan to non-temporal itemsets. Continues work by zhangjiajin

 * Internal API uses List[Set[Int]] which is likely not efficient; will need to refactor during QA

Closes #7646

Author: zhangjiajin <zhangjiajin@huawei.com>
Author: Feynman Liang <fliang@databricks.com>
Author: zhang jiajin <zhangjiajin@huawei.com>

Closes #7818 from feynmanliang/SPARK-8999-nonTemporal and squashes the following commits:

4ded81d [Feynman Liang] Replace all filters to filter nonempty
350e67e [Feynman Liang] Code review feedback
03156ca [Feynman Liang] Fix tests, drop delimiters at boundaries of sequences
d1fe0ed [Feynman Liang] Remove comments
86ca4e5 [Feynman Liang] Fix style
7c7bf39 [Feynman Liang] Fixed itemSet sequences
6073b10 [Feynman Liang] Basic itemset functionality, failing test
1a7fb48 [Feynman Liang] Add delimiter to results
5db00aa [Feynman Liang] Working for items, not itemsets
6787716 [Feynman Liang] Working on temporal sequences
f1114b9 [Feynman Liang] Add -1 delimiter
00fe756 [Feynman Liang] Reset base files for rebase
f486dcd [zhangjiajin] change maxLocalProjDBSize and fix a bug (remove -3 from frequent items).
60a0b76 [zhangjiajin] fixed a scala style error.
740c203 [zhangjiajin] fixed a scala style error.
5785cb8 [zhangjiajin] support non-temporal sequence
a5d649d [zhangjiajin] restore original version
09dc409 [zhangjiajin] Merge branch 'master' of https://github.com/apache/spark into multiItems_2
ae8c02d [zhangjiajin] Fixed some Scala style errors.
216ab0c [zhangjiajin] Support non-temporal sequence in PrefixSpan
b572f54 [zhangjiajin] initialize file before rebase.
f06772f [zhangjiajin] fix a scala style error.
a7e50d4 [zhangjiajin] Add feature: Collect enough frequent prefixes before projection in PrefixSpan.
c1d13d0 [zhang jiajin] Delete PrefixspanSuite.scala
d9d8137 [zhang jiajin] Delete Prefixspan.scala
c6ceb63 [zhangjiajin] Add new algorithm PrefixSpan and test file.
2015-08-01 01:56:27 -07:00
Holden Karau 65038973a1 [SPARK-7446] [MLLIB] Add inverse transform for string indexer
It is useful to convert the encoded indices back to their string representation for result inspection. We can add a function which creates an inverse transformation.

Author: Holden Karau <holden@pigscanfly.ca>

Closes #6339 from holdenk/SPARK-7446-inverse-transform-for-string-indexer and squashes the following commits:

7cdf915 [Holden Karau] scala style comment fix
b9cffb6 [Holden Karau] Update the labels param to have the metadata note
6a38edb [Holden Karau] Setting the default needs to come after the value gets defined
9e241d8 [Holden Karau] use Array.empty
21c8cfa [Holden Karau] Merge branch 'master' into SPARK-7446-inverse-transform-for-string-indexer
64dd3a3 [Holden Karau] Merge branch 'master' into SPARK-7446-inverse-transform-for-string-indexer
4f06c59 [Holden Karau] Fix comment styles, use empty array as the default, etc.
a60c0e3 [Holden Karau] CR feedback (remove old constructor, add a note about use of setLabels)
1987b95 [Holden Karau] Use default copy
71e8d66 [Holden Karau] Make labels a local param for StringIndexerInverse
8450d0b [Holden Karau] Use the labels param in StringIndexerInverse
7464019 [Holden Karau] Add a labels param
868b1a9 [Holden Karau] Update scaladoc since we don't have labelsCol anymore
5aa38bf [Holden Karau] Add an inverse test using only meta data, pass labels when calling inverse method
f3e0c64 [Holden Karau] CR feedback
ebed932 [Holden Karau] Add Experimental tag and some scaladocs. Also don't require that the inputCol has the metadata on it, instead have the labelsCol specified when creating the inverse.
03ebf95 [Holden Karau] Add explicit type for invert function
ecc65e0 [Holden Karau] Read the metadata correctly, use the array, pass the test
a42d773 [Holden Karau] Fix test to supply cols as per new invert method
16cc3c3 [Holden Karau] Add an invert method
d4bcb20 [Holden Karau] Make the inverse string indexer into a transformer (still needs test updates but compiles)
e8bf3ad [Holden Karau] Merge branch 'master' into SPARK-7446-inverse-transform-for-string-indexer
c3fdee1 [Holden Karau] Some WIP refactoring based on jkbradley's CR feedback. Definite work-in-progress
557bef8 [Holden Karau] Instead of using a private inverse transform, add an invert function so we can use it in a pipeline
88779c1 [Holden Karau] fix long line
78b28c1 [Holden Karau] Finish reverse part and add a test :)
bb16a6a [Holden Karau] Some progress
2015-08-01 01:09:38 -07:00
Wenchen Fan 1d59a4162b [SPARK-9480][SQL] add MapData and cleanup internal row stuff
This PR adds a `MapData` as internal representation of map type in Spark SQL, and provides a default implementation with just 2 `ArrayData`.

After that, we have specialized getters for all internal type, so I removed generic getter in `ArrayData` and added specialized `toArray` for it.
Also did some refactor and cleanup for `InternalRow` and its subclasses.

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #7799 from cloud-fan/map-data and squashes the following commits:

77d482f [Wenchen Fan] fix python
e8f6682 [Wenchen Fan] skip MapData equality check in HiveInspectorSuite
40cc9db [Wenchen Fan] add toString
6e06ec9 [Wenchen Fan] some more cleanup
a90aca1 [Wenchen Fan] add MapData
2015-08-01 00:17:15 -07:00
Feynman Liang f51fd6fbb4 [SPARK-8936] [MLLIB] OnlineLDA document-topic Dirichlet hyperparameter optimization
Adds `alpha` (document-topic Dirichlet parameter) hyperparameter optimization to `OnlineLDAOptimizer` following Huang: Maximum Likelihood Estimation of Dirichlet Distribution Parameters. Also introduces a private `setSampleWithReplacement` to `OnlineLDAOptimizer` for unit testing purposes.

Author: Feynman Liang <fliang@databricks.com>

Closes #7836 from feynmanliang/SPARK-8936-alpha-optimize and squashes the following commits:

4bef484 [Feynman Liang] Documentation improvements
c3c6c1d [Feynman Liang] Fix docs
151e859 [Feynman Liang] Fix style
fa77518 [Feynman Liang] Hyperparameter optimization
2015-07-31 18:36:22 -07:00
Yanbo Liang fbef566a10 [SPARK-9308] [ML] ml.NaiveBayesModel support predicting class probabilities
Make NaiveBayesModel support predicting class probabilities, inherit from ProbabilisticClassificationModel.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #7672 from yanboliang/spark-9308 and squashes the following commits:

25e224c [Yanbo Liang] raw2probabilityInPlace should operate in-place
3ee56d6 [Yanbo Liang] change predictRaw and raw2probabilityInPlace
c07e7a2 [Yanbo Liang] ml.NaiveBayesModel support predicting class probabilities
2015-07-31 13:11:42 -07:00
Meihua Wu 3c0d2e5521 [SPARK-9246] [MLLIB] DistributedLDAModel predict top docs per topic
Add topDocumentsPerTopic to DistributedLDAModel.

Add ScalaDoc and unit tests.

Author: Meihua Wu <meihuawu@umich.edu>

Closes #7769 from rotationsymmetry/SPARK-9246 and squashes the following commits:

1029e79c [Meihua Wu] clean up code comments
a023b82 [Meihua Wu] Update tests to use Long for doc index.
91e5998 [Meihua Wu] Use Long for doc index.
b9f70cf [Meihua Wu] Revise topDocumentsPerTopic
26ff3f6 [Meihua Wu] Add topDocumentsPerTopic, scala doc and unit tests
2015-07-31 13:01:10 -07:00
Feynman Liang a8340fa7df [SPARK-9481] Add logLikelihood to LocalLDAModel
jkbradley Exposes `bound` (variational log likelihood bound) through public API as `logLikelihood`. Also adds unit tests, some DRYing of `LDASuite`, and includes unit tests mentioned in #7760

Author: Feynman Liang <fliang@databricks.com>

Closes #7801 from feynmanliang/SPARK-9481-logLikelihood and squashes the following commits:

6d1b2c9 [Feynman Liang] Negate perplexity definition
5f62b20 [Feynman Liang] Add logLikelihood
2015-07-31 12:12:22 -07:00
Yanbo Liang e8bdcdeabb [SPARK-6885] [ML] decision tree support predict class probabilities
Decision tree support predict class probabilities.
Implement the prediction probabilities function referred the old DecisionTree API and the [sklean API](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/tree.py#L593).
I make the DecisionTreeClassificationModel inherit from ProbabilisticClassificationModel, make the predictRaw to return the raw counts vector and make raw2probabilityInPlace/predictProbability return the probabilities for each prediction.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #7694 from yanboliang/spark-6885 and squashes the following commits:

08d5b7f [Yanbo Liang] fix ImpurityStats null parameters and raw2probabilityInPlace sum = 0 issue
2174278 [Yanbo Liang] solve merge conflicts
7e90ba8 [Yanbo Liang] fix typos
33ae183 [Yanbo Liang] fix annotation
ff043d3 [Yanbo Liang] raw2probabilityInPlace should operate in-place
c32d6ce [Yanbo Liang] optimize calculateImpurityStats function again
6167fb0 [Yanbo Liang] optimize calculateImpurityStats function
fbbe2ec [Yanbo Liang] eliminate duplicated struct and code
beb1634 [Yanbo Liang] try to eliminate impurityStats for each LearningNode
99e8943 [Yanbo Liang] code optimization
5ec3323 [Yanbo Liang] implement InformationGainAndImpurityStats
227c91b [Yanbo Liang] refactor LearningNode to store ImpurityCalculator
d746ffc [Yanbo Liang] decision tree support predict class probabilities
2015-07-31 11:56:52 -07:00
Yuhao Yang 4011a94715 [SPARK-9231] [MLLIB] DistributedLDAModel method for top topics per document
jira: https://issues.apache.org/jira/browse/SPARK-9231

Helper method in DistributedLDAModel of this form:
```
/**
 * For each document, return the top k weighted topics for that document.
 * return RDD of (doc ID, topic indices, topic weights)
 */
def topTopicsPerDocument(k: Int): RDD[(Long, Array[Int], Array[Double])]
```

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #7785 from hhbyyh/topTopicsPerdoc and squashes the following commits:

30ad153 [Yuhao Yang] small fix
fd24580 [Yuhao Yang] add topTopics per document to DistributedLDAModel
2015-07-31 11:50:15 -07:00
Alexander Ulanov 6add4eddb3 [SPARK-9471] [ML] Multilayer Perceptron
This pull request contains the following feature for ML:
   - Multilayer Perceptron classifier

This implementation is based on our initial pull request with bgreeven: https://github.com/apache/spark/pull/1290 and inspired by very insightful suggestions from mengxr and witgo (I would like to thank all other people from the mentioned thread for useful discussions). The original code was extensively tested and benchmarked. Since then, I've addressed two main requirements that prevented the code from merging into the main branch:
   - Extensible interface, so it will be easy to implement new types of networks
     - Main building blocks are traits `Layer` and `LayerModel`. They are used for constructing layers of ANN. New layers can be added by extending the `Layer` and `LayerModel` traits. These traits are private in this release in order to save path to improve them based on community feedback
     - Back propagation is implemented in general form, so there is no need to change it (optimization algorithm) when new layers are implemented
   - Speed and scalability: this implementation has to be comparable in terms of speed to the state of the art single node implementations.
     - The developed benchmark for large ANN shows that the proposed code is on par with C++ CPU implementation and scales nicely with the number of workers. Details can be found here: https://github.com/avulanov/ann-benchmark

   - DBN and RBM by witgo https://github.com/witgo/spark/tree/ann-interface-gemm-dbn
   - Dropout https://github.com/avulanov/spark/tree/ann-interface-gemm

mengxr and dbtsai kindly agreed to perform code review.

Author: Alexander Ulanov <nashb@yandex.ru>
Author: Bert Greevenbosch <opensrc@bertgreevenbosch.nl>

Closes #7621 from avulanov/SPARK-2352-ann and squashes the following commits:

4806b6f [Alexander Ulanov] Addressing reviewers comments.
a7e7951 [Alexander Ulanov] Default blockSize: 100. Added documentation to blockSize parameter and DataStacker class
f69bb3d [Alexander Ulanov] Addressing reviewers comments.
374bea6 [Alexander Ulanov] Moving ANN to ML package. GradientDescent constructor is now spark private.
43b0ae2 [Alexander Ulanov] Addressing reviewers comments. Adding multiclass test.
9d18469 [Alexander Ulanov] Addressing reviewers comments: unnecessary copy of data in predict
35125ab [Alexander Ulanov] Style fix in tests
e191301 [Alexander Ulanov] Apache header
a226133 [Alexander Ulanov] Multilayer Perceptron regressor and classifier
2015-07-31 11:23:30 -07:00
Yanbo Liang 69b62f76fc [SPARK-9214] [ML] [PySpark] support ml.NaiveBayes for Python
support ml.NaiveBayes for Python

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #7568 from yanboliang/spark-9214 and squashes the following commits:

5ee3fd6 [Yanbo Liang] fix typos
3ecd046 [Yanbo Liang] fix typos
f9c94d1 [Yanbo Liang] change lambda_ to smoothing and fix other issues
180452a [Yanbo Liang] fix typos
7dda1f4 [Yanbo Liang] support ml.NaiveBayes for Python
2015-07-30 23:03:48 -07:00
Ram Sriharsha 4e5919bfb4 [SPARK-7690] [ML] Multiclass classification Evaluator
Multiclass Classification Evaluator for ML Pipelines. F1 score, precision, recall, weighted precision and weighted recall are supported as available metrics.

Author: Ram Sriharsha <rsriharsha@hw11853.local>

Closes #7475 from harsha2010/SPARK-7690 and squashes the following commits:

9bf4ec7 [Ram Sriharsha] fix indentation
3f09a85 [Ram Sriharsha] cleanup doc
16115ae [Ram Sriharsha] code review fixes
032d2a3 [Ram Sriharsha] fix test
eec9865 [Ram Sriharsha] Fix Python Indentation
1dbeffd [Ram Sriharsha] Merge branch 'master' into SPARK-7690
68cea85 [Ram Sriharsha] Merge branch 'master' into SPARK-7690
54c03de [Ram Sriharsha] [SPARK-7690][ml][WIP] Multiclass Evaluator for ML Pipeline
2015-07-30 23:02:11 -07:00
Sean Owen 65fa4181c3 [SPARK-9077] [MLLIB] Improve error message for decision trees when numExamples < maxCategoriesPerFeature
Improve error message when number of examples is less than arity of high-arity categorical feature

CC jkbradley is this about what you had in mind? I know it's a starter, but was on my list to close out in the short term.

Author: Sean Owen <sowen@cloudera.com>

Closes #7800 from srowen/SPARK-9077 and squashes the following commits:

b8f6cdb [Sean Owen] Improve error message when number of examples is less than arity of high-arity categorical feature
2015-07-30 17:26:18 -07:00
Eric Liang e7905a9395 [SPARK-9463] [ML] Expose model coefficients with names in SparkR RFormula
Preview:

```
> summary(m)
            features coefficients
1        (Intercept)    1.6765001
2       Sepal_Length    0.3498801
3 Species.versicolor   -0.9833885
4  Species.virginica   -1.0075104

```

Design doc from umbrella task: https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/edit

cc mengxr

Author: Eric Liang <ekl@databricks.com>

Closes #7771 from ericl/summary and squashes the following commits:

ccd54c3 [Eric Liang] second pass
a5ca93b [Eric Liang] comments
2772111 [Eric Liang] clean up
70483ef [Eric Liang] fix test
7c247d4 [Eric Liang] Merge branch 'master' into summary
3c55024 [Eric Liang] working
8c539aa [Eric Liang] first pass
2015-07-30 16:15:43 -07:00
Joseph K. Bradley be7be6d4c7 [SPARK-6684] [MLLIB] [ML] Add checkpointing to GBTs
Add checkpointing to GradientBoostedTrees, GBTClassifier, GBTRegressor

CC: mengxr

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #7804 from jkbradley/gbt-checkpoint3 and squashes the following commits:

3fbd7ba [Joseph K. Bradley] tiny fix
b3e160c [Joseph K. Bradley] unset checkpoint dir after test
9cc3a04 [Joseph K. Bradley] added checkpointing to GBTs
2015-07-30 16:04:23 -07:00
martinzapletal 7f7a319c4c [SPARK-8671] [ML] Added isotonic regression to the pipeline API.
Author: martinzapletal <zapletal-martin@email.cz>

Closes #7517 from zapletal-martin/SPARK-8671-isotonic-regression-api and squashes the following commits:

8c435c1 [martinzapletal] Review https://github.com/apache/spark/pull/7517 feedback update.
bebbb86 [martinzapletal] Merge remote-tracking branch 'upstream/master' into SPARK-8671-isotonic-regression-api
b68efc0 [martinzapletal] Added tests for param validation.
07c12bd [martinzapletal] Comments and refactoring.
834fcf7 [martinzapletal] Merge remote-tracking branch 'upstream/master' into SPARK-8671-isotonic-regression-api
b611fee [martinzapletal] SPARK-8671. Added first version of isotonic regression to pipeline API
2015-07-30 15:57:14 -07:00
zsxwing 0dbd6963d5 [SPARK-9479] [STREAMING] [TESTS] Fix ReceiverTrackerSuite failure for maven build and other potential test failures in Streaming
See https://issues.apache.org/jira/browse/SPARK-9479 for the failure cause.

The PR includes the following changes:
1. Make ReceiverTrackerSuite create StreamingContext in the test body.
2. Fix places that don't stop StreamingContext. I verified no SparkContext was stopped in the shutdown hook locally after this fix.
3. Fix an issue that `ReceiverTracker.endpoint` may be null.
4. Make sure stopping SparkContext in non-main thread won't fail other tests.

Author: zsxwing <zsxwing@gmail.com>

Closes #7797 from zsxwing/fix-ReceiverTrackerSuite and squashes the following commits:

3a4bb98 [zsxwing] Fix another potential NPE
d7497df [zsxwing] Fix ReceiverTrackerSuite; make sure StreamingContext in tests is closed
2015-07-30 15:39:46 -07:00
Feynman Liang 89cda69ecd [SPARK-9454] Change LDASuite tests to use vector comparisons
jkbradley Changes the current hacky string-comparison for vector compares.

Author: Feynman Liang <fliang@databricks.com>

Closes #7775 from feynmanliang/SPARK-9454-ldasuite-vector-compare and squashes the following commits:

bd91a82 [Feynman Liang] Remove println
905c76e [Feynman Liang] Fix string compare in distributed EM
2f24c13 [Feynman Liang] Improve LDASuite tests
2015-07-30 14:08:59 -07:00
Feynman Liang d8cfd531c7 [SPARK-5567] [MLLIB] Add predict method to LocalLDAModel
jkbradley hhbyyh

Adds `topicDistributions` to LocalLDAModel. Please review after #7757 is merged.

Author: Feynman Liang <fliang@databricks.com>

Closes #7760 from feynmanliang/SPARK-5567-predict-in-LDA and squashes the following commits:

0ad1134 [Feynman Liang] Remove println
27b3877 [Feynman Liang] Code review fixes
6bfb87c [Feynman Liang] Remove extra newline
476f788 [Feynman Liang] Fix checks and doc for variationalInference
061780c [Feynman Liang] Code review cleanup
3be2947 [Feynman Liang] Rename topicDistribution -> topicDistributions
2a821a6 [Feynman Liang] Add predict methods to LocalLDAModel
2015-07-30 13:17:54 -07:00
Wenchen Fan c0cc0eaec6 [SPARK-9390][SQL] create a wrapper for array type
Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #7724 from cloud-fan/array-data and squashes the following commits:

d0408a1 [Wenchen Fan] fix python
661e608 [Wenchen Fan] rebase
f39256c [Wenchen Fan] fix hive...
6dbfa6f [Wenchen Fan] fix hive again...
8cb8842 [Wenchen Fan] remove element type parameter from getArray
43e9816 [Wenchen Fan] fix mllib
e719afc [Wenchen Fan] fix hive
4346290 [Wenchen Fan] address comment
d4a38da [Wenchen Fan] remove sizeInBytes and add license
7e283e2 [Wenchen Fan] create a wrapper for array type
2015-07-30 10:04:30 -07:00
Sean Owen ed3cb1d21c [SPARK-9277] [MLLIB] SparseVector constructor must throw an error when declared number of elements less than array length
Check that SparseVector size is at least as big as the number of indices/values provided. And add tests for constructor checks.

CC MechCoder jkbradley -- I am not sure if a change needs to also happen in the Python API? I didn't see it had any similar checks to begin with, but I don't know it well.

Author: Sean Owen <sowen@cloudera.com>

Closes #7794 from srowen/SPARK-9277 and squashes the following commits:

e8dc31e [Sean Owen] Fix scalastyle
6ffe34a [Sean Owen] Check that SparseVector size is at least as big as the number of indices/values provided. And add tests for constructor checks.
2015-07-30 09:19:55 -07:00
Meihua Wu a6e53a9c8b [SPARK-9225] [MLLIB] LDASuite needs unit tests for empty documents
Add unit tests for running LDA with empty documents.
Both EMLDAOptimizer and OnlineLDAOptimizer are tested.

feynmanliang

Author: Meihua Wu <meihuawu@umich.edu>

Closes #7620 from rotationsymmetry/SPARK-9225 and squashes the following commits:

3ed7c88 [Meihua Wu] Incorporate reviewer's further comments
f9432e8 [Meihua Wu] Incorporate reviewer's comments
8e1b9ec [Meihua Wu] Merge remote-tracking branch 'upstream/master' into SPARK-9225
ad55665 [Meihua Wu] Add unit tests for running LDA with empty documents
2015-07-30 08:52:01 -07:00
Yuhao Yang 9c0501c5d0 [SPARK-] [MLLIB] minor fix on tokenizer doc
A trivial fix for the comments of RegexTokenizer.

Maybe this is too small, yet I just noticed it and think it can be quite misleading. I can create a jira if necessary.

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #7791 from hhbyyh/docFix and squashes the following commits:

cdf2542 [Yuhao Yang] minor fix on tokenizer doc
2015-07-30 08:20:52 -07:00
zhangjiajin d212a31422 [SPARK-8998] [MLLIB] Distribute PrefixSpan computation for large projected databases
Continuation of work by zhangjiajin

Closes #7412

Author: zhangjiajin <zhangjiajin@huawei.com>
Author: Feynman Liang <fliang@databricks.com>
Author: zhang jiajin <zhangjiajin@huawei.com>

Closes #7783 from feynmanliang/SPARK-8998-improve-distributed and squashes the following commits:

a61943d [Feynman Liang] Collect small patterns to local
4ddf479 [Feynman Liang] Parallelize freqItemCounts
ad23aa9 [zhang jiajin] Merge pull request #1 from feynmanliang/SPARK-8998-collectBeforeLocal
87fa021 [Feynman Liang] Improve extend prefix readability
c2caa5c [Feynman Liang] Readability improvements and comments
1235cfc [Feynman Liang] Use Iterable[Array[_]] over Array[Array[_]] for database
da0091b [Feynman Liang] Use lists for prefixes to reuse data
cb2a4fc [Feynman Liang] Inline code for readability
01c9ae9 [Feynman Liang] Add getters
6e149fa [Feynman Liang] Fix splitPrefixSuffixPairs
64271b3 [zhangjiajin] Modified codes according to comments.
d2250b7 [zhangjiajin] remove minPatternsBeforeLocalProcessing, add maxSuffixesBeforeLocalProcessing.
b07e20c [zhangjiajin] Merge branch 'master' of https://github.com/apache/spark into CollectEnoughPrefixes
095aa3a [zhangjiajin] Modified the code according to the review comments.
baa2885 [zhangjiajin] Modified the code according to the review comments.
6560c69 [zhangjiajin] Add feature: Collect enough frequent prefixes before projection in PrefixeSpan
a8fde87 [zhangjiajin] Merge branch 'master' of https://github.com/apache/spark
4dd1c8a [zhangjiajin] initialize file before rebase.
078d410 [zhangjiajin] fix a scala style error.
22b0ef4 [zhangjiajin] Add feature: Collect enough frequent prefixes before projection in PrefixSpan.
ca9c4c8 [zhangjiajin] Modified the code according to the review comments.
574e56c [zhangjiajin] Add new object LocalPrefixSpan, and do some optimization.
ba5df34 [zhangjiajin] Fix a Scala style error.
4c60fb3 [zhangjiajin] Fix some Scala style errors.
1dd33ad [zhangjiajin] Modified the code according to the review comments.
89bc368 [zhangjiajin] Fixed a Scala style error.
a2eb14c [zhang jiajin] Delete PrefixspanSuite.scala
951fd42 [zhang jiajin] Delete Prefixspan.scala
575995f [zhangjiajin] Modified the code according to the review comments.
91fd7e6 [zhangjiajin] Add new algorithm PrefixSpan and test file.
2015-07-30 08:14:09 -07:00
Joseph K. Bradley c5815930be [SPARK-5561] [MLLIB] Generalized PeriodicCheckpointer for RDDs and Graphs
PeriodicGraphCheckpointer was introduced for Latent Dirichlet Allocation (LDA), but it was meant to be generalized to work with Graphs, RDDs, and other data structures based on RDDs.  This PR generalizes it.

For those who are not familiar with the periodic checkpointer, it tries to automatically handle persisting/unpersisting and checkpointing/removing checkpoint files in a lineage of RDD-based objects.

I need it generalized to use with GradientBoostedTrees [https://issues.apache.org/jira/browse/SPARK-6684].  It should be useful for other iterative algorithms as well.

Changes I made:
* Copied PeriodicGraphCheckpointer to PeriodicCheckpointer.
* Within PeriodicCheckpointer, I created abstract methods for the basic operations (checkpoint, persist, etc.).
* The subclasses for Graphs and RDDs implement those abstract methods.
* I copied the test suite for the graph checkpointer and made tiny modifications to make it work for RDDs.

To review this PR, I recommend doing 2 diffs:
(1) diff between the old PeriodicGraphCheckpointer.scala and the new PeriodicCheckpointer.scala
(2) diff between the 2 test suites

CCing andrewor14 in case there are relevant changes to checkpointing.
CCing feynmanliang in case you're interested in learning about checkpointing.
CCing mengxr for final OK.
Thanks all!

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #7728 from jkbradley/gbt-checkpoint and squashes the following commits:

d41902c [Joseph K. Bradley] Oops, forgot to update an extra time in the checkpointer tests, after the last commit. I'll fix that. I'll also make some of the checkpointer methods protected, which I should have done before.
32b23b8 [Joseph K. Bradley] fixed usage of checkpointer in lda
0b3dbc0 [Joseph K. Bradley] Changed checkpointer constructor not to take initial data.
568918c [Joseph K. Bradley] Generalized PeriodicGraphCheckpointer to PeriodicCheckpointer, with subclasses for RDDs and Graphs.
2015-07-30 07:56:15 -07:00
Yuhao Yang d31c618e3c [SPARK-7368] [MLLIB] Add QR decomposition for RowMatrix
jira: https://issues.apache.org/jira/browse/SPARK-7368
Add QR decomposition for RowMatrix.

I'm not sure what's the blueprint about the distributed Matrix from community and whether this will be a desirable feature , so I sent a prototype for discussion. I'll go on polish the code and provide ut and performance statistics if it's acceptable.

The implementation refers to the [paper: https://www.cs.purdue.edu/homes/dgleich/publications/Benson%202013%20-%20direct-tsqr.pdf]
Austin R. Benson, David F. Gleich, James Demmel. "Direct QR factorizations for tall-and-skinny matrices in MapReduce architectures", 2013 IEEE International Conference on Big Data, which is a stable algorithm with good scalability.

Currently I tried it on a 400000 * 500 rowMatrix (16 partitions) and it can bring down the computation time from 8.8 mins (using breeze.linalg.qr.reduced)  to 2.6 mins on a 4 worker cluster. I think there will still be some room for performance improvement.

Any trial and suggestion is welcome.

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #5909 from hhbyyh/qrDecomposition and squashes the following commits:

cec797b [Yuhao Yang] remove unnecessary qr
0fb1012 [Yuhao Yang] hierarchy R computing
3fbdb61 [Yuhao Yang] update qr to indirect and add ut
0d913d3 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into qrDecomposition
39213c3 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into qrDecomposition
c0fc0c7 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into qrDecomposition
39b0b22 [Yuhao Yang] initial draft for discussion
2015-07-30 07:49:10 -07:00
Feynman Liang a200e64561 [SPARK-9440] [MLLIB] Add hyperparameters to LocalLDAModel save/load
jkbradley MechCoder

Resolves blocking issue for SPARK-6793. Please review after #7705 is merged.

Author: Feynman Liang <fliang@databricks.com>

Closes #7757 from feynmanliang/SPARK-9940-localSaveLoad and squashes the following commits:

d0d8cf4 [Feynman Liang] Fix thisClassName
0f30109 [Feynman Liang] Fix tests after changing LDAModel public API
dc61981 [Feynman Liang] Add hyperparams to LocalLDAModel save/load
2015-07-29 19:02:15 -07:00
Holden Karau 37c2d1927c [SPARK-9016] [ML] make random forest classifiers implement classification trait
Implement the classification trait for RandomForestClassifiers. The plan is to use this in the future to providing thresholding for RandomForestClassifiers (as well as other classifiers that implement that trait).

Author: Holden Karau <holden@pigscanfly.ca>

Closes #7432 from holdenk/SPARK-9016-make-random-forest-classifiers-implement-classification-trait and squashes the following commits:

bf22fa6 [Holden Karau] Add missing imports for testing suite
e948f0d [Holden Karau] Check the prediction generation from rawprediciton
25320c3 [Holden Karau] Don't supply numClasses when not needed, assert model classes are as expected
1a67e04 [Holden Karau] Use old decission tree stuff instead
673e0c3 [Holden Karau] Merge branch 'master' into SPARK-9016-make-random-forest-classifiers-implement-classification-trait
0d15b96 [Holden Karau] FIx typo
5eafad4 [Holden Karau] add a constructor for rootnode + num classes
fc6156f [Holden Karau] scala style fix
2597915 [Holden Karau] take num classes in constructor
3ccfe4a [Holden Karau] Merge in master, make pass numClasses through randomforest for training
222a10b [Holden Karau] Increase numtrees to 3 in the python test since before the two were equal and the argmax was selecting the last one
16aea1c [Holden Karau] Make tests match the new models
b454a02 [Holden Karau] Make the Tree classifiers extends the Classifier base class
77b4114 [Holden Karau] Import vectors lib
2015-07-29 18:18:29 -07:00
Bimal Tandel 103d8cce78 [SPARK-8921] [MLLIB] Add @since tags to mllib.stat
Author: Bimal Tandel <bimal@bimal-MBP.local>

Closes #7730 from BimalTandel/branch_spark_8921 and squashes the following commits:

3ea230a [Bimal Tandel] Spark 8921 add @since tags
2015-07-29 16:54:58 -07:00
Feynman Liang 2cc212d56a [SPARK-6793] [MLLIB] OnlineLDAOptimizer LDA perplexity
Implements `logPerplexity` in `OnlineLDAOptimizer`. Also refactors inference code into companion object to enable future reuse (e.g. `predict` method).

Author: Feynman Liang <fliang@databricks.com>

Closes #7705 from feynmanliang/SPARK-6793-perplexity and squashes the following commits:

6da2c99 [Feynman Liang] Remove get* from LDAModel public API
8381da6 [Feynman Liang] Code review comments
17f7000 [Feynman Liang] Documentation typo fixes
2f452a4 [Feynman Liang] Remove auxillary DistributedLDAModel constructor
a275914 [Feynman Liang] Prevent empty counts calls to variationalInference
06d02d9 [Feynman Liang] Remove deprecated LocalLDAModel constructor
afecb46 [Feynman Liang] Fix regression bug in sstats accumulator
5a327a0 [Feynman Liang] Code review quick fixes
998c03e [Feynman Liang] Fix style
1cbb67d [Feynman Liang] Fix access modifier bug
4362daa [Feynman Liang] Organize imports
4f171f7 [Feynman Liang] Fix indendation
2f049ce [Feynman Liang] Fix failing save/load tests
7415e96 [Feynman Liang] Pick changes from big PR
11e7c33 [Feynman Liang] Merge remote-tracking branch 'apache/master' into SPARK-6793-perplexity
f8adc48 [Feynman Liang] Add logPerplexity, refactor variationalBound into a method
cd521d6 [Feynman Liang] Refactor methods into companion class
7f62a55 [Feynman Liang] --amend
c62cb1e [Feynman Liang] Outer product for stats, revert Range slicing
aead650 [Feynman Liang] Range slice, in-place update, reduce transposes
2015-07-29 16:20:20 -07:00
MechCoder 198d181dfb [SPARK-7105] [PYSPARK] [MLLIB] Support model save/load in GMM
This PR introduces save / load for GMM's in python API.

Also I refactored `GaussianMixtureModel` and inherited it from `JavaModelWrapper` with model being `GaussianMixtureModelWrapper`, a wrapper which provides convenience methods to `GaussianMixtureModel` (due to serialization and deserialization issues) and I moved the creation of gaussians to the scala backend.

Author: MechCoder <manojkumarsivaraj334@gmail.com>

Closes #7617 from MechCoder/python_gmm_save_load and squashes the following commits:

9c305aa [MechCoder] [SPARK-7105] [PySpark] [MLlib] Support model save/load in GMM
2015-07-28 15:00:25 -07:00
Eric Liang 8d5bb5283c [SPARK-9391] [ML] Support minus, dot, and intercept operators in SparkR RFormula
Adds '.', '-', and intercept parsing to RFormula. Also splits RFormulaParser into a separate file.

Umbrella design doc here: https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/edit?usp=sharing

mengxr

Author: Eric Liang <ekl@databricks.com>

Closes #7707 from ericl/string-features-2 and squashes the following commits:

8588625 [Eric Liang] exclude complex types for .
8106ffe [Eric Liang] comments
a9350bb [Eric Liang] s/var/val
9c50d4d [Eric Liang] Merge branch 'string-features' into string-features-2
581afb2 [Eric Liang] Merge branch 'master' into string-features
08ae539 [Eric Liang] Merge branch 'string-features' into string-features-2
f99131a [Eric Liang] comments
cecec43 [Eric Liang] Merge branch 'string-features' into string-features-2
0bf3c26 [Eric Liang] update docs
4592df2 [Eric Liang] intercept supports
7412a2e [Eric Liang] Fri Jul 24 14:56:51 PDT 2015
3cf848e [Eric Liang] fix the parser
0556c2b [Eric Liang] Merge branch 'string-features' into string-features-2
c302a2c [Eric Liang] fix tests
9d1ac82 [Eric Liang] Merge remote-tracking branch 'upstream/master' into string-features
e713da3 [Eric Liang] comments
cd231a9 [Eric Liang] Wed Jul 22 17:18:44 PDT 2015
4d79193 [Eric Liang] revert to seq + distinct
169a085 [Eric Liang] tweak functional test
a230a47 [Eric Liang] Merge branch 'master' into string-features
72bd6f3 [Eric Liang] fix merge
d841cec [Eric Liang] Merge branch 'master' into string-features
5b2c4a2 [Eric Liang] Mon Jul 20 18:45:33 PDT 2015
b01c7c5 [Eric Liang] add test
8a637db [Eric Liang] encoder wip
a1d03f4 [Eric Liang] refactor into estimator
2015-07-28 14:16:57 -07:00
vinodkc 4af622c855 [SPARK-8919] [DOCUMENTATION, MLLIB] Added @since tags to mllib.recommendation
Author: vinodkc <vinod.kc.in@gmail.com>

Closes #7325 from vinodkc/add_since_mllib.recommendation and squashes the following commits:

93156f2 [vinodkc] Changed 0.8.0 to 0.9.1
c413350 [vinodkc] Added @since
2015-07-28 08:48:57 -07:00
Eric Liang 8ddfa52c20 [SPARK-9230] [ML] Support StringType features in RFormula
This adds StringType feature support via OneHotEncoder. As part of this task it was necessary to change RFormula to an Estimator, so that factor levels could be determined from the training dataset.

Not sure if I am using uids correctly here, would be good to get reviewer help on that.
cc mengxr

Umbrella design doc: https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/edit#

Author: Eric Liang <ekl@databricks.com>

Closes #7574 from ericl/string-features and squashes the following commits:

f99131a [Eric Liang] comments
0bf3c26 [Eric Liang] update docs
c302a2c [Eric Liang] fix tests
9d1ac82 [Eric Liang] Merge remote-tracking branch 'upstream/master' into string-features
e713da3 [Eric Liang] comments
4d79193 [Eric Liang] revert to seq + distinct
169a085 [Eric Liang] tweak functional test
a230a47 [Eric Liang] Merge branch 'master' into string-features
72bd6f3 [Eric Liang] fix merge
d841cec [Eric Liang] Merge branch 'master' into string-features
5b2c4a2 [Eric Liang] Mon Jul 20 18:45:33 PDT 2015
b01c7c5 [Eric Liang] add test
8a637db [Eric Liang] encoder wip
a1d03f4 [Eric Liang] refactor into estimator
2015-07-27 17:17:49 -07:00
George Dittmar 1f7b3d9dc7 [SPARK-7423] [MLLIB] Modify ClassificationModel and Probabalistic model to use Vector.argmax
Use Vector.argmax call instead of converting to dense vector before calculating predictions.

Author: George Dittmar <georgedittmar@gmail.com>

Closes #7670 from GeorgeDittmar/sprk-7423 and squashes the following commits:

e796747 [George Dittmar] Changing ClassificationModel and ProbabilisticClassificationModel to use Vector.argmax instead of converting to DenseVector
2015-07-27 11:16:33 -07:00
Yuhao Yang b79bf1df62 [SPARK-9337] [MLLIB] Add an ut for Word2Vec to verify the empty vocabulary check
jira: https://issues.apache.org/jira/browse/SPARK-9337

Word2Vec should throw exception when vocabulary is empty

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #7660 from hhbyyh/ut4Word2vec and squashes the following commits:

17a18cb [Yuhao Yang] add ut for word2vec
2015-07-26 14:02:20 +01:00
Reynold Xin 4a01bfc2a2 [SPARK-9350][SQL] Introduce an InternalRow generic getter that requires a DataType
Currently UnsafeRow cannot support a generic getter. However, if the data type is known, we can support a generic getter.

Author: Reynold Xin <rxin@databricks.com>

Closes #7666 from rxin/generic-getter-with-datatype and squashes the following commits:

ee2874c [Reynold Xin] Add a default implementation for getStruct.
1e109a0 [Reynold Xin] [SPARK-9350][SQL] Introduce an InternalRow generic getter that requires a DataType.
033ee88 [Reynold Xin] Removed getAs in non test code.
2015-07-25 23:52:37 -07:00
MechCoder a400ab516f [SPARK-7045] [MLLIB] Avoid intermediate representation when creating model
Word2Vec used to convert from an Array[Float] representation to a Map[String, Array[Float]] and then back to an Array[Float] through Word2VecModel.

This prevents this conversion while still supporting the older method of supplying a Map.

Author: MechCoder <manojkumarsivaraj334@gmail.com>

Closes #5748 from MechCoder/spark-7045 and squashes the following commits:

e308913 [MechCoder] move docs
5703116 [MechCoder] minor
fa04313 [MechCoder] style fixes
b1d61c4 [MechCoder] better errors and tests
3b32c8c [MechCoder] [SPARK-7045] Avoid intermediate representation when creating model
2015-07-24 14:58:07 -07:00
MechCoder e253124513 [SPARK-9222] [MLlib] Make class instantiation variables in DistributedLDAModel private[clustering]
This makes it easier to test all the class variables of the DistributedLDAmodel.

Author: MechCoder <manojkumarsivaraj334@gmail.com>

Closes #7573 from MechCoder/lda_test and squashes the following commits:

2f1a293 [MechCoder] [SPARK-9222] [MLlib] Make class instantiation variables in DistributedLDAModel private[clustering]
2015-07-24 10:56:48 -07:00
Reynold Xin 431ca39be5 [SPARK-9285][SQL] Remove InternalRow's inheritance from Row.
I also changed InternalRow's size/length function to numFields, to make it more obvious that it is not about bytes, but the number of fields.

Author: Reynold Xin <rxin@databricks.com>

Closes #7626 from rxin/internalRow and squashes the following commits:

e124daf [Reynold Xin] Fixed test case.
805ceb7 [Reynold Xin] Commented out the failed test suite.
f8a9ca5 [Reynold Xin] Fixed more bugs. Still at least one more remaining.
76d9081 [Reynold Xin] Fixed data sources.
7807f70 [Reynold Xin] Fixed DataFrameSuite.
cb60cd2 [Reynold Xin] Code review & small bug fixes.
0a2948b [Reynold Xin] Fixed style.
3280d03 [Reynold Xin] [SPARK-9285][SQL] Remove InternalRow's inheritance from Row.
2015-07-24 09:37:36 -07:00
Ram Sriharsha d4d762f275 [SPARK-8092] [ML] Allow OneVsRest Classifier feature and label column names to be configurable.
The base classifier input and output columns are ignored in favor of  the ones specified in OneVsRest.

Author: Ram Sriharsha <rsriharsha@hw11853.local>

Closes #6631 from harsha2010/SPARK-8092 and squashes the following commits:

6591dc6 [Ram Sriharsha] add documentation for params
b7024b1 [Ram Sriharsha] cleanup
f0e2bfb [Ram Sriharsha] merge with master
108d3d7 [Ram Sriharsha] merge with master
4f74126 [Ram Sriharsha] Allow label/ features columns to be configurable
2015-07-23 22:35:41 -07:00
Davies Liu 8a94eb23d5 [SPARK-9069] [SPARK-9264] [SQL] remove unlimited precision support for DecimalType
Romove Decimal.Unlimited (change to support precision up to 38, to match with Hive and other databases).

In order to keep backward source compatibility, Decimal.Unlimited is still there, but change to Decimal(38, 18).

If no precision and scale is provide, it's Decimal(10, 0) as before.

Author: Davies Liu <davies@databricks.com>

Closes #7605 from davies/decimal_unlimited and squashes the following commits:

aa3f115 [Davies Liu] fix tests and style
fb0d20d [Davies Liu] address comments
bfaae35 [Davies Liu] fix style
df93657 [Davies Liu] address comments and clean up
06727fd [Davies Liu] Merge branch 'master' of github.com:apache/spark into decimal_unlimited
4c28969 [Davies Liu] fix tests
8d783cc [Davies Liu] fix tests
788631c [Davies Liu] fix double with decimal in Union/except
1779bde [Davies Liu] fix scala style
c9c7c78 [Davies Liu] remove Decimal.Unlimited
2015-07-23 18:31:13 -07:00
Liang-Chi Hsieh 825ab1e452 [SPARK-7254] [MLLIB] Run PowerIterationClustering directly on graph
JIRA: https://issues.apache.org/jira/browse/SPARK-7254

Author: Liang-Chi Hsieh <viirya@appier.com>
Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #6054 from viirya/pic_on_graph and squashes the following commits:

8b87b81 [Liang-Chi Hsieh] Fix scala style.
a22fb8b [Liang-Chi Hsieh] For comment.
ef565a0 [Liang-Chi Hsieh] Fix indentation.
d249aa1 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into pic_on_graph
82d7351 [Liang-Chi Hsieh] Run PowerIterationClustering directly on graph.
2015-07-22 23:29:26 -07:00
Joseph K. Bradley 410dd41cf6 [SPARK-9268] [ML] Removed varargs annotation from Params.setDefault taking multiple params
Removed varargs annotation from Params.setDefault taking multiple params.

Though varargs is technically correct, it often requires that developers do clean assembly, rather than (not clean) assembly, which is a nuisance during development.

CC: mengxr

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #7604 from jkbradley/params-setdefault-varargs and squashes the following commits:

6016dc6 [Joseph K. Bradley] removed varargs annotation from Params.setDefault taking multiple params
2015-07-22 23:27:25 -07:00
Josh Rosen b217230f2a [SPARK-9144] Remove DAGScheduler.runLocallyWithinThread and spark.localExecution.enabled
Spark has an option called spark.localExecution.enabled; according to the docs:

> Enables Spark to run certain jobs, such as first() or take() on the driver, without sending tasks to the cluster. This can make certain jobs execute very quickly, but may require shipping a whole partition of data to the driver.

This feature ends up adding quite a bit of complexity to DAGScheduler, especially in the runLocallyWithinThread method, but as far as I know nobody uses this feature (I searched the mailing list and haven't seen any recent mentions of the configuration nor stacktraces including the runLocally method). As a step towards scheduler complexity reduction, I propose that we remove this feature and all code related to it for Spark 1.5.

This pull request simply brings #7484 up to date.

Author: Josh Rosen <joshrosen@databricks.com>
Author: Reynold Xin <rxin@databricks.com>

Closes #7585 from rxin/remove-local-exec and squashes the following commits:

84bd10e [Reynold Xin] Python fix.
1d9739a [Reynold Xin] Merge pull request #7484 from JoshRosen/remove-localexecution
eec39fa [Josh Rosen] Remove allowLocal(); deprecate user-facing uses of it.
b0835dc [Josh Rosen] Remove local execution code in DAGScheduler
8975d96 [Josh Rosen] Remove local execution tests.
ffa8c9b [Josh Rosen] Remove documentation for configuration
2015-07-22 21:04:04 -07:00
Reynold Xin d71a13f475 [SPARK-9262][build] Treat Scala compiler warnings as errors
I've seen a few cases in the past few weeks that the compiler is throwing warnings that are caused by legitimate bugs. This patch upgrades warnings to errors, except deprecation warnings.

Note that ideally we should be able to mark deprecation warnings as errors as well. However, due to the lack of ability to suppress individual warning messages in the Scala compiler, we cannot do that (since we do need to access deprecated APIs in Hadoop).

Most of the work are done by ericl.

Author: Reynold Xin <rxin@databricks.com>
Author: Eric Liang <ekl@databricks.com>

Closes #7598 from rxin/warnings and squashes the following commits:

beb311b [Reynold Xin] Fixed tests.
542c031 [Reynold Xin] Fixed one more warning.
87c354a [Reynold Xin] Fixed all non-deprecation warnings.
78660ac [Eric Liang] first effort to fix warnings
2015-07-22 21:02:19 -07:00
martinzapletal a721ee5270 [SPARK-8484] [ML] Added TrainValidationSplit for hyper-parameter tuning.
- [X] Added TrainValidationSplit for hyper-parameter tuning. It randomly splits the input dataset into train and validation and use evaluation metric on the validation set to select the best model. It should be similar to CrossValidator, but simpler and less expensive.
- [X] Simplified replacement of https://github.com/apache/spark/pull/6996

Author: martinzapletal <zapletal-martin@email.cz>

Closes #7337 from zapletal-martin/SPARK-8484-TrainValidationSplit and squashes the following commits:

cafc949 [martinzapletal] Review comments https://github.com/apache/spark/pull/7337.
511b398 [martinzapletal] Merge remote-tracking branch 'upstream/master' into SPARK-8484-TrainValidationSplit
f4fc9c4 [martinzapletal] SPARK-8484 Resolved feedback to https://github.com/apache/spark/pull/7337
00c4f5a [martinzapletal] SPARK-8484. Styling.
d699506 [martinzapletal] SPARK-8484. Styling.
93ed2ee [martinzapletal] Styling.
3bc1853 [martinzapletal] SPARK-8484. Styling.
2aa6f43 [martinzapletal] SPARK-8484. Added TrainValidationSplit for hyper-parameter tuning. It randomly splits the input dataset into train and validation and use evaluation metric on the validation set to select the best model.
21662eb [martinzapletal] SPARK-8484. Added TrainValidationSplit for hyper-parameter tuning. It randomly splits the input dataset into train and validation and use evaluation metric on the validation set to select the best model.
2015-07-22 17:35:05 -07:00
Matei Zaharia fe26584a1f [SPARK-9244] Increase some memory defaults
There are a few memory limits that people hit often and that we could
make higher, especially now that memory sizes have grown.

- spark.akka.frameSize: This defaults at 10 but is often hit for map
  output statuses in large shuffles. This memory is not fully allocated
  up-front, so we can just make this larger and still not affect jobs
  that never sent a status that large. We increase it to 128.

- spark.executor.memory: Defaults at 512m, which is really small. We
  increase it to 1g.

Author: Matei Zaharia <matei@databricks.com>

Closes #7586 from mateiz/configs and squashes the following commits:

ce0038a [Matei Zaharia] [SPARK-9244] Increase some memory defaults
2015-07-22 15:28:09 -07:00
Feynman Liang 1aca9c13c1 [SPARK-8536] [MLLIB] Generalize OnlineLDAOptimizer to asymmetric document-topic Dirichlet priors
Modify `LDA` to take asymmetric document-topic prior distributions and `OnlineLDAOptimizer` to use the asymmetric prior during variational inference.

This PR only generalizes `OnlineLDAOptimizer` and the associated `LocalLDAModel`; `EMLDAOptimizer` and `DistributedLDAModel` still only support symmetric `alpha` (checked during `EMLDAOptimizer.initialize`).

Author: Feynman Liang <fliang@databricks.com>

Closes #7575 from feynmanliang/SPARK-8536-LDA-asymmetric-priors and squashes the following commits:

af8fbb7 [Feynman Liang] Fix merge errors
ef5821d [Feynman Liang] Merge remote-tracking branch 'apache/master' into SPARK-8536-LDA-asymmetric-priors
58f1d7b [Feynman Liang] Fix from review feedback
a6dcf70 [Feynman Liang] Change docConcentration interface and move LDAOptimizer validation to initialize, add sad path tests
72038ff [Feynman Liang] Add tests referenced against gensim
d4284fa [Feynman Liang] Generalize OnlineLDA to asymmetric priors, no tests
2015-07-22 15:07:05 -07:00
Feynman Liang 8486cd8531 [SPARK-9224] [MLLIB] OnlineLDA Performance Improvements
In-place updates, reduce number of transposes, and vectorize operations in OnlineLDA implementation.

Author: Feynman Liang <fliang@databricks.com>

Closes #7454 from feynmanliang/OnlineLDA-perf-improvements and squashes the following commits:

78b0f5a [Feynman Liang] Make in-place variables vals, fix BLAS error
7f62a55 [Feynman Liang] --amend
c62cb1e [Feynman Liang] Outer product for stats, revert Range slicing
aead650 [Feynman Liang] Range slice, in-place update, reduce transposes
2015-07-22 13:06:01 -07:00
MechCoder 89db3c0b6e [SPARK-5989] [MLLIB] Model save/load for LDA
Add support for saving and loading LDA both the local and distributed versions.

Author: MechCoder <manojkumarsivaraj334@gmail.com>

Closes #6948 from MechCoder/lda_save_load and squashes the following commits:

49bcdce [MechCoder] minor style fixes
cc14054 [MechCoder] minor
4587d1d [MechCoder] Minor changes
c753122 [MechCoder] Load and save the model in private methods
2782326 [MechCoder] [SPARK-5989] Model save/load for LDA
2015-07-21 10:31:31 -07:00
petz2000 df4ddb3120 [SPARK-8915] [DOCUMENTATION, MLLIB] Added @since tags to mllib.classification
Created since tags for methods in mllib.classification

Author: petz2000 <petz2000@gmail.com>

Closes #7371 from petz2000/add_since_mllib.classification and squashes the following commits:

39fe291 [petz2000] Removed whitespace in block comment
c9b1e03 [petz2000] Removed @since tags again from protected and private methods
cd759b6 [petz2000] Added @since tags to methods
2015-07-21 08:50:43 -07:00
Holden Karau 4d97be9530 [SPARK-9204][ML] Add default params test for linearyregression suite
Author: Holden Karau <holden@pigscanfly.ca>

Closes #7553 from holdenk/SPARK-9204-add-default-params-test-to-linear-regression and squashes the following commits:

630ba19 [Holden Karau] style fix
faa08a3 [Holden Karau] Add default params test for linearyregression suite
2015-07-20 22:15:10 -07:00
Eric Liang 1cbdd89918 [SPARK-9201] [ML] Initial integration of MLlib + SparkR using RFormula
This exposes the SparkR:::glm() and SparkR:::predict() APIs. It was necessary to change RFormula to silently drop the label column if it was missing from the input dataset, which is kind of a hack but necessary to integrate with the Pipeline API.

The umbrella design doc for MLlib + SparkR integration can be viewed here: https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/edit

mengxr

Author: Eric Liang <ekl@databricks.com>

Closes #7483 from ericl/spark-8774 and squashes the following commits:

3dfac0c [Eric Liang] update
17ef516 [Eric Liang] more comments
1753a0f [Eric Liang] make glm generic
b0f50f8 [Eric Liang] equivalence test
550d56d [Eric Liang] export methods
c015697 [Eric Liang] second pass
117949a [Eric Liang] comments
5afbc67 [Eric Liang] test label columns
6b7f15f [Eric Liang] Fri Jul 17 14:20:22 PDT 2015
3a63ae5 [Eric Liang] Fri Jul 17 13:41:52 PDT 2015
ce61367 [Eric Liang] Fri Jul 17 13:41:17 PDT 2015
0299c59 [Eric Liang] Fri Jul 17 13:40:32 PDT 2015
e37603f [Eric Liang] Fri Jul 17 12:15:03 PDT 2015
d417d0c [Eric Liang] Merge remote-tracking branch 'upstream/master' into spark-8774
29a2ce7 [Eric Liang] Merge branch 'spark-8774-1' into spark-8774
d1959d2 [Eric Liang] clarify comment
2db68aa [Eric Liang] second round of comments
dc3c943 [Eric Liang] address comments
5765ec6 [Eric Liang] fix style checks
1f361b0 [Eric Liang] doc
d33211b [Eric Liang] r support
fb0826b [Eric Liang] [SPARK-8774] Add R model formula with basic support as a transformer
2015-07-20 20:49:38 -07:00
Meihua Wu ff3c72dbaf [SPARK-9175] [MLLIB] BLAS.gemm fails to update matrix C when alpha==0 and beta!=1
Fix BLAS.gemm to update matrix C when alpha==0 and beta!=1
Also include unit tests to verify the fix.

mengxr brkyvz

Author: Meihua Wu <meihuawu@umich.edu>

Closes #7503 from rotationsymmetry/fix_BLAS_gemm and squashes the following commits:

fce199c [Meihua Wu] Fix BLAS.gemm to update C when alpha==0 and beta!=1
2015-07-20 17:03:46 -07:00
MechCoder d0b4e93f7e [SPARK-8996] [MLLIB] [PYSPARK] Python API for Kolmogorov-Smirnov Test
Python API for the KS-test

Statistics.kolmogorovSmirnovTest(data, distName, *params)
I'm not quite sure how to support the callable function since it is not serializable.

Author: MechCoder <manojkumarsivaraj334@gmail.com>

Closes #7430 from MechCoder/spark-8996 and squashes the following commits:

2dd009d [MechCoder] minor
021d233 [MechCoder] Remove one wrapper and other minor stuff
49d07ab [MechCoder] [SPARK-8996] [MLlib] Python API for Kolmogorov-Smirnov Test
2015-07-20 09:00:01 -07:00
George Dittmar 3f7de7db4c [SPARK-7422] [MLLIB] Add argmax to Vector, SparseVector
Modifying Vector, DenseVector, and SparseVector to implement argmax functionality. This work is to set the stage for changes to be done in Spark-7423.

Author: George Dittmar <georgedittmar@gmail.com>
Author: George <dittmar@Georges-MacBook-Pro.local>
Author: dittmarg <george.dittmar@webtrends.com>
Author: Xiangrui Meng <meng@databricks.com>

Closes #6112 from GeorgeDittmar/SPARK-7422 and squashes the following commits:

3e0a939 [George Dittmar] Merge pull request #1 from mengxr/SPARK-7422
127dec5 [Xiangrui Meng] update argmax impl
2ea6a55 [George Dittmar] Added MimaExcludes for Vectors.argmax
98058f4 [George Dittmar] Merge branch 'master' of github.com:apache/spark into SPARK-7422
5fd9380 [George Dittmar] fixing style check error
42341fb [George Dittmar] refactoring arg max check to better handle zero values
b22af46 [George Dittmar] Fixing spaces between commas in unit test
f2eba2f [George Dittmar] Cleaning up unit tests to be fewer lines
aa330e3 [George Dittmar] Fixing some last if else spacing issues
ac53c55 [George Dittmar] changing dense vector argmax unit test to be one line call vs 2
d5b5423 [George Dittmar] Fixing code style and updating if logic on when to check for zero values
ee1a85a [George Dittmar] Cleaning up unit tests a bit and modifying a few cases
3ee8711 [George Dittmar] Fixing corner case issue with zeros in the active values of the sparse vector. Updated unit tests
b1f059f [George Dittmar] Added comment before we start arg max calculation. Updated unit tests to cover corner cases
f21dcce [George Dittmar] commit
af17981 [dittmarg] Initial work fixing bug that was made clear in pr
eeda560 [George] Fixing SparseVector argmax function to ignore zero values while doing the calculation.
4526acc [George] Merge branch 'master' of github.com:apache/spark into SPARK-7422
df9538a [George] Added argmax to sparse vector and added unit test
3cffed4 [George] Adding unit tests for argmax functions for Dense and Sparse vectors
04677af [George] initial work on adding argmax to Vector and SparseVector
2015-07-20 08:55:37 -07:00
Rekha Joshi 1017908205 [SPARK-9118] [ML] Implement IntArrayParam in mllib
Implement IntArrayParam in mllib

Author: Rekha Joshi <rekhajoshm@gmail.com>
Author: Joshi <rekhajoshm@gmail.com>

Closes #7481 from rekhajoshm/SPARK-9118 and squashes the following commits:

d3b1766 [Joshi] Implement IntArrayParam
0be142d [Rekha Joshi] Merge pull request #3 from apache/master
106fd8e [Rekha Joshi] Merge pull request #2 from apache/master
e3677c9 [Rekha Joshi] Merge pull request #1 from apache/master
2015-07-17 20:02:05 -07:00
Yu ISHIKAWA 34a889db85 [SPARK-7879] [MLLIB] KMeans API for spark.ml Pipelines
I Implemented the KMeans API for spark.ml Pipelines. But it doesn't include clustering abstractions for spark.ml (SPARK-7610). It would fit for another issues. And I'll try it later, since we are trying to add the hierarchical clustering algorithms in another issue. Thanks.

[SPARK-7879] KMeans API for spark.ml Pipelines - ASF JIRA https://issues.apache.org/jira/browse/SPARK-7879

Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #6756 from yu-iskw/SPARK-7879 and squashes the following commits:

be752de [Yu ISHIKAWA] Add assertions
a14939b [Yu ISHIKAWA] Fix the dashed line's length in pyspark.ml.rst
4c61693 [Yu ISHIKAWA] Remove the test about whether "features" and "prediction" columns exist or not in Python
fb2417c [Yu ISHIKAWA] Use getInt, instead of get
f397be4 [Yu ISHIKAWA] Switch the comparisons.
ca78b7d [Yu ISHIKAWA] Add the Scala docs about the constraints of each parameter.
effc650 [Yu ISHIKAWA] Using expertSetParam and expertGetParam
c8dc6e6 [Yu ISHIKAWA] Remove an unnecessary test
19a9d63 [Yu ISHIKAWA] Include spark.ml.clustering to python tests
1abb19c [Yu ISHIKAWA] Add the statements about spark.ml.clustering into pyspark.ml.rst
f8338bc [Yu ISHIKAWA] Add the placeholders in Python
4a03003 [Yu ISHIKAWA] Test for contains in Python
6566c8b [Yu ISHIKAWA] Use `get`, instead of `apply`
288e8d5 [Yu ISHIKAWA] Using `contains` to check the column names
5a7d574 [Yu ISHIKAWA] Renamce `validateInitializationMode` to `validateInitMode` and remove throwing exception
97cfae3 [Yu ISHIKAWA] Fix the type of return value of `KMeans.copy`
e933723 [Yu ISHIKAWA] Remove the default value of seed from the Model class
978ee2c [Yu ISHIKAWA] Modify the docs of KMeans, according to mllib's KMeans
2ec80bc [Yu ISHIKAWA] Fit on 1 line
e186be1 [Yu ISHIKAWA] Make a few variables, setters and getters be expert ones
b2c205c [Yu ISHIKAWA] Rename the method `getInitializationSteps` to `getInitSteps` and `setInitializationSteps` to `setInitSteps` in Scala and Python
f43f5b4 [Yu ISHIKAWA] Rename the method `getInitializationMode` to `getInitMode` and `setInitializationMode` to `setInitMode` in Scala and Python
3cb5ba4 [Yu ISHIKAWA] Modify the description about epsilon and the validation
4fa409b [Yu ISHIKAWA] Add a comment about the default value of epsilon
2f392e1 [Yu ISHIKAWA] Make some variables `final` and Use `IntParam` and `DoubleParam`
19326f8 [Yu ISHIKAWA] Use `udf`, instead of callUDF
4d2ad1e [Yu ISHIKAWA] Modify the indentations
0ae422f [Yu ISHIKAWA] Add a test for `setParams`
4ff7913 [Yu ISHIKAWA] Add "ml.clustering" to `javacOptions` in SparkBuild.scala
11ffdf1 [Yu ISHIKAWA] Use `===` and the variable
220a176 [Yu ISHIKAWA] Set a random seed in the unit testing
92c3efc [Yu ISHIKAWA] Make the points for a test be fewer
c758692 [Yu ISHIKAWA] Modify the parameters of KMeans in Python
6aca147 [Yu ISHIKAWA] Add some unit testings to validate the setter methods
687cacc [Yu ISHIKAWA] Alias mllib.KMeans as MLlibKMeans in KMeansSuite.scala
a4dfbef [Yu ISHIKAWA] Modify the last brace and indentations
5bedc51 [Yu ISHIKAWA] Remve an extra new line
444c289 [Yu ISHIKAWA] Add the validation for `runs`
e41989c [Yu ISHIKAWA] Modify how to validate `initStep`
7ea133a [Yu ISHIKAWA] Change how to validate `initMode`
7991e15 [Yu ISHIKAWA] Add a validation for `k`
c2df35d [Yu ISHIKAWA] Make `predict` private
93aa2ff [Yu ISHIKAWA] Use `withColumn` in `transform`
d3a79f7 [Yu ISHIKAWA] Remove the inhefited docs
e9532e1 [Yu ISHIKAWA] make `parentModel` of KMeansModel private
8559772 [Yu ISHIKAWA] Remove the `paramMap` parameter of KMeans
6684850 [Yu ISHIKAWA] Rename `initializationSteps` to `initSteps`
99b1b96 [Yu ISHIKAWA] Rename `initializationMode` to `initMode`
79ea82b [Yu ISHIKAWA] Modify the parameters of KMeans docs
6569bcd [Yu ISHIKAWA] Change how to set the default values with `setDefault`
20a795a [Yu ISHIKAWA] Change how to set the default values with `setDefault`
11c2a12 [Yu ISHIKAWA] Limit the imports
badb481 [Yu ISHIKAWA] Alias spark.mllib.{KMeans, KMeansModel}
f80319a [Yu ISHIKAWA] Rebase mater branch and add copy methods
85d92b1 [Yu ISHIKAWA] Add `KMeans.setPredictionCol`
aa9469d [Yu ISHIKAWA] Fix a python test suite error caused by python 3.x
c2d6bcb [Yu ISHIKAWA] ADD Java test suites of the KMeans API for spark.ml Pipeline
598ed2e [Yu ISHIKAWA] Implement the KMeans API for spark.ml Pipelines in Python
63ad785 [Yu ISHIKAWA] Implement the KMeans API for spark.ml Pipelines in Scala
2015-07-17 18:30:04 -07:00
Bryan Cutler 8b8be1f5d6 [SPARK-7127] [MLLIB] Adding broadcast of model before prediction for ensembles
Broadcast of ensemble models in transformImpl before call to predict

Author: Bryan Cutler <bjcutler@us.ibm.com>

Closes #6300 from BryanCutler/bcast-ensemble-models-7127 and squashes the following commits:

86e73de [Bryan Cutler] [SPARK-7127] Replaced deprecated callUDF with udf
40a139d [Bryan Cutler] Merge branch 'master' into bcast-ensemble-models-7127
9afad56 [Bryan Cutler] [SPARK-7127] Simplified calls by overriding transformImpl and using broadcasted model in callUDF to make prediction
1f34be4 [Bryan Cutler] [SPARK-7127] Removed accidental newline
171a6ce [Bryan Cutler] [SPARK-7127] Used modelAccessor parameter in predictImpl to access broadcasted model
6fd153c [Bryan Cutler] [SPARK-7127] Applied broadcasting to remaining ensemble models
aaad77b [Bryan Cutler] [SPARK-7127] Removed abstract class for broadcasting model, instead passing a prediction function as param to transform
83904bb [Bryan Cutler] [SPARK-7127] Adding broadcast of model before prediction in RandomForestClassifier
2015-07-17 14:10:16 -07:00
Feynman Liang 6da1069696 [SPARK-9090] [ML] Fix definition of residual in LinearRegressionSummary, EnsembleTestHelper, and SquaredError
Make the definition of residuals in Spark consistent with literature. We have been using `prediction - label` for residuals, but literature usually defines `residual = label - prediction`.

Author: Feynman Liang <fliang@databricks.com>

Closes #7435 from feynmanliang/SPARK-9090-Fix-LinearRegressionSummary-Residuals and squashes the following commits:

f4b39d8 [Feynman Liang] Fix doc
bc12a92 [Feynman Liang] Tweak EnsembleTestHelper and SquaredError residuals
63f0d60 [Feynman Liang] Fix definition of residual
2015-07-17 14:00:53 -07:00
Yanbo Liang 9974642870 [SPARK-8600] [ML] Naive Bayes API for spark.ml Pipelines
Naive Bayes API for spark.ml Pipelines

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #7284 from yanboliang/spark-8600 and squashes the following commits:

bc890f7 [Yanbo Liang] remove labels valid check
c3de687 [Yanbo Liang] remove labels from ml.NaiveBayesModel
a2b3088 [Yanbo Liang] address comments
3220b82 [Yanbo Liang] trigger jenkins
3018a41 [Yanbo Liang] address comments
208e166 [Yanbo Liang] Naive Bayes API for spark.ml Pipelines
2015-07-17 13:55:17 -07:00
Yuhao Yang 806c579f43 [SPARK-9062] [ML] Change output type of Tokenizer to Array(String, true)
jira: https://issues.apache.org/jira/browse/SPARK-9062

Currently output type of Tokenizer is Array(String, false), which is not compatible with Word2Vec and Other transformers since their input type is Array(String, true). Seq[String] in udf will be treated as Array(String, true) by default.

I'm not sure what's the recommended way for Tokenizer to handle the null value in the input. Any suggestion will be welcome.

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #7414 from hhbyyh/tokenizer and squashes the following commits:

c01bd7a [Yuhao Yang] change output type of tokenizer
2015-07-17 13:43:19 -07:00
Yanbo Liang 441e072a22 [MINOR] [ML] fix wrong annotation of RFormula.formula
fix wrong annotation of RFormula.formula

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #7470 from yanboliang/RFormula and squashes the following commits:

61f1919 [Yanbo Liang] fix wrong annotation
2015-07-17 09:00:41 -07:00
Xiangrui Meng 358e7bf652 [SPARK-9126] [MLLIB] do not assert on time taken by Thread.sleep()
Measure lower and upper bounds for task time and use them for validation. This PR also implements `Stopwatch.toString`. This suite should finish in less than 1 second.

jkbradley pwendell

Author: Xiangrui Meng <meng@databricks.com>

Closes #7457 from mengxr/SPARK-9126 and squashes the following commits:

4b40faa [Xiangrui Meng] simplify tests
739f5bd [Xiangrui Meng] do not assert on time taken by Thread.sleep()
2015-07-16 23:02:06 -07:00
Joseph K. Bradley 322d286bb7 [SPARK-7131] [ML] Copy Decision Tree, Random Forest impl to spark.ml
This PR copies the RandomForest implementation from spark.mllib to spark.ml.  Note that this includes the DecisionTree implementation, but not the GradientBoostedTrees one (which will come later).

I essentially copied a minimal amount of code to spark.ml, removed the use of bins (and only used splits), and modified code only as much as necessary to get it to compile.  The spark.ml implementation still uses some spark.mllib classes (privately), which can be moved in future PRs.

This refactoring will be helpful in extending the node representation to include more information, such as class probabilities.

Specifically:
* Copied code from spark.mllib to spark.ml:
  * mllib.tree.DecisionTree, mllib.tree.RandomForest copied to ml.tree.impl.RandomForest (main implementation)
  * NodeIdCache (needed to use splits instead of bins)
  * TreePoint (use splits instead of bins)
* Added ml.tree.LearningNode used in RandomForest training (needed vars)
* Removed bins from implementation, and only used splits
* Small fix in JavaDecisionTreeRegressorSuite

CC: mengxr  manishamde  codedeft chouqin

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #7294 from jkbradley/dt-move-impl and squashes the following commits:

48749be [Joseph K. Bradley] cleanups based on code review, mostly style
bea9703 [Joseph K. Bradley] scala style fixes.  added some scala doc
4e6d2a4 [Joseph K. Bradley] removed unnecessary use of copyValues, setParent for trees
9a4d721 [Joseph K. Bradley] cleanups. removed InfoGainStats from ml, using old one for now.
836e7d4 [Joseph K. Bradley] Fixed test suite failures
bd5e063 [Joseph K. Bradley] fixed bucketizing issue
0df3759 [Joseph K. Bradley] Need to remove use of Bucketizer
d5224a9 [Joseph K. Bradley] modified tree and forest to use moved impl
cc01823 [Joseph K. Bradley] still editing RF to get it to work
19143fb [Joseph K. Bradley] More progress, but not done yet.  Rebased with master after 1.4 release.
2015-07-16 22:26:59 -07:00
Xiangrui Meng 73d92b00b9 [SPARK-9018] [MLLIB] add stopwatches
Add stopwatches for easy instrumentation of MLlib algorithms. This is based on the `TimeTracker` used in decision trees. The distributed version uses Spark accumulator. jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #7415 from mengxr/SPARK-9018 and squashes the following commits:

40b4347 [Xiangrui Meng] == -> ===
c477745 [Xiangrui Meng] address Joseph's comments
f981a49 [Xiangrui Meng] add stopwatches
2015-07-15 21:02:42 -07:00
Eric Liang 6960a7938c [SPARK-8774] [ML] Add R model formula with basic support as a transformer
This implements minimal R formula support as a feature transformer. Both numeric and string labels are supported, but features must be numeric for now.

cc mengxr

Author: Eric Liang <ekl@databricks.com>

Closes #7381 from ericl/spark-8774-1 and squashes the following commits:

d1959d2 [Eric Liang] clarify comment
2db68aa [Eric Liang] second round of comments
dc3c943 [Eric Liang] address comments
5765ec6 [Eric Liang] fix style checks
1f361b0 [Eric Liang] doc
fb0826b [Eric Liang] [SPARK-8774] Add R model formula with basic support as a transformer
2015-07-15 20:33:06 -07:00
Feynman Liang 536533cad8 [SPARK-9005] [MLLIB] Fix RegressionMetrics computation of explainedVariance
Fixes implementation of `explainedVariance` and `r2` to be consistent with their definitions as described in [SPARK-9005](https://issues.apache.org/jira/browse/SPARK-9005).

Author: Feynman Liang <fliang@databricks.com>

Closes #7361 from feynmanliang/SPARK-9005-RegressionMetrics-bugs and squashes the following commits:

f1112fc [Feynman Liang] Add explainedVariance formula
1a3d098 [Feynman Liang] SROwen code review comments
08a0e1b [Feynman Liang] Fix pyspark tests
db8605a [Feynman Liang] Style fix
bde9761 [Feynman Liang] Fix RegressionMetrics tests, relax assumption predictor is unbiased
c235de0 [Feynman Liang] Fix RegressionMetrics tests
4c4e56f [Feynman Liang] Fix RegressionMetrics computation of explainedVariance and r2
2015-07-15 13:32:25 -07:00
Feynman Liang 1bb8accbc9 [SPARK-8997] [MLLIB] Performance improvements in LocalPrefixSpan
Improves the performance of LocalPrefixSpan by implementing optimizations proposed in [SPARK-8997](https://issues.apache.org/jira/browse/SPARK-8997)

Author: Feynman Liang <fliang@databricks.com>
Author: Feynman Liang <feynman.liang@gmail.com>
Author: Xiangrui Meng <meng@databricks.com>

Closes #7360 from feynmanliang/SPARK-8997-improve-prefixspan and squashes the following commits:

59db2f5 [Feynman Liang] Merge pull request #1 from mengxr/SPARK-8997
91e4357 [Xiangrui Meng] update LocalPrefixSpan impl
9212256 [Feynman Liang] MengXR code review comments
f055d82 [Feynman Liang] Fix failing scalatest
2e00cba [Feynman Liang] Depth first projections
70b93e3 [Feynman Liang] Performance improvements in LocalPrefixSpan, fix tests
2015-07-14 23:50:57 -07:00
FlytxtRnD 3f6296fed4 [SPARK-8018] [MLLIB] KMeans should accept initial cluster centers as param
This allows Kmeans to be initialized using an existing set of cluster centers provided as  a KMeansModel object. This mode of initialization performs a single run.

Author: FlytxtRnD <meethu.mathew@flytxt.com>

Closes #6737 from FlytxtRnD/Kmeans-8018 and squashes the following commits:

94b56df [FlytxtRnD] style correction
ef95ee2 [FlytxtRnD] style correction
c446c58 [FlytxtRnD] documentation and numRuns warning change
06d13ef [FlytxtRnD] numRuns corrected
d12336e [FlytxtRnD] numRuns variable modifications
07f8554 [FlytxtRnD] remove setRuns from setIntialModel
e721dfe [FlytxtRnD] Merge remote-tracking branch 'upstream/master' into Kmeans-8018
242ead1 [FlytxtRnD] corrected == to === in assert
714acb5 [FlytxtRnD] added numRuns
60c8ce2 [FlytxtRnD] ignore runs parameter and initialModel test suite changed
582e6d9 [FlytxtRnD] Merge remote-tracking branch 'upstream/master' into Kmeans-8018
3f5fc8e [FlytxtRnD] test case modified and one runs condition added
cd5dc5c [FlytxtRnD] Merge remote-tracking branch 'upstream/master' into Kmeans-8018
16f1b53 [FlytxtRnD] Merge branch 'Kmeans-8018', remote-tracking branch 'upstream/master' into Kmeans-8018
e9c35d7 [FlytxtRnD] Remove getInitialModel and match cluster count criteria
6959861 [FlytxtRnD] Accept initial cluster centers in KMeans
2015-07-14 23:29:02 -07:00
Yu ISHIKAWA 4692769655 [SPARK-6259] [MLLIB] Python API for LDA
I implemented the Python API for LDA. But I didn't implemented a method for `LDAModel.describeTopics()`, beause it's a little hard to implement it now. And adding document about that and an example code would fit for another issue.

TODO: LDAModel.describeTopics() in Python must be also implemented. But it would be nice to fit for another issue. Implementing it is a little hard, since the return value of `describeTopics` in Scala consists of Tuple classes.

Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #6791 from yu-iskw/SPARK-6259 and squashes the following commits:

6855f59 [Yu ISHIKAWA] LDA inherits object
28bd165 [Yu ISHIKAWA] Change the place of testing code
d7a332a [Yu ISHIKAWA] Remove the doc comment about the optimizer's default value
083e226 [Yu ISHIKAWA] Add the comment about the supported values and the default value of `optimizer`
9f8bed8 [Yu ISHIKAWA] Simplify casting
faa9764 [Yu ISHIKAWA] Add some comments for the LDA paramters
98f645a [Yu ISHIKAWA] Remove the interface for `describeTopics`. Because it is not implemented.
57ac03d [Yu ISHIKAWA] Remove the unnecessary import in Python unit testing
73412c3 [Yu ISHIKAWA] Fix the typo
2278829 [Yu ISHIKAWA] Fix the indentation
39514ec [Yu ISHIKAWA] Modify how to cast the input data
8117e18 [Yu ISHIKAWA] Fix the validation problems by `lint-scala`
77fd1b7 [Yu ISHIKAWA] Not use LabeledPoint
68f0653 [Yu ISHIKAWA] Support some parameters for `ALS.train()` in Python
25ef2ac [Yu ISHIKAWA] Resolve conflicts with rebasing
2015-07-14 23:27:42 -07:00
Sean Owen 740b034f1c [SPARK-4362] [MLLIB] Make prediction probability available in NaiveBayesModel
Add predictProbabilities to Naive Bayes, return class probabilities.

Continues https://github.com/apache/spark/pull/6761

Author: Sean Owen <sowen@cloudera.com>

Closes #7376 from srowen/SPARK-4362 and squashes the following commits:

23d5a76 [Sean Owen] Fix model.labels -> model.theta
95d91fb [Sean Owen] Check that predicted probabilities sum to 1
b32d1c8 [Sean Owen] Add predictProbabilities to Naive Bayes, return class probabilities
2015-07-14 22:44:54 +01:00
Vinod K C 714fc55f4a [SPARK-8991] [ML] Update SharedParamsCodeGen's Generated Documentation
Removed private[ml] from Generated documentation

Author: Vinod K C <vinod.kc@huawei.com>

Closes #7367 from vinodkc/fix_sharedparmascodegen and squashes the following commits:

4fa3c8f [Vinod K C] Adding auto generated code
7e19025 [Vinod K C] Removed private[ml]
2015-07-13 12:03:39 -07:00
Joseph K. Bradley 0c5207c66d [SPARK-8994] [ML] tiny cleanups to Params, Pipeline
Made default impl of Params.validateParams empty
CC mengxr

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #7349 from jkbradley/pipeline-small-cleanups and squashes the following commits:

4e0f013 [Joseph K. Bradley] small cleanups after SPARK-5956
2015-07-10 21:25:09 -07:00
zhangjiajin 7f6be1f24d [SPARK-6487] [MLLIB] Add sequential pattern mining algorithm PrefixSpan to Spark MLlib
Add parallel PrefixSpan algorithm and test file.
Support non-temporal sequences.

Author: zhangjiajin <zhangjiajin@huawei.com>
Author: zhang jiajin <zhangjiajin@huawei.com>

Closes #7258 from zhangjiajin/master and squashes the following commits:

ca9c4c8 [zhangjiajin] Modified the code according to the review comments.
574e56c [zhangjiajin] Add new object LocalPrefixSpan, and do some optimization.
ba5df34 [zhangjiajin] Fix a Scala style error.
4c60fb3 [zhangjiajin] Fix some Scala style errors.
1dd33ad [zhangjiajin] Modified the code according to the review comments.
89bc368 [zhangjiajin] Fixed a Scala style error.
a2eb14c [zhang jiajin] Delete PrefixspanSuite.scala
951fd42 [zhang jiajin] Delete Prefixspan.scala
575995f [zhangjiajin] Modified the code according to the review comments.
91fd7e6 [zhangjiajin] Add new algorithm PrefixSpan and test file.
2015-07-10 21:11:46 -07:00
jose.cambronero 9c5075775d [SPARK-8598] [MLLIB] Implementation of 1-sample, two-sided, Kolmogorov Smirnov Test for RDDs
This contribution is my original work and I license it to the project under it's open source license.

Author: jose.cambronero <jose.cambronero@cloudera.com>

Closes #6994 from josepablocam/master and squashes the following commits:

bbb30b1 [jose.cambronero] renamed KSTestResult to KolmogorovSmirnovTestResult, to stay consistent with method name
0d0c201 [jose.cambronero] kstTest -> kolmogorovSmirnovTest in statistics.md
1f56371 [jose.cambronero] changed ksTest in public API to kolmogorovSmirnovTest for clarity
a48ae7b [jose.cambronero] refactor code to account for serializable RealDistribution. Reuse testOneSample( _, cdf)
1bb44bd [jose.cambronero]  style and doc changes. Factored out ks test into 2 separate tests
2ec2aa6 [jose.cambronero] initialize to stdnormal when no params passed (and log). Change unit tests to approximate equivalence rather than strict
a4bc0c7 [jose.cambronero] changed ksTest(data, distName) to ksTest(data, distName, params*) after api discussions. Changed tests and docs accordingly
7e66f57 [jose.cambronero] copied implementation note to public api docs, and added @see for links to wiki info
e760ebd [jose.cambronero] line length changes to fit style check
3288e42 [jose.cambronero] addressed style changes, correctness change to simpler approach, and fixed edge case for foldLeft in searchOneSampleCandidates when a partition is empty
9026895 [jose.cambronero] addressed style changes, correctness change to simpler approach, and fixed edge case for foldLeft in searchOneSampleCandidates when a partition is empty
1226b30 [jose.cambronero] reindent multi-line lambdas, prior intepretation of style guide was wrong on my part
9c0f1af [jose.cambronero] additional style changes incorporated and added documentation to mllib statistics docs
3f81ad2 [jose.cambronero] renamed ks1 sample test for clarity
992293b [jose.cambronero] Style changes as per comments and added implementation note explaining the distributed approach.
6a4784f [jose.cambronero] specified what distributions are available for the convenience method ksTest(data, name) (solely standard normal)
4b8ba61 [jose.cambronero] fixed off by 1/N in cases when post-constant adjustment ecdf is above cdf, but prior to adj it was below
0b5e8ec [jose.cambronero] changed KS one sample test to perform just 1 distributed pass (in addition to the sorting pass), operates on each partition separately. Implementation of Sandy Ryza's algorithm
16b5c4c [jose.cambronero] renamed dat to data and eliminated recalc of RDD size by sharing as argument between empirical and evalOneSampleP
c18dc66 [jose.cambronero] removed ksTestOpt from API and changed comments in HypothesisTestSuite accordingly
f6951b6 [jose.cambronero] changed style and some comments based on feedback from pull request
b9cff3a [jose.cambronero] made small changes to pass style check
ce8e9a1 [jose.cambronero] added kstest testing in HypothesisTestSuite
4da189b [jose.cambronero] added user facing ks test functions
c659ea1 [jose.cambronero] created KS test class
13dfe4d [jose.cambronero] created test result class for ks test
2015-07-10 20:55:45 -07:00
rahulpalamuttam 0772026c2f [SPARK-8923] [DOCUMENTATION, MLLIB] Add @since tags to mllib.fpm
Author: rahulpalamuttam <rahulpalamut@gmail.com>

Closes #7341 from rahulpalamuttam/TaggingMLlibfpm and squashes the following commits:

bef2843 [rahulpalamuttam] fix @since tags in mmlib.fpm
cd86252 [rahulpalamuttam] Add @since tags to mllib.fpm
2015-07-10 16:07:31 -07:00
Jonathan Alter e14b545d2d [SPARK-7977] [BUILD] Disallowing println
Author: Jonathan Alter <jonalter@users.noreply.github.com>

Closes #7093 from jonalter/SPARK-7977 and squashes the following commits:

ccd44cc [Jonathan Alter] Changed println to log in ThreadingSuite
7fcac3e [Jonathan Alter] Reverting to println in ThreadingSuite
10724b6 [Jonathan Alter] Changing some printlns to logs in tests
eeec1e7 [Jonathan Alter] Merge branch 'master' of github.com:apache/spark into SPARK-7977
0b1dcb4 [Jonathan Alter] More println cleanup
aedaf80 [Jonathan Alter] Merge branch 'master' of github.com:apache/spark into SPARK-7977
925fd98 [Jonathan Alter] Merge branch 'master' of github.com:apache/spark into SPARK-7977
0c16fa3 [Jonathan Alter] Replacing some printlns with logs
45c7e05 [Jonathan Alter] Merge branch 'master' of github.com:apache/spark into SPARK-7977
5c8e283 [Jonathan Alter] Allowing println in audit-release examples
5b50da1 [Jonathan Alter] Allowing printlns in example files
ca4b477 [Jonathan Alter] Merge branch 'master' of github.com:apache/spark into SPARK-7977
83ab635 [Jonathan Alter] Fixing new printlns
54b131f [Jonathan Alter] Merge branch 'master' of github.com:apache/spark into SPARK-7977
1cd8a81 [Jonathan Alter] Removing some unnecessary comments and printlns
b837c3a [Jonathan Alter] Disallowing println
2015-07-10 11:34:01 +01:00
Holden Karau 2727304660 [SPARK-8913] [ML] Simplify LogisticRegression suite to use Vector Vector comparision
Cleanup tests from SPARK 8700.

Author: Holden Karau <holden@pigscanfly.ca>

Closes #7335 from holdenk/SPARK-8913-cleanup-tests-from-SPARK-8700-logistic-regression-r2-really-logistic-regression-this-time and squashes the following commits:

e5e2c5f [Holden Karau] Simplify LogisticRegression suite to use Vector <-> Vector comparisions instead of comparing element by element
2015-07-09 19:08:33 -07:00
Feynman Liang a0cc3e5aa3 [SPARK-8538] [SPARK-8539] [ML] Linear Regression Training and Testing Results
Adds results (e.g. objective value at each iteration, residuals) on training and user-specified test sets for LinearRegressionModel.

Notes to Reviewers:
 * Are the `*TrainingResults` and `Results` classes too specialized for `LinearRegressionModel`? Where would be an appropriate level of abstraction?
 * Please check `transient` annotations are correct; the datasets should not be copied and kept during serialization.
 * Any thoughts on `RDD`s versus `DataFrame`s? If using `DataFrame`s, suggested schemas for each intermediate step? Also, how to create a "local DataFrame" without a `sqlContext`?

Author: Feynman Liang <fliang@databricks.com>

Closes #7099 from feynmanliang/SPARK-8538 and squashes the following commits:

d219fa4 [Feynman Liang] Update docs
4a42680 [Feynman Liang] Change Summary to hold values, move transient annotations down to metrics and predictions DF
6300031 [Feynman Liang] Code review changes
0a5e762 [Feynman Liang] Fix build error
e71102d [Feynman Liang] Merge branch 'master' into SPARK-8538
3367489 [Feynman Liang] Merge branch 'master' into SPARK-8538
70f267c [Feynman Liang] Make TrainingSummary transient and remove Serializable from *Summary and RegressionMetrics
1d9ea42 [Feynman Liang] Fix failing Java test
a65dfda [Feynman Liang] Make TrainingSummary and metrics serializable, prediction dataframe transient
0a605d8 [Feynman Liang] Replace Params from LinearRegression*Summary with private constructor vals
c2fe835 [Feynman Liang] Optimize imports
02d8a70 [Feynman Liang] Add Params to LinearModel*Summary, refactor tests and add test for evaluate()
8f999f4 [Feynman Liang] Refactor from jkbradley code review
072e948 [Feynman Liang] Style
509ae36 [Feynman Liang] Use DFs and localize serialization to LinearRegressionModel
9509c79 [Feynman Liang] Fix imports
b2bbaa3 [Feynman Liang] Refactored LinearRegressionResults API to be more private
ffceaec [Feynman Liang] Merge branch 'master' into SPARK-8538
1cedb2b [Feynman Liang] Add test for decreasing objective trace
dab0aff [Feynman Liang] Add LinearRegressionTrainingResults tests, make test suite code copy+pasteable
97b0a81 [Feynman Liang] Add LinearRegressionModel.evaluate() to get results on test sets
dc51bce [Feynman Liang] Style guide fixes
521f397 [Feynman Liang] Use RDD[(Double, Double)] instead of DF
2ff5710 [Feynman Liang] Add training results and model summary to ML LinearRegression
2015-07-09 16:21:21 -07:00
Holden Karau e29ce319fa [SPARK-8963][ML] cleanup tests in linear regression suite
Simplify model weight assertions to use vector comparision, switch to using absTol when comparing with 0.0 intercepts

Author: Holden Karau <holden@pigscanfly.ca>

Closes #7327 from holdenk/SPARK-8913-cleanup-tests-from-SPARK-8700-logistic-regression and squashes the following commits:

5bac185 [Holden Karau] Simplify model weight assertions to use vector comparision, switch to using absTol when comparing with 0.0 intercepts
2015-07-09 15:49:30 -07:00
Yuhao Yang 0cd84c86ca [SPARK-8703] [ML] Add CountVectorizer as a ml transformer to convert document to words count vector
jira: https://issues.apache.org/jira/browse/SPARK-8703

Converts a text document to a sparse vector of token counts.

I can further add an estimator to extract vocabulary from corpus if that's appropriate.

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #7084 from hhbyyh/countVectorization and squashes the following commits:

5f3f655 [Yuhao Yang] text change
24728e4 [Yuhao Yang] style improvement
576728a [Yuhao Yang] rename to model and some fix
1deca28 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into countVectorization
99b0c14 [Yuhao Yang] undo extension from HashingTF
12c2dc8 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into countVectorization
7ee1c31 [Yuhao Yang] extends HashingTF
809fb59 [Yuhao Yang] minor fix for ut
7c61fb3 [Yuhao Yang] add countVectorizer
2015-07-09 10:26:38 -07:00
Davies Liu 74d8d3d928 [SPARK-8450] [SQL] [PYSARK] cleanup type converter for Python DataFrame
This PR fixes the converter for Python DataFrame, especially for DecimalType

Closes #7106

Author: Davies Liu <davies@databricks.com>

Closes #7131 from davies/decimal_python and squashes the following commits:

4d3c234 [Davies Liu] Merge branch 'master' of github.com:apache/spark into decimal_python
20531d6 [Davies Liu] Merge branch 'master' of github.com:apache/spark into decimal_python
7d73168 [Davies Liu] fix conflit
6cdd86a [Davies Liu] Merge branch 'master' of github.com:apache/spark into decimal_python
7104e97 [Davies Liu] improve type infer
9cd5a21 [Davies Liu] run python tests with SPARK_PREPEND_CLASSES
829a05b [Davies Liu] fix UDT in python
c99e8c5 [Davies Liu] fix mima
c46814a [Davies Liu] convert decimal for Python DataFrames
2015-07-08 18:22:53 -07:00
Feynman Liang f472b8cdc0 [SPARK-5016] [MLLIB] Distribute GMM mixture components to executors
Distribute expensive portions of computation for Gaussian mixture components (in particular, pre-computation of `MultivariateGaussian.rootSigmaInv`, the inverse covariance matrix and covariance determinant) across executors. Repost of PR#4654.

Notes for reviewers:
 * What should be the policy for when to distribute computation. Always? When numClusters > threshold? User-specified param?

TODO:
 * Performance testing and comparison for large number of clusters

Author: Feynman Liang <fliang@databricks.com>

Closes #7166 from feynmanliang/GMM_parallel_mixtures and squashes the following commits:

4f351fa [Feynman Liang] Update heuristic and scaladoc
5ea947e [Feynman Liang] Fix parallelization logic
00eb7db [Feynman Liang] Add helper method for GMM's M step, remove distributeGaussians flag
e7c8127 [Feynman Liang] Add distributeGaussians flag and tests
1da3c7f [Feynman Liang] Distribute mixtures
2015-07-08 16:32:00 -07:00
Feynman Liang 8c32b2e870 [SPARK-8877] [MLLIB] Public API for association rule generation
Adds FPGrowth.generateAssociationRules to public API for generating association rules after mining frequent itemsets.

Author: Feynman Liang <fliang@databricks.com>

Closes #7271 from feynmanliang/SPARK-8877 and squashes the following commits:

83b8baf [Feynman Liang] Add API Doc
867abff [Feynman Liang] Add FPGrowth.generateAssociationRules and change access modifiers for AssociationRules
2015-07-08 16:27:11 -07:00
DB Tsai 57221934e0 [SPARK-8700][ML] Disable feature scaling in Logistic Regression
All compressed sensing applications, and some of the regression use-cases will have better result by turning the feature scaling off. However, if we implement this naively by training the dataset without doing any standardization, the rate of convergency will not be good. This can be implemented by still standardizing the training dataset but we penalize each component differently to get effectively the same objective function but a better numerical problem. As a result, for those columns with high variances, they will be penalized less, and vice versa. Without this, since all the features are standardized, so they will be penalized the same.

In R, there is an option for this.
`standardize`
Logical flag for x variable standardization, prior to fitting the model sequence. The coefficients are always returned on the original scale. Default is standardize=TRUE. If variables are in the same units already, you might not wish to standardize. See details below for y standardization with family="gaussian".

+cc holdenk mengxr jkbradley

Author: DB Tsai <dbt@netflix.com>

Closes #7080 from dbtsai/lors and squashes the following commits:

877e6c7 [DB Tsai] repahse the doc
7cf45f2 [DB Tsai] address feedback
78d75c9 [DB Tsai] small change
c2c9e60 [DB Tsai] style
6e1a8e0 [DB Tsai] first commit
2015-07-08 15:21:58 -07:00
Kashif Rasul 3bb217750a [SPARK-8872] [MLLIB] added verification results from R for FPGrowthSuite
Author: Kashif Rasul <kashif.rasul@gmail.com>

Closes #7269 from kashif/SPARK-8872 and squashes the following commits:

2d5457f [Kashif Rasul] added R code for FP Int type
3de6808 [Kashif Rasul] added verification results from R for FPGrowthSuite
2015-07-08 08:44:58 -07:00
DB Tsai 3bf20c27ff [SPARK-8845] [ML] ML use of Breeze optimization: use adjustedValue instead of value
In LinearRegression and LogisticRegression, we use Breeze's optimizers (LBFGS and OWLQN). We check the State.value to see the current objective. However, Breeze's documentation makes it sound like value and adjustedValue differ for some optimizers, possibly including OWLQN: 26faf62286/math/src/main/scala/breeze/optimize/FirstOrderMinimizer.scala (L36)
If that is the case, then we should use adjustedValue instead of value. This is relevant to SPARK-8538 and SPARK-8539, where we will provide the objective trace to the user.

Author: DB Tsai <dbt@netflix.com>

Closes #7245 from dbtsai/SPARK-8845 and squashes the following commits:

fa4c91e [DB Tsai] address feedback
e6caac1 [DB Tsai] java style multiline comment
b10c574 [DB Tsai] address feedback
c9ff81e [DB Tsai] first commit
2015-07-07 15:46:44 -07:00
MechCoder 35d781e71b [SPARK-8704] [ML] [PySpark] Add missing methods in StandardScaler
Add std, mean to StandardScalerModel
getVectors, findSynonyms to Word2Vec Model
setFeatures and getFeatures to hashingTF

Author: MechCoder <manojkumarsivaraj334@gmail.com>

Closes #7086 from MechCoder/missing_model_methods and squashes the following commits:

9fbae90 [MechCoder] Add type
6e3d6b2 [MechCoder] [SPARK-8704] Add missing methods in StandardScaler (ML and PySpark)
2015-07-07 12:35:40 -07:00
Feynman Liang 3336c7b148 [SPARK-8559] [MLLIB] Support Association Rule Generation
Distributed generation of single-consequent association rules from a RDD of frequent itemsets. Tests referenced against `R`'s implementation of A Priori in [arules](http://cran.r-project.org/web/packages/arules/index.html).

Author: Feynman Liang <fliang@databricks.com>

Closes #7005 from feynmanliang/fp-association-rules-distributed and squashes the following commits:

466ced0 [Feynman Liang] Refactor AR generation impl
73c1cff [Feynman Liang] Make rule attributes public, remove numTransactions from FreqItemset
80f63ff [Feynman Liang] Change default confidence and optimize imports
04cf5b5 [Feynman Liang] Code review with @mengxr, add R to tests
0cc1a6a [Feynman Liang] Java compatibility test
f3c14b5 [Feynman Liang] Fix MiMa test
764375e [Feynman Liang] Fix tests
1187307 [Feynman Liang] Almost working tests
b20779b [Feynman Liang] Working implementation
5395c4e [Feynman Liang] Fix imports
2d34405 [Feynman Liang] Partial implementation of distributed ar
83ace4b [Feynman Liang] Local rule generation without pruning complete
69c2c87 [Feynman Liang] Working local implementation, now to parallelize../..
4e1ec9a [Feynman Liang] Pull FreqItemsets out, refactor type param, tests
69ccedc [Feynman Liang] First implementation of association rule generation
2015-07-07 11:34:30 -07:00
MechCoder 1dbc4a155f [SPARK-8711] [ML] Add additional methods to PySpark ML tree models
Add numNodes and depth to treeModels, add treeWeights to ensemble Models.
Add __repr__ to all models.

Author: MechCoder <manojkumarsivaraj334@gmail.com>

Closes #7095 from MechCoder/missing_methods_tree and squashes the following commits:

23b08be [MechCoder] private [spark]
38a0860 [MechCoder] rename pyTreeWeights to javaTreeWeights
6d16ad8 [MechCoder] Fix Python 3 Error
47d7023 [MechCoder] Use np.allclose and treeEnsembleModel -> TreeEnsembleMethods
819098c [MechCoder] [SPARK-8711] [ML] Add additional methods ot PySpark ML tree models
2015-07-07 08:58:08 -07:00
Yanbo Liang d73bc08d98 [SPARK-8788] [ML] Add Java unit test for PCA transformer
Add Java unit test for PCA transformer

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #7184 from yanboliang/spark-8788 and squashes the following commits:

9d1a2af [Yanbo Liang] address comments
b34451f [Yanbo Liang] Add Java unit test for PCA transformer
2015-07-07 08:19:17 -07:00
Alok Singh 6718c1eb67 [SPARK-5562] [MLLIB] LDA should handle empty document.
See the jira https://issues.apache.org/jira/browse/SPARK-5562

Author: Alok  Singh <singhal@Aloks-MacBook-Pro.local>
Author: Alok  Singh <singhal@aloks-mbp.usca.ibm.com>
Author: Alok Singh <“singhal@us.ibm.com”>

Closes #7064 from aloknsingh/aloknsingh_SPARK-5562 and squashes the following commits:

259a0a7 [Alok Singh] change as per the comments by @jkbradley
be48491 [Alok  Singh] [SPARK-5562][MLlib] re-order import in alphabhetical order
c01311b [Alok  Singh] [SPARK-5562][MLlib] fix the newline typo
b271c8a [Alok  Singh] [SPARK-5562][Mllib] As per github discussion with jkbradley. We would like to simply things.
7c06251 [Alok  Singh] [SPARK-5562][MLlib] modified the JavaLDASuite for test passing
c710cb6 [Alok  Singh] fix the scala code style to have space after :
2572a08 [Alok  Singh] [SPARK-5562][MLlib] change the import xyz._ to the import xyz.{c1, c2} ..
ab55fbf [Alok  Singh] [SPARK-5562][MLlib] Change as per Sean Owen's comments https://github.com/apache/spark/pull/7064/files#diff-9236d23975e6f5a5608ffc81dfd79146
9f4f9ea [Alok  Singh] [SPARK-5562][MLlib] LDA should handle empty document.
2015-07-06 21:53:55 -07:00
Xiangrui Meng 96c5eeec39 Revert "[SPARK-7212] [MLLIB] Add sequence learning flag"
This reverts commit 25f574eb9a. After speaking to some users and developers, we realized that FP-growth doesn't meet the requirement for frequent sequence mining. PrefixSpan (SPARK-6487) would be the correct algorithm for it. feynmanliang

Author: Xiangrui Meng <meng@databricks.com>

Closes #7240 from mengxr/SPARK-7212.revert and squashes the following commits:

2b3d66b [Xiangrui Meng] Revert "[SPARK-7212] [MLLIB] Add sequence learning flag"
2015-07-06 16:11:22 -07:00
Joshi f9c448dce8 [SPARK-7137] [ML] Update SchemaUtils checkInputColumn to print more info if needed
Author: Joshi <rekhajoshm@gmail.com>
Author: Rekha Joshi <rekhajoshm@gmail.com>

Closes #5992 from rekhajoshm/fix/SPARK-7137 and squashes the following commits:

8c42b57 [Joshi] update checkInputColumn to print more info if needed
33ddd2e [Joshi] update checkInputColumn to print more info if needed
acf3e17 [Joshi] update checkInputColumn to print more info if needed
8993c0e [Joshi] SPARK-7137: Add checkInputColumn back to Params and print more info
e3677c9 [Rekha Joshi] Merge pull request #1 from apache/master
2015-07-05 12:58:03 -07:00
Yu ISHIKAWA 488bad319a [SPARK-7104] [MLLIB] Support model save/load in Python's Word2Vec
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #6821 from yu-iskw/SPARK-7104 and squashes the following commits:

975136b [Yu ISHIKAWA] Organize import
0ef58b6 [Yu ISHIKAWA] Use rmtree, instead of removedirs
cb21653 [Yu ISHIKAWA] Add an explicit type for `Word2VecModelWrapper.save`
1d468ef [Yu ISHIKAWA] [SPARK-7104][MLlib] Support model save/load in Python's Word2Vec
2015-07-02 15:55:16 -07:00
lewuathe 7d9cc9673e [SPARK-3382] [MLLIB] GradientDescent convergence tolerance
GrandientDescent can receive convergence tolerance value. Default value is 0.0.
When loss value becomes less than the tolerance which is set by user, iteration is terminated.

Author: lewuathe <lewuathe@me.com>

Closes #3636 from Lewuathe/gd-convergence-tolerance and squashes the following commits:

0b8a9a8 [lewuathe] Update doc
ce91b15 [lewuathe] Merge branch 'master' into gd-convergence-tolerance
4f22c2b [lewuathe] Modify based on SPARK-1503
5e47b82 [lewuathe] Merge branch 'master' into gd-convergence-tolerance
abadb7e [lewuathe] Fix LassoSuite
8fadebd [lewuathe] Fix failed unit tests
ee5de46 [lewuathe] Merge branch 'master' into gd-convergence-tolerance
8313ba2 [lewuathe] Fix styles
0ead94c [lewuathe] Merge branch 'master' into gd-convergence-tolerance
a94cfd5 [lewuathe] Modify some styles
3aef0a2 [lewuathe] Modify converged logic to do relative comparison
f7b19d5 [lewuathe] [SPARK-3382] Clarify comparison logic
e6c9cd2 [lewuathe] [SPARK-3382] Compare with the diff of solution vector
4b125d2 [lewuathe] [SPARK3382] Fix scala style
e7c10dd [lewuathe] [SPARK-3382] format improvements
f867eea [lewuathe] [SPARK-3382] Modify warning message statements
b9d5e61 [lewuathe] [SPARK-3382] should compare diff inside loss history and convergence tolerance
5433f71 [lewuathe] [SPARK-3382] GradientDescent convergence tolerance
2015-07-02 15:00:13 -07:00
MechCoder 34d448dbe1 [SPARK-8479] [MLLIB] Add numNonzeros and numActives to linalg.Matrices
Matrices allow zeros to be stored in values. Sometimes a method is handy to check if the numNonZeros are same as number of Active values.

Author: MechCoder <manojkumarsivaraj334@gmail.com>

Closes #6904 from MechCoder/nnz_matrix and squashes the following commits:

252c6b7 [MechCoder] Add to MiMa excludes
e2390f5 [MechCoder] Use count instead of foreach
2f62b2f [MechCoder] Add to MiMa excludes
d6e96ef [MechCoder] [SPARK-8479] Add numNonzeros and numActives to linalg.Matrices
2015-07-02 11:28:14 -07:00
Liang-Chi Hsieh 0e553a3e93 [SPARK-8708] [MLLIB] Paritition ALS ratings based on both users and products
JIRA: https://issues.apache.org/jira/browse/SPARK-8708

Previously the partitions of ratings are only based on the given products. So if the `usersProducts` given for prediction contains only few products or even one product, the generated ratings will be pushed into few or single partition and can't use high parallelism.

The following codes are the example reported in the JIRA. Because it asks the predictions for users on product 2. There is only one partition in the result.

    >>> r1 = (1, 1, 1.0)
    >>> r2 = (1, 2, 2.0)
    >>> r3 = (2, 1, 2.0)
    >>> r4 = (2, 2, 2.0)
    >>> r5 = (3, 1, 1.0)
    >>> ratings = sc.parallelize([r1, r2, r3, r4, r5], 5)
    >>> users = ratings.map(itemgetter(0)).distinct()
    >>> model = ALS.trainImplicit(ratings, 1, seed=10)
    >>> predictions_for_2 = model.predictAll(users.map(lambda u: (u, 2)))
    >>> predictions_for_2.glom().map(len).collect()
    [0, 0, 3, 0, 0]

This PR uses user and product instead of only product to partition the ratings.

Author: Liang-Chi Hsieh <viirya@gmail.com>
Author: Liang-Chi Hsieh <viirya@appier.com>

Closes #7121 from viirya/mfm_fix_partition and squashes the following commits:

779946d [Liang-Chi Hsieh] Calculate approximate numbers of users and products in one pass.
4336dc2 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into mfm_fix_partition
83e56c1 [Liang-Chi Hsieh] Instead of additional join, use the numbers of users and products to decide how to perform join.
b534dc8 [Liang-Chi Hsieh] Paritition ratings based on both users and products.
2015-07-02 10:18:23 -07:00
Alok Singh 99c40cd0d8 [SPARK-8647] [MLLIB] Potential issue with constant hashCode
I added the code,
  // see [SPARK-8647], this achieves the needed constant hash code without constant no.
  override def hashCode(): Int = this.getClass.getName.hashCode()

does getting the constant hash code as per jira

Author: Alok  Singh <singhal@Aloks-MacBook-Pro.local>

Closes #7146 from aloknsingh/aloknsingh_SPARK-8647 and squashes the following commits:

e58bccf [Alok  Singh] [SPARK-8647][MLlib] to avoid the class derivation issues, change the constant hashCode to override def hashCode(): Int = classOf[MatrixUDT].getName.hashCode()
43cdb89 [Alok  Singh] [SPARK-8647][MLlib] Potential issue with constant hashCode
2015-07-02 09:58:57 -07:00
Ilya Ganelin 3697232b7d [SPARK-3071] Increase default driver memory
I've updated default values in comments, documentation, and in the command line builder to be 1g based on comments in the JIRA. I've also updated most usages to point at a single variable defined in the Utils.scala and JavaUtils.java files. This wasn't possible in all cases (R, shell scripts etc.) but usage in most code is now pointing at the same place.

Please let me know if I've missed anything.

Will the spark-shell use the value within the command line builder during instantiation?

Author: Ilya Ganelin <ilya.ganelin@capitalone.com>

Closes #7132 from ilganeli/SPARK-3071 and squashes the following commits:

4074164 [Ilya Ganelin] String fix
271610b [Ilya Ganelin] Merge branch 'SPARK-3071' of github.com:ilganeli/spark into SPARK-3071
273b6e9 [Ilya Ganelin] Test fix
fd67721 [Ilya Ganelin] Update JavaUtils.java
26cc177 [Ilya Ganelin] test fix
e5db35d [Ilya Ganelin] Fixed test failure
39732a1 [Ilya Ganelin] merge fix
a6f7deb [Ilya Ganelin] Created default value for DRIVER MEM in Utils that's now used in almost all locations instead of setting manually in each
09ad698 [Ilya Ganelin] Update SubmitRestProtocolSuite.scala
19b6f25 [Ilya Ganelin] Missed one doc update
2698a3d [Ilya Ganelin] Updated default value for driver memory
2015-07-01 23:11:02 -07:00
Rosstin 4e4f74b5e1 [SPARK-8660] [MLLIB] removed > symbols from comments in LogisticRegressionSuite.scala for ease of copypaste
'>' symbols removed from comments in LogisticRegressionSuite.scala, for ease of copypaste

also single-lined the multiline commands (is this desirable, or does it violate style?)

Author: Rosstin <asterazul@gmail.com>

Closes #7167 from Rosstin/SPARK-8660-2 and squashes the following commits:

f4b9bc8 [Rosstin] SPARK-8660 restored character limit on multiline comments in LogisticRegressionSuite.scala
fe6b112 [Rosstin] SPARK-8660 > symbols removed from LogisticRegressionSuite.scala for easy of copypaste
39ddd50 [Rosstin] Merge branch 'master' of github.com:apache/spark into SPARK-8661
5a05dee [Rosstin] SPARK-8661 for LinearRegressionSuite.scala, changed javadoc-style comments to regular multiline comments to make it easier to copy-paste the R code.
bb9a4b1 [Rosstin] Merge branch 'master' of github.com:apache/spark into SPARK-8660
242aedd [Rosstin] SPARK-8660, changed comment style from JavaDoc style to normal multiline comment in order to make copypaste into R easier, in file classification/LogisticRegressionSuite.scala
2cd2985 [Rosstin] Merge branch 'master' of github.com:apache/spark into SPARK-8639
21ac1e5 [Rosstin] Merge branch 'master' of github.com:apache/spark into SPARK-8639
6c18058 [Rosstin] fixed minor typos in docs/README.md and docs/api.md
2015-07-01 21:42:06 -07:00
lewuathe 184de91d15 [SPARK-6263] [MLLIB] Python MLlib API missing items: Utils
Implement missing API in pyspark.

MLUtils
* appendBias
* loadVectors

`kFold` is also missing however I am not sure `ClassTag` can be passed or restored through python.

Author: lewuathe <lewuathe@me.com>

Closes #5707 from Lewuathe/SPARK-6263 and squashes the following commits:

16863ea [lewuathe] Merge master
3fc27e7 [lewuathe] Merge branch 'master' into SPARK-6263
6084e9c [lewuathe] Resolv conflict
d2aa2a0 [lewuathe] Resolv conflict
9c329d8 [lewuathe] Fix efficiency
3a12a2d [lewuathe] Merge branch 'master' into SPARK-6263
1d4714b [lewuathe] Fix style
b29e2bc [lewuathe] Remove scipy dependencies
e32eb40 [lewuathe] Merge branch 'master' into SPARK-6263
25d3c9d [lewuathe] Remove unnecessary imports
7ec04db [lewuathe] Resolv conflict
1502d13 [lewuathe] Resolv conflict
d6bd416 [lewuathe] Check existence of scipy.sparse
5d555b1 [lewuathe] Construct scipy.sparse matrix
c345a44 [lewuathe] Merge branch 'master' into SPARK-6263
b8b5ef7 [lewuathe] Fix unnecessary sort method
d254be7 [lewuathe] Merge branch 'master' into SPARK-6263
62a9c7e [lewuathe] Fix appendBias return type
454c73d [lewuathe] Merge branch 'master' into SPARK-6263
a353354 [lewuathe] Remove unnecessary appendBias implementation
44295c2 [lewuathe] Merge branch 'master' into SPARK-6263
64f72ad [lewuathe] Merge branch 'master' into SPARK-6263
c728046 [lewuathe] Fix style
2980569 [lewuathe] [SPARK-6263] Python MLlib API missing items: Utils
2015-07-01 11:14:07 -07:00
Feynman Liang f457569886 [SPARK-8471] [ML] Rename DiscreteCosineTransformer to DCT
Rename DiscreteCosineTransformer and related classes to DCT.

Author: Feynman Liang <fliang@databricks.com>

Closes #7138 from feynmanliang/dct-features and squashes the following commits:

e547b3e [Feynman Liang] Fix renaming bug
9d5c9e4 [Feynman Liang] Lowercase JavaDCTSuite variable
f9a8958 [Feynman Liang] Remove old files
f8fe794 [Feynman Liang] Merge branch 'master' into dct-features
894d0b2 [Feynman Liang] Rename DiscreteCosineTransformer to DCT
433dbc7 [Feynman Liang] Test refactoring
91e9636 [Feynman Liang] Style guide and test helper refactor
b5ac19c [Feynman Liang] Use Vector types, add Java test
530983a [Feynman Liang] Tests for other numeric datatypes
195d7aa [Feynman Liang] Implement support for arbitrary numeric types
95d4939 [Feynman Liang] Working DCT for 1D Doubles
2015-06-30 20:19:43 -07:00
lee19 e72526227f [SPARK-8563] [MLLIB] Fixed a bug so that IndexedRowMatrix.computeSVD().U.numCols = k
I'm sorry that I made https://github.com/apache/spark/pull/6949 closed by mistake.
I pushed codes again.

And, I added a test code.

>
There is a bug that `U.numCols() = self.nCols` in `IndexedRowMatrix.computeSVD()`
It should have been `U.numCols() = k = svd.U.numCols()`

>
```
self = U * sigma * V.transpose
(m x n) = (m x n) * (k x k) * (k x n) //ASIS
-->
(m x n) = (m x k) * (k x k) * (k x n) //TOBE
```

Author: lee19 <lee19@live.co.kr>

Closes #6953 from lee19/MLlibBugfix and squashes the following commits:

c1812a0 [lee19] [SPARK-8563] [MLlib] Used nRows instead of numRows() to reduce a burden.
4b9803b [lee19] [SPARK-8563] [MLlib] Fixed a build error.
c2ccd89 [lee19] Added a unit test that validates matrix sizes of svd for [SPARK-8563][MLlib]
8373424 [lee19] [SPARK-8563][MLlib] Fixed a bug so that IndexedRowMatrix.computeSVD().U.numCols = k
2015-06-30 14:08:00 -07:00
Joseph K. Bradley 3ba23ffd37 [SPARK-8736] [ML] GBTRegressor should not threshold prediction
Changed GBTRegressor so it does NOT threshold the prediction.  Added test which fails with bug but works after fix.

CC: feynmanliang  mengxr

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #7134 from jkbradley/gbrt-fix and squashes the following commits:

613b90e [Joseph K. Bradley] Changed GBTRegressor so it does NOT threshold the prediction
2015-06-30 14:02:50 -07:00
Yuhao Yang 61d7b533dd [SPARK-7514] [MLLIB] Add MinMaxScaler to feature transformation
jira: https://issues.apache.org/jira/browse/SPARK-7514
Add a popular scaling method to feature component, which is commonly known as min-max normalization or Rescaling.

Core function is,
Normalized(x) = (x - min) / (max - min) * scale + newBase

where `newBase` and `scale` are parameters (type Double) of the `VectorTransformer`. `newBase` is the new minimum number for the features, and `scale` controls the ranges after transformation. This is a little complicated than the basic MinMax normalization, yet it provides flexibility so that users can control the range more specifically. like [0.1, 0.9] in some NN application.

For case that `max == min`, 0.5 is used as the raw value. (0.5 * scale + newBase)
I'll add UT once the design got settled ( and this is not considered as too naive)

reference:
 http://en.wikipedia.org/wiki/Feature_scaling
http://stn.spotfire.com/spotfire_client_help/index.htm#norm/norm_scale_between_0_and_1.htm

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #6039 from hhbyyh/minMaxNorm and squashes the following commits:

f942e9f [Yuhao Yang] add todo for metadata
8b37bbc [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into minMaxNorm
4894dbc [Yuhao Yang] add copy
fa2989f [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into minMaxNorm
29db415 [Yuhao Yang] add clue and minor adjustment
5b8f7cc [Yuhao Yang] style fix
9b133d0 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into minMaxNorm
22f20f2 [Yuhao Yang] style change and bug fix
747c9bb [Yuhao Yang] add ut and remove mllib version
a5ba0aa [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into minMaxNorm
585cc07 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into minMaxNorm
1c6dcb1 [Yuhao Yang] minor change
0f1bc80 [Yuhao Yang] add MinMaxScaler to ml
8e7436e [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into minMaxNorm
3663165 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into minMaxNorm
1247c27 [Yuhao Yang] some comments improvement
d285a19 [Yuhao Yang] initial checkin for minMaxNorm
2015-06-30 12:44:43 -07:00
Feynman Liang 74cc16dbc3 [SPARK-8471] [ML] Discrete Cosine Transform Feature Transformer
Implementation and tests for Discrete Cosine Transformer.

Author: Feynman Liang <fliang@databricks.com>

Closes #6894 from feynmanliang/dct-features and squashes the following commits:

433dbc7 [Feynman Liang] Test refactoring
91e9636 [Feynman Liang] Style guide and test helper refactor
b5ac19c [Feynman Liang] Use Vector types, add Java test
530983a [Feynman Liang] Tests for other numeric datatypes
195d7aa [Feynman Liang] Implement support for arbitrary numeric types
95d4939 [Feynman Liang] Working DCT for 1D Doubles
2015-06-30 12:31:33 -07:00
Yanbo Liang c1befd780c [SPARK-8664] [ML] Add PCA transformer
Add PCA transformer for ML pipeline

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #7065 from yanboliang/spark-8664 and squashes the following commits:

4afae45 [Yanbo Liang] address comments
e9effd7 [Yanbo Liang] Add PCA transformer
2015-06-30 12:23:48 -07:00