Commit graph

471 commits

Author SHA1 Message Date
Travis Galoppo c4f0b4f334 SPARK-5020 [MLlib] GaussianMixtureModel.predictMembership() should take an RDD only
Removed unnecessary parameters to predictMembership()

CC: jkbradley

Author: Travis Galoppo <tjg2107@columbia.edu>

Closes #3854 from tgaloppo/spark-5020 and squashes the following commits:

1bf4669 [Travis Galoppo] renamed predictMembership() to predictSoft()
0f1d96e [Travis Galoppo] SPARK-5020 - Removed superfluous parameters from predictMembership()
2014-12-31 15:39:58 -08:00
Sean Owen 3d194cc757 SPARK-4547 [MLLIB] OOM when making bins in BinaryClassificationMetrics
Now that I've implemented the basics here, I'm less convinced there is a need for this change, somehow. Callers can downsample before or after. Really the OOM is not in the ROC curve code, but in code that might `collect()` it for local analysis. Still, might be useful to down-sample since the ROC curve probably never needs millions of points.

This is a first pass. Since the `(score,label)` are already grouped and sorted, I think it's sufficient to just take every Nth such pair, in order to downsample by a factor of N? this is just like retaining every Nth point on the curve, which I think is the goal. All of the data is still used to build the curve of course.

What do you think about the API, and usefulness?

Author: Sean Owen <sowen@cloudera.com>

Closes #3702 from srowen/SPARK-4547 and squashes the following commits:

1d34d05 [Sean Owen] Indent and reorganize numBins scaladoc
692d825 [Sean Owen] Change handling of large numBins, make 2nd consturctor instead of optional param, style change
a03610e [Sean Owen] Add downsamplingFactor to BinaryClassificationMetrics
2014-12-31 13:37:04 -08:00
Liang-Chi Hsieh 06a9aa589c [SPARK-4797] Replace breezeSquaredDistance
This PR replaces slow breezeSquaredDistance.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #3643 from viirya/faster_squareddistance and squashes the following commits:

f28b275 [Liang-Chi Hsieh] Move the implementation to linalg.Vectors and rename as sqdist.
0bc48ee [Liang-Chi Hsieh] Merge branch 'master' into faster_squareddistance
ba34422 [Liang-Chi Hsieh] Fix bug.
91849d0 [Liang-Chi Hsieh] Modified for comment.
44a65ad [Liang-Chi Hsieh] Modified for comments.
35db395 [Liang-Chi Hsieh] Fix bug and some modifications for comments.
f4f5ebb [Liang-Chi Hsieh] Follow BLAS.dot pattern to replace intersect, diff with while-loop.
a36e09f [Liang-Chi Hsieh] Use while-loop to replace foreach for better performance.
d3e0628 [Liang-Chi Hsieh] Make the methods private.
dd415bc [Liang-Chi Hsieh] Consider different cases of SparseVector and DenseVector.
13669db [Liang-Chi Hsieh] Replace breezeSquaredDistance.
2014-12-31 11:50:53 -08:00
Liu Jiongzhou 035bac88c7 [SPARK-4998][MLlib]delete the "train" function
To make the functions with the same in "object" effective, specially when using java reflection.
As the "train" function defined in "class DecisionTree" will hide the functions with the same name in "object DecisionTree".

JIRA[SPARK-4998]

Author: Liu Jiongzhou <ljzzju@163.com>

Closes #3836 from ljzzju/master and squashes the following commits:

4e13133 [Liu Jiongzhou] [MLlib]delete the "train" function
2014-12-30 15:55:56 -08:00
Jakub Dubovsky 0f31992c61 [Spark-4995] Replace Vector.toBreeze.activeIterator with foreachActive
New foreachActive method of vector was introduced by SPARK-4431 as more efficient alternative to vector.toBreeze.activeIterator. There are some parts of codebase where it was not yet replaced.

dbtsai

Author: Jakub Dubovsky <dubovsky@avast.com>

Closes #3846 from james64/SPARK-4995-foreachActive and squashes the following commits:

3eb7e37 [Jakub Dubovsky] Scalastyle fix
32fe6c6 [Jakub Dubovsky] activeIterator removed - IndexedRowMatrix.toBreeze
47a4777 [Jakub Dubovsky] activeIterator removed in RowMatrix.toBreeze
90a7d98 [Jakub Dubovsky] activeIterator removed in MLUtils.saveAsLibSVMFile
2014-12-30 14:19:07 -08:00
DB Tsai 040d6f2d13 [SPARK-4972][MLlib] Updated the scala doc for lasso and ridge regression for the change of LeastSquaresGradient
In #SPARK-4907, we added factor of 2 into the LeastSquaresGradient. We updated the scala doc for lasso and ridge regression here.

Author: DB Tsai <dbtsai@alpinenow.com>

Closes #3808 from dbtsai/doc and squashes the following commits:

ec3c989 [DB Tsai] first commit
2014-12-29 17:17:12 -08:00
ganonp 343db392b5 Added setMinCount to Word2Vec.scala
Wanted to customize the private minCount variable in the Word2Vec class. Added
a method to do so.

Author: ganonp <ganonp@gmail.com>

Closes #3693 from ganonp/my-custom-spark and squashes the following commits:

ad534f2 [ganonp] made norm method public
5110a6f [ganonp] Reorganized
854958b [ganonp] Fixed Indentation for setMinCount
12ed8f9 [ganonp] Update Word2Vec.scala
76bdf5a [ganonp] Update Word2Vec.scala
ffb88bb [ganonp] Update Word2Vec.scala
5eb9100 [ganonp] Added setMinCount to Word2Vec.scala
2014-12-29 15:31:19 -08:00
Travis Galoppo 6cf6fdf3ff SPARK-4156 [MLLIB] EM algorithm for GMMs
Implementation of Expectation-Maximization for Gaussian Mixture Models.

This is my maiden contribution to Apache Spark, so I apologize now if I have done anything incorrectly; having said that, this work is my own, and I offer it to the project under the project's open source license.

Author: Travis Galoppo <tjg2107@columbia.edu>
Author: Travis Galoppo <travis@localhost.localdomain>
Author: tgaloppo <tjg2107@columbia.edu>
Author: FlytxtRnD <meethu.mathew@flytxt.com>

Closes #3022 from tgaloppo/master and squashes the following commits:

aaa8f25 [Travis Galoppo] MLUtils: changed privacy of EPSILON from [util] to [mllib]
709e4bf [Travis Galoppo] fixed usage line to include optional maxIterations parameter
acf1fba [Travis Galoppo] Fixed parameter comment in GaussianMixtureModel Made maximum iterations an optional parameter to DenseGmmEM
9b2fc2a [Travis Galoppo] Style improvements Changed ExpectationSum to a private class
b97fe00 [Travis Galoppo] Minor fixes and tweaks.
1de73f3 [Travis Galoppo] Removed redundant array from array creation
578c2d1 [Travis Galoppo] Removed unused import
227ad66 [Travis Galoppo] Moved prediction methods into model class.
308c8ad [Travis Galoppo] Numerous changes to improve code
cff73e0 [Travis Galoppo] Replaced accumulators with RDD.aggregate
20ebca1 [Travis Galoppo] Removed unusued code
42b2142 [Travis Galoppo] Added functionality to allow setting of GMM starting point. Added two cluster test to testing suite.
8b633f3 [Travis Galoppo] Style issue
9be2534 [Travis Galoppo] Style issue
d695034 [Travis Galoppo] Fixed style issues
c3b8ce0 [Travis Galoppo] Merge branch 'master' of https://github.com/tgaloppo/spark   Adds predict() method
2df336b [Travis Galoppo] Fixed style issue
b99ecc4 [tgaloppo] Merge pull request #1 from FlytxtRnD/predictBranch
f407b4c [FlytxtRnD] Added predict() to return the cluster labels and membership values
97044cf [Travis Galoppo] Fixed style issues
dc9c742 [Travis Galoppo] Moved MultivariateGaussian utility class
e7d413b [Travis Galoppo] Moved multivariate Gaussian utility class to mllib/stat/impl Improved comments
9770261 [Travis Galoppo] Corrected a variety of style and naming issues.
8aaa17d [Travis Galoppo] Added additional train() method to companion object for cluster count and tolerance parameters.
676e523 [Travis Galoppo] Fixed to no longer ignore delta value provided on command line
e6ea805 [Travis Galoppo] Merged with master branch; update test suite with latest context changes. Improved cluster initialization strategy.
86fb382 [Travis Galoppo] Merge remote-tracking branch 'upstream/master'
719d8cc [Travis Galoppo] Added scala test suite with basic test
c1a8e16 [Travis Galoppo] Made GaussianMixtureModel class serializable Modified sum function for better performance
5c96c57 [Travis Galoppo] Merge remote-tracking branch 'upstream/master'
c15405c [Travis Galoppo] SPARK-4156
2014-12-29 15:29:15 -08:00
Burak Yavuz 02b55de3dc [SPARK-4409][MLlib] Additional Linear Algebra Utils
Addition of a very limited number of local matrix manipulation and generation methods that would be helpful in the further development for algorithms on top of BlockMatrix (SPARK-3974), such as Randomized SVD, and Multi Model Training (SPARK-1486).
The proposed methods for addition are:

For `Matrix`
 - map: maps the values in the matrix with a given function. Produces a new matrix.
 - update: the values in the matrix are updated with a given function. Occurs in place.

Factory methods for `DenseMatrix`:
 - *zeros: Generate a matrix consisting of zeros
 - *ones: Generate a matrix consisting of ones
 - *eye: Generate an identity matrix
 - *rand: Generate a matrix consisting of i.i.d. uniform random numbers
 - *randn: Generate a matrix consisting of i.i.d. gaussian random numbers
 - *diag: Generate a diagonal matrix from a supplied vector
*These methods already exist in the factory methods for `Matrices`, however for cases where we require a `DenseMatrix`, you constantly have to add `.asInstanceOf[DenseMatrix]` everywhere, which makes the code "dirtier". I propose moving these functions to factory methods for `DenseMatrix` where the putput will be a `DenseMatrix` and the factory methods for `Matrices` will call these functions directly and output a generic `Matrix`.

Factory methods for `SparseMatrix`:
 - speye: Identity matrix in sparse format. Saves a ton of memory when dimensions are large, especially in Multi Model Training, where each row requires being multiplied by a scalar.
 - sprand: Generate a sparse matrix with a given density consisting of i.i.d. uniform random numbers.
 - sprandn: Generate a sparse matrix with a given density consisting of i.i.d. gaussian random numbers.
 - diag: Generate a diagonal matrix from a supplied vector, but is memory efficient, because it just stores the diagonal. Again, very helpful in Multi Model Training.

Factory methods for `Matrices`:
 - Include all the factory methods given above, but return a generic `Matrix` rather than `SparseMatrix` or `DenseMatrix`.
 - horzCat: Horizontally concatenate matrices to form one larger matrix. Very useful in both Multi Model Training, and for the repartitioning of BlockMatrix.
 - vertCat: Vertically concatenate matrices to form one larger matrix. Very useful for the repartitioning of BlockMatrix.

The names for these methods were selected from MATLAB

Author: Burak Yavuz <brkyvz@gmail.com>
Author: Xiangrui Meng <meng@databricks.com>

Closes #3319 from brkyvz/SPARK-4409 and squashes the following commits:

b0354f6 [Burak Yavuz] [SPARK-4409] Incorporated mengxr's code
04c4829 [Burak Yavuz] Merge pull request #1 from mengxr/SPARK-4409
80cfa29 [Xiangrui Meng] minor changes
ecc937a [Xiangrui Meng] update sprand
4e95e24 [Xiangrui Meng] simplify fromCOO implementation
10a63a6 [Burak Yavuz] [SPARK-4409] Fourth pass of code review
f62d6c7 [Burak Yavuz] [SPARK-4409] Modified genRandMatrix
3971c93 [Burak Yavuz] [SPARK-4409] Third pass of code review
75239f8 [Burak Yavuz] [SPARK-4409] Second pass of code review
e4bd0c0 [Burak Yavuz] [SPARK-4409] Modified horzcat and vertcat
65c562e [Burak Yavuz] [SPARK-4409] Hopefully fixed Java Test
d8be7bc [Burak Yavuz] [SPARK-4409] Organized imports
065b531 [Burak Yavuz] [SPARK-4409] First pass after code review
a8120d2 [Burak Yavuz] [SPARK-4409] Finished updates to API according to SPARK-4614
f798c82 [Burak Yavuz] [SPARK-4409] Updated API according to SPARK-4614
c75f3cd [Burak Yavuz] [SPARK-4409] Added JavaAPI Tests, and fixed a couple of bugs
d662f9d [Burak Yavuz] [SPARK-4409] Modified according to remote repo
83dfe37 [Burak Yavuz] [SPARK-4409] Scalastyle error fixed
a14c0da [Burak Yavuz] [SPARK-4409] Initial commit to add methods
2014-12-29 13:24:26 -08:00
zsxwing f9ed2b6641 [SPARK-4608][Streaming] Reorganize StreamingContext implicit to improve API convenience
There is only one implicit function `toPairDStreamFunctions` in `StreamingContext`. This PR did similar reorganization like [SPARK-4397](https://issues.apache.org/jira/browse/SPARK-4397).

Compiled the following codes with Spark Streaming 1.1.0 and ran it with this PR. Everything is fine.
```Scala
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._

object StreamingApp {

  def main(args: Array[String]) {
    val conf = new SparkConf().setMaster("local[2]").setAppName("FileWordCount")
    val ssc = new StreamingContext(conf, Seconds(10))
    val lines = ssc.textFileStream("/some/path")
    val words = lines.flatMap(_.split(" "))
    val pairs = words.map(word => (word, 1))
    val wordCounts = pairs.reduceByKey(_ + _)
    wordCounts.print()

    ssc.start()
    ssc.awaitTermination()
  }
}
```

Author: zsxwing <zsxwing@gmail.com>

Closes #3464 from zsxwing/SPARK-4608 and squashes the following commits:

aa6d44a [zsxwing] Fix a copy-paste error
f74c190 [zsxwing] Merge branch 'master' into SPARK-4608
e6f9cc9 [zsxwing] Update the docs
27833bb [zsxwing] Remove `import StreamingContext._`
c15162c [zsxwing] Reorganize StreamingContext implicit to improve API convenience
2014-12-25 19:46:05 -08:00
Sean Owen 29fabb1b52 SPARK-4297 [BUILD] Build warning fixes omnibus
There are a number of warnings generated in a normal, successful build right now. They're mostly Java unchecked cast warnings, which can be suppressed. But there's a grab bag of other Scala language warnings and so on that can all be easily fixed. The forthcoming PR fixes about 90% of the build warnings I see now.

Author: Sean Owen <sowen@cloudera.com>

Closes #3157 from srowen/SPARK-4297 and squashes the following commits:

8c9e469 [Sean Owen] Suppress unchecked cast warnings, and several other build warning fixes
2014-12-24 13:32:51 -08:00
DB Tsai a96b72781a [SPARK-4907][MLlib] Inconsistent loss and gradient in LeastSquaresGradient compared with R
In most of the academic paper and algorithm implementations,
people use L = 1/2n ||A weights-y||^2 instead of L = 1/n ||A weights-y||^2
for least-squared loss. See Eq. (1) in http://web.stanford.edu/~hastie/Papers/glmnet.pdf

Since MLlib uses different convention, this will result different residuals and
all the stats properties will be different from GLMNET package in R.

The model coefficients will be still the same under this change.

Author: DB Tsai <dbtsai@alpinenow.com>

Closes #3746 from dbtsai/lir and squashes the following commits:

19c2e85 [DB Tsai] make stepsize twice to converge to the same solution
0b2c29c [DB Tsai] first commit
2014-12-22 16:42:55 -08:00
RJ Nowling ee1fb97a97 [SPARK-4728][MLLib] Add exponential, gamma, and log normal sampling to MLlib da...
...ta generators

This patch adds:

* Exponential, gamma, and log normal generators that wrap Apache Commons math3 to the private API
* Functions for generating exponential, gamma, and log normal RDDs and vector RDDs
* Tests for the above

Author: RJ Nowling <rnowling@gmail.com>

Closes #3680 from rnowling/spark4728 and squashes the following commits:

455f50a [RJ Nowling] Add tests for exponential, gamma, and log normal samplers to JavaRandomRDDsSuite
3e1134a [RJ Nowling] Fix val/var, unncessary creation of Distribution objects when setting seeds, and import line longer than line wrap limits
58f5b97 [RJ Nowling] Fix bounds in tests so they scale with variance, not stdev
84fd98d [RJ Nowling] Add more values for testing distributions.
9f96232 [RJ Nowling] [SPARK-4728] Add exponential, gamma, and log normal sampling to MLlib data generators
2014-12-18 21:00:49 -08:00
DB Tsai 59a49db598 [SPARK-4887][MLlib] Fix a bad unittest in LogisticRegressionSuite
The original test doesn't make sense since if you step in, the lossSum is already NaN,
and the coefficients are diverging. That's because the step size is too large for SGD,
so it doesn't work.

The correct behavior is that you should get smaller coefficients than the one
without regularization. Comparing the values using 20000.0 relative error doesn't
make sense as well.

Author: DB Tsai <dbtsai@alpinenow.com>

Closes #3735 from dbtsai/mlortestfix and squashes the following commits:

b1a3c42 [DB Tsai] first commit
2014-12-18 13:55:49 -08:00
Yuu ISHIKAWA 8098fab06c [SPARK-4494][mllib] IDFModel.transform() add support for single vector
I improved `IDFModel.transform` to allow using a single vector.

[[SPARK-4494] IDFModel.transform() add support for single vector - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-4494)

Author: Yuu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #3603 from yu-iskw/idf and squashes the following commits:

256ff3d [Yuu ISHIKAWA] Fix typo
a3bf566 [Yuu ISHIKAWA] - Fix typo - Optimize import order - Aggregate the assertion tests - Modify `IDFModel.transform` API for pyspark
d25e49b [Yuu ISHIKAWA] Add the implementation of `IDFModel.transform` for a term frequency vector
2014-12-15 13:44:15 -08:00
Xiangrui Meng 7e758d7092 [FIX][DOC] Fix broken links in ml-guide.md
and some minor changes in ScalaDoc.

Author: Xiangrui Meng <meng@databricks.com>

Closes #3601 from mengxr/SPARK-4575-fix and squashes the following commits:

c559768 [Xiangrui Meng] minor code update
ce94da8 [Xiangrui Meng] Java Bean -> JavaBean
0b5c182 [Xiangrui Meng] fix links in ml-guide
2014-12-04 20:16:35 +08:00
Joseph K. Bradley 469a6e5f3b [SPARK-4575] [mllib] [docs] spark.ml pipelines doc + bug fixes
Documentation:
* Added ml-guide.md, linked from mllib-guide.md
* Updated mllib-guide.md with small section pointing to ml-guide.md

Examples:
* CrossValidatorExample
* SimpleParamsExample
* (I copied these + the SimpleTextClassificationPipeline example into the ml-guide.md)

Bug fixes:
* PipelineModel: did not use ParamMaps correctly
* UnaryTransformer: issues with TypeTag serialization (Thanks to mengxr for that fix!)

CC: mengxr shivaram  etrain  Documentation for Pipelines: I know the docs are not complete, but the goal is to have enough to let interested people get started using spark.ml and to add more docs once the package is more established/complete.

Author: Joseph K. Bradley <joseph@databricks.com>
Author: jkbradley <joseph.kurata.bradley@gmail.com>
Author: Xiangrui Meng <meng@databricks.com>

Closes #3588 from jkbradley/ml-package-docs and squashes the following commits:

d393b5c [Joseph K. Bradley] fixed bug in Pipeline (typo from last commit).  updated examples for CV and Params for spark.ml
c38469c [Joseph K. Bradley] Updated ml-guide with CV examples
99f88c2 [Joseph K. Bradley] Fixed bug in PipelineModel.transform* with usage of params.  Updated CrossValidatorExample to use more training examples so it is less likely to get a 0-size fold.
ea34dc6 [jkbradley] Merge pull request #4 from mengxr/ml-package-docs
3b83ec0 [Xiangrui Meng] replace TypeTag with explicit datatype
41ad9b1 [Joseph K. Bradley] Added examples for spark.ml: SimpleParamsExample + Java version, CrossValidatorExample + Java version.  CrossValidatorExample not working yet.  Added programming guide for spark.ml, but need to add CrossValidatorExample to it once CrossValidatorExample works.
2014-12-04 17:00:06 +08:00
Joseph K. Bradley 657a88835d [SPARK-4580] [SPARK-4610] [mllib] [docs] Documentation for tree ensembles + DecisionTree API fix
Major changes:
* Added programming guide sections for tree ensembles
* Added examples for tree ensembles
* Updated DecisionTree programming guide with more info on parameters
* **API change**: Standardized the tree parameter for the number of classes (for classification)

Minor changes:
* Updated decision tree documentation
* Updated existing tree and tree ensemble examples
 * Use train/test split, and compute test error instead of training error.
 * Fixed decision_tree_runner.py to actually use the number of classes it computes from data. (small bug fix)

Note: I know this is a lot of lines, but most is covered by:
* Programming guide sections for gradient boosting and random forests.  (The changes are probably best viewed by generating the docs locally.)
* New examples (which were copied from the programming guide)
* The "numClasses" renaming

I have run all examples and relevant unit tests.

CC: mengxr manishamde codedeft

Author: Joseph K. Bradley <joseph@databricks.com>
Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>

Closes #3461 from jkbradley/ensemble-docs and squashes the following commits:

70a75f3 [Joseph K. Bradley] updated forest vs boosting comparison
d1de753 [Joseph K. Bradley] Added note about toString and toDebugString for DecisionTree to migration guide
8e87f8f [Joseph K. Bradley] Combined GBT and RandomForest guides into one ensembles guide
6fab846 [Joseph K. Bradley] small fixes based on review
b9f8576 [Joseph K. Bradley] updated decision tree doc
375204c [Joseph K. Bradley] fixed python style
2b60b6e [Joseph K. Bradley] merged Java RandomForest examples into 1 file.  added header.  Fixed small bug in same example in the programming guide.
706d332 [Joseph K. Bradley] updated python DT runner to print full model if it is small
c76c823 [Joseph K. Bradley] added migration guide for mllib
abe5ed7 [Joseph K. Bradley] added examples for random forest in Java and Python to examples folder
07fc11d [Joseph K. Bradley] Renamed numClassesForClassification to numClasses everywhere in trees and ensembles. This is a breaking API change, but it was necessary to correct an API inconsistency in Spark 1.1 (where Python DecisionTree used numClasses but Scala used numClassesForClassification).
cdfdfbc [Joseph K. Bradley] added examples for GBT
6372a2b [Joseph K. Bradley] updated decision tree examples to use random split.  tested all of them.
ad3e695 [Joseph K. Bradley] added gbt and random forest to programming guide.  still need to update their examples
2014-12-04 09:57:50 +08:00
DB Tsai d00542987e [SPARK-4717][MLlib] Optimize BLAS library to avoid de-reference multiple times in loop
Have a local reference to `values` and `indices` array in the `Vector` object
so JVM can locate the value with one operation call. See `SPARK-4581`
for similar optimization, and the bytecode analysis.

Author: DB Tsai <dbtsai@alpinenow.com>

Closes #3577 from dbtsai/blasopt and squashes the following commits:

62d38c4 [DB Tsai] formating
0316cef [DB Tsai] first commit
2014-12-03 22:31:39 +08:00
DB Tsai 7fc49ed911 [SPARK-4708][MLLib] Make k-mean runs two/three times faster with dense/sparse sample
Note that the usage of `breezeSquaredDistance` in
`org.apache.spark.mllib.util.MLUtils.fastSquaredDistance`
is in the critical path, and `breezeSquaredDistance` is slow.
We should replace it with our own implementation.

Here is the benchmark against mnist8m dataset.

Before
DenseVector: 70.04secs
SparseVector: 59.05secs

With this PR
DenseVector: 30.58secs
SparseVector: 21.14secs

Author: DB Tsai <dbtsai@alpinenow.com>

Closes #3565 from dbtsai/kmean and squashes the following commits:

08bc068 [DB Tsai] restyle
de24662 [DB Tsai] address feedback
b185a77 [DB Tsai] cleanup
4554ddd [DB Tsai] first commit
2014-12-03 19:01:56 +08:00
DB Tsai 64f3175bf9 [SPARK-4611][MLlib] Implement the efficient vector norm
The vector norm in breeze is implemented by `activeIterator` which is known to be very slow.
In this PR, an efficient vector norm is implemented, and with this API, `Normalizer` and
`k-means` have big performance improvement.

Here is the benchmark against mnist8m dataset.

a) `Normalizer`
Before
DenseVector: 68.25secs
SparseVector: 17.01secs

With this PR
DenseVector: 12.71secs
SparseVector: 2.73secs

b) `k-means`
Before
DenseVector: 83.46secs
SparseVector: 61.60secs

With this PR
DenseVector: 70.04secs
SparseVector: 59.05secs

Author: DB Tsai <dbtsai@alpinenow.com>

Closes #3462 from dbtsai/norm and squashes the following commits:

63c7165 [DB Tsai] typo
0c3637f [DB Tsai] add import org.apache.spark.SparkContext._ back
6fa616c [DB Tsai] address feedback
9b7cb56 [DB Tsai] move norm to static method
0b632e6 [DB Tsai] kmeans
dbed124 [DB Tsai] style
c1a877c [DB Tsai] first commit
2014-12-02 11:40:43 +08:00
Xiangrui Meng 561d31d2f1 [SPARK-4614][MLLIB] Slight API changes in Matrix and Matrices
Before we have a full picture of the operators we want to add, it might be safer to hide `Matrix.transposeMultiply` in 1.2.0. Another update we want to change is `Matrix.randn` and `Matrix.rand`, both of which should take a `Random` implementation. Otherwise, it is very likely to produce inconsistent RDDs. I also added some unit tests for matrix factory methods. All APIs are new in 1.2, so there is no incompatible changes.

brkyvz

Author: Xiangrui Meng <meng@databricks.com>

Closes #3468 from mengxr/SPARK-4614 and squashes the following commits:

3b0e4e2 [Xiangrui Meng] add mima excludes
6bfd8a4 [Xiangrui Meng] hide transposeMultiply; add rng to rand and randn; add unit tests
2014-11-26 08:22:50 -08:00
Xiangrui Meng b5fb1410c5 [SPARK-4604][MLLIB] make MatrixFactorizationModel public
User could construct an MF model directly. I added a note about the performance.

Author: Xiangrui Meng <meng@databricks.com>

Closes #3459 from mengxr/SPARK-4604 and squashes the following commits:

f64bcd3 [Xiangrui Meng] organize imports
ed08214 [Xiangrui Meng] check preconditions and unit tests
a624c12 [Xiangrui Meng] make MatrixFactorizationModel public
2014-11-25 20:11:40 -08:00
Joseph K. Bradley c251fd7405 [SPARK-4583] [mllib] LogLoss for GradientBoostedTrees fix + doc updates
Currently, the LogLoss used by GradientBoostedTrees has 2 issues:
* the gradient (and therefore loss) does not match that used by Friedman (1999)
* the error computation uses 0/1 accuracy, not log loss

This PR updates LogLoss.
It also adds some doc for boosting and forests.

I tested it on sample data and made sure the log loss is monotonically decreasing with each boosting iteration.

CC: mengxr manishamde codedeft

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #3439 from jkbradley/gbt-loss-fix and squashes the following commits:

cfec17e [Joseph K. Bradley] removed forgotten temp comments
a27eb6d [Joseph K. Bradley] corrections to last log loss commit
ed5da2c [Joseph K. Bradley] updated LogLoss (boosting) for numerical stability
5e52bff [Joseph K. Bradley] * Removed the 1/2 from SquaredError.  This also required updating the test suite since it effectively doubles the gradient and loss. * Added doc for developers within RandomForest. * Small cleanup in test suite (generating data only once)
e57897a [Joseph K. Bradley] Fixed LogLoss for GradientBoostedTrees, and updated doc for losses, forests, and boosting
2014-11-25 20:10:15 -08:00
DB Tsai bf1a6aaac5 [SPARK-4581][MLlib] Refactorize StandardScaler to improve the transformation performance
The following optimizations are done to improve the StandardScaler model
transformation performance.

1) Covert Breeze dense vector to primitive vector to reduce the overhead.
2) Since mean can be potentially a sparse vector, we explicitly convert it to dense primitive vector.
3) Have a local reference to `shift` and `factor` array so JVM can locate the value with one operation call.
4) In pattern matching part, we use the mllib SparseVector/DenseVector instead of breeze's vector to
make the codebase cleaner.

Benchmark with mnist8m dataset:

Before,
DenseVector withMean and withStd: 50.97secs
DenseVector withMean and withoutStd: 42.11secs
DenseVector withoutMean and withStd: 8.75secs
SparseVector withoutMean and withStd: 5.437secs

With this PR,
DenseVector withMean and withStd: 5.76secs
DenseVector withMean and withoutStd: 5.28secs
DenseVector withoutMean and withStd: 5.30secs
SparseVector withoutMean and withStd: 1.27secs

Note that without the local reference copy of `factor` and `shift` arrays,
the runtime is almost three time slower.

DenseVector withMean and withStd: 18.15secs
DenseVector withMean and withoutStd: 18.05secs
DenseVector withoutMean and withStd: 18.54secs
SparseVector withoutMean and withStd: 2.01secs

The following code,
```scala
while (i < size) {
   values(i) = (values(i) - shift(i)) * factor(i)
   i += 1
}
```
will generate the bytecode
```
   L13
    LINENUMBER 106 L13
   FRAME FULL [org/apache/spark/mllib/feature/StandardScalerModel org/apache/spark/mllib/linalg/Vector org/apache/spark/mllib/linalg/Vector org/apache/spark/mllib/linalg/DenseVector T [D I I] []
    ILOAD 7
    ILOAD 6
    IF_ICMPGE L14
   L15
    LINENUMBER 107 L15
    ALOAD 5
    ILOAD 7
    ALOAD 5
    ILOAD 7
    DALOAD
    ALOAD 0
    INVOKESPECIAL org/apache/spark/mllib/feature/StandardScalerModel.shift ()[D
    ILOAD 7
    DALOAD
    DSUB
    ALOAD 0
    INVOKESPECIAL org/apache/spark/mllib/feature/StandardScalerModel.factor ()[D
    ILOAD 7
    DALOAD
    DMUL
    DASTORE
   L16
    LINENUMBER 108 L16
    ILOAD 7
    ICONST_1
    IADD
    ISTORE 7
    GOTO L13
```
, while with the local reference of the `shift` and `factor` arrays, the bytecode will be
```
   L14
    LINENUMBER 107 L14
    ALOAD 0
    INVOKESPECIAL org/apache/spark/mllib/feature/StandardScalerModel.factor ()[D
    ASTORE 9
   L15
    LINENUMBER 108 L15
   FRAME FULL [org/apache/spark/mllib/feature/StandardScalerModel org/apache/spark/mllib/linalg/Vector [D org/apache/spark/mllib/linalg/Vector org/apache/spark/mllib/linalg/DenseVector T [D I I [D] []
    ILOAD 8
    ILOAD 7
    IF_ICMPGE L16
   L17
    LINENUMBER 109 L17
    ALOAD 6
    ILOAD 8
    ALOAD 6
    ILOAD 8
    DALOAD
    ALOAD 2
    ILOAD 8
    DALOAD
    DSUB
    ALOAD 9
    ILOAD 8
    DALOAD
    DMUL
    DASTORE
   L18
    LINENUMBER 110 L18
    ILOAD 8
    ICONST_1
    IADD
    ISTORE 8
    GOTO L15
```

You can see that with local reference, the both of the arrays will be in the stack, so JVM can access the value without calling `INVOKESPECIAL`.

Author: DB Tsai <dbtsai@alpinenow.com>

Closes #3435 from dbtsai/standardscaler and squashes the following commits:

85885a9 [DB Tsai] revert to have lazy in shift array.
daf2b06 [DB Tsai] Address the feedback
cdb5cef [DB Tsai] small change
9c51eef [DB Tsai] style
fc795e4 [DB Tsai] update
5bffd3d [DB Tsai] first commit
2014-11-25 11:07:11 -08:00
GuoQiang Li f515f9432b [SPARK-4526][MLLIB]GradientDescent get a wrong gradient value according to the gradient formula.
This is caused by the miniBatchSize parameter.The number of `RDD.sample` returns is not fixed.
cc mengxr

Author: GuoQiang Li <witgo@qq.com>

Closes #3399 from witgo/GradientDescent and squashes the following commits:

13cb228 [GuoQiang Li] review commit
668ab66 [GuoQiang Li] Double to Long
b6aa11a [GuoQiang Li] Check miniBatchSize is greater than 0
0b5c3e3 [GuoQiang Li] Minor fix
12e7424 [GuoQiang Li] GradientDescent get a wrong gradient value according to the gradient formula, which is caused by the miniBatchSize parameter.
2014-11-25 02:01:19 -08:00
DB Tsai 89f9122646 [SPARK-4596][MLLib] Refactorize Normalizer to make code cleaner
In this refactoring, the performance will be slightly increased due to removing
the overhead from breeze vector. The bottleneck is still in breeze norm
which is implemented by activeIterator.

This inefficiency of breeze norm will be addressed in next PR. At least,
this PR makes the code more consistent in the codebase.

Author: DB Tsai <dbtsai@alpinenow.com>

Closes #3446 from dbtsai/normalizer and squashes the following commits:

e20a2b9 [DB Tsai] first commit
2014-11-25 01:57:34 -08:00
tkaessmann 9ce2bf3821 [SPARK-4582][MLLIB] get raw vectors for further processing in Word2Vec
This is #3309 for the master branch.

e.g. clustering

Author: tkaessmann <tobias.kaessmanns24.com>

Closes #3309 from tkaessmann/branch-1.2 and squashes the following commits:

e3a3142 [tkaessmann] changes the comment for getVectors
58d3d83 [tkaessmann] removes sign from comment
a5be213 [tkaessmann] fixes getVectors to fit code guidelines
3782fa9 [tkaessmann] get raw vectors for further processing

Author: tkaessmann <tobias.kaessmann@s24.com>

Closes #3437 from mengxr/SPARK-4582 and squashes the following commits:

6c666b4 [tkaessmann] get raw vectors for further processing in Word2Vec
2014-11-24 19:58:01 -08:00
Davies Liu b660de7a9c [SPARK-4562] [MLlib] speedup vector
This PR change the underline array of DenseVector to numpy.ndarray to avoid the conversion, because most of the users will using numpy.array.

It also improve the serialization of DenseVector.

Before this change:

trial	| trainingTime | 	testTime
-------|--------|--------
0	| 5.126 | 	1.786
1	|2.698	|1.693

After the change:

trial	| trainingTime |	testTime
-------|--------|--------
0	|4.692	|0.554
1	|2.307	|0.525

This could partially fix the performance regression during test.

Author: Davies Liu <davies@databricks.com>

Closes #3420 from davies/ser2 and squashes the following commits:

0e1e6f3 [Davies Liu] fix tests
426f5db [Davies Liu] impove toArray()
44707ec [Davies Liu] add name for ISO-8859-1
fa7d791 [Davies Liu] address comments
1cfb137 [Davies Liu] handle zero sparse vector
2548ee2 [Davies Liu] fix tests
9e6389d [Davies Liu] bugfix
470f702 [Davies Liu] speed up DenseMatrix
f0d3c40 [Davies Liu] speedup SparseVector
ef6ce70 [Davies Liu] speed up dense vector
2014-11-24 16:37:14 -08:00
DB Tsai b5d17ef10e [SPARK-4431][MLlib] Implement efficient foreachActive for dense and sparse vector
Previously, we were using Breeze's activeIterator to access the non-zero elements
in dense/sparse vector. Due to the overhead, we switched back to native `while loop`
in #SPARK-4129.

However, #SPARK-4129 requires de-reference the dv.values/sv.values in
each access to the value, which is very expensive. Also, in MultivariateOnlineSummarizer,
we're using Breeze's dense vector to store the partial stats, and this is very expensive compared
with using primitive scala array.

In this PR, efficient foreachActive is implemented to unify the code path for dense and sparse
vector operation which makes codebase easier to maintain. Breeze dense vector is replaced
by primitive array to reduce the overhead further.

Benchmarking with mnist8m dataset on single JVM
with first 200 samples loaded in memory, and repeating 5000 times.

Before change:
Sparse Vector - 30.02
Dense Vector - 38.27

With this PR:
Sparse Vector - 6.29
Dense Vector - 11.72

Author: DB Tsai <dbtsai@alpinenow.com>

Closes #3288 from dbtsai/activeIterator and squashes the following commits:

844b0e6 [DB Tsai] formating
03dd693 [DB Tsai] futher performance tunning.
1907ae1 [DB Tsai] address feedback
98448bb [DB Tsai] Made the override final, and had a local copy of variables which made the accessing a single step operation.
c0cbd5a [DB Tsai] fix a bug
6441f92 [DB Tsai] Finished SPARK-4431
2014-11-21 18:15:07 -08:00
Davies Liu ce95bd8e13 [SPARK-4531] [MLlib] cache serialized java object
The Pyrolite is pretty slow (comparing to the adhoc serializer in 1.1), it cause much performance regression in 1.2, because we cache the serialized Python object in JVM, deserialize them into Java object in each step.

This PR change to cache the deserialized JavaRDD instead of PythonRDD to avoid the deserialization of Pyrolite. It should have similar memory usage as before, but much faster.

Author: Davies Liu <davies@databricks.com>

Closes #3397 from davies/cache and squashes the following commits:

7f6e6ce [Davies Liu] Update -> Updater
4b52edd [Davies Liu] using named argument
63b984e [Davies Liu] fix
7da0332 [Davies Liu] add unpersist()
dff33e1 [Davies Liu] address comments
c2bdfc2 [Davies Liu] refactor
d572f00 [Davies Liu] Merge branch 'master' into cache
f1063e1 [Davies Liu] cache serialized java object
2014-11-21 15:02:31 -08:00
Davies Liu 1c53a5db99 [SPARK-4439] [MLlib] add python api for random forest
```
    class RandomForestModel
     |  A model trained by RandomForest
     |
     |  numTrees(self)
     |      Get number of trees in forest.
     |
     |  predict(self, x)
     |      Predict values for a single data point or an RDD of points using the model trained.
     |
     |  toDebugString(self)
     |      Full model
     |
     |  totalNumNodes(self)
     |      Get total number of nodes, summed over all trees in the forest.
     |

    class RandomForest
     |  trainClassifier(cls, data, numClassesForClassification, categoricalFeaturesInfo, numTrees, featureSubsetStrategy='auto', impurity='gini', maxDepth=4, maxBins=32, seed=None):
     |      Method to train a decision tree model for binary or multiclass classification.
     |
     |      :param data: Training dataset: RDD of LabeledPoint.
     |                   Labels should take values {0, 1, ..., numClasses-1}.
     |      :param numClassesForClassification: number of classes for classification.
     |      :param categoricalFeaturesInfo: Map storing arity of categorical features.
     |                                  E.g., an entry (n -> k) indicates that feature n is categorical
     |                                  with k categories indexed from 0: {0, 1, ..., k-1}.
     |      :param numTrees: Number of trees in the random forest.
     |      :param featureSubsetStrategy: Number of features to consider for splits at each node.
     |                                Supported: "auto" (default), "all", "sqrt", "log2", "onethird".
     |                                If "auto" is set, this parameter is set based on numTrees:
     |                                  if numTrees == 1, set to "all";
     |                                  if numTrees > 1 (forest) set to "sqrt".
     |      :param impurity: Criterion used for information gain calculation.
     |                   Supported values: "gini" (recommended) or "entropy".
     |      :param maxDepth: Maximum depth of the tree. E.g., depth 0 means 1 leaf node; depth 1 means
     |                       1 internal node + 2 leaf nodes. (default: 4)
     |      :param maxBins: maximum number of bins used for splitting features (default: 100)
     |      :param seed:  Random seed for bootstrapping and choosing feature subsets.
     |      :return: RandomForestModel that can be used for prediction
     |
     |   trainRegressor(cls, data, categoricalFeaturesInfo, numTrees, featureSubsetStrategy='auto', impurity='variance', maxDepth=4, maxBins=32, seed=None):
     |      Method to train a decision tree model for regression.
     |
     |      :param data: Training dataset: RDD of LabeledPoint.
     |                   Labels are real numbers.
     |      :param categoricalFeaturesInfo: Map storing arity of categorical features.
     |                                   E.g., an entry (n -> k) indicates that feature n is categorical
     |                                   with k categories indexed from 0: {0, 1, ..., k-1}.
     |      :param numTrees: Number of trees in the random forest.
     |      :param featureSubsetStrategy: Number of features to consider for splits at each node.
     |                                 Supported: "auto" (default), "all", "sqrt", "log2", "onethird".
     |                                 If "auto" is set, this parameter is set based on numTrees:
     |                                 if numTrees == 1, set to "all";
     |                                 if numTrees > 1 (forest) set to "onethird".
     |      :param impurity: Criterion used for information gain calculation.
     |                       Supported values: "variance".
     |      :param maxDepth: Maximum depth of the tree. E.g., depth 0 means 1 leaf node; depth 1 means
     |                       1 internal node + 2 leaf nodes.(default: 4)
     |      :param maxBins: maximum number of bins used for splitting features (default: 100)
     |      :param seed:  Random seed for bootstrapping and choosing feature subsets.
     |      :return: RandomForestModel that can be used for prediction
     |
```

Author: Davies Liu <davies@databricks.com>

Closes #3320 from davies/forest and squashes the following commits:

8003dfc [Davies Liu] reorder
53cf510 [Davies Liu] fix docs
4ca593d [Davies Liu] fix docs
e0df852 [Davies Liu] fix docs
0431746 [Davies Liu] rebased
2b6f239 [Davies Liu] Merge branch 'master' of github.com:apache/spark into forest
885abee [Davies Liu] address comments
dae7fc0 [Davies Liu] address comments
89a000f [Davies Liu] fix docs
565d476 [Davies Liu] add python api for random forest
2014-11-20 15:31:28 -08:00
Xiangrui Meng 15cacc8124 [SPARK-4486][MLLIB] Improve GradientBoosting APIs and doc
There are some inconsistencies in the gradient boosting APIs. The target is a general boosting meta-algorithm, but the implementation is attached to trees. This was partially due to the delay of SPARK-1856. But for the 1.2 release, we should make the APIs consistent.

1. WeightedEnsembleModel -> private[tree] TreeEnsembleModel and renamed members accordingly.
1. GradientBoosting -> GradientBoostedTrees
1. Add RandomForestModel and GradientBoostedTreesModel and hide CombiningStrategy
1. Slightly refactored TreeEnsembleModel (Vote takes weights into consideration.)
1. Remove `trainClassifier` and `trainRegressor` from `GradientBoostedTrees` because they are the same as `train`
1. Rename class `train` method to `run` because it hides the static methods with the same name in Java. Deprecated `DecisionTree.train` class method.
1. Simplify BoostingStrategy and make sure the input strategy is not modified. Users should put algo and numClasses in treeStrategy. We create ensembleStrategy inside boosting.
1. Fix a bug in GradientBoostedTreesSuite with AbsoluteError
1. doc updates

manishamde jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #3374 from mengxr/SPARK-4486 and squashes the following commits:

7097251 [Xiangrui Meng] address joseph's comments
98dea09 [Xiangrui Meng] address manish's comments
4aae3b7 [Xiangrui Meng] add RandomForestModel and GradientBoostedTreesModel, hide CombiningStrategy
ea4c467 [Xiangrui Meng] fix unit tests
751da4e [Xiangrui Meng] rename class method train -> run
19030a5 [Xiangrui Meng] update boosting public APIs
2014-11-20 00:48:59 -08:00
Marcelo Vanzin 397d3aae5b Bumping version to 1.3.0-SNAPSHOT.
Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #3277 from vanzin/version-1.3 and squashes the following commits:

7c3c396 [Marcelo Vanzin] Added temp repo to sbt build.
5f404ff [Marcelo Vanzin] Add another exclusion.
19457e7 [Marcelo Vanzin] Update old version to 1.2, add temporary 1.2 repo.
3c8d705 [Marcelo Vanzin] Workaround for MIMA checks.
e940810 [Marcelo Vanzin] Bumping version to 1.3.0-SNAPSHOT.
2014-11-18 21:24:18 -08:00
Davies Liu d2e29516f2 [SPARK-4306] [MLlib] Python API for LogisticRegressionWithLBFGS
```
class LogisticRegressionWithLBFGS
 |  train(cls, data, iterations=100, initialWeights=None, corrections=10, tolerance=0.0001, regParam=0.01, intercept=False)
 |      Train a logistic regression model on the given data.
 |
 |      :param data:           The training data, an RDD of LabeledPoint.
 |      :param iterations:     The number of iterations (default: 100).
 |      :param initialWeights: The initial weights (default: None).
 |      :param regParam:       The regularizer parameter (default: 0.01).
 |      :param regType:        The type of regularizer used for training
 |                             our model.
 |                             :Allowed values:
 |                               - "l1" for using L1 regularization
 |                               - "l2" for using L2 regularization
 |                               - None for no regularization
 |                               (default: "l2")
 |      :param intercept:      Boolean parameter which indicates the use
 |                             or not of the augmented representation for
 |                             training data (i.e. whether bias features
 |                             are activated or not).
 |      :param corrections:    The number of corrections used in the LBFGS update (default: 10).
 |      :param tolerance:      The convergence tolerance of iterations for L-BFGS (default: 1e-4).
 |
 |      >>> data = [
 |      ...     LabeledPoint(0.0, [0.0, 1.0]),
 |      ...     LabeledPoint(1.0, [1.0, 0.0]),
 |      ... ]
 |      >>> lrm = LogisticRegressionWithLBFGS.train(sc.parallelize(data))
 |      >>> lrm.predict([1.0, 0.0])
 |      1
 |      >>> lrm.predict([0.0, 1.0])
 |      0
 |      >>> lrm.predict(sc.parallelize([[1.0, 0.0], [0.0, 1.0]])).collect()
 |      [1, 0]
```

Author: Davies Liu <davies@databricks.com>

Closes #3307 from davies/lbfgs and squashes the following commits:

34bd986 [Davies Liu] Merge branch 'master' of http://git-wip-us.apache.org/repos/asf/spark into lbfgs
5a945a6 [Davies Liu] address comments
941061b [Davies Liu] Merge branch 'master' of github.com:apache/spark into lbfgs
03e5543 [Davies Liu] add it to docs
ed2f9a8 [Davies Liu] add regType
76cd1b6 [Davies Liu] reorder arguments
4429a74 [Davies Liu] Update classification.py
9252783 [Davies Liu] python api for LogisticRegressionWithLBFGS
2014-11-18 15:57:33 -08:00
Davies Liu 8fbf72b790 [SPARK-4435] [MLlib] [PySpark] improve classification
This PR add setThrehold() and clearThreshold() for LogisticRegressionModel and SVMModel, also support RDD of vector in LogisticRegressionModel.predict(), SVNModel.predict() and NaiveBayes.predict()

Author: Davies Liu <davies@databricks.com>

Closes #3305 from davies/setThreshold and squashes the following commits:

d0b835f [Davies Liu] Merge branch 'master' of github.com:apache/spark into setThreshold
e4acd76 [Davies Liu] address comments
2231a5f [Davies Liu] bugfix
7bd9009 [Davies Liu] address comments
0b0a8a7 [Davies Liu] address comments
c1e5573 [Davies Liu] improve classification
2014-11-18 10:11:13 -08:00
Felix Maximilian Möller cedc3b5aa4 ALS implicit: added missing parameter alpha in doc string
Author: Felix Maximilian Möller <felixmaximilian.moeller@immobilienscout24.de>

Closes #3343 from felixmaximilian/fix-documentation and squashes the following commits:

43dcdfb [Felix Maximilian Möller] Removed the information about the switch implicitPrefs. The parameter implicitPrefs cannot be set in this context because it is inherent true when calling the trainImplicit method.
7d172ba [Felix Maximilian Möller] added missing parameter alpha in doc string.
2014-11-18 10:08:24 -08:00
GuoQiang Li 5168c6ca9f [SPARK-4422][MLLIB]In some cases, Vectors.fromBreeze get wrong results.
cc mengxr

Author: GuoQiang Li <witgo@qq.com>

Closes #3281 from witgo/SPARK-4422 and squashes the following commits:

5f1fa5e [GuoQiang Li] import order
50783bd [GuoQiang Li] review commits
7a10123 [GuoQiang Li] In some cases, Vectors.fromBreeze get wrong results.
2014-11-16 21:31:51 -08:00
Xiangrui Meng 32218307ed [SPARK-4372][MLLIB] Make LR and SVM's default parameters consistent in Scala and Python
The current default regParam is 1.0 and regType is claimed to be none in Python (but actually it is l2), while regParam = 0.0 and regType is L2 in Scala. We should make the default values consistent. This PR sets the default regType to L2 and regParam to 0.01. Note that the default regParam value in LIBLINEAR (and hence scikit-learn) is 1.0. However, we use average loss instead of total loss in our formulation. Hence regParam=1.0 is definitely too heavy.

In LinearRegression, we set regParam=0.0 and regType=None, because we have separate classes for Lasso and Ridge, both of which use regParam=0.01 as the default.

davies atalwalkar

Author: Xiangrui Meng <meng@databricks.com>

Closes #3232 from mengxr/SPARK-4372 and squashes the following commits:

9979837 [Xiangrui Meng] update Ridge/Lasso to use default regParam 0.01 cast input arguments
d3ba096 [Xiangrui Meng] change 'none' back to None
1909a6e [Xiangrui Meng] change default regParam to 0.01 and regType to L2 in LR and SVM
2014-11-13 13:54:16 -08:00
Xiangrui Meng ca26a212fd [SPARK-4378][MLLIB] make ALS more Java-friendly
Add Java-friendly version of `run` and `predict`, and use bulk prediction in Java unit tests. The user guide update will come later (though we may not save many lines of code there). srowen

Author: Xiangrui Meng <meng@databricks.com>

Closes #3240 from mengxr/SPARK-4378 and squashes the following commits:

6581503 [Xiangrui Meng] check number of predictions
6c8bbd1 [Xiangrui Meng] make ALS more Java-friendly
2014-11-13 11:42:27 -08:00
Andrew Bullen 484fecbf14 [SPARK-4256] Make Binary Evaluation Metrics functions defined in cases where there ar...
...e 0 positive or 0 negative examples.

Author: Andrew Bullen <andrew.bullen@workday.com>

Closes #3118 from abull/master and squashes the following commits:

c2bf2b1 [Andrew Bullen] [SPARK-4256] Update Code formatting for BinaryClassificationMetricsSpec
36b0533 [Andrew Bullen] [SYMAN-4256] Extract BinaryClassificationMetricsSuite assertions into private method
4d2f79a [Andrew Bullen] [SPARK-4256] Refactor classification metrics tests - extract comparison functions in test
f411e70 [Andrew Bullen] [SPARK-4256] Define precision as 1.0 when there are no positive examples; update code formatting per pull request comments
d9a09ef [Andrew Bullen] Make Binary Evaluation Metrics functions defined in cases where there are 0 positive or 0 negative examples.
2014-11-12 22:14:44 -08:00
Xiangrui Meng 23f5bdf06a [SPARK-4373][MLLIB] fix MLlib maven tests
We want to make sure there is at most one spark context inside the same jvm. JoshRosen

Author: Xiangrui Meng <meng@databricks.com>

Closes #3235 from mengxr/SPARK-4373 and squashes the following commits:

6574b69 [Xiangrui Meng] rename LocalSparkContext to MLlibTestSparkContext
913d48d [Xiangrui Meng] make sure there is at most one spark context inside the same jvm
2014-11-12 18:15:14 -08:00
Davies Liu bd86118c4e [SPARK-4369] [MLLib] fix TreeModel.predict() with RDD
Fix  TreeModel.predict() with RDD, added tests for it.

(Also checked that other models don't have this issue)

Author: Davies Liu <davies@databricks.com>

Closes #3230 from davies/predict and squashes the following commits:

81172aa [Davies Liu] fix predict
2014-11-12 13:56:41 -08:00
Xiangrui Meng 4b736dbab3 [SPARK-3530][MLLIB] pipeline and parameters with examples
This PR adds package "org.apache.spark.ml" with pipeline and parameters, as discussed on the JIRA. This is a joint work of jkbradley etrain shivaram and many others who helped on the design, also with help from  marmbrus and liancheng on the Spark SQL side. The design doc can be found at:

https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o/edit?usp=sharing

**org.apache.spark.ml**

This is a new package with new set of ML APIs that address practical machine learning pipelines. (Sorry for taking so long!) It will be an alpha component, so this is definitely not something set in stone. The new set of APIs, inspired by the MLI project from AMPLab and scikit-learn, takes leverage on Spark SQL's schema support and execution plan optimization. It introduces the following components that help build a practical pipeline:

1. Transformer, which transforms a dataset into another
2. Estimator, which fits models to data, where models are transformers
3. Evaluator, which evaluates model output and returns a scalar metric
4. Pipeline, a simple pipeline that consists of transformers and estimators

Parameters could be supplied at fit/transform or embedded with components.

1. Param: a strong-typed parameter key with self-contained doc
2. ParamMap: a param -> value map
3. Params: trait for components with parameters

For any component that implements `Params`, user can easily check the doc by calling `explainParams`:

~~~
> val lr = new LogisticRegression
> lr.explainParams
maxIter: max number of iterations (default: 100)
regParam: regularization constant (default: 0.1)
labelCol: label column name (default: label)
featuresCol: features column name (default: features)
~~~

or user can check individual param:

~~~
> lr.maxIter
maxIter: max number of iterations (default: 100)
~~~

**Please start with the example code in test suites and under `org.apache.spark.examples.ml`, where I put several examples:**

1. run a simple logistic regression job

~~~
    val lr = new LogisticRegression()
      .setMaxIter(10)
      .setRegParam(1.0)
    val model = lr.fit(dataset)
    model.transform(dataset, model.threshold -> 0.8) // overwrite threshold
      .select('label, 'score, 'prediction).collect()
      .foreach(println)
~~~

2. run logistic regression with cross-validation and grid search using areaUnderROC (default) as the metric

~~~
    val lr = new LogisticRegression
    val lrParamMaps = new ParamGridBuilder()
      .addGrid(lr.regParam, Array(0.1, 100.0))
      .addGrid(lr.maxIter, Array(0, 5))
      .build()
    val eval = new BinaryClassificationEvaluator
    val cv = new CrossValidator()
      .setEstimator(lr)
      .setEstimatorParamMaps(lrParamMaps)
      .setEvaluator(eval)
      .setNumFolds(3)
    val bestModel = cv.fit(dataset)
~~~

3. run a pipeline that consists of a standard scaler and a logistic regression component

~~~
    val scaler = new StandardScaler()
      .setInputCol("features")
      .setOutputCol("scaledFeatures")
    val lr = new LogisticRegression()
      .setFeaturesCol(scaler.getOutputCol)
    val pipeline = new Pipeline()
      .setStages(Array(scaler, lr))
    val model = pipeline.fit(dataset)
    val predictions = model.transform(dataset)
      .select('label, 'score, 'prediction)
      .collect()
      .foreach(println)
~~~

4. a simple text classification pipeline, which recognizes "spark":

~~~
    val training = sparkContext.parallelize(Seq(
      LabeledDocument(0L, "a b c d e spark", 1.0),
      LabeledDocument(1L, "b d", 0.0),
      LabeledDocument(2L, "spark f g h", 1.0),
      LabeledDocument(3L, "hadoop mapreduce", 0.0)))
    val tokenizer = new Tokenizer()
      .setInputCol("text")
      .setOutputCol("words")
    val hashingTF = new HashingTF()
      .setInputCol(tokenizer.getOutputCol)
      .setOutputCol("features")
    val lr = new LogisticRegression()
      .setMaxIter(10)
    val pipeline = new Pipeline()
      .setStages(Array(tokenizer, hashingTF, lr))
    val model = pipeline.fit(training)
    val test = sparkContext.parallelize(Seq(
      Document(4L, "spark i j k"),
      Document(5L, "l m"),
      Document(6L, "mapreduce spark"),
      Document(7L, "apache hadoop")))
    model.transform(test)
      .select('id, 'text, 'prediction, 'score)
      .collect()
      .foreach(println)
~~~

Java examples are very similar. I put example code that creates a simple text classification pipeline in Scala and Java, where a simple tokenizer is defined as a transformer outside `org.apache.spark.ml`.

**What are missing now and will be added soon:**

1. ~~Runtime check of schemas. So before we touch the data, we will go through the schema and make sure column names and types match the input parameters.~~
2. ~~Java examples.~~
3. ~~Store training parameters in trained models.~~
4. (later) Serialization and Python API.

Author: Xiangrui Meng <meng@databricks.com>

Closes #3099 from mengxr/SPARK-3530 and squashes the following commits:

2cc93fd [Xiangrui Meng] hide APIs as much as I can
34319ba [Xiangrui Meng] use local instead local[2] for unit tests
2524251 [Xiangrui Meng] rename PipelineStage.transform to transformSchema
c9daab4 [Xiangrui Meng] remove mockito version
1397ab5 [Xiangrui Meng] use sqlContext from LocalSparkContext instead of TestSQLContext
6ffc389 [Xiangrui Meng] try to fix unit test
a59d8b7 [Xiangrui Meng] doc updates
977fd9d [Xiangrui Meng] add scala ml package object
6d97fe6 [Xiangrui Meng] add AlphaComponent annotation
731f0e4 [Xiangrui Meng] update package doc
0435076 [Xiangrui Meng] remove ;this from setters
fa21d9b [Xiangrui Meng] update extends indentation
f1091b3 [Xiangrui Meng] typo
228a9f4 [Xiangrui Meng] do not persist before calling binary classification metrics
f51cd27 [Xiangrui Meng] rename default to defaultValue
b3be094 [Xiangrui Meng] refactor schema transform in lr
8791e8e [Xiangrui Meng] rename copyValues to inheritValues and make it do the right thing
51f1c06 [Xiangrui Meng] remove leftover code in Transformer
494b632 [Xiangrui Meng] compure score once
ad678e9 [Xiangrui Meng] more doc for Transformer
4306ed4 [Xiangrui Meng] org imports in text pipeline
6e7c1c7 [Xiangrui Meng] update pipeline
4f9e34f [Xiangrui Meng] more doc for pipeline
aa5dbd4 [Xiangrui Meng] fix typo
11be383 [Xiangrui Meng] fix unit tests
3df7952 [Xiangrui Meng] clean up
986593e [Xiangrui Meng] re-org java test suites
2b11211 [Xiangrui Meng] remove external data deps
9fd4933 [Xiangrui Meng] add unit test for pipeline
2a0df46 [Xiangrui Meng] update tests
2d52e4d [Xiangrui Meng] add @AlphaComponent to package-info
27582a4 [Xiangrui Meng] doc changes
73a000b [Xiangrui Meng] add schema transformation layer
6736e87 [Xiangrui Meng] more doc / remove HasMetricName trait
80a8b5e [Xiangrui Meng] rename SimpleTransformer to UnaryTransformer
62ca2bb [Xiangrui Meng] check param parent in set/get
1622349 [Xiangrui Meng] add getModel to PipelineModel
a0e0054 [Xiangrui Meng] update StandardScaler to use SimpleTransformer
d0faa04 [Xiangrui Meng] remove implicit mapping from ParamMap
c7f6921 [Xiangrui Meng] move ParamGridBuilder test to ParamGridBuilderSuite
e246f29 [Xiangrui Meng] re-org:
7772430 [Xiangrui Meng] remove modelParams add a simple text classification pipeline
b95c408 [Xiangrui Meng] remove implicits add unit tests to params
bab3e5b [Xiangrui Meng] update params
fe0ee92 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-3530
6e86d98 [Xiangrui Meng] some code clean-up
2d040b3 [Xiangrui Meng] implement setters inside each class, add Params.copyValues [ci skip]
fd751fc [Xiangrui Meng] add java-friendly versions of fit and tranform
3f810cd [Xiangrui Meng] use multi-model training api in cv
5b8f413 [Xiangrui Meng] rename model to modelParams
9d2d35d [Xiangrui Meng] test varargs and chain model params
f46e927 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-3530
1ef26e0 [Xiangrui Meng] specialize methods/types for Java
df293ed [Xiangrui Meng] switch to setter/getter
376db0a [Xiangrui Meng] pipeline and parameters
2014-11-12 10:38:57 -08:00
Xiangrui Meng 84324fbcb9 [SPARK-4355][MLLIB] fix OnlineSummarizer.merge when other.mean is zero
See inline comment about the bug. I also did some code clean-up. dbtsai I moved `update` to a private method of `MultivariateOnlineSummarizer`. I don't think it will cause performance regression, but it would be great if you have some time to test.

Author: Xiangrui Meng <meng@databricks.com>

Closes #3220 from mengxr/SPARK-4355 and squashes the following commits:

5ef601f [Xiangrui Meng] fix OnlineSummarizer.merge when other.mean is zero and some code clean-up
2014-11-12 01:50:11 -08:00
Manish Amde 2ef016b130 [MLLIB] SPARK-4347: Reducing GradientBoostingSuite run time.
Before:
[info] GradientBoostingSuite:
[info] - Regression with continuous features: SquaredError (22 seconds, 115 milliseconds)
[info] - Regression with continuous features: Absolute Error (19 seconds, 330 milliseconds)
[info] - Binary classification with continuous features: Log Loss (19 seconds, 17 milliseconds)

After:
[info] - Regression with continuous features: SquaredError (7 seconds, 69 milliseconds)
[info] - Regression with continuous features: Absolute Error (4 seconds, 617 milliseconds)
[info] - Binary classification with continuous features: Log Loss (4 seconds, 658 milliseconds)

cc: mengxr, jkbradley

Author: Manish Amde <manish9ue@gmail.com>

Closes #3214 from manishamde/gbt_test_speedup and squashes the following commits:

8994552 [Manish Amde] reducing gbt test run times
2014-11-11 22:47:53 -08:00
Michelangelo D'Agostino 7e9d975676 [MLLIB] [PYTHON] SPARK-4221: Expose nonnegative ALS in the python API
SPARK-1553 added alternating nonnegative least squares to MLLib, however it's not possible to access it via the python API.  This pull request resolves that.

Author: Michelangelo D'Agostino <mdagostino@civisanalytics.com>

Closes #3095 from mdagost/python_nmf and squashes the following commits:

a6743ad [Michelangelo D'Agostino] Use setters instead of static methods in PythonMLLibAPI.  Remove the new static methods I added.  Set seed in tests.  Change ratings to ratingsRDD in both train and trainImplicit for consistency.
7cffd39 [Michelangelo D'Agostino] Swapped nonnegative and seed in a few more places.
3fdc851 [Michelangelo D'Agostino] Moved seed to the end of the python parameter list.
bdcc154 [Michelangelo D'Agostino] Change seed type to java.lang.Long so that it can handle null.
cedf043 [Michelangelo D'Agostino] Added in ability to set the seed from python and made that play nice with the nonnegative changes.  Also made the python ALS tests more exact.
a72fdc9 [Michelangelo D'Agostino] Expose nonnegative ALS in the python API.
2014-11-07 22:53:01 -08:00
Joseph K. Bradley 5b3b6f6f5f [SPARK-4197] [mllib] GradientBoosting API cleanup and examples in Scala, Java
### Summary

* Made it easier to construct default Strategy and BoostingStrategy and to set parameters using simple types.
* Added Scala and Java examples for GradientBoostedTrees
* small cleanups and fixes

### Details

GradientBoosting bug fixes (“bug” = bad default options)
* Force boostingStrategy.weakLearnerParams.algo = Regression
* Force boostingStrategy.weakLearnerParams.impurity = impurity.Variance
* Only persist data if not yet persisted (since it causes an error if persisted twice)

BoostingStrategy
* numEstimators: renamed to numIterations
* removed subsamplingRate (duplicated by Strategy)
* removed categoricalFeaturesInfo since it belongs with the weak learner params (since boosting can be oblivious to feature type)
* Changed algo to var (not val) and added BeanProperty, with overload taking String argument
* Added assertValid() method
* Updated defaultParams() method and eliminated defaultWeakLearnerParams() since that belongs in Strategy

Strategy (for DecisionTree)
* Changed algo to var (not val) and added BeanProperty, with overload taking String argument
* Added setCategoricalFeaturesInfo method taking Java Map.
* Cleaned up assertValid
* Changed val’s to def’s since parameters can now be changed.

CC: manishamde mengxr codedeft

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #3094 from jkbradley/gbt-api and squashes the following commits:

7a27e22 [Joseph K. Bradley] scalastyle fix
52013d5 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into gbt-api
e9b8410 [Joseph K. Bradley] Summary of changes
2014-11-05 10:33:13 -08:00
Davies Liu c8abddc516 [SPARK-3964] [MLlib] [PySpark] add Hypothesis test Python API
```
pyspark.mllib.stat.StatisticschiSqTest(observed, expected=None)
    :: Experimental ::

    If `observed` is Vector, conduct Pearson's chi-squared goodness
    of fit test of the observed data against the expected distribution,
    or againt the uniform distribution (by default), with each category
    having an expected frequency of `1 / len(observed)`.
    (Note: `observed` cannot contain negative values)

    If `observed` is matrix, conduct Pearson's independence test on the
    input contingency matrix, which cannot contain negative entries or
    columns or rows that sum up to 0.

    If `observed` is an RDD of LabeledPoint, conduct Pearson's independence
    test for every feature against the label across the input RDD.
    For each feature, the (feature, label) pairs are converted into a
    contingency matrix for which the chi-squared statistic is computed.
    All label and feature values must be categorical.

    :param observed: it could be a vector containing the observed categorical
                     counts/relative frequencies, or the contingency matrix
                     (containing either counts or relative frequencies),
                     or an RDD of LabeledPoint containing the labeled dataset
                     with categorical features. Real-valued features will be
                     treated as categorical for each distinct value.
    :param expected: Vector containing the expected categorical counts/relative
                     frequencies. `expected` is rescaled if the `expected` sum
                     differs from the `observed` sum.
    :return: ChiSquaredTest object containing the test statistic, degrees
             of freedom, p-value, the method used, and the null hypothesis.
```

Author: Davies Liu <davies@databricks.com>

Closes #3091 from davies/his and squashes the following commits:

145d16c [Davies Liu] address comments
0ab0764 [Davies Liu] fix float
5097d54 [Davies Liu] add Hypothesis test Python API
2014-11-04 21:35:52 -08:00
Niklas Wilcke f90ad5d426 [Spark-4060] [MLlib] exposing special rdd functions to the public
Author: Niklas Wilcke <1wilcke@informatik.uni-hamburg.de>

Closes #2907 from numbnut/master and squashes the following commits:

7f7c767 [Niklas Wilcke] [Spark-4060] [MLlib] exposing special rdd functions to the public, #2907
2014-11-04 09:57:03 -08:00