Commit graph

563 commits

Author SHA1 Message Date
Xiangrui Meng 4f4c6d5a5d [SPARK-5730][ML] add doc groups to spark.ml components
This PR adds three groups to the ScalaDoc: `param`, `setParam`, and `getParam`. Params will show up in the generated Scala API doc as the top group. Setters/getters will be at the bottom.

Preview:

![screen shot 2015-02-13 at 2 47 49 pm](https://cloud.githubusercontent.com/assets/829644/6196657/5740c240-b38f-11e4-94bb-bd8ef5a796c5.png)

Author: Xiangrui Meng <meng@databricks.com>

Closes #4600 from mengxr/SPARK-5730 and squashes the following commits:

febed9a [Xiangrui Meng] add doc groups to spark.ml components
2015-02-13 16:45:59 -08:00
Xiangrui Meng d50a91d529 [SPARK-5803][MLLIB] use ArrayBuilder to build primitive arrays
because ArrayBuffer is not specialized.

Author: Xiangrui Meng <meng@databricks.com>

Closes #4594 from mengxr/SPARK-5803 and squashes the following commits:

1261bd5 [Xiangrui Meng] merge master
a4ea872 [Xiangrui Meng] use ArrayBuilder to build primitive arrays
2015-02-13 16:43:49 -08:00
Xiangrui Meng 99bd500665 [SPARK-5757][MLLIB] replace SQL JSON usage in model import/export by json4s
This PR detaches MLlib model import/export code from SQL's JSON support, and hence unblocks #4544 . yhuai

Author: Xiangrui Meng <meng@databricks.com>

Closes #4555 from mengxr/SPARK-5757 and squashes the following commits:

b0415e8 [Xiangrui Meng] replace SQL JSON usage by json4s
2015-02-12 10:48:13 -08:00
Liang-Chi Hsieh f86a89a2e0 [SPARK-5714][Mllib] Refactor initial step of LDA to remove redundant operations
The `initialState` of LDA performs several RDD operations that looks redundant. This pr tries to simplify these operations.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #4501 from viirya/sim_lda and squashes the following commits:

4870fe4 [Liang-Chi Hsieh] For comments.
9af1487 [Liang-Chi Hsieh] Refactor initial step of LDA to remove redundant operations.
2015-02-10 21:51:15 -08:00
Reynold Xin 7e24249af1 [SQL][DataFrame] Fix column computability bug.
Do not recursively strip out projects. Only strip the first level project.

```scala
df("colA") + df("colB").as("colC")
```

Previously, the above would construct an invalid plan.

Author: Reynold Xin <rxin@databricks.com>

Closes #4519 from rxin/computability and squashes the following commits:

87ff763 [Reynold Xin] Code review feedback.
015c4fc [Reynold Xin] [SQL][DataFrame] Fix column computability.
2015-02-10 19:50:44 -08:00
Davies Liu ea60284095 [SPARK-5704] [SQL] [PySpark] createDataFrame from RDD with columns
Deprecate inferSchema() and applySchema(), use createDataFrame() instead, which could take an optional `schema` to create an DataFrame from an RDD. The `schema` could be StructType or list of names of columns.

Author: Davies Liu <davies@databricks.com>

Closes #4498 from davies/create and squashes the following commits:

08469c1 [Davies Liu] remove Scala/Java API for now
c80a7a9 [Davies Liu] fix hive test
d1bd8f2 [Davies Liu] cleanup applySchema
9526e97 [Davies Liu] createDataFrame from RDD with columns
2015-02-10 19:40:12 -08:00
MechCoder fd2c032f95 [SPARK-5021] [MLlib] Gaussian Mixture now supports Sparse Input
Following discussion in the Jira.

Author: MechCoder <manojkumarsivaraj334@gmail.com>

Closes #4459 from MechCoder/sparse_gmm and squashes the following commits:

1b18dab [MechCoder] Rewrite syr for sparse matrices
e579041 [MechCoder] Add test for covariance matrix
5cb370b [MechCoder] Separate tests for sparse data
5e096bd [MechCoder] Alphabetize and correct error message
e180f4c [MechCoder] [SPARK-5021] Gaussian Mixture now supports Sparse Input
2015-02-10 14:05:55 -08:00
Joseph K. Bradley ef2f55b97f [SPARK-5597][MLLIB] save/load for decision trees and emsembles
This is based on #4444 from jkbradley with the following changes:

1. Node schema updated to
   ~~~
treeId: int
nodeId: Int
predict/
       |- predict: Double
       |- prob: Double
impurity: Double
isLeaf: Boolean
split/
     |- feature: Int
     |- threshold: Double
     |- featureType: Int
     |- categories: Array[Double]
leftNodeId: Integer
rightNodeId: Integer
infoGain: Double
~~~

2. Some refactor of the implementation.

Closes #4444.

Author: Joseph K. Bradley <joseph@databricks.com>
Author: Xiangrui Meng <meng@databricks.com>

Closes #4493 from mengxr/SPARK-5597 and squashes the following commits:

75e3bb6 [Xiangrui Meng] fix style
2b0033d [Xiangrui Meng] update tree export schema and refactor the implementation
45873a2 [Joseph K. Bradley] org imports
1d4c264 [Joseph K. Bradley] Added save/load for tree ensembles
dcdbf85 [Joseph K. Bradley] added save/load for decision tree but need to generalize it to ensembles
2015-02-09 22:09:07 -08:00
Sean Owen 36c4e1d759 SPARK-4900 [MLLIB] MLlib SingularValueDecomposition ARPACK IllegalStateException
Fix ARPACK error code mapping, at least. It's not yet clear whether the error is what we expect from ARPACK. If it isn't, not sure if that's to be treated as an MLlib or Breeze issue.

Author: Sean Owen <sowen@cloudera.com>

Closes #4485 from srowen/SPARK-4900 and squashes the following commits:

7355aa1 [Sean Owen] Fix ARPACK error code mapping
2015-02-09 21:13:58 -08:00
Sandy Ryza 0793ee1b4d SPARK-2149. [MLLIB] Univariate kernel density estimation
Author: Sandy Ryza <sandy@cloudera.com>

Closes #1093 from sryza/sandy-spark-2149 and squashes the following commits:

5f06b33 [Sandy Ryza] More review comments
0f73060 [Sandy Ryza] Respond to Sean's review comments
0dfa005 [Sandy Ryza] SPARK-2149. Univariate kernel density estimation
2015-02-09 10:12:12 +00:00
Sean Owen 4396dfb37f SPARK-4405 [MLLIB] Matrices.* construction methods should check for rows x cols overflow
Check that size of dense matrix array is not beyond Int.MaxValue in Matrices.* methods. jkbradley this should be an easy one. Review and/or merge as you see fit.

Author: Sean Owen <sowen@cloudera.com>

Closes #4461 from srowen/SPARK-4405 and squashes the following commits:

c67574e [Sean Owen] Check that size of dense matrix array is not beyond Int.MaxValue in Matrices.* methods
2015-02-08 21:08:50 -08:00
Joseph K. Bradley c17161189d [SPARK-5660][MLLIB] Make Matrix apply public
This is #4447 with `override`.

Closes #4447

Author: Joseph K. Bradley <joseph@databricks.com>
Author: Xiangrui Meng <meng@databricks.com>

Closes #4462 from mengxr/SPARK-5660 and squashes the following commits:

f82c8d6 [Xiangrui Meng] add override to matrix.apply
91cedde [Joseph K. Bradley] made matrix apply public
2015-02-08 21:07:36 -08:00
Xiangrui Meng 5c299c58fb [SPARK-5598][MLLIB] model save/load for ALS
following #4233. jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #4422 from mengxr/SPARK-5598 and squashes the following commits:

a059394 [Xiangrui Meng] SaveLoad not extending Loader
14b7ea6 [Xiangrui Meng] address comments
f487cb2 [Xiangrui Meng] add unit tests
62fc43c [Xiangrui Meng] implement save/load for MFM
2015-02-08 16:26:20 -08:00
mbittmann 4878313695 [SPARK-5656] Fail gracefully for large values of k and/or n that will ex...
...ceed max int.

Large values of k and/or n in EigenValueDecomposition.symmetricEigs will result in array initialization to a value larger than Integer.MAX_VALUE in the following: var v = new Array[Double](n * ncv)

Author: mbittmann <mbittmann@gmail.com>
Author: bittmannm <mark.bittmann@agilex.com>

Closes #4433 from mbittmann/master and squashes the following commits:

ee56e05 [mbittmann] [SPARK-5656] Combine checks into simple message
e49cbbb [mbittmann] [SPARK-5656] Simply error message
860836b [mbittmann] Array size check updates based on code review
a604816 [bittmannm] [SPARK-5656] Fail gracefully for large values of k and/or n that will exceed max int.
2015-02-08 10:13:29 +00:00
Xiangrui Meng 0e23ca9f80 [SPARK-5601][MLLIB] make streaming linear algorithms Java-friendly
Overload `trainOn`, `predictOn`, and `predictOnValues`.

CC freeman-lab

Author: Xiangrui Meng <meng@databricks.com>

Closes #4432 from mengxr/streaming-java and squashes the following commits:

6a79b85 [Xiangrui Meng] add java test for streaming logistic regression
2d7b357 [Xiangrui Meng] organize imports
1f662b3 [Xiangrui Meng] make streaming linear algorithms Java-friendly
2015-02-06 15:42:59 -08:00
Liang-Chi Hsieh 80f3bcb58f [SPARK-5652][Mllib] Use broadcasted weights in LogisticRegressionModel
`LogisticRegressionModel`'s `predictPoint` should directly use broadcasted weights. This pr also fixes the compilation errors of two unit test suite: `JavaLogisticRegressionSuite ` and `JavaLinearRegressionSuite`.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #4429 from viirya/use_bcvalue and squashes the following commits:

5a797e5 [Liang-Chi Hsieh] Use broadcasted weights. Fix compilation error.
2015-02-06 11:22:11 -08:00
Joseph K. Bradley dc0c4490a1 [SPARK-4789] [SPARK-4942] [SPARK-5031] [mllib] Standardize ML Prediction APIs
This is part (1a) of the updates from the design doc in [https://docs.google.com/document/d/1BH9el33kBX8JiDdgUJXdLW14CA2qhTCWIG46eXZVoJs]

**UPDATE**: Most of the APIs are being kept private[spark] to allow further discussion.  Here is a list of changes which are public:
* new output columns: rawPrediction, probabilities
  * The “score” column is now called “rawPrediction”
* Classifiers now provide numClasses
* Params.get and .set are now protected instead of private[ml].
* ParamMap now has a size method.
* new classes: LinearRegression, LinearRegressionModel
* LogisticRegression now has an intercept.

### Sketch of APIs (most of which are private[spark] for now)

Abstract classes for learning algorithms (+ corresponding Model abstractions):
* Classifier (+ ClassificationModel)
* ProbabilisticClassifier (+ ProbabilisticClassificationModel)
* Regressor (+ RegressionModel)
* Predictor (+ PredictionModel)
* *For all of these*:
 * There is no strongly typed training-time API.
 * There is a strongly typed test-time (prediction) API which helps developers implement new algorithms.

Concrete classes: learning algorithms
* LinearRegression
* LogisticRegression (updated to use new abstract classes)
 * Also, removed "score" in favor of "probability" output column.  Changed BinaryClassificationEvaluator to match. (SPARK-5031)

Other updates:
* params.scala: Changed Params.set/get to be protected instead of private[ml]
 * This was needed for the example of defining a class from outside of the MLlib namespace.
* VectorUDT: Will later change from private[spark] to public.
 * This is needed for outside users to write their own validateAndTransformSchema() methods using vectors.
 * Also, added equals() method.f
* SPARK-4942 : ML Transformers should allow output cols to be turned on,off
 * Update validateAndTransformSchema
 * Update transform
* (Updated examples, test suites according to other changes)

New examples:
* DeveloperApiExample.scala (example of defining algorithm from outside of the MLlib namespace)
 * Added Java version too

Test Suites:
* LinearRegressionSuite
* LogisticRegressionSuite
* + Java versions of above suites

CC: mengxr  etrain  shivaram

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #3637 from jkbradley/ml-api-part1 and squashes the following commits:

405bfb8 [Joseph K. Bradley] Last edits based on code review.  Small cleanups
fec348a [Joseph K. Bradley] Added JavaDeveloperApiExample.java and fixed other issues: Made developer API private[spark] for now. Added constructors Java can understand to specialized Param types.
8316d5e [Joseph K. Bradley] fixes after rebasing on master
fc62406 [Joseph K. Bradley] fixed test suites after last commit
bcb9549 [Joseph K. Bradley] Fixed issues after rebasing from master (after move from SchemaRDD to DataFrame)
9872424 [Joseph K. Bradley] fixed JavaLinearRegressionSuite.java Java sql api
f542997 [Joseph K. Bradley] Added MIMA excludes for VectorUDT (now public), and added DeveloperApi annotation to it
216d199 [Joseph K. Bradley] fixed after sql datatypes PR got merged
f549e34 [Joseph K. Bradley] Updates based on code review.  Major ones are: * Created weakly typed Predictor.train() method which is called by fit() so that developers do not have to call schema validation or copy parameters. * Made Predictor.featuresDataType have a default value of VectorUDT.   * NOTE: This could be dangerous since the FeaturesType type parameter cannot have a default value.
343e7bd [Joseph K. Bradley] added blanket mima exclude for ml package
82f340b [Joseph K. Bradley] Fixed bug in LogisticRegression (introduced in this PR).  Fixed Java suites
0a16da9 [Joseph K. Bradley] Fixed Linear/Logistic RegressionSuites
c3c8da5 [Joseph K. Bradley] small cleanup
934f97b [Joseph K. Bradley] Fixed bugs from previous commit.
1c61723 [Joseph K. Bradley] * Made ProbabilisticClassificationModel into a subclass of ClassificationModel.  Also introduced ProbabilisticClassifier.  * This was to support output column “probabilityCol” in transform().
4e2f711 [Joseph K. Bradley] rat fix
bc654e1 [Joseph K. Bradley] Added spark.ml LinearRegressionSuite
8d13233 [Joseph K. Bradley] Added methods: * Classifier: batch predictRaw() * Predictor: train() without paramMap ProbabilisticClassificationModel.predictProbabilities() * Java versions of all above batch methods + others
1680905 [Joseph K. Bradley] Added JavaLabeledPointSuite.java for spark.ml, and added constructor to LabeledPoint which defaults weight to 1.0
adbe50a [Joseph K. Bradley] * fixed LinearRegression train() to use embedded paramMap * added Predictor.predict(RDD[Vector]) method * updated Linear/LogisticRegressionSuites
58802e3 [Joseph K. Bradley] added train() to Predictor subclasses which does not take a ParamMap.
57d54ab [Joseph K. Bradley] * Changed semantics of Predictor.train() to merge the given paramMap with the embedded paramMap. * remove threshold_internal from logreg * Added Predictor.copy() * Extended LogisticRegressionSuite
e433872 [Joseph K. Bradley] Updated docs.  Added LabeledPointSuite to spark.ml
54b7b31 [Joseph K. Bradley] Fixed issue with logreg threshold being set correctly
0617d61 [Joseph K. Bradley] Fixed bug from last commit (sorting paramMap by parameter names in toString).  Fixed bug in persisting logreg data.  Added threshold_internal to logreg for faster test-time prediction (avoiding map lookup).
601e792 [Joseph K. Bradley] Modified ParamMap to sort parameters in toString.  Cleaned up classes in class hierarchy, before implementing tests and examples.
d705e87 [Joseph K. Bradley] Added LinearRegression and Regressor back from ml-api branch
52f4fde [Joseph K. Bradley] removing everything except for simple class hierarchy for classification
d35bb5d [Joseph K. Bradley] fixed compilation issues, but have not added tests yet
bfade12 [Joseph K. Bradley] Added lots of classes for new ML API:
2015-02-05 23:43:47 -08:00
Xiangrui Meng 6b88825a25 [SPARK-5604][MLLIB] remove checkpointDir from trees
This is the second part of SPARK-5604, which removes checkpointDir from tree strategies. Note that this is a break change. I will mention it in the migration guide.

Author: Xiangrui Meng <meng@databricks.com>

Closes #4407 from mengxr/SPARK-5604-1 and squashes the following commits:

13a276d [Xiangrui Meng] remove checkpointDir from trees
2015-02-05 23:32:09 -08:00
Xiangrui Meng c19152cd2a [SPARK-5604[MLLIB] remove checkpointDir from LDA
`checkpointDir` is a Spark global configuration. Users should set it outside LDA. This PR also hides some methods under `private[clustering] object LDA`, so they don't show up in the generated Java doc (SPARK-5610).

jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #4390 from mengxr/SPARK-5604 and squashes the following commits:

a34bb39 [Xiangrui Meng] remove checkpointDir from LDA
2015-02-05 15:07:33 -08:00
x1- 62371adaa5 [SPARK-5460][MLlib] Wrapped Try around deleteAllCheckpoints - RandomForest.
Because `deleteAllCheckpoints` has IOException potential.
fix issue.

Author: x1- <viva008@gmail.com>

Closes #4347 from x1-/SPARK-5460 and squashes the following commits:

7a3d8de [x1-] change `Try()` to `try catch { case ... }` ar RandomForest.
3a52745 [x1-] modified typo. 'faild' -> 'failed' and remove disused '-'.
1572576 [x1-] Wrapped `Try` around `deleteAllCheckpoints` - RandomForest.
2015-02-05 15:02:04 -08:00
Reynold Xin 6580929fa0 [HOTFIX] MLlib build break. 2015-02-05 00:42:50 -08:00
Reynold Xin c3ba4d4cd0 [MLlib] Minor: UDF style update.
Author: Reynold Xin <rxin@databricks.com>

Closes #4388 from rxin/mllib-style and squashes the following commits:

61d465b [Reynold Xin] oops
3364295 [Reynold Xin] Missed one ..
5e068e3 [Reynold Xin] [MLlib] Minor: UDF style update.
2015-02-04 23:57:53 -08:00
Reynold Xin 7d789e117d [SPARK-5612][SQL] Move DataFrame implicit functions into SQLContext.implicits.
Author: Reynold Xin <rxin@databricks.com>

Closes #4386 from rxin/df-implicits and squashes the following commits:

9d96606 [Reynold Xin] style fix
edd296b [Reynold Xin] ReplSuite
1c946ab [Reynold Xin] [SPARK-5612][SQL] Move DataFrame implicit functions into SQLContext.implicits.
2015-02-04 23:44:34 -08:00
Xiangrui Meng db34690466 [SPARK-5599] Check MLlib public APIs for 1.3
There are no break changes (against 1.2) in this PR. I hide the PythonMLLibAPI, which is only called by Py4J, and renamed `SparseMatrix.diag` to `SparseMatrix.spdiag`. All other changes are documentation and annotations. The `Experimental` tag is removed from `ALS.setAlpha` and `Rating`. One issue not addressed in this PR is the `setCheckpointDir` in `LDA` (https://issues.apache.org/jira/browse/SPARK-5604).

CC: srowen jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #4377 from mengxr/SPARK-5599 and squashes the following commits:

17975dc [Xiangrui Meng] fix tests
4487f20 [Xiangrui Meng] remove experimental tag from each stat method because Statistics is experimental already
3cd969a [Xiangrui Meng] remove freeman (sorry~) from StreamLA public doc
55900f5 [Xiangrui Meng] make IR experimental and update its doc
9b8eed3 [Xiangrui Meng] graduate Rating and setAlpha in ALS
b854d28 [Xiangrui Meng] correct iid doc in RandomRDDs
27f5bdd [Xiangrui Meng] update linalg docs and some new method signatures
371721b [Xiangrui Meng] mark fpg as experimental and update its doc
8aca7ee [Xiangrui Meng] change SLR to experimental and update the doc
ebbb2e9 [Xiangrui Meng] mark PIC experimental and update the doc
7830d3b [Xiangrui Meng] mark GMM experimental
a378496 [Xiangrui Meng] use the correct subscript syntax in PIC
c65c424 [Xiangrui Meng] update LDAModel doc
a213b0c [Xiangrui Meng] update GMM constructor
3993054 [Xiangrui Meng] hide algorithm in SLR
ad6b9ce [Xiangrui Meng] Revert "make ClassificatinModel.predict(JavaRDD) return JavaDoubleRDD"
0054684 [Xiangrui Meng] add doc to LRModel's constructor
a89763b [Xiangrui Meng] make ClassificatinModel.predict(JavaRDD) return JavaDoubleRDD
7c0946c [Xiangrui Meng] hide PythonMLLibAPI
2015-02-04 23:03:47 -08:00
Joseph K. Bradley 975bcef467 [SPARK-5596] [mllib] ML model import/export for GLMs, NaiveBayes
This is a PR for Parquet-based model import/export.  Please see the design doc on [the JIRA](https://issues.apache.org/jira/browse/SPARK-4587).

Note: This includes only a subset of regression and classification models:
* NaiveBayes, SVM, LogisticRegression
* LinearRegression, RidgeRegression, Lasso

Follow-up PRs will cover other models.

Sketch of current contents:
* New traits: Saveable, Loader
* Implementations for some algorithms
* Also: Added LogisticRegressionModel.getThreshold method (so that unit test could check the threshold)

CC: mengxr  selvinsource

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #4233 from jkbradley/ml-import-export and squashes the following commits:

87c4eb8 [Joseph K. Bradley] small cleanups
12d9059 [Joseph K. Bradley] Many cleanups after code review.  Major changes: Storing numFeatures, numClasses in model metadata. Improvements to unit tests
b4ee064 [Joseph K. Bradley] Reorganized save/load for regression and classification.  Renamed concepts to Saveable, Loader
a34aef5 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into ml-import-export
ee99228 [Joseph K. Bradley] scala style fix
79675d5 [Joseph K. Bradley] cleanups in LogisticRegression after rebasing after multinomial PR
d1e5882 [Joseph K. Bradley] organized imports
2935963 [Joseph K. Bradley] Added save/load and tests for most classification and regression models
c495dba [Joseph K. Bradley] made version for model import/export local to each model
1496852 [Joseph K. Bradley] Added save/load for NaiveBayes
8d46386 [Joseph K. Bradley] Added save/load to NaiveBayes
1577d70 [Joseph K. Bradley] fixed issues after rebasing on master (DataFrame patch)
64914a3 [Joseph K. Bradley] added getThreshold to SVMModel
b1fc5ec [Joseph K. Bradley] small cleanups
418ba1b [Joseph K. Bradley] Added save, load to mllib.classification.LogisticRegressionModel, plus test suite
2015-02-04 22:46:48 -08:00
Xiangrui Meng eb15631854 [FIX][MLLIB] fix seed handling in Python GMM
If `seed` is `None` on the python side, it will pass in as a `null`. So we should use `java.lang.Long` instead of `Long` to take it.

Author: Xiangrui Meng <meng@databricks.com>

Closes #4349 from mengxr/gmm-fix and squashes the following commits:

3be5926 [Xiangrui Meng] fix seed handling in Python GMM
2015-02-03 20:39:11 -08:00
Reynold Xin 1077f2e1de [SPARK-5578][SQL][DataFrame] Provide a convenient way for Scala users to use UDFs
A more convenient way to define user-defined functions.

Author: Reynold Xin <rxin@databricks.com>

Closes #4345 from rxin/defineUDF and squashes the following commits:

639c0f8 [Reynold Xin] udf tests.
0a0b339 [Reynold Xin] defineUDF -> udf.
b452b8d [Reynold Xin] Fix UDF registration.
d2e42c3 [Reynold Xin] SQLContext.udf.register() returns a UserDefinedFunction also.
4333605 [Reynold Xin] [SQL][DataFrame] defineUDF.
2015-02-03 20:07:46 -08:00
Jacky Li e380d2d46c [SPARK-5520][MLlib] Make FP-Growth implementation take generic item types (WIP)
Make FPGrowth.run API take generic item types:
`def run[Item: ClassTag, Basket <: Iterable[Item]](data: RDD[Basket]): FPGrowthModel[Item]`
so that user can invoke it by run[String, Seq[String]], run[Int, Seq[Int]], run[Int, List[Int]], etc.

Scala part is done, while java part is still in progress

Author: Jacky Li <jacky.likun@huawei.com>
Author: Jacky Li <jackylk@users.noreply.github.com>
Author: Xiangrui Meng <meng@databricks.com>

Closes #4340 from jackylk/SPARK-5520-WIP and squashes the following commits:

f5acf84 [Jacky Li] Merge pull request #2 from mengxr/SPARK-5520
63073d0 [Xiangrui Meng] update to make generic FPGrowth Java-friendly
737d8bb [Jacky Li] fix scalastyle
793f85c [Jacky Li] add Java test case
7783351 [Jacky Li] add generic support in FPGrowth
2015-02-03 17:02:42 -08:00
Xiangrui Meng 659329f9ee [minor] update streaming linear algorithms
Author: Xiangrui Meng <meng@databricks.com>

Closes #4329 from mengxr/streaming-lr and squashes the following commits:

78731e1 [Xiangrui Meng] update streaming linear algorithms
2015-02-03 00:14:43 -08:00
Joseph K. Bradley 980764f3c0 [SPARK-1405] [mllib] Latent Dirichlet Allocation (LDA) using EM
**This PR introduces an API + simple implementation for Latent Dirichlet Allocation (LDA).**

The [design doc for this PR](https://docs.google.com/document/d/1kSsDqTeZMEB94Bs4GTd0mvdAmduvZSSkpoSfn-seAzo) has been updated since I initially posted it.  In particular, see the API and Planning for the Future sections.

* Settle on a public API which may eventually include:
  * more inference algorithms
  * more options / functionality
* Have an initial easy-to-understand implementation which others may improve.
* This is NOT intended to support every topic model out there.  However, if there are suggestions for making this extensible or pluggable in the future, that could be nice, as long as it does not complicate the API or implementation too much.
* This may not be very scalable currently.  It will be important to check and improve accuracy.  For correctness of the implementation, please check against the Asuncion et al. (2009) paper in the design doc.

**Dependency: This makes MLlib depend on GraphX.**

Files and classes:
* LDA.scala (441 lines):
  * class LDA (main estimator class)
  * LDA.Document  (text + document ID)
* LDAModel.scala (266 lines)
  * abstract class LDAModel
  * class LocalLDAModel
  * class DistributedLDAModel
* LDAExample.scala (245 lines): script to run LDA + a simple (private) Tokenizer
* LDASuite.scala (144 lines)

Data/model representation and algorithm:
* Data/model: Uses GraphX, with term vertices + document vertices
* Algorithm: EM, following [Asuncion, Welling, Smyth, and Teh.  "On Smoothing and Inference for Topic Models."  UAI, 2009.](http://arxiv-web3.library.cornell.edu/abs/1205.2662v1)
* For more details, please see the description in the “DEVELOPERS NOTE” in LDA.scala

Please refer to the JIRA for more discussion + the [design doc for this PR](https://docs.google.com/document/d/1kSsDqTeZMEB94Bs4GTd0mvdAmduvZSSkpoSfn-seAzo)

Here, I list the main changes AFTER the design doc was posted.

Design decisions:
* logLikelihood() computes the log likelihood of the data and the current point estimate of parameters.  This is different from the likelihood of the data given the hyperparameters, which would be harder to compute.  I’d describe the current approach as more frequentist, whereas the harder approach would be more Bayesian.
* The current API takes Documents as token count vectors.  I believe there should be an extended API taking RDD[String] or RDD[Array[String]] in a future PR.  I have sketched this out in the design doc (as well as handier versions of getTopics returning Strings).
* Hyperparameters should be set differently for different inference/learning algorithms.  See Asuncion et al. (2009) in the design doc for a good demonstration.  I encourage good behavior via defaults and warning messages.

Items planned for future PRs:
* perplexity
* API taking Strings

* Should LDA be called LatentDirichletAllocation (and LDAModel be LatentDirichletAllocationModel)?
  * Pro: We may someday want LinearDiscriminantAnalysis.
  * Con: Very long names

* Should LDA reside in clustering?  Or do we want a sub-package?
  * mllib.topicmodel
  * mllib.clustering.topicmodel

* Does the API seem reasonable and extensible?

* Unit tests:
  * Should there be a test which checks a clustering results?  E.g., train on a small, fake dataset with 2 very distinct topics/clusters, and ensure LDA finds those 2 topics/clusters.  Does that sound useful or too flaky?

This has not been tested much for scaling.  I have run it on a laptop for 200 iterations on a 5MB dataset with 1000 terms and 5 topics.  Running it for 500 iterations made it fail because of GC problems.  I'm running larger scale tests & will put results here, but future PRs may need to improve the scaling.

* dlwh  for the initial implementation
  * + jegonzal  for some code in the initial implementation
* The many contributors towards topic model implementations in Spark which were referenced as a basis for this PR: akopich witgo yinxusen dlwh EntilZha jegonzal  IlyaKozlov
  * Note: The plan is to include this full list in the authors if this PR gets merged.  Please notify me if you prefer otherwise.

CC: mengxr

Authors:
  Joseph K. Bradley <joseph@databricks.com>
  Joseph Gonzalez <joseph.e.gonzalez@gmail.com>
  David Hall <david.lw.hall@gmail.com>
  Guoqiang Li <witgo@qq.com>
  Xiangrui Meng <meng@databricks.com>
  Pedro Rodriguez <pedro@snowgeek.org>
  Avanesov Valeriy <acopich@gmail.com>
  Xusen Yin <yinxusen@gmail.com>

Closes #2388
Closes #4047 from jkbradley/davidhall-lda and squashes the following commits:

77e8814 [Joseph K. Bradley] small doc fix
5c74345 [Joseph K. Bradley] cleaned up doc based on code review
589728b [Joseph K. Bradley] Updates per code review.  Main change was in LDAExample for faster vocab computation.  Also updated PeriodicGraphCheckpointerSuite.scala to clean up checkpoint files at end
e3980d2 [Joseph K. Bradley] cleaned up PeriodicGraphCheckpointerSuite.scala
74487e5 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into davidhall-lda
4ae2a7d [Joseph K. Bradley] removed duplicate graphx dependency in mllib/pom.xml
e391474 [Joseph K. Bradley] Removed LDATiming.  Added PeriodicGraphCheckpointerSuite.scala.  Small LDA cleanups.
e8d8acf [Joseph K. Bradley] Added catch for BreakIterator exception.  Improved preprocessing to reduce passes over data
1a231b4 [Joseph K. Bradley] fixed scalastyle
91aadfe [Joseph K. Bradley] Added Java-friendly run method to LDA. Added Java test suite for LDA. Changed LDAModel.describeTopics to return Java-friendly type
b75472d [Joseph K. Bradley] merged improvements from LDATiming into LDAExample.  Will remove LDATiming after done testing
993ca56 [Joseph K. Bradley] * Removed Document type in favor of (Long, Vector) * Changed doc ID restriction to be: id must be nonnegative and unique in the doc (instead of 0,1,2,...) * Add checks for valid ranges of eta, alpha * Rename “LearningState” to “EMOptimizer” * Renamed params: termSmoothing -> topicConcentration, topicSmoothing -> docConcentration   * Also added aliases alpha, beta
cb5a319 [Joseph K. Bradley] Added checkpointing to LDA * new class PeriodicGraphCheckpointer * params checkpointDir, checkpointInterval to LDA
43c1c40 [Joseph K. Bradley] small cleanup
0b90393 [Joseph K. Bradley] renamed LDA LearningState.collectTopicTotals to globalTopicTotals
77a2c85 [Joseph K. Bradley] Moved auto term,topic smoothing computation to get*Smoothing methods.  Changed word to term in some places.  Updated LDAExample to use default smoothing amounts.
fb1e7b5 [Xiangrui Meng] minor
08d59a3 [Xiangrui Meng] reset spacing
9fe0b95 [Xiangrui Meng] optimize aggregateMessages
cec0a9c [Xiangrui Meng] * -> *=
6cb11b0 [Xiangrui Meng] optimize computePTopic
9eb3d02 [Xiangrui Meng] + -> +=
892530c [Xiangrui Meng] use axpy
45cc7f2 [Xiangrui Meng] mapPart -> flatMap
ce53be9 [Joseph K. Bradley] fixed example name
75749e7 [Joseph K. Bradley] scala style fix
9f2a492 [Joseph K. Bradley] Unit tests and fixes for LDA, now ready for PR
377ebd9 [Joseph K. Bradley] separated LDA models into own file.  more cleanups before PR
2d40006 [Joseph K. Bradley] cleanups before PR
2891e89 [Joseph K. Bradley] Prepped LDA main class for PR, but some cleanups remain
0cb7187 [Joseph K. Bradley] Added 3 files from dlwh LDA implementation
2015-02-02 23:57:37 -08:00
Xiangrui Meng 0cc7b88c99 [SPARK-5536] replace old ALS implementation by the new one
The only issue is that `analyzeBlock` is removed, which was marked as a developer API. I didn't change other tests in the ALSSuite under `spark.mllib` to ensure that the implementation is correct.

CC: srowen coderxiang

Author: Xiangrui Meng <meng@databricks.com>

Closes #4321 from mengxr/SPARK-5536 and squashes the following commits:

5a3cee8 [Xiangrui Meng] update python tests that are too strict
e840acf [Xiangrui Meng] ignore scala style check for ALS.train
e9a721c [Xiangrui Meng] update mima excludes
9ee6a36 [Xiangrui Meng] merge master
9a8aeac [Xiangrui Meng] update tests
d8c3271 [Xiangrui Meng] remove analyzeBlocks
d68eee7 [Xiangrui Meng] add checkpoint to new ALS
22a56f8 [Xiangrui Meng] wrap old ALS
c387dff [Xiangrui Meng] support random seed
3bdf24b [Xiangrui Meng] make storage level configurable in the new ALS
2015-02-02 23:49:09 -08:00
FlytxtRnD 50a1a874e1 [SPARK-5012][MLLib][PySpark]Python API for Gaussian Mixture Model
Python API for the Gaussian Mixture Model clustering algorithm in MLLib.

Author: FlytxtRnD <meethu.mathew@flytxt.com>

Closes #4059 from FlytxtRnD/PythonGmmWrapper and squashes the following commits:

c973ab3 [FlytxtRnD] Merge branch 'PythonGmmWrapper', remote-tracking branch 'upstream/master' into PythonGmmWrapper
339b09c [FlytxtRnD] Added MultivariateGaussian namedtuple  and Arraybuffer in trainGaussianMixture
fa0a142 [FlytxtRnD] New line added
d5b36ab [FlytxtRnD] Changed argument names to lowercase
ac134f1 [FlytxtRnD] Merge branch 'PythonGmmWrapper' of https://github.com/FlytxtRnD/spark into PythonGmmWrapper
6671ea1 [FlytxtRnD] Added mllib/stat/distribution.py
3aee84b [FlytxtRnD] Fixed style issues
2e9f12a [FlytxtRnD] Added mllib/stat/distribution.py and fixed style issues
b22532c [FlytxtRnD] Merge branch 'PythonGmmWrapper', remote-tracking branch 'upstream/master' into PythonGmmWrapper
2e14d82 [FlytxtRnD] Incorporate MultivariateGaussian instances in GaussianMixtureModel
05767c7 [FlytxtRnD] Merge branch 'PythonGmmWrapper', remote-tracking branch 'upstream/master' into PythonGmmWrapper
3464d19 [FlytxtRnD] Merge branch 'PythonGmmWrapper', remote-tracking branch 'upstream/master' into PythonGmmWrapper
c1d4c71 [FlytxtRnD] Merge branch 'PythonGmmWrapper', remote-tracking branch 'origin/PythonGmmWrapper' into PythonGmmWrapper
426d130 [FlytxtRnD] Added random seed parameter
332bad1 [FlytxtRnD] Merge branch 'PythonGmmWrapper', remote-tracking branch 'upstream/master' into PythonGmmWrapper
f82750b [FlytxtRnD] Fixed style issues
5c83825 [FlytxtRnD] Split input file with space delimiter
fda60f3 [FlytxtRnD] Python API for Gaussian Mixture Model
2015-02-02 23:04:55 -08:00
freeman eb0da6c4bd [SPARK-4979][MLLIB] Streaming logisitic regression
This adds support for streaming logistic regression with stochastic gradient descent, in the same manner as the existing implementation of streaming linear regression. It is a relatively simple addition because most of the work is already done by the abstract class `StreamingLinearAlgorithm` and existing algorithms and models from MLlib.

The PR includes
- Streaming Logistic Regression algorithm
- Unit tests for accuracy, streaming convergence, and streaming prediction
- An example use

cc mengxr tdas

Author: freeman <the.freeman.lab@gmail.com>

Closes #4306 from freeman-lab/streaming-logisitic-regression and squashes the following commits:

5c2c70b [freeman] Use Option on model
5cca2bc [freeman] Merge remote-tracking branch 'upstream/master' into streaming-logisitic-regression
275f8bd [freeman] Make private to mllib
3926e4e [freeman] Line formatting
5ee8694 [freeman] Experimental tag for docs
2fc68ac [freeman] Fix example formatting
85320b1 [freeman] Fixed line length
d88f717 [freeman] Remove stray comment
59d7ecb [freeman] Add streaming logistic regression
e78fe28 [freeman] Add streaming logistic regression example
321cc66 [freeman] Set private and protected within mllib
2015-02-02 22:42:15 -08:00
Liang-Chi Hsieh 1bcd46574e [SPARK-5512][Mllib] Run the PIC algorithm with initial vector suggected by the PIC paper
As suggested by the paper of Power Iteration Clustering, it is useful to set the initial vector v0 as the degree vector d. This pr tries to add a running method for that.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #4301 from viirya/pic_degreevector and squashes the following commits:

7db28fb [Liang-Chi Hsieh] Refactor it to address comments.
19cf94e [Liang-Chi Hsieh] Add an option to select initialization method.
ec88567 [Liang-Chi Hsieh] Run the PIC algorithm with degree vector d as suggected by the PIC paper.
2015-02-02 19:34:25 -08:00
Xiangrui Meng ef65cf09b0 [SPARK-5540] hide ALS.solveLeastSquares
This method survived the code review and it has been there since v1.1.0. It exposes jblas types. Let's remove it from the public API. I think no one calls it directly.

Author: Xiangrui Meng <meng@databricks.com>

Closes #4318 from mengxr/SPARK-5540 and squashes the following commits:

586ade6 [Xiangrui Meng] hide ALS.solveLeastSquares
2015-02-02 17:10:01 -08:00
DB Tsai b1aa8fe988 [SPARK-2309][MLlib] Multinomial Logistic Regression
#1379 is automatically closed by asfgit, and github can not reopen it once it's closed, so this will be the new PR.

Binary Logistic Regression can be extended to Multinomial Logistic Regression by running K-1 independent Binary Logistic Regression models. The following formula is implemented.
http://www.slideshare.net/dbtsai/2014-0620-mlor-36132297/25

Author: DB Tsai <dbtsai@alpinenow.com>

Closes #3833 from dbtsai/mlor and squashes the following commits:

4e2f354 [DB Tsai] triger jenkins
697b7c9 [DB Tsai] address some feedback
4ce4d33 [DB Tsai] refactoring
ff843b3 [DB Tsai] rebase
f114135 [DB Tsai] refactoring
4348426 [DB Tsai] Addressed feedback from Sean Owen
a252197 [DB Tsai] first commit
2015-02-02 15:59:15 -08:00
Xiangrui Meng 46d50f151c [SPARK-5513][MLLIB] Add nonnegative option to ml's ALS
This PR ports the NNLS solver to the new ALS implementation.

CC: coderxiang

Author: Xiangrui Meng <meng@databricks.com>

Closes #4302 from mengxr/SPARK-5513 and squashes the following commits:

4cbdab0 [Xiangrui Meng] fix serialization
88de634 [Xiangrui Meng] add NNLS to ml's ALS
2015-02-02 15:55:44 -08:00
Alexander Ulanov c081b21b1f [MLLIB] SPARK-5491 (ex SPARK-1473): Chi-square feature selection
The following is implemented:
1) generic traits for feature selection and filtering
2) trait for feature selection of LabeledPoint with discrete data
3) traits for calculation of contingency table and chi squared
4) class for chi-squared feature selection
5) tests for the above

Needs some optimization in matrix operations.

This request is a try to implement feature selection for MLLIB, the previous work by the issue author izendejas was not finished (https://issues.apache.org/jira/browse/SPARK-1473). This request is also related to data discretization issues: https://issues.apache.org/jira/browse/SPARK-1303 and https://issues.apache.org/jira/browse/SPARK-1216 that weren't merged.

Author: Alexander Ulanov <nashb@yandex.ru>

Closes #1484 from avulanov/featureselection and squashes the following commits:

755d358 [Alexander Ulanov] Addressing reviewers comments @mengxr
a6ad82a [Alexander Ulanov] Addressing reviewers comments @mengxr
714b878 [Alexander Ulanov] Addressing reviewers comments @mengxr
010acff [Alexander Ulanov] Rebase
427ca4e [Alexander Ulanov] Addressing reviewers comments: implement VectorTransformer interface, use Statistics.chiSqTest
f9b070a [Alexander Ulanov] Adding Apache header in tests...
80363ca [Alexander Ulanov] Tests, comments, apache headers and scala style
150a3e0 [Alexander Ulanov] Scala style fix
f356365 [Alexander Ulanov] Chi Squared by contingency table. Refactoring
2bacdc7 [Alexander Ulanov] Combinations and chi-squared values test
66e0333 [Alexander Ulanov] Feature selector, fix of lazyness
aab9b73 [Alexander Ulanov] Feature selection redesign with vigdorchik
e24eee4 [Alexander Ulanov] Traits for FeatureSelection, CombinationsCalculator and FeatureFilter
ca49e80 [Alexander Ulanov] Feature selection filter
2ade254 [Alexander Ulanov] Code style
0bd8434 [Alexander Ulanov] Chi Squared feature selection: initial version
2015-02-02 12:13:05 -08:00
Jacky Li 859f7249a6 [SPARK-4001][MLlib] adding parallel FP-Growth algorithm for frequent pattern mining in MLlib
Apriori is the classic algorithm for frequent item set mining in a transactional data set. It will be useful if Apriori algorithm is added to MLLib in Spark. This PR add an implementation for it.
There is a point I am not sure wether it is most efficient. In order to filter out the eligible frequent item set, currently I am using a cartesian operation on two RDDs to calculate the degree of support of each item set, not sure wether it is better to use broadcast variable to achieve the same.

I will add an example to use this algorithm if requires

Author: Jacky Li <jacky.likun@huawei.com>
Author: Jacky Li <jackylk@users.noreply.github.com>
Author: Xiangrui Meng <meng@databricks.com>

Closes #2847 from jackylk/apriori and squashes the following commits:

bee3093 [Jacky Li] Merge pull request #1 from mengxr/SPARK-4001
7e69725 [Xiangrui Meng] simplify FPTree and update FPGrowth
ec21f7d [Jacky Li] fix scalastyle
93f3280 [Jacky Li] create FPTree class
d110ab2 [Jacky Li] change test case to use MLlibTestSparkContext
a6c5081 [Jacky Li] Add Parallel FPGrowth algorithm
eb3e4ca [Jacky Li] add FPGrowth
03df2b6 [Jacky Li] refactory according to comments
7b77ad7 [Jacky Li] fix scalastyle check
f68a0bd [Jacky Li] add 2 apriori implemenation and fp-growth implementation
889b33f [Jacky Li] modify per scalastyle check
da2cba7 [Jacky Li] adding apriori algorithm for frequent item set mining in Spark
2015-02-01 20:07:25 -08:00
Yuhao Yang d85cd4eb14 [Spark-5406][MLlib] LocalLAPACK mode in RowMatrix.computeSVD should have much smaller upper bound
JIRA link: https://issues.apache.org/jira/browse/SPARK-5406

The code in breeze svd  imposes the upper bound for LocalLAPACK in RowMatrix.computeSVD
code from breeze svd (https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/linalg/functions/svd.scala)
     val workSize = ( 3
        * scala.math.min(m, n)
        * scala.math.min(m, n)
        + scala.math.max(scala.math.max(m, n), 4 * scala.math.min(m, n)
          * scala.math.min(m, n) + 4 * scala.math.min(m, n))
      )
      val work = new Array[Double](workSize)

As a result, 7 * n * n + 4 * n < Int.MaxValue at least (depends on JVM)

In some worse cases, like n = 25000, work size will become positive again (80032704) and bring wired behavior.

The PR is only the beginning, to support Genbase ( an important biological benchmark that would help promote Spark to genetic applications, http://www.paradigm4.com/wp-content/uploads/2014/06/Genomics-Benchmark-Technical-Report.pdf),
which needs to compute svd for matrix up to 60K * 70K. I found many potential issues and would like to know if there's any plan undergoing that would expand the range of matrix computation based on Spark.
Thanks.

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #4200 from hhbyyh/rowMatrix and squashes the following commits:

f7864d0 [Yuhao Yang] update auto logic for rowMatrix svd
23860e4 [Yuhao Yang] fix comment style
e48a6e4 [Yuhao Yang] make latent svd computation constraint clear
2015-02-01 19:40:26 -08:00
Xiangrui Meng 4a171225ba [SPARK-5424][MLLIB] make the new ALS impl take generic ID types
This PR makes the ALS implementation take generic ID types, e.g., Long and String, and expose it as a developer API.

TODO:
- [x] make sure that specialization works (validated in profiler)

srowen You may like this change:) I hit a Scala compiler bug with specialization. It compiles now but users and items must have the same type. I'm going to check whether specialization really works.

Author: Xiangrui Meng <meng@databricks.com>

Closes #4281 from mengxr/generic-als and squashes the following commits:

96072c3 [Xiangrui Meng] merge master
135f741 [Xiangrui Meng] minor update
c2db5e5 [Xiangrui Meng] make test pass
86588e1 [Xiangrui Meng] use a single ID type for both users and items
74f1f73 [Xiangrui Meng] compile but runtime error at test
e36469a [Xiangrui Meng] add classtags and make it compile
7a5aeb3 [Xiangrui Meng] UserType -> User, ItemType -> Item
c8ee0bc [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into generic-als
72b5006 [Xiangrui Meng] remove generic from pipeline interface
8bbaea0 [Xiangrui Meng] make ALS take generic IDs
2015-02-01 14:13:31 -08:00
Octavian Geagla bdb0680d37 [SPARK-5207] [MLLIB] StandardScalerModel mean and variance re-use
This seems complete, the duplication of tests for provided means/variances might be overkill, would appreciate some feedback.

Author: Octavian Geagla <ogeagla@gmail.com>

Closes #4140 from ogeagla/SPARK-5207 and squashes the following commits:

fa64dfa [Octavian Geagla] [SPARK-5207] [MLLIB] [WIP] change StandardScalerModel to take stddev instead of variance
9078fe0 [Octavian Geagla] [SPARK-5207] [MLLIB] [WIP] Incorporate code review feedback: change arg ordering, add dev api annotations, do better null checking, add another test and some doc for this.
997d2e0 [Octavian Geagla] [SPARK-5207] [MLLIB] [WIP] make withMean and withStd public, add constructor which uses defaults, un-refactor test class
64408a4 [Octavian Geagla] [SPARK-5207] [MLLIB] [WIP] change StandardScalerModel contructor to not be private to mllib, added tests for newly-exposed functionality
2015-02-01 09:21:14 -08:00
Sean Owen c84d5a10e8 SPARK-3359 [CORE] [DOCS] sbt/sbt unidoc doesn't work with Java 8
These are more `javadoc` 8-related changes I spotted while investigating. These should be helpful in any event, but this does not nearly resolve SPARK-3359, which may never be feasible while using `unidoc` and `javadoc` 8.

Author: Sean Owen <sowen@cloudera.com>

Closes #4193 from srowen/SPARK-3359 and squashes the following commits:

5b33f66 [Sean Owen] Additional scaladoc fixes for javadoc 8; still not going to be javadoc 8 compatible
2015-01-31 10:40:42 -08:00
Burak Yavuz ef8974b1b7 [SPARK-3975] Added support for BlockMatrix addition and multiplication
Support for multiplying and adding large distributed matrices!

Author: Burak Yavuz <brkyvz@gmail.com>
Author: Burak Yavuz <brkyvz@dn51t42l.sunet>
Author: Burak Yavuz <brkyvz@dn51t4rd.sunet>
Author: Burak Yavuz <brkyvz@dn0a221430.sunet>
Author: Burak Yavuz <brkyvz@dn0a22b17d.sunet>

Closes #4274 from brkyvz/SPARK-3975PR2 and squashes the following commits:

17abd59 [Burak Yavuz] added indices to error message
ac25783 [Burak Yavuz] merged masyer
b66fd8b [Burak Yavuz] merged masyer
e39baff [Burak Yavuz] addressed code review v1
2dba642 [Burak Yavuz] [SPARK-3975] Added support for BlockMatrix addition and multiplication
fb7624b [Burak Yavuz] merged master
98c58ea [Burak Yavuz] added tests
cdeb5df [Burak Yavuz] before adding tests
c9bf247 [Burak Yavuz] fixed merge conflicts
1cb0d06 [Burak Yavuz] [SPARK-3976] Added doc
f92a916 [Burak Yavuz] merge upstream
1a63b20 [Burak Yavuz] [SPARK-3974] Remove setPartition method. Isn't required
1e8bb2a [Burak Yavuz] [SPARK-3974] Change return type of cache and persist
e3d24c3 [Burak Yavuz] [SPARK-3976] Pulled upstream changes
fa3774f [Burak Yavuz] [SPARK-3976] updated matrix multiplication and addition implementation
239ab4b [Burak Yavuz] [SPARK-3974] Addressed @jkbradley's comments
add7b05 [Burak Yavuz] [SPARK-3976] Updated code according to upstream changes
e29acfd [Burak Yavuz] Merge branch 'master' of github.com:apache/spark into SPARK-3976
3127233 [Burak Yavuz] fixed merge conflicts with upstream
ba414d2 [Burak Yavuz] [SPARK-3974] fixed frobenius norm
ab6cde0 [Burak Yavuz] [SPARK-3974] Modifications cleaning code up, making size calculation more robust
9ae85aa [Burak Yavuz] [SPARK-3974] Made partitioner a variable inside BlockMatrix instead of a constructor variable
d033861 [Burak Yavuz] [SPARK-3974] Removed SubMatrixInfo and added constructor without partitioner
8e954ab [Burak Yavuz] save changes
bbeae8c [Burak Yavuz] merged master
987ea53 [Burak Yavuz] merged master
49b9586 [Burak Yavuz] [SPARK-3974] Updated testing utils from master
645afbe [Burak Yavuz] [SPARK-3974] Pull latest master
beb1edd [Burak Yavuz] merge conflicts fixed
f41d8db [Burak Yavuz] update tests
b05aabb [Burak Yavuz] [SPARK-3974] Updated tests to reflect changes
56b0546 [Burak Yavuz] updates from 3974 PR
b7b8a8f [Burak Yavuz] pull updates from master
b2dec63 [Burak Yavuz] Pull changes from 3974
19c17e8 [Burak Yavuz] [SPARK-3974] Changed blockIdRow and blockIdCol
5f062e6 [Burak Yavuz] updates with 3974
6729fbd [Burak Yavuz] Updated with respect to SPARK-3974 PR
589fbb6 [Burak Yavuz] [SPARK-3974] Code review feedback addressed
63a4858 [Burak Yavuz] added grid multiplication
aa8f086 [Burak Yavuz] [SPARK-3974] Additional comments added
7381b99 [Burak Yavuz] merge with PR1
f378e16 [Burak Yavuz] [SPARK-3974] Block Matrix Abstractions ready
b693209 [Burak Yavuz] Ready for Pull request
2015-01-31 00:47:30 -08:00
martinzapletal 34250a613c [MLLIB][SPARK-3278] Monotone (Isotonic) regression using parallel pool adjacent violators algorithm
This PR introduces an API for Isotonic regression and one algorithm implementing it, Pool adjacent violators.

The Isotonic regression problem is sufficiently described in [Floudas, Pardalos, Encyclopedia of Optimization](http://books.google.co.uk/books?id=gtoTkL7heS0C&pg=RA2-PA87&lpg=RA2-PA87&dq=pooled+adjacent+violators+code&source=bl&ots=ZzQbZXVJnn&sig=reH_hBV6yIb9BeZNTF9092vD8PY&hl=en&sa=X&ei=WmF2VLiOIZLO7Qa-t4Bo&ved=0CD8Q6AEwBA#v=onepage&q&f=false), [Wikipedia](http://en.wikipedia.org/wiki/Isotonic_regression) or [Stat Wiki](http://stat.wikia.com/wiki/Isotonic_regression).

Pool adjacent violators was introduced by  M. Ayer et al. in 1955.  A history and development of isotonic regression algorithms is in [Leeuw, Hornik, Mair, Isotone Optimization in R: Pool-Adjacent-Violators Algorithm (PAVA) and Active Set Methods](http://www.jstatsoft.org/v32/i05/paper) and list of available algorithms including their complexity is listed in [Stout, Fastest Isotonic Regression Algorithms](http://web.eecs.umich.edu/~qstout/IsoRegAlg_140812.pdf).

An approach to parallelize the computation of PAV was presented in [Kearsley, Tapia, Trosset, An Approach to Parallelizing Isotonic Regression](http://softlib.rice.edu/pub/CRPC-TRs/reports/CRPC-TR96640.pdf).

The implemented Pool adjacent violators algorithm is based on  [Floudas, Pardalos, Encyclopedia of Optimization](http://books.google.co.uk/books?id=gtoTkL7heS0C&pg=RA2-PA87&lpg=RA2-PA87&dq=pooled+adjacent+violators+code&source=bl&ots=ZzQbZXVJnn&sig=reH_hBV6yIb9BeZNTF9092vD8PY&hl=en&sa=X&ei=WmF2VLiOIZLO7Qa-t4Bo&ved=0CD8Q6AEwBA#v=onepage&q&f=false) (Chapter Isotonic regression problems, p. 86) and  [Leeuw, Hornik, Mair, Isotone Optimization in R: Pool-Adjacent-Violators Algorithm (PAVA) and Active Set Methods](http://www.jstatsoft.org/v32/i05/paper), also nicely formulated in [Tibshirani,  Hoefling, Tibshirani, Nearly-Isotonic Regression](http://www.stat.cmu.edu/~ryantibs/papers/neariso.pdf). Implementation itself inspired by R implementations [Klaus, Strimmer, 2008, fdrtool: Estimation of (Local) False Discovery Rates and Higher Criticism](http://cran.r-project.org/web/packages/fdrtool/index.html) and [R Development Core Team, stats, 2009](https://github.com/lgautier/R-3-0-branch-alt/blob/master/src/library/stats/R/isoreg.R). I ran tests with both these libraries and confirmed they yield the same results. More R implementations referenced in aforementioned [Leeuw, Hornik, Mair, Isotone Optimization in R: Pool-Adjacent-Violators
Algorithm (PAVA) and Active Set Methods](http://www.jstatsoft.org/v32/i05/paper). The implementation is also inspired and cross checked with other implementations: [Ted Harding, 2007](https://stat.ethz.ch/pipermail/r-help/2007-March/127981.html), [scikit-learn](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/_isotonic.pyx), [Andrew Tulloch, 2014, Julia](https://github.com/ajtulloch/Isotonic.jl/blob/master/src/pooled_pava.jl), [Andrew Tulloch, 2014, c++](https://gist.github.com/ajtulloch/9499872), described in [Andrew Tulloch, Speeding up isotonic regression in scikit-learn by 5,000x](http://tullo.ch/articles/speeding-up-isotonic-regression/), [Fabian Pedregosa, 2012](https://gist.github.com/fabianp/3081831), [Sreangsu Acharyya. libpav](f744bc1b0f/src/pav.h?at=default) and [Gustav Larsson](https://gist.github.com/gustavla/9499068).

Author: martinzapletal <zapletal-martin@email.cz>
Author: Xiangrui Meng <meng@databricks.com>
Author: Martin Zapletal <zapletal-martin@email.cz>

Closes #3519 from zapletal-martin/SPARK-3278 and squashes the following commits:

5a54ea4 [Martin Zapletal] Merge pull request #2 from mengxr/isotonic-fix-java
37ba24e [Xiangrui Meng] fix java tests
e3c0e44 [martinzapletal] Merge remote-tracking branch 'origin/SPARK-3278' into SPARK-3278
d8feb82 [martinzapletal] Merge remote-tracking branch 'upstream/master' into SPARK-3278
ded071c [Martin Zapletal] Merge pull request #1 from mengxr/SPARK-3278
4dfe136 [Xiangrui Meng] add cache back
0b35c15 [Xiangrui Meng] compress pools and update tests
35d044e [Xiangrui Meng] update paraPAVA
077606b [Xiangrui Meng] minor
05422a8 [Xiangrui Meng] add unit test for model construction
5925113 [Xiangrui Meng] Merge remote-tracking branch 'zapletal-martin/SPARK-3278' into SPARK-3278
80c6681 [Xiangrui Meng] update IRModel
3da56e5 [martinzapletal] SPARK-3278 fixed indentation error
75eac55 [martinzapletal] Merge remote-tracking branch 'upstream/master' into SPARK-3278
88eb4e2 [martinzapletal] SPARK-3278 changes after PR comments https://github.com/apache/spark/pull/3519. Isotonic parameter removed from algorithm, defined behaviour for multiple data points with the same feature value, added tests to verify it
e60a34f [martinzapletal] SPARK-3278 changes after PR comments https://github.com/apache/spark/pull/3519. Styling and comment fixes.
d93c8f9 [martinzapletal] SPARK-3278 changes after PR comments https://github.com/apache/spark/pull/3519. Change to IsotonicRegression api. Isotonic parameter now follows api of other mllib algorithms
1fff77d [martinzapletal] SPARK-3278 changes after PR comments https://github.com/apache/spark/pull/3519. Java api changes, test refactoring, comments and citations, isotonic regression model validations, linear interpolation for predictions
12151e6 [martinzapletal] Merge remote-tracking branch 'upstream/master' into SPARK-3278
7aca4cc [martinzapletal] SPARK-3278 comment spelling
9ae9d53 [martinzapletal] SPARK-3278 changes after PR feedback https://github.com/apache/spark/pull/3519. Binary search used for isotonic regression model predictions
fad4bf9 [martinzapletal] SPARK-3278 changes after PR comments https://github.com/apache/spark/pull/3519
ce0e30c [martinzapletal] SPARK-3278 readability refactoring
f90c8c7 [martinzapletal] Merge remote-tracking branch 'upstream/master' into SPARK-3278
0d14bd3 [martinzapletal] SPARK-3278 changed Java api to match Scala api's (Double, Double, Double)
3c2954b [martinzapletal] SPARK-3278 Isotonic regression java api
45aa7e8 [martinzapletal] SPARK-3278 Isotonic regression java api
e9b3323 [martinzapletal] Merge branch 'SPARK-3278-weightedLabeledPoint' into SPARK-3278
823d803 [martinzapletal] Merge remote-tracking branch 'upstream/master' into SPARK-3278
941fd1f [martinzapletal] SPARK-3278 Isotonic regression java api
a24e29f [martinzapletal] SPARK-3278 refactored weightedlabeledpoint to (double, double, double) and updated api
deb0f17 [martinzapletal] SPARK-3278 refactored weightedlabeledpoint to (double, double, double) and updated api
8cefd18 [martinzapletal] Merge remote-tracking branch 'upstream/master' into SPARK-3278-weightedLabeledPoint
cab5a46 [martinzapletal] SPARK-3278 PR 3519 refactoring WeightedLabeledPoint to tuple as per comments
b8b1620 [martinzapletal] Removed WeightedLabeledPoint. Replaced by tuple of doubles
34760d5 [martinzapletal] Removed WeightedLabeledPoint. Replaced by tuple of doubles
089bf86 [martinzapletal] Removed MonotonicityConstraint, Isotonic and Antitonic constraints. Replced by simple boolean
c06f88c [martinzapletal] Merge remote-tracking branch 'upstream/master' into SPARK-3278
6046550 [martinzapletal] SPARK-3278 scalastyle errors resolved
8f5daf9 [martinzapletal] SPARK-3278 added comments and cleaned up api to consistently handle weights
629a1ce [martinzapletal] SPARK-3278 added isotonic regression for weighted data. Added tests for Java api
05d9048 [martinzapletal] SPARK-3278 isotonic regression refactoring and api changes
961aa05 [martinzapletal] Merge remote-tracking branch 'upstream/master' into SPARK-3278
3de71d0 [martinzapletal] SPARK-3278 added initial version of Isotonic regression algorithm including proposed API
2015-01-31 00:46:02 -08:00
Travis Galoppo 986977340d SPARK-5400 [MLlib] Changed name of GaussianMixtureEM to GaussianMixture
Decoupling the model and the algorithm

Author: Travis Galoppo <tjg2107@columbia.edu>

Closes #4290 from tgaloppo/spark-5400 and squashes the following commits:

9c1534c [Travis Galoppo] Fixed invokation instructions in comments
d848076 [Travis Galoppo] SPARK-5400 Changed name of GaussianMixtureEM to GaussianMixture to separate model from algorithm
2015-01-30 15:32:25 -08:00
sboeschhuawei f377431a57 [SPARK-4259][MLlib]: Add Power Iteration Clustering Algorithm with Gaussian Similarity Function
Add single pseudo-eigenvector PIC
Including documentations and updated pom.xml with the following codes:
mllib/src/main/scala/org/apache/spark/mllib/clustering/PIClustering.scala
mllib/src/test/scala/org/apache/spark/mllib/clustering/PIClusteringSuite.scala

Author: sboeschhuawei <stephen.boesch@huawei.com>
Author: Fan Jiang <fanjiang.sc@huawei.com>
Author: Jiang Fan <fjiang6@gmail.com>
Author: Stephen Boesch <stephen.boesch@huawei.com>
Author: Xiangrui Meng <meng@databricks.com>

Closes #4254 from fjiang6/PIC and squashes the following commits:

4550850 [sboeschhuawei] Removed pic test data
f292f31 [Stephen Boesch] Merge pull request #44 from mengxr/SPARK-4259
4b78aaf [Xiangrui Meng] refactor PIC
24fbf52 [sboeschhuawei] Updated API to be similar to KMeans plus other changes requested by Xiangrui on the PR
c12dfc8 [sboeschhuawei] Removed examples files and added pic_data.txt. Revamped testcases yet to come
92d4752 [sboeschhuawei] Move the Guassian/ Affinity matrix calcs out of PIC. Presently in the test suite
7ebd149 [sboeschhuawei] Incorporate Xiangrui's first set of PR comments except restructure PIC.run to take Graph but do not remove Gaussian
121e4d5 [sboeschhuawei] Remove unused testing data files
1c3a62e [sboeschhuawei] removed matplot.py and reordered all private methods to bottom of PIC
218a49d [sboeschhuawei] Applied Xiangrui's comments - especially removing RDD/PICLinalg classes and making noncritical methods private
43ab10b [sboeschhuawei] Change last two println's to log4j logger
88aacc8 [sboeschhuawei] Add assert to testcase on cluster sizes
24f438e [sboeschhuawei] fixed incorrect markdown in clustering doc
060e6bf [sboeschhuawei] Added link to PIC doc from the main clustering md doc
be659e3 [sboeschhuawei] Added mllib specific log4j
90e7fa4 [sboeschhuawei] Converted from custom Linalg routines to Breeze: added JavaDoc comments; added Markdown documentation
bea48ea [sboeschhuawei] Converted custom Linear Algebra datatypes/routines to use Breeze.
b29c0db [Fan Jiang] Update PIClustering.scala
ace9749 [Fan Jiang] Update PIClustering.scala
a112f38 [sboeschhuawei] Added graphx main and test jars as dependencies to mllib/pom.xml
f656c34 [sboeschhuawei] Added iris dataset
b7dbcbe [sboeschhuawei] Added axes and combined into single plot for matplotlib
a2b1e57 [sboeschhuawei] Revert inadvertent update to KMeans
9294263 [sboeschhuawei] Added visualization/plotting of input/output data
e5df2b8 [sboeschhuawei] First end to end working PIC
0700335 [sboeschhuawei] First end to end working version: but has bad performance issue
32a90dc [sboeschhuawei] Update circles test data values
0ef163f [sboeschhuawei] Added ConcentricCircles data generation and KMeans clustering
3fd5bc8 [sboeschhuawei] PIClustering is running in new branch (up to the pseudo-eigenvector convergence step)
d5aae20 [Jiang Fan] Adding Power Iteration Clustering and Suite test
a3c5fbe [Jiang Fan] Adding Power Iteration Clustering
2015-01-30 14:09:49 -08:00
Burak Yavuz 6ee8338b37 [SPARK-5486] Added validate method to BlockMatrix
The `validate` method will allow users to debug their `BlockMatrix`, if operations like `add` or `multiply` return unexpected results. It checks the following properties in a `BlockMatrix`:
- Are the dimensions of the `BlockMatrix` consistent with what the user entered: (`nRows`, `nCols`)
- Are the dimensions of each `MatrixBlock` consistent with what the user entered: (`rowsPerBlock`, `colsPerBlock`)
- Are there blocks with duplicate indices

Author: Burak Yavuz <brkyvz@gmail.com>

Closes #4279 from brkyvz/SPARK-5486 and squashes the following commits:

c152a73 [Burak Yavuz] addressed code review v2
598c583 [Burak Yavuz] merged master
b55ac5c [Burak Yavuz] addressed code review v1
25f083b [Burak Yavuz] simplify implementation
0aa519a [Burak Yavuz] [SPARK-5486] Added validate method to BlockMatrix
2015-01-30 13:59:10 -08:00
Xiangrui Meng 0a95085f09 [SPARK-5496][MLLIB] Allow both classification and Classification in Algo for trees.
to be backward compatible.

Author: Xiangrui Meng <meng@databricks.com>

Closes #4287 from mengxr/SPARK-5496 and squashes the following commits:

a025c53 [Xiangrui Meng] Allow both classification and Classification in Algo for trees.
2015-01-30 10:08:07 -08:00
Joseph J.C. Tang 54d95758fc [MLLIB] SPARK-4846: throw a RuntimeException and give users hints to increase the minCount
When the vocabSize\*vectorSize is larger than Int.MaxValue/8, we try to throw a RuntimeException. Because under this circumstance it would definitely throw an OOM when allocating memory to serialize the arrays syn0Global&syn1Global.   syn0Global&syn1Global are float arrays. Serializing them should need a byte array of more than 8 times of syn0Global's size.
Also if we catch an OOM even if vocabSize\*vectorSize is less than Int.MaxValue/8, we should give users hints to increase the minCount or decrease the vectorSize.

Author: Joseph J.C. Tang <jinntrance@gmail.com>

Closes #4247 from jinntrance/w2v-fix and squashes the following commits:

b5eb71f [Joseph J.C. Tang] throw a RuntimeException and give users hints regarding the vectorSize&minCount
2015-01-30 10:07:26 -08:00