Commit graph

736 commits

Author SHA1 Message Date
Xiangrui Meng d75496b189 [SPARK-3701][MLLIB] update python linalg api and small fixes
1. doc updates
2. simple checks on vector dimensions
3. use column major for matrices

davies jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #2548 from mengxr/mllib-py-clean and squashes the following commits:

6dce2df [Xiangrui Meng] address comments
116b5db [Xiangrui Meng] use np.dot instead of array.dot
75f2fcc [Xiangrui Meng] fix python style
fefce00 [Xiangrui Meng] better check of vector size with more tests
067ef71 [Xiangrui Meng] majored -> major
ef853f9 [Xiangrui Meng] update python linalg api and small fixes
2014-09-30 17:10:36 -07:00
Reza Zadeh 587a0cd7ed [MLlib] [SPARK-2885] DIMSUM: All-pairs similarity
# All-pairs similarity via DIMSUM
Compute all pairs of similar vectors using brute force approach, and also DIMSUM sampling approach.

Laying down some notation: we are looking for all pairs of similar columns in an m x n RowMatrix whose entries are denoted a_ij, with the i’th row denoted r_i and the j’th column denoted c_j. There is an oversampling parameter labeled ɣ that should be set to 4 log(n)/s to get provably correct results (with high probability), where s is the similarity threshold.

The algorithm is stated with a Map and Reduce, with proofs of correctness and efficiency in published papers [1] [2]. The reducer is simply the summation reducer. The mapper is more interesting, and is also the heart of the scheme. As an exercise, you should try to see why in expectation, the map-reduce below outputs cosine similarities.

![dimsumv2](https://cloud.githubusercontent.com/assets/3220351/3807272/d1d9514e-1c62-11e4-9f12-3cfdb1d78b3a.png)

[1] Bosagh-Zadeh, Reza and Carlsson, Gunnar (2013), Dimension Independent Matrix Square using MapReduce, arXiv:1304.1467 http://arxiv.org/abs/1304.1467

[2] Bosagh-Zadeh, Reza and Goel, Ashish (2012), Dimension Independent Similarity Computation, arXiv:1206.2082 http://arxiv.org/abs/1206.2082

# Testing

Tests for all invocations included.

Added L1 and L2 norm computation to MultivariateStatisticalSummary since it was needed. Added tests for both of them.

Author: Reza Zadeh <rizlar@gmail.com>
Author: Xiangrui Meng <meng@databricks.com>

Closes #1778 from rezazadeh/dimsumv2 and squashes the following commits:

404c64c [Reza Zadeh] Merge remote-tracking branch 'upstream/master' into dimsumv2
4eb71c6 [Reza Zadeh] Add excludes for normL1 and normL2
ee8bd65 [Reza Zadeh] Merge remote-tracking branch 'upstream/master' into dimsumv2
976ddd4 [Reza Zadeh] Broadcast colMags. Avoid div by zero.
3467cff [Reza Zadeh] Merge remote-tracking branch 'upstream/master' into dimsumv2
aea0247 [Reza Zadeh] Allow large thresholds to promote sparsity
9fe17c0 [Xiangrui Meng] organize imports
2196ba5 [Xiangrui Meng] Merge branch 'rezazadeh-dimsumv2' into dimsumv2
254ca08 [Reza Zadeh] Merge remote-tracking branch 'upstream/master' into dimsumv2
f2947e4 [Xiangrui Meng] some optimization
3c4cf41 [Xiangrui Meng] Merge branch 'master' into rezazadeh-dimsumv2
0e4eda4 [Reza Zadeh] Use partition index for RNG
251bb9c [Reza Zadeh] Documentation
25e9d0d [Reza Zadeh] Line length for style
fb296f6 [Reza Zadeh] renamed to normL1 and normL2
3764983 [Reza Zadeh] Documentation
e9c6791 [Reza Zadeh] New interface and documentation
613f261 [Reza Zadeh] Column magnitude summary
75a0b51 [Reza Zadeh] Use Ints instead of Longs in the shuffle
0f12ade [Reza Zadeh] Style changes
eb1dc20 [Reza Zadeh] Use Double.PositiveInfinity instead of Double.Max
f56a882 [Reza Zadeh] Remove changes to MultivariateOnlineSummarizer
dbc55ba [Reza Zadeh] Make colMagnitudes a method in RowMatrix
41e8ece [Reza Zadeh] style changes
139c8e1 [Reza Zadeh] Syntax changes
029aa9c [Reza Zadeh] javadoc and new test
75edb25 [Reza Zadeh] All tests passing!
05e59b8 [Reza Zadeh] Add test
502ce52 [Reza Zadeh] new interface
654c4fb [Reza Zadeh] default methods
3726ca9 [Reza Zadeh] Remove MatrixAlgebra
6bebabb [Reza Zadeh] remove changes to MatrixSuite
5b8cd7d [Reza Zadeh] Initial files
2014-09-29 11:15:09 -07:00
Joseph K. Bradley 0dc2b6361d [SPARK-1545] [mllib] Add Random Forests
This PR adds RandomForest to MLlib.  The implementation is basic, and future performance optimizations will be important.  (Note: RFs = Random Forests.)

# Overview

## RandomForest
* trains multiple trees at once to reduce the number of passes over the data
* allows feature subsets at each node
* uses a queue of nodes instead of fixed groups for each level

This implementation is based an implementation by manishamde and the [Alpine Labs Sequoia Forest](https://github.com/AlpineNow/SparkML2) by codedeft (in particular, the TreePoint, BaggedPoint, and node queue implementations).  Thank you for your inputs!

## Testing

Correctness: This has been tested for correctness with the test suites and with DecisionTreeRunner on example datasets.

Performance: This has been performance tested using [this branch of spark-perf](https://github.com/jkbradley/spark-perf/tree/rfs).  Results below.

### Regression tests for DecisionTree

Summary: For training 1 tree, there are small regressions, especially from feature subsampling.

In the table below, each row is a single (random) dataset.  The 2 different sets of result columns are for 2 different RF implementations:
* (numTrees): This is from an earlier commit, after implementing RandomForest to train multiple trees at once.  It does not include any code for feature subsampling.
* (feature subsets): This is from this current PR's code, after implementing feature subsampling.
These tests were to identify regressions in DecisionTree, so they are training 1 tree with all of the features (i.e., no feature subsampling).

These were run on an EC2 cluster with 15 workers, training 1 tree with maxDepth = 5 (= 6 levels).  Speedup values < 1 indicate slowdowns from the old DecisionTree implementation.

numInstances | numFeatures | runtime (sec) | speedup | runtime (sec) | speedup
---- | ---- | ---- | ---- | ---- | ----
 | | (numTrees) | (numTrees) | (feature subsets) | (feature subsets)
20000 | 100 | 4.051 | 1.044433473 | 4.478 | 0.9448414471
20000 | 500 | 8.472 | 1.104461756 | 9.315 | 1.004508857
20000 | 1500 | 19.354 | 1.05854087 | 20.863 | 0.9819776638
20000 | 3500 | 43.674 | 1.072033704 | 45.887 | 1.020332556
200000 | 100 | 4.196 | 1.171830315 | 4.848 | 1.014232673
200000 | 500 | 8.926 | 1.082791844 | 9.771 | 0.989151571
200000 | 1500 | 20.58 | 1.068415938 | 22.134 | 0.9934038131
200000 | 3500 | 48.043 | 1.075203464 | 52.249 | 0.9886505005
2000000 | 100 | 4.944 | 1.01355178 | 5.796 | 0.8645617667
2000000 | 500 | 11.11 | 1.016831683 | 12.482 | 0.9050632911
2000000 | 1500 | 31.144 | 1.017852556 | 35.274 | 0.8986789136
2000000 | 3500 | 79.981 | 1.085382778 | 101.105 | 0.8586123337
20000000 | 100 | 8.304 | 0.9270231214 | 9.073 | 0.8484514494
20000000 | 500 | 28.174 | 1.083268262 | 34.236 | 0.8914592826
20000000 | 1500 | 143.97 | 0.9579634646 | 159.275 | 0.8659111599

### Tests for forests

I have run other tests with numTrees=10 and with sqrt(numFeatures), and those indicate that multi-model training and feature subsets can speed up training for forests, especially when training deeper trees.

# Details on specific classes

## Changes to DecisionTree
* Main train() method is now in RandomForest.
* findBestSplits() is no longer needed.  (It split levels into groups, but we now use a queue of nodes.)
* Many small changes to support RFs.  (Note: These methods should be moved to RandomForest.scala in a later PR, but are in DecisionTree.scala to make code comparison easier.)

## RandomForest
* Main train() method is from old DecisionTree.
* selectNodesToSplit: Note that it selects nodes and feature subsets jointly to track memory usage.

## RandomForestModel
* Stores an Array[DecisionTreeModel]
* Prediction:
 * For classification, most common label.  For regression, mean.
 * We could support other methods later.

## examples/.../DecisionTreeRunner
* This now takes numTrees and featureSubsetStrategy, to support RFs.

## DTStatsAggregator
* 2 types of functionality (w/ and w/o subsampling features): These require different indexing methods.  (We could treat both as subsampling, but this is less efficient
  DTStatsAggregator is now abstract, and 2 child classes implement these 2 types of functionality.

## impurities
* These now take instance weights.

## Node
* Some vals changed to vars.
 * This is unfortunately a public API change (DeveloperApi).  This could be avoided by creating a LearningNode struct, but would be awkward.

## RandomForestSuite
Please let me know if there are missing tests!

## BaggedPoint
This wraps TreePoint and holds bootstrap weights/counts.

# Design decisions

* BaggedPoint: BaggedPoint is separate from TreePoint since it may be useful for other bagging algorithms later on.

* RandomForest public API: What options should be easily supported by the train* methods?  Should ALL options be in the Java-friendly constructors?  Should there be a constructor taking Strategy?

* Feature subsampling options: What options should be supported?  scikit-learn supports the same options, except for "onethird."  One option would be to allow users to specific fractions ("0.1"): the current options could be supported, and any unrecognized values would be parsed as Doubles in [0,1].

* Splits and bins are computed before bootstrapping, so all trees use the same discretization.

* One queue, instead of one queue per tree.

CC: mengxr manishamde codedeft chouqin  Please let me know if you have suggestions---thanks!

Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>
Author: qiping.lqp <qiping.lqp@alibaba-inc.com>
Author: chouqin <liqiping1991@gmail.com>

Closes #2435 from jkbradley/rfs-new and squashes the following commits:

c694174 [Joseph K. Bradley] Fixed typo
cc59d78 [Joseph K. Bradley] fixed imports
e25909f [Joseph K. Bradley] Simplified node group maps.  Specifically, created NodeIndexInfo to store node index in agg and feature subsets, and no longer create extra maps in findBestSplits
fbe9a1e [Joseph K. Bradley] Changed default featureSubsetStrategy to be sqrt for classification, onethird for regression.  Updated docs with references.
ef7c293 [Joseph K. Bradley] Updates based on code review.  Most substantial changes: * Simplified DTStatsAggregator * Made RandomForestModel.trees public * Added test for regression to RandomForestSuite
593b13c [Joseph K. Bradley] Fixed bug in metadata for computing log2(num features).  Now it checks >= 1.
a1a08df [Joseph K. Bradley] Removed old comments
866e766 [Joseph K. Bradley] Changed RandomForestSuite randomized tests to use multiple fixed random seeds.
ff8bb96 [Joseph K. Bradley] removed usage of null from RandomForest and replaced with Option
bf1a4c5 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into rfs-new
6b79c07 [Joseph K. Bradley] Added RandomForestSuite, and fixed small bugs, style issues.
d7753d4 [Joseph K. Bradley] Added numTrees and featureSubsetStrategy to DecisionTreeRunner (to support RandomForest).  Fixed bugs so that RandomForest now runs.
746d43c [Joseph K. Bradley] Implemented feature subsampling.  Tested DecisionTree but not RandomForest.
6309d1d [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into rfs-new.  Added RandomForestModel.toString
b7ae594 [Joseph K. Bradley] Updated docs.  Small fix for bug which does not cause errors: No longer allocate unused child nodes for leaf nodes.
121c74e [Joseph K. Bradley] Basic random forests are implemented.  Random features per node not yet implemented.  Test suite not implemented.
325d18a [Joseph K. Bradley] Merge branch 'chouqin-dt-preprune' into rfs-new
4ef9bf1 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into rfs-new
61b2e72 [Joseph K. Bradley] Added max of 10GB for maxMemoryInMB in Strategy.
a95e7c8 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into chouqin-dt-preprune
6da8571 [Joseph K. Bradley] RFs partly implemented, not done yet
eddd1eb [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into rfs-new
5c4ac33 [Joseph K. Bradley] Added check in Strategy to make sure minInstancesPerNode >= 1
0dd4d87 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-spark-3160
95c479d [Joseph K. Bradley] * Fixed typo in tree suite test "do not choose split that does not satisfy min instance per node requirements" * small style fixes
e2628b6 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into chouqin-dt-preprune
19b01af [Joseph K. Bradley] Merge remote-tracking branch 'chouqin/dt-preprune' into chouqin-dt-preprune
f1d11d1 [chouqin] fix typo
c7ebaf1 [chouqin] fix typo
39f9b60 [chouqin] change edge `minInstancesPerNode` to 2 and add one more test
c6e2dfc [Joseph K. Bradley] Added minInstancesPerNode and minInfoGain parameters to DecisionTreeRunner.scala and to Python API in tree.py
306120f [Joseph K. Bradley] Fixed typo in DecisionTreeModel.scala doc
eaa1dcf [Joseph K. Bradley] Added topNode doc in DecisionTree and scalastyle fix
d4d7864 [Joseph K. Bradley] Marked Node.build as deprecated
d4dbb99 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-spark-3160
1a8f0ad [Joseph K. Bradley] Eliminated pre-allocated nodes array in main train() method. * Nodes are constructed and added to the tree structure as needed during training.
0278a11 [chouqin] remove `noSplit` and set `Predict` private to tree
d593ec7 [chouqin] fix docs and change minInstancesPerNode to 1
2ab763b [Joseph K. Bradley] Simplifications to DecisionTree code:
efcc736 [qiping.lqp] fix bug
10b8012 [qiping.lqp] fix style
6728fad [qiping.lqp] minor fix: remove empty lines
bb465ca [qiping.lqp] Merge branch 'master' of https://github.com/apache/spark into dt-preprune
cadd569 [qiping.lqp] add api docs
46b891f [qiping.lqp] fix bug
e72c7e4 [qiping.lqp] add comments
845c6fa [qiping.lqp] fix style
f195e83 [qiping.lqp] fix style
987cbf4 [qiping.lqp] fix bug
ff34845 [qiping.lqp] separate calculation of predict of node from calculation of info gain
ac42378 [qiping.lqp] add min info gain and min instances per node parameters in decision tree
2014-09-28 21:44:50 -07:00
RJ Nowling ec9df6a765 [SPARK-3614][MLLIB] Add minimumOccurence filtering to IDF
This PR for [SPARK-3614](https://issues.apache.org/jira/browse/SPARK-3614) adds functionality for filtering out terms which do not appear in at least a minimum number of documents.

This is implemented using a minimumOccurence parameter (default 0).  When terms' document frequencies are less than minimumOccurence, their IDFs are set to 0, just like when the DF is 0.  As a result, the TF-IDFs for the terms are found to be 0, as if the terms were not present in the documents.

This PR makes the following changes:
* Add a minimumOccurence parameter to the IDF and DocumentFrequencyAggregator classes.
* Create a parameter-less constructor for IDF with a default minimumOccurence value of 0 to remain backwards-compatibility with the original IDF API.
* Sets the IDFs to 0 for terms which DFs are less than minimumOccurence
* Add tests to the Spark IDFSuite and Java JavaTfIdfSuite test suites
* Updated the MLLib Feature Extraction programming guide to describe the new feature

Author: RJ Nowling <rnowling@gmail.com>

Closes #2494 from rnowling/spark-3614-idf-filter and squashes the following commits:

0aa3c63 [RJ Nowling] Fix identation
e6523a8 [RJ Nowling] Remove unnecessary toDouble's from IDFSuite
bfa82ec [RJ Nowling] Add space after if
30d20b3 [RJ Nowling] Add spaces around equals signs
9013447 [RJ Nowling] Add space before division operator
79978fc [RJ Nowling] Remove unnecessary semi-colon
40fd70c [RJ Nowling] Change minimumOccurence to minDocFreq in code and docs
47850ab [RJ Nowling] Changed minimumOccurence to Int from Long
9fb4093 [RJ Nowling] Remove unnecessary lines from IDF class docs
1fc09d8 [RJ Nowling] Add backwards-compatible constructor to DocumentFrequencyAggregator
1801fd2 [RJ Nowling] Fix style errors in IDF.scala
6897252 [RJ Nowling] Preface minimumOccurence members with val to make them final and immutable
a200bab [RJ Nowling] Remove unnecessary else statement
4b974f5 [RJ Nowling] Remove accidentally-added import from testing
c0cc643 [RJ Nowling] Add minimumOccurence filtering to IDF
2014-09-26 09:58:47 -07:00
Aaron Staple ff637c9380 [SPARK-1484][MLLIB] Warn when running an iterative algorithm on uncached data.
Add warnings to KMeans, GeneralizedLinearAlgorithm, and computeSVD when called with input data that is not cached. KMeans is implemented iteratively, and I believe that GeneralizedLinearAlgorithm’s current optimizers are iterative and its future optimizers are also likely to be iterative. RowMatrix’s computeSVD is iterative against an RDD when run in DistARPACK mode. ALS and DecisionTree are iterative as well, but they implement RDD caching internally so do not require a warning.

I added a warning to GeneralizedLinearAlgorithm rather than inside its optimizers, where the iteration actually occurs, because internally GeneralizedLinearAlgorithm maps its input data to an uncached RDD before passing it to an optimizer. (In other words, the warning would be printed for every GeneralizedLinearAlgorithm run, regardless of whether its input is cached, if the warning were in GradientDescent or other optimizer.) I assume that use of an uncached RDD by GeneralizedLinearAlgorithm is intentional, and that the mapping there (adding label, intercepts and scaling) is a lightweight operation. Arguably a user calling an optimizer such as GradientDescent will be knowledgable enough to cache their data without needing a log warning, so lack of a warning in the optimizers may be ok.

Some of the documentation examples making use of these iterative algorithms did not cache their training RDDs (while others did). I updated the examples to always cache. I also fixed some (unrelated) minor errors in the documentation examples.

Author: Aaron Staple <aaron.staple@gmail.com>

Closes #2347 from staple/SPARK-1484 and squashes the following commits:

bd49701 [Aaron Staple] Address review comments.
ab2d4a4 [Aaron Staple] Disable warnings on python code path.
a7a0f99 [Aaron Staple] Change code comments per review comments.
7cca1dc [Aaron Staple] Change warning message text.
c77e939 [Aaron Staple] [SPARK-1484][MLLIB] Warn when running an iterative algorithm on uncached data.
3b6c511 [Aaron Staple] Minor doc example fixes.
2014-09-25 16:11:00 -07:00
Davies Liu fce5e251d6 [SPARK-3491] [MLlib] [PySpark] use pickle to serialize data in MLlib
Currently, we serialize the data between JVM and Python case by case manually, this cannot scale to support so many APIs in MLlib.

This patch will try to address this problem by serialize the data using pickle protocol, using Pyrolite library to serialize/deserialize in JVM. Pickle protocol can be easily extended to support customized class.

All the modules are refactored to use this protocol.

Known issues: There will be some performance regression (both CPU and memory, the serialized data increased)

Author: Davies Liu <davies.liu@gmail.com>

Closes #2378 from davies/pickle_mllib and squashes the following commits:

dffbba2 [Davies Liu] Merge branch 'master' of github.com:apache/spark into pickle_mllib
810f97f [Davies Liu] fix equal of matrix
032cd62 [Davies Liu] add more type check and conversion for user_product
bd738ab [Davies Liu] address comments
e431377 [Davies Liu] fix cache of rdd, refactor
19d0967 [Davies Liu] refactor Picklers
2511e76 [Davies Liu] cleanup
1fccf1a [Davies Liu] address comments
a2cc855 [Davies Liu] fix tests
9ceff73 [Davies Liu] test size of serialized Rating
44e0551 [Davies Liu] fix cache
a379a81 [Davies Liu] fix pickle array in python2.7
df625c7 [Davies Liu] Merge commit '154d141' into pickle_mllib
154d141 [Davies Liu] fix autobatchedpickler
44736d7 [Davies Liu] speed up pickling array in Python 2.7
e1d1bfc [Davies Liu] refactor
708dc02 [Davies Liu] fix tests
9dcfb63 [Davies Liu] fix style
88034f0 [Davies Liu] rafactor, address comments
46a501e [Davies Liu] choose batch size automatically
df19464 [Davies Liu] memorize the module and class name during pickleing
f3506c5 [Davies Liu] Merge branch 'master' into pickle_mllib
722dd96 [Davies Liu] cleanup _common.py
0ee1525 [Davies Liu] remove outdated tests
b02e34f [Davies Liu] remove _common.py
84c721d [Davies Liu] Merge branch 'master' into pickle_mllib
4d7963e [Davies Liu] remove muanlly serialization
6d26b03 [Davies Liu] fix tests
c383544 [Davies Liu] classification
f2a0856 [Davies Liu] mllib/regression
d9f691f [Davies Liu] mllib/util
cccb8b1 [Davies Liu] mllib/tree
8fe166a [Davies Liu] Merge branch 'pickle' into pickle_mllib
aa2287e [Davies Liu] random
f1544c4 [Davies Liu] refactor clustering
52d1350 [Davies Liu] use new protocol in mllib/stat
b30ef35 [Davies Liu] use pickle to serialize data for mllib/recommendation
f44f771 [Davies Liu] enable tests about array
3908f5c [Davies Liu] Merge branch 'master' into pickle
c77c87b [Davies Liu] cleanup debugging code
60e4e2f [Davies Liu] support unpickle array.array for Python 2.6
2014-09-19 15:01:11 -07:00
Burak e76ef5cb8e [SPARK-3418] Sparse Matrix support (CCS) and additional native BLAS operations added
Local `SparseMatrix` support added in Compressed Column Storage (CCS) format in addition to Level-2 and Level-3 BLAS operations such as dgemv and dgemm respectively.

BLAS doesn't support  sparse matrix operations, therefore support for `SparseMatrix`-`DenseMatrix` multiplication and `SparseMatrix`-`DenseVector` implementations have been added. I will post performance comparisons in the comments momentarily.

Author: Burak <brkyvz@gmail.com>

Closes #2294 from brkyvz/SPARK-3418 and squashes the following commits:

88814ed [Burak] Hopefully fixed MiMa this time
47e49d5 [Burak] really fixed MiMa issue
f0bae57 [Burak] [SPARK-3418] Fixed MiMa compatibility issues (excluded from check)
4b7dbec [Burak] 9/17 comments addressed
7af2f83 [Burak] sealed traits Vector and Matrix
d3a8a16 [Burak] [SPARK-3418] Squashed missing alpha bug.
421045f [Burak] [SPARK-3418] New code review comments addressed
f35a161 [Burak] [SPARK-3418] Code review comments addressed and multiplication further optimized
2508577 [Burak] [SPARK-3418] Fixed one more style issue
d16e8a0 [Burak] [SPARK-3418] Fixed style issues and added documentation for methods
204a3f7 [Burak] [SPARK-3418] Fixed failing Matrix unit test
6025297 [Burak] [SPARK-3418] Fixed Scala-style errors
dc7be71 [Burak] [SPARK-3418][MLlib] Matrix unit tests expanded with indexing and updating
d2d5851 [Burak] [SPARK-3418][MLlib] Sparse Matrix support and additional native BLAS operations added
2014-09-18 22:18:51 -07:00
qiping.lqp fdb302f49c [SPARK-3516] [mllib] DecisionTree: Add minInstancesPerNode, minInfoGain params to example and Python API
Added minInstancesPerNode, minInfoGain params to:
* DecisionTreeRunner.scala example
* Python API (tree.py)

Also:
* Fixed typo in tree suite test "do not choose split that does not satisfy min instance per node requirements"
* small style fixes

CC: mengxr

Author: qiping.lqp <qiping.lqp@alibaba-inc.com>
Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>
Author: chouqin <liqiping1991@gmail.com>

Closes #2349 from jkbradley/chouqin-dt-preprune and squashes the following commits:

61b2e72 [Joseph K. Bradley] Added max of 10GB for maxMemoryInMB in Strategy.
a95e7c8 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into chouqin-dt-preprune
95c479d [Joseph K. Bradley] * Fixed typo in tree suite test "do not choose split that does not satisfy min instance per node requirements" * small style fixes
e2628b6 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into chouqin-dt-preprune
19b01af [Joseph K. Bradley] Merge remote-tracking branch 'chouqin/dt-preprune' into chouqin-dt-preprune
f1d11d1 [chouqin] fix typo
c7ebaf1 [chouqin] fix typo
39f9b60 [chouqin] change edge `minInstancesPerNode` to 2 and add one more test
c6e2dfc [Joseph K. Bradley] Added minInstancesPerNode and minInfoGain parameters to DecisionTreeRunner.scala and to Python API in tree.py
0278a11 [chouqin] remove `noSplit` and set `Predict` private to tree
d593ec7 [chouqin] fix docs and change minInstancesPerNode to 1
efcc736 [qiping.lqp] fix bug
10b8012 [qiping.lqp] fix style
6728fad [qiping.lqp] minor fix: remove empty lines
bb465ca [qiping.lqp] Merge branch 'master' of https://github.com/apache/spark into dt-preprune
cadd569 [qiping.lqp] add api docs
46b891f [qiping.lqp] fix bug
e72c7e4 [qiping.lqp] add comments
845c6fa [qiping.lqp] fix style
f195e83 [qiping.lqp] fix style
987cbf4 [qiping.lqp] fix bug
ff34845 [qiping.lqp] separate calculation of predict of node from calculation of info gain
ac42378 [qiping.lqp] add min info gain and min instances per node parameters in decision tree
2014-09-15 17:43:26 -07:00
Reza Zadeh 983d6a9c48 [MLlib] Update SVD documentation in IndexedRowMatrix
Updating this to reflect the newest SVD via ARPACK

Author: Reza Zadeh <rizlar@gmail.com>

Closes #2389 from rezazadeh/irmdocs and squashes the following commits:

7fa1313 [Reza Zadeh] Update svd docs
715da25 [Reza Zadeh] Updated computeSVD documentation IndexedRowMatrix
2014-09-15 17:41:15 -07:00
Christoph Sawade 3b93128139 [SPARK-3396][MLLIB] Use SquaredL2Updater in LogisticRegressionWithSGD
SimpleUpdater ignores the regularizer, which leads to an unregularized
LogReg. To enable the common L2 regularizer (and the corresponding
regularization parameter) for logistic regression the SquaredL2Updater
has to be used in SGD (see, e.g., [SVMWithSGD])

Author: Christoph Sawade <christoph@sawade.me>

Closes #2398 from BigCrunsh/fix-regparam-logreg and squashes the following commits:

0820c04 [Christoph Sawade] Use SquaredL2Updater in LogisticRegressionWithSGD
2014-09-15 17:39:31 -07:00
Joseph K. Bradley b8634df1f1 [SPARK-3160] [SPARK-3494] [mllib] DecisionTree: eliminate pre-allocated nodes, parentImpurities arrays. Memory calc bug fix.
This PR includes some code simplifications and re-organization which will be helpful for implementing random forests.  The main changes are that the nodes and parentImpurities arrays are no longer pre-allocated in the main train() method.

Also added 2 bug fixes:
* maxMemoryUsage calculation
* over-allocation of space for bins in DTStatsAggregator for unordered features.

Relation to RFs:
* Since RFs will be deeper and will therefore be more likely sparse (not full trees), it could be a cost savings to avoid pre-allocating a full tree.
* The associated re-organization also reduces bookkeeping, which will make RFs easier to implement.
* The return code doneTraining may be generalized to include cases such as nodes ready for local training.

Details:

No longer pre-allocate parentImpurities array in main train() method.
* parentImpurities values are now stored in individual nodes (in Node.stats.impurity).
* These were not really needed.  They were used in calculateGainForSplit(), but they can be calculated anyways using parentNodeAgg.

No longer using Node.build since tree structure is constructed on-the-fly.
* Did not eliminate since it is public (Developer) API.  Marked as deprecated.

Eliminated pre-allocated nodes array in main train() method.
* Nodes are constructed and added to the tree structure as needed during training.
* Moved tree construction from main train() method into findBestSplitsPerGroup() since there is no need to keep the (split, gain) array for an entire level of nodes.  Only one element of that array is needed at a time, so we do not the array.

findBestSplits() now returns 2 items:
* rootNode (newly created root node on first iteration, same root node on later iterations)
* doneTraining (indicating if all nodes at that level were leafs)

Updated DecisionTreeSuite.  Notes:
* Improved test "Second level node building with vs. without groups"
** generateOrderedLabeledPoints() modified so that it really does require 2 levels of internal nodes.
* Related update: Added Node.deepCopy (private[tree]), used for test suite

CC: mengxr

Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>

Closes #2341 from jkbradley/dt-spark-3160 and squashes the following commits:

07dd1ee [Joseph K. Bradley] Fixed overflow bug with computing maxMemoryUsage in DecisionTree.  Also fixed bug with over-allocating space in DTStatsAggregator for unordered features.
debe072 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-spark-3160
5c4ac33 [Joseph K. Bradley] Added check in Strategy to make sure minInstancesPerNode >= 1
0dd4d87 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-spark-3160
306120f [Joseph K. Bradley] Fixed typo in DecisionTreeModel.scala doc
eaa1dcf [Joseph K. Bradley] Added topNode doc in DecisionTree and scalastyle fix
d4d7864 [Joseph K. Bradley] Marked Node.build as deprecated
d4dbb99 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-spark-3160
1a8f0ad [Joseph K. Bradley] Eliminated pre-allocated nodes array in main train() method. * Nodes are constructed and added to the tree structure as needed during training.
2ab763b [Joseph K. Bradley] Simplifications to DecisionTree code:
2014-09-12 01:37:59 -07:00
qiping.lqp 79cdb9b64a [SPARK-2207][SPARK-3272][MLLib]Add minimum information gain and minimum instances per node as training parameters for decision tree.
These two parameters can act as early stop rules to do pre-pruning. When a split cause cause left or right child to have less than `minInstancesPerNode` or has less information gain than `minInfoGain`, current node will not be split by this split.

When there is no possible splits that satisfy requirements, there is no useful information gain stats, but we still need to calculate the predict value for current node. So I separated calculation of predict from calculation of information gain, which can also save computation when the number of possible splits is large. Please see [SPARK-3272](https://issues.apache.org/jira/browse/SPARK-3272) for more details.

CC: mengxr manishamde jkbradley, please help me review this, thanks.

Author: qiping.lqp <qiping.lqp@alibaba-inc.com>
Author: chouqin <liqiping1991@gmail.com>

Closes #2332 from chouqin/dt-preprune and squashes the following commits:

f1d11d1 [chouqin] fix typo
c7ebaf1 [chouqin] fix typo
39f9b60 [chouqin] change edge `minInstancesPerNode` to 2 and add one more test
0278a11 [chouqin] remove `noSplit` and set `Predict` private to tree
d593ec7 [chouqin] fix docs and change minInstancesPerNode to 1
efcc736 [qiping.lqp] fix bug
10b8012 [qiping.lqp] fix style
6728fad [qiping.lqp] minor fix: remove empty lines
bb465ca [qiping.lqp] Merge branch 'master' of https://github.com/apache/spark into dt-preprune
cadd569 [qiping.lqp] add api docs
46b891f [qiping.lqp] fix bug
e72c7e4 [qiping.lqp] add comments
845c6fa [qiping.lqp] fix style
f195e83 [qiping.lqp] fix style
987cbf4 [qiping.lqp] fix bug
ff34845 [qiping.lqp] separate calculation of predict of node from calculation of info gain
ac42378 [qiping.lqp] add min info gain and min instances per node parameters in decision tree
2014-09-10 15:37:10 -07:00
Xiangrui Meng 50a4fa774a [SPARK-3443][MLLIB] update default values of tree:
Adjust the default values of decision tree, based on the memory requirement discussed in https://github.com/apache/spark/pull/2125 :

1. maxMemoryInMB: 128 -> 256
2. maxBins: 100 -> 32
3. maxDepth: 4 -> 5 (in some example code)

jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #2322 from mengxr/tree-defaults and squashes the following commits:

cda453a [Xiangrui Meng] fix tests
5900445 [Xiangrui Meng] update comments
8c81831 [Xiangrui Meng] update default values of tree:
2014-09-08 18:59:57 -07:00
Joseph K. Bradley 711356b422 [SPARK-3086] [SPARK-3043] [SPARK-3156] [mllib] DecisionTree aggregation improvements
Summary:
1. Variable numBins for each feature [SPARK-3043]
2. Reduced data reshaping in aggregation [SPARK-3043]
3. Choose ordering for ordered categorical features adaptively [SPARK-3156]
4. Changed nodes to use 1-indexing [SPARK-3086]
5. Small clean-ups

Note: This PR looks bigger than it is since I moved several functions from inside findBestSplitsPerGroup to outside of it (to make it clear what was being serialized in the aggregation).

Speedups: This update helps most when many features use few bins but a few features use many bins.  Some example results on speedups with 2M examples, 3.5K features (15-worker EC2 cluster):
* Example where old code was reasonably efficient (1/2 continuous, 1/4 binary, 1/4 20-category): 164.813 --> 116.491 sec
* Example where old code wasted many bins (1/10 continuous, 81/100 binary, 9/100 20-category): 128.701 --> 39.334 sec

Details:

(1) Variable numBins for each feature [SPARK-3043]

DecisionTreeMetadata now computes a variable numBins for each feature.  It also tracks numSplits.

(2) Reduced data reshaping in aggregation [SPARK-3043]

Added DTStatsAggregator, a wrapper around the aggregate statistics array for easy but efficient indexing.
* Added ImpurityAggregator and ImpurityCalculator classes, to make DecisionTree code more oblivious to the type of impurity.
* Design note: I originally tried creating Impurity classes which stored data and storing the aggregates in an Array[Array[Array[Impurity]]].  However, this led to significant slowdowns, perhaps because of overhead in creating so many objects.

The aggregate statistics are never reshaped, and cumulative sums are computed in-place.

Updated the layout of aggregation functions.  The update simplifies things by (1) dividing features into ordered/unordered (instead of ordered/unordered/continuous) and (2) making use of the DTStatsAggregator for indexing.
For this update, the following functions were refactored:
* updateBinForOrderedFeature
* updateBinForUnorderedFeature
* binaryOrNotCategoricalBinSeqOp
* multiclassWithCategoricalBinSeqOp
* regressionBinSeqOp
The above 5 functions were replaced with:
* orderedBinSeqOp
* someUnorderedBinSeqOp

Other changes:
* calculateGainForSplit now treats all feature types the same way.
* Eliminated extractLeftRightNodeAggregates.

(3) Choose ordering for ordered categorical features adaptively [SPARK-3156]

Updated binsToBestSplit():
* This now computes cumulative sums of stats for ordered features.
* For ordered categorical features, it chooses an ordering for categories. (This uses to be done by findSplitsBins.)
* Uses iterators to shorten code and avoid building an Array[Array[InformationGainStats]].

Side effects:
* In findSplitsBins: A sample of the data is only taken for data with continuous features.  It is not needed for data with only categorical features.
* In findSplitsBins: splits and bins are no longer pre-computed for ordered categorical features since they are not needed.
* TreePoint binning is simpler for categorical features.

(4) Changed nodes to use 1-indexing [SPARK-3086]

Nodes used to be indexed from 0.  Now they are indexed from 1.
Node indexing functions are now collected in object Node (Node.scala).

(5) Small clean-ups

Eliminated functions extractNodeInfo() and extractInfoForLowerLevels() to reduce duplicate code.
Eliminated InvalidBinIndex since it is no longer used.

CC: mengxr  manishamde  Please let me know if you have thoughts on this—thanks!

Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>

Closes #2125 from jkbradley/dt-opt3alt and squashes the following commits:

42c192a [Joseph K. Bradley] Merge branch 'rfs' into dt-opt3alt
d3cc46b [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt3alt
00e4404 [Joseph K. Bradley] optimization for TreePoint construction (pre-computing featureArity and isUnordered as arrays)
425716c [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into rfs
a2acea5 [Joseph K. Bradley] Small optimizations based on profiling
aa4e4df [Joseph K. Bradley] Updated DTStatsAggregator with bug fix (nodeString should not be multiplied by statsSize)
4651154 [Joseph K. Bradley] Changed numBins semantics for unordered features. * Before: numBins = numSplits = (1 << k - 1) - 1 * Now: numBins = 2 * numSplits = 2 * [(1 << k - 1) - 1] * This also involved changing the semantics of: ** DecisionTreeMetadata.numUnorderedBins()
1e3b1c7 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt3alt
1485fcc [Joseph K. Bradley] Made some DecisionTree methods private.
92f934f [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt3alt
e676da1 [Joseph K. Bradley] Updated documentation for DecisionTree
37ca845 [Joseph K. Bradley] Fixed problem with how DecisionTree handles ordered categorical	features.
105f8ab [Joseph K. Bradley] Removed commented-out getEmptyBinAggregates from DecisionTree
062c31d [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt3alt
6d32ccd [Joseph K. Bradley] In DecisionTree.binsToBestSplit, changed loops to iterators to shorten code.
807cd00 [Joseph K. Bradley] Finished DTStatsAggregator, a wrapper around the aggregate statistics for easy but hopefully efficient indexing.  Modified old ImpurityAggregator classes and renamed them ImpurityCalculator; added ImpurityAggregator classes which work with DTStatsAggregator but do not store data.  Unit tests all succeed.
f2166fd [Joseph K. Bradley] still working on DTStatsAggregator
92f7118 [Joseph K. Bradley] Added partly written DTStatsAggregator
fd8df30 [Joseph K. Bradley] Moved some aggregation helpers outside of findBestSplitsPerGroup
d7c53ee [Joseph K. Bradley] Added more doc for ImpurityAggregator
a40f8f1 [Joseph K. Bradley] Changed nodes to be indexed from 1.  Tests work.
95cad7c [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt3
5f94342 [Joseph K. Bradley] Added treeAggregate since not yet merged from master.  Moved node indexing functions to Node.
61c4509 [Joseph K. Bradley] Fixed bugs from merge: missing DT timer call, and numBins setting.  Cleaned up DT Suite some.
3ba7166 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt3
b314659 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt3
9c83363 [Joseph K. Bradley] partial merge but not done yet
45f7ea7 [Joseph K. Bradley] partial merge, not yet done
5fce635 [Joseph K. Bradley] Merge branch 'dt-opt2' into dt-opt3
26d10dd [Joseph K. Bradley] Removed tree/model/Filter.scala since no longer used.  Removed debugging println calls in DecisionTree.scala.
356daba [Joseph K. Bradley] Merge branch 'dt-opt1' into dt-opt2
430d782 [Joseph K. Bradley] Added more debug info on binning error.  Added some docs.
d036089 [Joseph K. Bradley] Print timing info to logDebug.
e66f1b1 [Joseph K. Bradley] TreePoint * Updated doc * Made some methods private
8464a6e [Joseph K. Bradley] Moved TimeTracker to tree/impl/ in its own file, and cleaned it up.  Removed debugging println calls from DecisionTree.  Made TreePoint extend Serialiable
a87e08f [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt1
dd4d3aa [Joseph K. Bradley] Mid-process in bug fix: bug for binary classification with categorical features * Bug: Categorical features were all treated as ordered for binary classification.  This is possible but would require the bin ordering to be determined on-the-fly after the aggregation.  Currently, the ordering is determined a priori and fixed for all splits. * (Temp) Fix: Treat low-arity categorical features as unordered for binary classification. * Related change: I removed most tests for isMulticlass in the code.  I instead test metadata for whether there are unordered features. * Status: The bug may be fixed, but more testing needs to be done.
438a660 [Joseph K. Bradley] removed subsampling for mnist8m from DT
86e217f [Joseph K. Bradley] added cache to DT input
e3c84cc [Joseph K. Bradley] Added stuff fro mnist8m to D T Runner
51ef781 [Joseph K. Bradley] Fixed bug introduced by last commit: Variance impurity calculation was incorrect since counts were swapped accidentally
fd65372 [Joseph K. Bradley] Major changes: * Created ImpurityAggregator classes, rather than old aggregates. * Feature split/bin semantics are based on ordered vs. unordered ** E.g.: numSplits = numBins for all unordered features, and numSplits = numBins - 1 for all ordered features. * numBins can differ for each feature
c1565a5 [Joseph K. Bradley] Small DecisionTree updates: * Simplification: Updated calculateGainForSplit to take aggregates for a single (feature, split) pair. * Internal doc: findAggForOrderedFeatureClassification
b914f3b [Joseph K. Bradley] DecisionTree optimization: eliminated filters + small changes
b2ed1f3 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt
0f676e2 [Joseph K. Bradley] Optimizations + Bug fix for DecisionTree
3211f02 [Joseph K. Bradley] Optimizing DecisionTree * Added TreePoint representation to avoid calling findBin multiple times. * (not working yet, but debugging)
f61e9d2 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-timing
bcf874a [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-timing
511ec85 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-timing
a95bc22 [Joseph K. Bradley] timing for DecisionTree internals
2014-09-08 09:47:13 -07:00
GuoQiang Li 607ae39c22 [SPARK-3397] Bump pom.xml version number of master branch to 1.2.0-SNAPSHOT
Author: GuoQiang Li <witgo@qq.com>

Closes #2268 from witgo/SPARK-3397 and squashes the following commits:

eaf913f [GuoQiang Li] Bump pom.xml version number of master branch to 1.2.0-SNAPSHOT
2014-09-06 15:04:50 -07:00
Kousuke Saruta 1bed0a3869 [SPARK-3372] [MLlib] MLlib doesn't pass maven build / checkstyle due to multi-byte character contained in Gradient.scala
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #2248 from sarutak/SPARK-3372 and squashes the following commits:

73a28b8 [Kousuke Saruta] Replaced UTF-8 hyphen with ascii hyphen
2014-09-03 20:47:00 -07:00
Reza Zadeh 0f16b23cd1 [MLlib] Squash bug in IndexedRowMatrix
Kill this bug fast before it does damage.

Author: Reza Zadeh <rizlar@gmail.com>

Closes #2224 from rezazadeh/indexrmbug and squashes the following commits:

53386d6 [Reza Zadeh] Squash bug in IndexedRowMatrix
2014-09-02 09:48:05 -07:00
Xiangrui Meng 220f413686 [SPARK-2495][MLLIB] make KMeans constructor public
to re-construct k-means models freeman-lab

Author: Xiangrui Meng <meng@databricks.com>

Closes #2112 from mengxr/public-constructors and squashes the following commits:

18d53a9 [Xiangrui Meng] make KMeans constructor public
2014-08-25 12:30:02 -07:00
Xiangrui Meng 0a984aa155 [SPARK-3142][MLLIB] output shuffle data directly in Word2Vec
Sorry I didn't realize this in #2043. Ishiihara

Author: Xiangrui Meng <meng@databricks.com>

Closes #2049 from mengxr/more-w2v and squashes the following commits:

050b1c5 [Xiangrui Meng] output shuffle data directly
2014-08-19 22:16:22 -07:00
Xiangrui Meng fce5c0fb63 [HOTFIX][Streaming][MLlib] use temp folder for checkpoint
or Jenkins will complain about no Apache header in checkpoint files. tdas rxin

Author: Xiangrui Meng <meng@databricks.com>

Closes #2046 from mengxr/tmp-checkpoint and squashes the following commits:

0d3ec73 [Xiangrui Meng] remove ssc.stop
9797843 [Xiangrui Meng] change checkpointDir to lazy val
89964ab [Xiangrui Meng] use temp folder for checkpoint
2014-08-19 22:05:29 -07:00
Xiangrui Meng 068b6fe6a1 [SPARK-3130][MLLIB] detect negative values in naive Bayes
because NB treats feature values as term frequencies. jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #2038 from mengxr/nb-neg and squashes the following commits:

52c37c3 [Xiangrui Meng] address comments
65f892d [Xiangrui Meng] detect negative values in nb
2014-08-19 21:01:23 -07:00
Xiangrui Meng 1870dbaa55 [MLLIB] minor update to word2vec
very minor update Ishiihara

Author: Xiangrui Meng <meng@databricks.com>

Closes #2043 from mengxr/minor-w2v and squashes the following commits:

be649fd [Xiangrui Meng] remove map because we only need append
eccefcc [Xiangrui Meng] minor updates to word2vec
2014-08-19 17:41:37 -07:00
Xiangrui Meng 825d4fe47b [SPARK-3136][MLLIB] Create Java-friendly methods in RandomRDDs
Though we don't use default argument for methods in RandomRDDs, it is still not easy for Java users to use because the output type is either `RDD[Double]` or `RDD[Vector]`. Java users should expect `JavaDoubleRDD` and `JavaRDD[Vector]`, respectively. We should create dedicated methods for Java users, and allow default arguments in Scala methods in RandomRDDs, to make life easier for both Java and Scala users. This PR also contains documentation for random data generation. brkyvz

Author: Xiangrui Meng <meng@databricks.com>

Closes #2041 from mengxr/stat-doc and squashes the following commits:

fc5eedf [Xiangrui Meng] add missing comma
ffde810 [Xiangrui Meng] address comments
aef6d07 [Xiangrui Meng] add doc for random data generation
b99d94b [Xiangrui Meng] add java-friendly methods to RandomRDDs
2014-08-19 16:06:48 -07:00
freeman 31f0b071ef [SPARK-3128][MLLIB] Use streaming test suite for StreamingLR
Refactored tests for streaming linear regression to use existing  streaming test utilities. Summary of changes:
- Made ``mllib`` depend on tests from ``streaming``
- Rewrote accuracy and convergence tests to use ``setupStreams`` and ``runStreams``
- Added new test for the accuracy of predictions generated by ``predictOnValue``

These tests should run faster, be easier to extend/maintain, and provide a reference for new tests.

mengxr tdas

Author: freeman <the.freeman.lab@gmail.com>

Closes #2037 from freeman-lab/streamingLR-predict-tests and squashes the following commits:

e851ca7 [freeman] Fixed long lines
50eb0bf [freeman] Refactored tests to use streaming test tools
32c43c2 [freeman] Added test for prediction
2014-08-19 13:28:57 -07:00
Xiangrui Meng 217b5e915e [SPARK-3108][MLLIB] add predictOnValues to StreamingLR and fix predictOn
It is useful in streaming to allow users to carry extra data with the prediction, for monitoring the prediction error for example. freeman-lab

Author: Xiangrui Meng <meng@databricks.com>

Closes #2023 from mengxr/predict-on-values and squashes the following commits:

cac47b8 [Xiangrui Meng] add classtag
2821b3b [Xiangrui Meng] use mapValues
0925efa [Xiangrui Meng] add predictOnValues to StreamingLR and fix predictOn
2014-08-18 18:20:54 -07:00
Joseph K. Bradley c8b16ca0d8 [SPARK-2850] [SPARK-2626] [mllib] MLlib stats examples + small fixes
Added examples for statistical summarization:
* Scala: StatisticalSummary.scala
** Tests: correlation, MultivariateOnlineSummarizer
* python: statistical_summary.py
** Tests: correlation (since MultivariateOnlineSummarizer has no Python API)

Added examples for random and sampled RDDs:
* Scala: RandomAndSampledRDDs.scala
* python: random_and_sampled_rdds.py
* Both test:
** RandomRDDGenerators.normalRDD, normalVectorRDD
** RDD.sample, takeSample, sampleByKey

Added sc.stop() to all examples.

CorrelationSuite.scala
* Added 1 test for RDDs with only 1 value

RowMatrix.scala
* numCols(): Added check for numRows = 0, with error message.
* computeCovariance(): Added check for numRows <= 1, with error message.

Python SparseVector (pyspark/mllib/linalg.py)
* Added toDense() function

python/run-tests script
* Added stat.py (doc test)

CC: mengxr dorx  Main changes were examples to show usage across APIs.

Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>

Closes #1878 from jkbradley/mllib-stats-api-check and squashes the following commits:

ea5c047 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
dafebe2 [Joseph K. Bradley] Bug fixes for examples SampledRDDs.scala and sampled_rdds.py: Check for division by 0 and for missing key in maps.
8d1e555 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
60c72d9 [Joseph K. Bradley] Fixed stat.py doc test to work for Python versions printing nan or NaN.
b20d90a [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
4e5d15e [Joseph K. Bradley] Changed pyspark/mllib/stat.py doc tests to use NaN instead of nan.
32173b7 [Joseph K. Bradley] Stats examples update.
c8c20dc [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
cf70b07 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
0b7cec3 [Joseph K. Bradley] Small updates based on code review.  Renamed statistical_summary.py to correlations.py
ab48f6e [Joseph K. Bradley] RowMatrix.scala * numCols(): Added check for numRows = 0, with error message. * computeCovariance(): Added check for numRows <= 1, with error message.
65e4ebc [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
8195c78 [Joseph K. Bradley] Added examples for random and sampled RDDs: * Scala: RandomAndSampledRDDs.scala * python: random_and_sampled_rdds.py * Both test: ** RandomRDDGenerators.normalRDD, normalVectorRDD ** RDD.sample, takeSample, sampleByKey
064985b [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
ee918e9 [Joseph K. Bradley] Added examples for statistical summarization: * Scala: StatisticalSummary.scala ** Tests: correlation, MultivariateOnlineSummarizer * python: statistical_summary.py ** Tests: correlation (since MultivariateOnlineSummarizer has no Python API)
2014-08-18 18:01:39 -07:00
Joseph K. Bradley 115eeb30dd [mllib] DecisionTree: treeAggregate + Python example bug fix
Small DecisionTree updates:
* Changed main DecisionTree aggregate to treeAggregate.
* Fixed bug in python example decision_tree_runner.py with missing argument (since categoricalFeaturesInfo is no longer an optional argument for trainClassifier).
* Fixed same bug in python doc tests, and added tree.py to doc tests.

CC: mengxr

Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>

Closes #2015 from jkbradley/dt-opt2 and squashes the following commits:

b5114fa [Joseph K. Bradley] Fixed python tree.py doc test (extra newline)
8e4665d [Joseph K. Bradley] Added tree.py to python doc tests.  Fixed bug from missing categoricalFeaturesInfo argument.
b7b2922 [Joseph K. Bradley] Fixed bug in python example decision_tree_runner.py with missing argument.  Changed main DecisionTree aggregate to treeAggregate.
85bbc1f [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt2
66d076f [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt2
a0ed0da [Joseph K. Bradley] Renamed DTMetadata to DecisionTreeMetadata.  Small doc updates.
3726d20 [Joseph K. Bradley] Small code improvements based on code review.
ac0b9f8 [Joseph K. Bradley] Small updates based on code review. Main change: Now using << instead of math.pow.
db0d773 [Joseph K. Bradley] scala style fix
6a38f48 [Joseph K. Bradley] Added DTMetadata class for cleaner code
931a3a7 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt2
797f68a [Joseph K. Bradley] Fixed DecisionTreeSuite bug for training second level.  Needed to update treePointToNodeIndex with groupShift.
f40381c [Joseph K. Bradley] Merge branch 'dt-opt1' into dt-opt2
5f2dec2 [Joseph K. Bradley] Fixed scalastyle issue in TreePoint
6b5651e [Joseph K. Bradley] Updates based on code review.  1 major change: persisting to memory + disk, not just memory.
2d2aaaf [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt1
26d10dd [Joseph K. Bradley] Removed tree/model/Filter.scala since no longer used.  Removed debugging println calls in DecisionTree.scala.
356daba [Joseph K. Bradley] Merge branch 'dt-opt1' into dt-opt2
430d782 [Joseph K. Bradley] Added more debug info on binning error.  Added some docs.
d036089 [Joseph K. Bradley] Print timing info to logDebug.
e66f1b1 [Joseph K. Bradley] TreePoint * Updated doc * Made some methods private
8464a6e [Joseph K. Bradley] Moved TimeTracker to tree/impl/ in its own file, and cleaned it up.  Removed debugging println calls from DecisionTree.  Made TreePoint extend Serialiable
a87e08f [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt1
c1565a5 [Joseph K. Bradley] Small DecisionTree updates: * Simplification: Updated calculateGainForSplit to take aggregates for a single (feature, split) pair. * Internal doc: findAggForOrderedFeatureClassification
b914f3b [Joseph K. Bradley] DecisionTree optimization: eliminated filters + small changes
b2ed1f3 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt
0f676e2 [Joseph K. Bradley] Optimizations + Bug fix for DecisionTree
3211f02 [Joseph K. Bradley] Optimizing DecisionTree * Added TreePoint representation to avoid calling findBin multiple times. * (not working yet, but debugging)
f61e9d2 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-timing
bcf874a [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-timing
511ec85 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-timing
a95bc22 [Joseph K. Bradley] timing for DecisionTree internals
2014-08-18 14:40:05 -07:00
Liquan Pei 9306b8c6c8 [MLlib] Remove transform(dataset: RDD[String]) from Word2Vec public API
mengxr
Remove  transform(dataset: RDD[String]) from public API.

Author: Liquan Pei <liquanpei@gmail.com>

Closes #2010 from Ishiihara/Word2Vec-api and squashes the following commits:

17b1031 [Liquan Pei] remove transform(dataset: RDD[String]) from public API
2014-08-18 01:15:45 -07:00
Liquan Pei 3c8fa50590 [SPARK-3097][MLlib] Word2Vec performance improvement
mengxr Please review the code. Adding weights in reduceByKey soon.

Only output model entry for words appeared in the partition before merging and use reduceByKey to combine model. In general, this implementation is 30s or so faster than implementation using big array.

Author: Liquan Pei <liquanpei@gmail.com>

Closes #1932 from Ishiihara/Word2Vec-improve2 and squashes the following commits:

d5377a9 [Liquan Pei] use syn0Global and syn1Global to represent model
cad2011 [Liquan Pei] bug fix for synModify array out of bound
083aa66 [Liquan Pei] update synGlobal in place and reduce synOut size
9075e1c [Liquan Pei] combine syn0Global and syn1Global to synGlobal
aa2ab36 [Liquan Pei] use reduceByKey to combine models
2014-08-17 23:29:44 -07:00
Xiangrui Meng c77f40668f [SPARK-3087][MLLIB] fix col indexing bug in chi-square and add a check for number of distinct values
There is a bug determining the column index. dorx

Author: Xiangrui Meng <meng@databricks.com>

Closes #1997 from mengxr/chisq-index and squashes the following commits:

8fc2ab2 [Xiangrui Meng] fix col indexing bug and add a check for number of distinct values
2014-08-17 20:53:18 -07:00
Joseph K. Bradley 73ab7f141c [SPARK-3042] [mllib] DecisionTree Filter top-down instead of bottom-up
DecisionTree needs to match each example to a node at each iteration.  It currently does this with a set of filters very inefficiently: For each example, it examines each node at the current level and traces up to the root to see if that example should be handled by that node.

Fix: Filter top-down using the partly built tree itself.

Major changes:
* Eliminated Filter class, findBinsForLevel() method.
* Set up node parent links in main loop over levels in train().
* Added predictNodeIndex() for filtering top-down.
* Added DTMetadata class

Other changes:
* Pre-compute set of unorderedFeatures.

Notes for following expected PR based on [https://issues.apache.org/jira/browse/SPARK-3043]:
* The unorderedFeatures set will next be stored in a metadata structure to simplify function calls (to store other items such as the data in strategy).

I've done initial tests indicating that this speeds things up, but am only now running large-scale ones.

CC: mengxr manishamde chouqin  Any comments are welcome---thanks!

Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>

Closes #1975 from jkbradley/dt-opt2 and squashes the following commits:

a0ed0da [Joseph K. Bradley] Renamed DTMetadata to DecisionTreeMetadata.  Small doc updates.
3726d20 [Joseph K. Bradley] Small code improvements based on code review.
ac0b9f8 [Joseph K. Bradley] Small updates based on code review. Main change: Now using << instead of math.pow.
db0d773 [Joseph K. Bradley] scala style fix
6a38f48 [Joseph K. Bradley] Added DTMetadata class for cleaner code
931a3a7 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt2
797f68a [Joseph K. Bradley] Fixed DecisionTreeSuite bug for training second level.  Needed to update treePointToNodeIndex with groupShift.
f40381c [Joseph K. Bradley] Merge branch 'dt-opt1' into dt-opt2
5f2dec2 [Joseph K. Bradley] Fixed scalastyle issue in TreePoint
6b5651e [Joseph K. Bradley] Updates based on code review.  1 major change: persisting to memory + disk, not just memory.
2d2aaaf [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt1
26d10dd [Joseph K. Bradley] Removed tree/model/Filter.scala since no longer used.  Removed debugging println calls in DecisionTree.scala.
356daba [Joseph K. Bradley] Merge branch 'dt-opt1' into dt-opt2
430d782 [Joseph K. Bradley] Added more debug info on binning error.  Added some docs.
d036089 [Joseph K. Bradley] Print timing info to logDebug.
e66f1b1 [Joseph K. Bradley] TreePoint * Updated doc * Made some methods private
8464a6e [Joseph K. Bradley] Moved TimeTracker to tree/impl/ in its own file, and cleaned it up.  Removed debugging println calls from DecisionTree.  Made TreePoint extend Serialiable
a87e08f [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt1
c1565a5 [Joseph K. Bradley] Small DecisionTree updates: * Simplification: Updated calculateGainForSplit to take aggregates for a single (feature, split) pair. * Internal doc: findAggForOrderedFeatureClassification
b914f3b [Joseph K. Bradley] DecisionTree optimization: eliminated filters + small changes
b2ed1f3 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt
0f676e2 [Joseph K. Bradley] Optimizations + Bug fix for DecisionTree
3211f02 [Joseph K. Bradley] Optimizing DecisionTree * Added TreePoint representation to avoid calling findBin multiple times. * (not working yet, but debugging)
f61e9d2 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-timing
bcf874a [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-timing
511ec85 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-timing
a95bc22 [Joseph K. Bradley] timing for DecisionTree internals
2014-08-16 23:53:14 -07:00
Xiangrui Meng fbad72288d [SPARK-3077][MLLIB] fix some chisq-test
- promote nullHypothesis field in ChiSqTestResult to TestResult. Every test should have a null hypothesis
- correct null hypothesis statement for independence test
- p-value: 0.01 -> 0.1

Author: Xiangrui Meng <meng@databricks.com>

Closes #1982 from mengxr/fix-chisq and squashes the following commits:

5f0de02 [Xiangrui Meng] make ChiSqTestResult constructor package private
bc74ea1 [Xiangrui Meng] update chisq-test
2014-08-16 21:16:27 -07:00
Xiangrui Meng ac6411c6e7 [SPARK-3081][MLLIB] rename RandomRDDGenerators to RandomRDDs
`RandomRDDGenerators` means factory for `RandomRDDGenerator`. However, its methods return RDDs but not RDDGenerators. So a more proper (and shorter) name would be `RandomRDDs`.

dorx brkyvz

Author: Xiangrui Meng <meng@databricks.com>

Closes #1979 from mengxr/randomrdds and squashes the following commits:

b161a2d [Xiangrui Meng] rename RandomRDDGenerators to RandomRDDs
2014-08-16 15:14:43 -07:00
Xiangrui Meng 7e70708a99 [SPARK-3048][MLLIB] add LabeledPoint.parse and remove loadStreamingLabeledPoints
Move `parse()` from `LabeledPointParser` to `LabeledPoint` and make it public. This breaks binary compatibility only when a user uses synthesized methods like `tupled` and `curried`, which is rare.

`LabeledPoint.parse` is more consistent with `Vectors.parse`, which is why `LabeledPointParser` is not preferred.

freeman-lab tdas

Author: Xiangrui Meng <meng@databricks.com>

Closes #1952 from mengxr/labelparser and squashes the following commits:

c818fb2 [Xiangrui Meng] merge master
ce20e6f [Xiangrui Meng] update mima excludes
b386b8d [Xiangrui Meng] fix tests
2436b3d [Xiangrui Meng] add parse() to LabeledPoint
2014-08-16 15:13:34 -07:00
Xiangrui Meng 2e069ca656 [SPARK-3001][MLLIB] Improve Spearman's correlation
The current implementation requires sorting individual columns, which could be done with a global sort.

result on a 32-node cluster:

m | n | prev | this
---|---|-------|-----
1000000 | 50 | 55s | 9s
10000000 | 50 | 97s | 76s
1000000 | 100  | 119s | 15s

Author: Xiangrui Meng <meng@databricks.com>

Closes #1917 from mengxr/spearman and squashes the following commits:

4d5d262 [Xiangrui Meng] remove unused import
85c48de [Xiangrui Meng] minor updates
a048d0c [Xiangrui Meng] remove cache and set a limit to cachedIds
b98bb18 [Xiangrui Meng] add comments
0846e07 [Xiangrui Meng] first version
2014-08-15 21:07:55 -07:00
Xiangrui Meng 5d25c0b74f [SPARK-3078][MLLIB] Make LRWithLBFGS API consistent with others
Should ask users to set parameters through the optimizer. dbtsai

Author: Xiangrui Meng <meng@databricks.com>

Closes #1973 from mengxr/lr-lbfgs and squashes the following commits:

e3efbb1 [Xiangrui Meng] fix tests
21b3579 [Xiangrui Meng] fix method name
641eea4 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into lr-lbfgs
456ab7c [Xiangrui Meng] update LRWithLBFGS
2014-08-15 21:04:29 -07:00
Joseph K. Bradley c7032290a3 [SPARK-3022] [SPARK-3041] [mllib] Call findBins once per level + unordered feature bug fix
DecisionTree improvements:
(1) TreePoint representation to avoid binning multiple times
(2) Bug fix: isSampleValid indexed bins incorrectly for unordered categorical features
(3) Timing for DecisionTree internals

Details:

(1) TreePoint representation to avoid binning multiple times

[https://issues.apache.org/jira/browse/SPARK-3022]

Added private[tree] TreePoint class for representing binned feature values.

The input RDD of LabeledPoint is converted to the TreePoint representation initially and then cached.  This avoids the previous problem of re-computing bins multiple times.

(2) Bug fix: isSampleValid indexed bins incorrectly for unordered categorical features

[https://issues.apache.org/jira/browse/SPARK-3041]

isSampleValid used to treat unordered categorical features incorrectly: It treated the bins as if indexed by featured values, rather than by subsets of values/categories.
* exhibited for unordered features (multi-class classification with categorical features of low arity)
* Fix: Index bins correctly for unordered categorical features.

(3) Timing for DecisionTree internals

Added tree/impl/TimeTracker.scala class which is private[tree] for now, for timing key parts of DT code.
Prints timing info via logDebug.

CC: mengxr manishamde chouqin  Very similar update, with one bug fix.  Many apologies for the conflicting update, but I hope that a few more optimizations I have on the way (which depend on this update) will prove valuable to you: SPARK-3042 and SPARK-3043

Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>

Closes #1950 from jkbradley/dt-opt1 and squashes the following commits:

5f2dec2 [Joseph K. Bradley] Fixed scalastyle issue in TreePoint
6b5651e [Joseph K. Bradley] Updates based on code review.  1 major change: persisting to memory + disk, not just memory.
2d2aaaf [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt1
430d782 [Joseph K. Bradley] Added more debug info on binning error.  Added some docs.
d036089 [Joseph K. Bradley] Print timing info to logDebug.
e66f1b1 [Joseph K. Bradley] TreePoint * Updated doc * Made some methods private
8464a6e [Joseph K. Bradley] Moved TimeTracker to tree/impl/ in its own file, and cleaned it up.  Removed debugging println calls from DecisionTree.  Made TreePoint extend Serialiable
a87e08f [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt1
0f676e2 [Joseph K. Bradley] Optimizations + Bug fix for DecisionTree
3211f02 [Joseph K. Bradley] Optimizing DecisionTree * Added TreePoint representation to avoid calling findBin multiple times. * (not working yet, but debugging)
f61e9d2 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-timing
bcf874a [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-timing
511ec85 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-timing
a95bc22 [Joseph K. Bradley] timing for DecisionTree internals
2014-08-15 14:50:10 -07:00
DB Tsai 9622106757 [SPARK-2979][MLlib] Improve the convergence rate by minimizing the condition number
In theory, the scale of your inputs are irrelevant to logistic regression.
You can "theoretically" multiply X1 by 1E6 and the estimate for β1 will
adjust accordingly. It will be 1E-6 times smaller than the original β1, due
to the invariance property of MLEs.

However, during the optimization process, the convergence (rate)
depends on the condition number of the training dataset. Scaling
the variables often reduces this condition number, thus improving
the convergence rate.

Without reducing the condition number, some training datasets
mixing the columns with different scales may not be able to converge.

GLMNET and LIBSVM packages perform the scaling to reduce
the condition number, and return the weights in the original scale.
See page 9 in http://cran.r-project.org/web/packages/glmnet/glmnet.pdf

Here, if useFeatureScaling is enabled, we will standardize the training
features by dividing the variance of each column (without subtracting
the mean to densify the sparse vector), and train the model in the
scaled space. Then we transform the coefficients from the scaled space
to the original scale as GLMNET and LIBSVM do.

Currently, it's only enabled in LogisticRegressionWithLBFGS.

Author: DB Tsai <dbtsai@alpinenow.com>

Closes #1897 from dbtsai/dbtsai-feature-scaling and squashes the following commits:

f19fc02 [DB Tsai] Added more comments
1d85289 [DB Tsai] Improve the convergence rate by minimize the condition number in LOR with LBFGS
2014-08-14 11:56:13 -07:00
Xiangrui Meng 69a57a18ee [SPARK-2995][MLLIB] add ALS.setIntermediateRDDStorageLevel
As mentioned in SPARK-2465, using `MEMORY_AND_DISK_SER` for user/product in/out links together with `spark.rdd.compress=true` can help reduce the space requirement by a lot, at the cost of speed. It might be useful to add this option so people can run ALS on much bigger datasets.

Another option for the method name is `setIntermediateRDDStorageLevel`.

Author: Xiangrui Meng <meng@databricks.com>

Closes #1913 from mengxr/als-storagelevel and squashes the following commits:

d942017 [Xiangrui Meng] rename to setIntermediateRDDStorageLevel
7550029 [Xiangrui Meng] add ALS.setIntermediateDataStorageLevel
2014-08-13 23:53:44 -07:00
Xiangrui Meng 7ecb867c4c [MLLIB] use Iterator.fill instead of Array.fill
Iterator.fill uses less memory

Author: Xiangrui Meng <meng@databricks.com>

Closes #1930 from mengxr/rand-gen-iter and squashes the following commits:

24178ca [Xiangrui Meng] use Iterator.fill instead of Array.fill
2014-08-13 16:20:49 -07:00
Doris Xin fe4735958e [SPARK-2993] [MLLib] colStats (wrapper around MultivariateStatisticalSummary) in Statistics
For both Scala and Python.

The ser/de util functions were moved out of `PythonMLLibAPI` and into their own object to avoid creating the `PythonMLLibAPI` object inside of `MultivariateStatisticalSummarySerialized`, which is then referenced inside of a method in `PythonMLLibAPI`.

`MultivariateStatisticalSummarySerialized` was created to serialize the `Vector` fields in `MultivariateStatisticalSummary`.

Author: Doris Xin <doris.s.xin@gmail.com>

Closes #1911 from dorx/colStats and squashes the following commits:

77b9924 [Doris Xin] developerAPI tag
de9cbbe [Doris Xin] reviewer comments and moved more ser/de
459faba [Doris Xin] colStats in Statistics for both Scala and Python
2014-08-12 23:47:42 -07:00
Liquan Pei f0060b75ff [MLlib] Correctly set vectorSize and alpha
mengxr
Correctly set vectorSize and alpha in Word2Vec training.

Author: Liquan Pei <liquanpei@gmail.com>

Closes #1900 from Ishiihara/Word2Vec-bugfix and squashes the following commits:

85f64f2 [Liquan Pei] correctly set vectorSize and alpha
2014-08-12 00:28:00 -07:00
Xiangrui Meng 9038d94e1e [SPARK-2923][MLLIB] Implement some basic BLAS routines
Having some basic BLAS operations implemented in MLlib can help simplify the current implementation and improve some performance.

Tested on my local machine:

~~~
bin/spark-submit --class org.apache.spark.examples.mllib.BinaryClassification \
examples/target/scala-*/spark-examples-*.jar --algorithm LR --regType L2 \
--regParam 1.0 --numIterations 1000 ~/share/data/rcv1.binary/rcv1_train.binary
~~~

1. before: ~1m
2. after: ~30s

CC: jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #1849 from mengxr/ml-blas and squashes the following commits:

ba583a2 [Xiangrui Meng] exclude Vector.copy
a4d7d2f [Xiangrui Meng] Merge branch 'master' into ml-blas
6edeab9 [Xiangrui Meng] address comments
940bdeb [Xiangrui Meng] rename MLlibBLAS to BLAS
c2a38bc [Xiangrui Meng] enhance dot tests
4cfaac4 [Xiangrui Meng] add apache header
48d01d2 [Xiangrui Meng] add tests for zeros and copy
3b882b1 [Xiangrui Meng] use blas.scal in gradient
735eb23 [Xiangrui Meng] remove d from BLAS routines
d2d7d3c [Xiangrui Meng] update gradient and lbfgs
7f78186 [Xiangrui Meng] add zeros to Vectors; add dscal and dcopy to BLAS
14e6645 [Xiangrui Meng] add ddot
cbb8273 [Xiangrui Meng] add daxpy test
07db0bb [Xiangrui Meng] Merge branch 'master' into ml-blas
e8c326d [Xiangrui Meng] axpy
2014-08-11 22:33:45 -07:00
DB Tsai 6fab941b65 [SPARK-2934][MLlib] Adding LogisticRegressionWithLBFGS Interface
for training with LBFGS Optimizer which will converge faster than SGD.

Author: DB Tsai <dbtsai@alpinenow.com>

Closes #1862 from dbtsai/dbtsai-lbfgs-lor and squashes the following commits:

aa84b81 [DB Tsai] small change
f852bcd [DB Tsai] Remove duplicate method
f119fdc [DB Tsai] Formatting
97776aa [DB Tsai] address more feedback
85b4a91 [DB Tsai] address feedback
3cf50c2 [DB Tsai] LogisticRegressionWithLBFGS interface
2014-08-11 19:49:29 -07:00
Doris Xin 32638b5e74 [SPARK-2515][mllib] Chi Squared test
Author: Doris Xin <doris.s.xin@gmail.com>

Closes #1733 from dorx/chisquare and squashes the following commits:

cafb3a7 [Doris Xin] fixed p-value for extreme case.
d286783 [Doris Xin] Merge branch 'master' into chisquare
e95e485 [Doris Xin] reviewer comments.
7dde711 [Doris Xin] ChiSqTestResult renaming and changed to Class
80d03e2 [Doris Xin] Reviewer comments.
c39eeb5 [Doris Xin] units passed with updated API
e90d90a [Doris Xin] Merge branch 'master' into chisquare
7eea80b [Doris Xin] WIP
d64c2fb [Doris Xin] Merge branch 'master' into chisquare
5686082 [Doris Xin] facelift
bc7eb2e [Doris Xin] unit passed; still need docs and some refactoring
50703a5 [Doris Xin] merge master
4e4e361 [Doris Xin] WIP
e6b83f3 [Doris Xin] reviewer comments
3d61582 [Doris Xin] input names
706d436 [Doris Xin] Added API for RDD[Vector]
6598379 [Doris Xin] API and code structure.
ff17423 [Doris Xin] WIP
2014-08-11 19:22:14 -07:00
Xiangrui Meng 74d6f62264 [SPARK-1997][MLLIB] update breeze to 0.9
0.9 dependences (this version doesn't depend on scalalogging and I excluded commons-math3 from its transitive dependencies):
~~~
+-org.scalanlp:breeze_2.10:0.9 [S]
  +-com.github.fommil.netlib:core:1.1.2
  +-com.github.rwl:jtransforms:2.4.0
  +-net.sf.opencsv:opencsv:2.3
  +-net.sourceforge.f2j:arpack_combined_all:0.1
  +-org.scalanlp:breeze-macros_2.10:0.3.1 [S]
  | +-org.scalamacros:quasiquotes_2.10:2.0.0 [S]
  |
  +-org.slf4j:slf4j-api:1.7.5
  +-org.spire-math:spire_2.10:0.7.4 [S]
    +-org.scalamacros:quasiquotes_2.10:2.0.0 [S]
    |
    +-org.spire-math:spire-macros_2.10:0.7.4 [S]
      +-org.scalamacros:quasiquotes_2.10:2.0.0 [S]
~~~

Closes #1749

CC: witgo avati

Author: Xiangrui Meng <meng@databricks.com>

Closes #1857 from mengxr/breeze-0.9 and squashes the following commits:

7fc16b6 [Xiangrui Meng] don't know why but exclude a private method for mima
dcc502e [Xiangrui Meng] update breeze to 0.9
2014-08-08 15:07:31 -07:00
Xiangrui Meng b9e9e53773 [SPARK-2852][MLLIB] Separate model from IDF/StandardScaler algorithms
This is part of SPARK-2828:

1. separate IDF model from IDF algorithm (which generates a model)
2. separate StandardScaler model from StandardScaler

CC: dbtsai

Author: Xiangrui Meng <meng@databricks.com>

Closes #1814 from mengxr/feature-api-update and squashes the following commits:

40d863b [Xiangrui Meng] move mean and variance to model
48a0fff [Xiangrui Meng] separate Model from StandardScaler algorithm
89f3486 [Xiangrui Meng] update IDF to separate Model from Algorithm
2014-08-07 11:28:12 -07:00
Joseph K. Bradley 8d1dec4fa4 [mllib] DecisionTree Strategy parameter checks
Added some checks to Strategy to print out meaningful error messages when given invalid DecisionTree parameters.
CC mengxr

Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>

Closes #1821 from jkbradley/dt-robustness and squashes the following commits:

4dc449a [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-robustness
7a61f7b [Joseph K. Bradley] Added some checks to Strategy to print out meaningful error messages when given invalid DecisionTree parameters
2014-08-07 00:20:38 -07:00
Joseph K. Bradley 47ccd5e71b [SPARK-2851] [mllib] DecisionTree Python consistency update
Added 6 static train methods to match Python API, but without default arguments (but with Python default args noted in docs).

Added factory classes for Algo and Impurity, but made private[mllib].

CC: mengxr dorx  Please let me know if there are other changes which would help with API consistency---thanks!

Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>

Closes #1798 from jkbradley/dt-python-consistency and squashes the following commits:

6f7edf8 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-python-consistency
a0d7dbe [Joseph K. Bradley] DecisionTree: In Java-friendly train* methods, changed to use JavaRDD instead of RDD.
ee1d236 [Joseph K. Bradley] DecisionTree API updates: * Removed train() function in Python API (tree.py) ** Removed corresponding function in Scala/Java API (the ones taking basic types)
00f820e [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-python-consistency
fe6dbfa [Joseph K. Bradley] removed unnecessary imports
e358661 [Joseph K. Bradley] DecisionTree API change: * Added 6 static train methods to match Python API, but without default arguments (but with Python default args noted in docs).
c699850 [Joseph K. Bradley] a few doc comments
eaf84c0 [Joseph K. Bradley] Added DecisionTree static train() methods API to match Python, but without default parameters
2014-08-06 22:58:59 -07:00
Xiangrui Meng 25cff1019d [SPARK-2852][MLLIB] API consistency for mllib.feature
This is part of SPARK-2828:

1. added a Java-friendly fit method to Word2Vec with tests
2. change DeveloperApi to Experimental for Normalizer & StandardScaler
3. change default feature dimension to 2^20 in HashingTF

Author: Xiangrui Meng <meng@databricks.com>

Closes #1807 from mengxr/feature-api-check and squashes the following commits:

773c1a9 [Xiangrui Meng] change default numFeatures to 2^20 in HashingTF change annotation from DeveloperApi to Experimental in Normalizer and StandardScaler
883e122 [Xiangrui Meng] add @Experimental to Word2VecModel add a Java-friendly method to Word2Vec.fit with tests
2014-08-06 14:07:51 -07:00
DB Tsai c7b52010df [MLlib] Use this.type as return type in k-means' builder pattern
to ensure that the return object is itself.

Author: DB Tsai <dbtsai@alpinenow.com>

Closes #1796 from dbtsai/dbtsai-kmeans and squashes the following commits:

658989e [DB Tsai] Alpine Data Labs
2014-08-05 23:32:29 -07:00
Michael Giannakopoulos 1aad9114c9 [SPARK-2550][MLLIB][APACHE SPARK] Support regularization and intercept in pyspark's linear methods
Related to Jira Issue: [SPARK-2550](https://issues.apache.org/jira/browse/SPARK-2550?jql=project%20%3D%20SPARK%20AND%20resolution%20%3D%20Unresolved%20AND%20priority%20%3D%20Major%20ORDER%20BY%20key%20DESC)

Author: Michael Giannakopoulos <miccagiann@gmail.com>

Closes #1775 from miccagiann/linearMethodsReg and squashes the following commits:

cb774c3 [Michael Giannakopoulos] MiniBatchFraction added in related PythonMLLibAPI java stubs.
81fcbc6 [Michael Giannakopoulos] Fixing a typo-error.
8ad263e [Michael Giannakopoulos] Adding regularizer type and intercept parameters to LogisticRegressionWithSGD and SVMWithSGD.
2014-08-05 16:30:32 -07:00
Xiangrui Meng cc491f69cd [SPARK-2864][MLLIB] fix random seed in word2vec; move model to local
It also moves the model to local in order to map `RDD[String]` to `RDD[Vector]`.

Ishiihara

Author: Xiangrui Meng <meng@databricks.com>

Closes #1790 from mengxr/word2vec-fix and squashes the following commits:

a87146c [Xiangrui Meng] add setters and make a default constructor
e5c923b [Xiangrui Meng] fix random seed in word2vec; move model to local
2014-08-05 16:22:41 -07:00
Liquan Pei e053c55819 [MLlib] [SPARK-2510]Word2Vec: Distributed Representation of Words
This is a pull request regarding SPARK-2510 at https://issues.apache.org/jira/browse/SPARK-2510. Word2Vec creates vector representation of words in a text corpus. The algorithm first constructs a vocabulary from the corpus and then learns vector representation of words in the vocabulary. The vector representation can be used as features in natural language processing and machine learning algorithms.

To make our implementation more scalable, we train each partition separately and merge the model of each partition after each iteration. To make the model more accurate, multiple iterations may be needed.

To investigate the vector representations is to find the closest words for a query word. For example, the top 20 closest words to "china" are for 1 partition and 1 iteration :

taiwan 0.8077646146334014
korea 0.740913304563621
japan 0.7240667798885471
republic 0.7107151279078352
thailand 0.6953217332072862
tibet 0.6916782118129544
mongolia 0.6800858715972612
macau 0.6794925677480378
singapore 0.6594048695593799
manchuria 0.658989931844148
laos 0.6512978726001666
nepal 0.6380792327845325
mainland 0.6365469459587788
myanmar 0.6358614338840394
macedonia 0.6322366180313249
xinjiang 0.6285291551708028
russia 0.6279951236068411
india 0.6272874944023487
shanghai 0.6234544135576999
macao 0.6220588462925876

The result with 10 partitions and 5 iterations is:
taiwan 0.8310495079388313
india 0.7737171315919039
japan 0.756777901233668
korea 0.7429767187102452
indonesia 0.7407557427278356
pakistan 0.712883426985585
mainland 0.7053379963140822
thailand 0.696298191073948
mongolia 0.693690656871415
laos 0.6913069680735292
macau 0.6903427690029617
republic 0.6766381604813666
malaysia 0.676460699141784
singapore 0.6728790997360923
malaya 0.672345232966194
manchuria 0.6703732292753156
macedonia 0.6637955686322028
myanmar 0.6589462882439646
kazakhstan 0.657017801081494
cambodia 0.6542383836451932

Author: Liquan Pei <lpei@gopivotal.com>
Author: Xiangrui Meng <meng@databricks.com>
Author: Liquan Pei <liquanpei@gmail.com>

Closes #1719 from Ishiihara/master and squashes the following commits:

2ba9483 [Liquan Pei] minor fix for Word2Vec test
e248441 [Liquan Pei] minor style change
26a948d [Liquan Pei] Merge pull request #1 from mengxr/Ishiihara-master
c14da41 [Xiangrui Meng] fix styles
384c771 [Xiangrui Meng] remove minCount and window from constructor change model to use float instead of double
e93e726 [Liquan Pei] use treeAggregate instead of aggregate
1a8fb41 [Liquan Pei] use weighted sum in combOp
7efbb6f [Liquan Pei] use broadcast version of vocab in aggregate
6bcc8be [Liquan Pei] add multiple iteration support
720b5a3 [Liquan Pei] Add test for Word2Vec algorithm, minor fixes
2e92b59 [Liquan Pei] modify according to feedback
57dc50d [Liquan Pei] code formatting
e4a04d3 [Liquan Pei] minor fix
0aafb1b [Liquan Pei] Add comments, minor fixes
8d6befe [Liquan Pei] initial commit
2014-08-03 23:55:58 -07:00
DB Tsai ae58aea2d1 SPARK-2272 [MLlib] Feature scaling which standardizes the range of independent variables or features of data
Feature scaling is a method used to standardize the range of independent variables or features of data. In data processing, it is generally performed during the data preprocessing step.

In this work, a trait called `VectorTransformer` is defined for generic transformation on a vector. It contains one method to be implemented, `transform` which applies transformation on a vector.

There are two implementations of `VectorTransformer` now, and they all can be easily extended with PMML transformation support.

1) `StandardScaler` - Standardizes features by removing the mean and scaling to unit variance using column summary statistics on the samples in the training set.

2) `Normalizer` - Normalizes samples individually to unit L^n norm

Author: DB Tsai <dbtsai@alpinenow.com>

Closes #1207 from dbtsai/dbtsai-feature-scaling and squashes the following commits:

78c15d3 [DB Tsai] Alpine Data Labs
2014-08-03 21:39:21 -07:00
Joseph K. Bradley 2998e38a94 [SPARK-2197] [mllib] Java DecisionTree bug fix and easy-of-use
Bug fix: Before, when an RDD was created in Java and passed to DecisionTree.train(), the fake class tag caused problems.
* Fix: DecisionTree: Used new RDD.retag() method to allow passing RDDs from Java.

Other improvements to Decision Trees for easy-of-use with Java:
* impurity classes: Added instance() methods to help with Java interface.
* Strategy: Added Java-friendly constructor
--> Note: I removed quantileCalculationStrategy from the Java-friendly constructor since (a) it is a special class and (b) there is only 1 option currently.  I suspect we will redo the API before the other options are included.

CC: mengxr

Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>

Closes #1740 from jkbradley/dt-java-new and squashes the following commits:

0805dc6 [Joseph K. Bradley] Changed Strategy to use JavaConverters instead of JavaConversions
519b1b7 [Joseph K. Bradley] * Organized imports in JavaDecisionTreeSuite.java * Using JavaConverters instead of JavaConversions in DecisionTreeSuite.scala
f7b5ca1 [Joseph K. Bradley] Improvements to make it easier to run DecisionTree from Java. * DecisionTree: Used new RDD.retag() method to allow passing RDDs from Java. * impurity classes: Added instance() methods to help with Java interface. * Strategy: Added Java-friendly constructor ** Note: I removed quantileCalculationStrategy from the Java-friendly constructor since (a) it is a special class and (b) there is only 1 option currently.  I suspect we will redo the API before the other options are included.
d78ada6 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-java
320853f [Joseph K. Bradley] Added JavaDecisionTreeSuite, partly written
13a585e [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-java
f1a8283 [Joseph K. Bradley] Added old JavaDecisionTreeSuite, to be updated later
225822f [Joseph K. Bradley] Bug: In DecisionTree, the method sequentialBinSearchForOrderedCategoricalFeatureInClassification() indexed bins from 0 to (math.pow(2, featureCategories.toInt - 1) - 1). This upper bound is the bound for unordered categorical features, not ordered ones. The upper bound should be the arity (i.e., max value) of the feature.
2014-08-03 10:36:52 -07:00
Joseph K. Bradley 3f67382e7c [SPARK-2478] [mllib] DecisionTree Python API
Added experimental Python API for Decision Trees.

API:
* class DecisionTreeModel
** predict() for single examples and RDDs, taking both feature vectors and LabeledPoints
** numNodes()
** depth()
** __str__()
* class DecisionTree
** trainClassifier()
** trainRegressor()
** train()

Examples and testing:
* Added example testing classification and regression with batch prediction: examples/src/main/python/mllib/tree.py
* Have also tested example usage in doc of python/pyspark/mllib/tree.py which tests single-example prediction with dense and sparse vectors

Also: Small bug fix in python/pyspark/mllib/_common.py: In _linear_predictor_typecheck, changed check for RDD to use isinstance() instead of type() in order to catch RDD subclasses.

CC mengxr manishamde

Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>

Closes #1727 from jkbradley/decisiontree-python-new and squashes the following commits:

3744488 [Joseph K. Bradley] Renamed test tree.py to decision_tree_runner.py Small updates based on github review.
6b86a9d [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
affceb9 [Joseph K. Bradley] * Fixed bug in doc tests in pyspark/mllib/util.py caused by change in loadLibSVMFile behavior.  (It used to threshold labels at 0 to make them 0/1, but it now leaves them as they are.) * Fixed small bug in loadLibSVMFile: If a data file had no features, then loadLibSVMFile would create a single all-zero feature.
67a29bc [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
cf46ad7 [Joseph K. Bradley] Python DecisionTreeModel * predict(empty RDD) returns an empty RDD instead of an error. * Removed support for calling predict() on LabeledPoint and RDD[LabeledPoint] * predict() does not cache serialized RDD any more.
aa29873 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
bf21be4 [Joseph K. Bradley] removed old run() func from DecisionTree
fa10ea7 [Joseph K. Bradley] Small style update
7968692 [Joseph K. Bradley] small braces typo fix
e34c263 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
4801b40 [Joseph K. Bradley] Small style update to DecisionTreeSuite
db0eab2 [Joseph K. Bradley] Merge branch 'decisiontree-bugfix2' into decisiontree-python-new
6873fa9 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
225822f [Joseph K. Bradley] Bug: In DecisionTree, the method sequentialBinSearchForOrderedCategoricalFeatureInClassification() indexed bins from 0 to (math.pow(2, featureCategories.toInt - 1) - 1). This upper bound is the bound for unordered categorical features, not ordered ones. The upper bound should be the arity (i.e., max value) of the feature.
93953f1 [Joseph K. Bradley] Likely done with Python API.
6df89a9 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
4562c08 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
665ba78 [Joseph K. Bradley] Small updates towards Python DecisionTree API
188cb0d [Joseph K. Bradley] Merge branch 'decisiontree-bugfix' into decisiontree-python-new
6622247 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
b8fac57 [Joseph K. Bradley] Finished Python DecisionTree API and example but need to test a bit more.
2b20c61 [Joseph K. Bradley] Small doc and style updates
1b29c13 [Joseph K. Bradley] Merge branch 'decisiontree-bugfix' into decisiontree-python-new
584449a [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
dab0b67 [Joseph K. Bradley] Added documentation for DecisionTree internals
8bb8aa0 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-bugfix
978cfcf [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-bugfix
6eed482 [Joseph K. Bradley] In DecisionTree: Changed from using procedural syntax for functions returning Unit to explicitly writing Unit return type.
376dca2 [Joseph K. Bradley] Updated meaning of maxDepth by 1 to fit scikit-learn and rpart. * In code, replaced usages of maxDepth <-- maxDepth + 1 * In params, replace settings of maxDepth <-- maxDepth - 1
e06e423 [Joseph K. Bradley] Merge branch 'decisiontree-bugfix' into decisiontree-python-new
bab3f19 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
59750f8 [Joseph K. Bradley] * Updated Strategy to check numClassesForClassification only if algo=Classification. * Updates based on comments: ** DecisionTreeRunner *** Made dataFormat arg default to libsvm ** Small cleanups ** tree.Node: Made recursive helper methods private, and renamed them.
52e17c5 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-bugfix
f5a036c [Joseph K. Bradley] Merge branch 'decisiontree-bugfix' into decisiontree-python-new
da50db7 [Joseph K. Bradley] Added one more test to DecisionTreeSuite: stump with 2 continuous variables for binary classification.  Caused problems in past, but fixed now.
8e227ea [Joseph K. Bradley] Changed Strategy so it only requires numClassesForClassification >= 2 for classification
cd1d933 [Joseph K. Bradley] Merge branch 'decisiontree-bugfix' into decisiontree-python-new
8ea8750 [Joseph K. Bradley] Bug fix: Off-by-1 when finding thresholds for splits for continuous features.
8a758db [Joseph K. Bradley] Merge branch 'decisiontree-bugfix' into decisiontree-python-new
5fe44ed [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
2283df8 [Joseph K. Bradley] 2 bug fixes.
73fbea2 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-bugfix
5f920a1 [Joseph K. Bradley] Demonstration of bug before submitting fix: Updated DecisionTreeSuite so that 3 tests fail.  Will describe bug in next commit.
f825352 [Joseph K. Bradley] Wrote Python API and example for DecisionTree.  Also added toString, depth, and numNodes methods to DecisionTreeModel.
2014-08-02 13:07:17 -07:00
Patrick Wendell dab37966b0 Revert "[SPARK-1470][SPARK-1842] Use the scala-logging wrapper instead of the directly sfl4j api"
This reverts commit adc8303294.
2014-08-01 23:55:30 -07:00
GuoQiang Li adc8303294 [SPARK-1470][SPARK-1842] Use the scala-logging wrapper instead of the directly sfl4j api
Author: GuoQiang Li <witgo@qq.com>

Closes #1369 from witgo/SPARK-1470_new and squashes the following commits:

66a1641 [GuoQiang Li] IncompatibleResultTypeProblem
73a89ba [GuoQiang Li] Use the scala-logging wrapper instead of the directly sfl4j api.
2014-08-01 23:55:11 -07:00
Burak fda475987f [SPARK-2801][MLlib]: DistributionGenerator renamed to RandomDataGenerator. RandomRDD is now of generic type
The RandomRDDGenerators used to only output RDD[Double].
Now RandomRDDGenerators.randomRDD can be used to generate a random RDD[T] via a class that extends RandomDataGenerator, by supplying a type T and overriding the nextValue() function as they wish.

Author: Burak <brkyvz@gmail.com>

Closes #1732 from brkyvz/SPARK-2801 and squashes the following commits:

c94a694 [Burak] [SPARK-2801][MLlib] Missing ClassTags added
22d96fe [Burak] [SPARK-2801][MLlib]: DistributionGenerator renamed to RandomDataGenerator, generic types added for RandomRDD instead of Double
2014-08-01 22:32:12 -07:00
Tor Myklebust e25ec06171 [SPARK-1580][MLLIB] Estimate ALS communication and computation costs.
Continue the work from #493.

Closes #493 and Closes #593

Author: Tor Myklebust <tmyklebu@gmail.com>
Author: Xiangrui Meng <meng@databricks.com>

Closes #1731 from mengxr/tmyklebu-alscost and squashes the following commits:

9b56a8b [Xiangrui Meng] updated API and added a simple test
68a3229 [Xiangrui Meng] merge master
217bd1d [Tor Myklebust] Documentation and choleskies -> subproblems.
8cbb718 [Tor Myklebust] Braces get spaces.
0455cd4 [Tor Myklebust] Parens for collectAsMap.
2b2febe [Tor Myklebust] Use `makeLinkRDDs` when estimating costs.
2ab7a5d [Tor Myklebust] Reindent estimateCost's declaration and make it return Seqs.
8b21e6d [Tor Myklebust] Fix overlong lines.
8cbebf1 [Tor Myklebust] Rename and clean up the return format of cost estimator.
6615ed5 [Tor Myklebust] It's more useful to give per-partition estimates.  Do that.
5530678 [Tor Myklebust] Merge branch 'master' of https://github.com/apache/spark into alscost
6c31324 [Tor Myklebust] Make it actually build...
a1184d1 [Tor Myklebust] Mark ALS.evaluatePartitioner DeveloperApi.
657a71b [Tor Myklebust] Simple-minded estimates of computation and communication costs in ALS.
dcf583a [Tor Myklebust] Remove the partitioner member variable; instead, thread that needle everywhere it needs to go.
23d6f91 [Tor Myklebust] Stop making the partitioner configurable.
495784f [Tor Myklebust] Merge branch 'master' of https://github.com/apache/spark
674933a [Tor Myklebust] Fix style.
40edc23 [Tor Myklebust] Fix missing space.
f841345 [Tor Myklebust] Fix daft bug creating 'pairs', also for -> foreach.
5ec9e6c [Tor Myklebust] Clean a couple of things up using 'map'.
36a0f43 [Tor Myklebust] Make the partitioner private.
d872b09 [Tor Myklebust] Add negative id ALS test.
df27697 [Tor Myklebust] Support custom partitioners.  Currently we use the same partitioner for users and products.
c90b6d8 [Tor Myklebust] Scramble user and product ids before bucketing.
c774d7d [Tor Myklebust] Make the partitioner a member variable and use it instead of modding directly.
2014-08-01 21:25:02 -07:00
Michael Giannakopoulos c281189222 [SPARK-2550][MLLIB][APACHE SPARK] Support regularization and intercept in pyspark's linear methods.
Related to issue: [SPARK-2550](https://issues.apache.org/jira/browse/SPARK-2550?jql=project%20%3D%20SPARK%20AND%20resolution%20%3D%20Unresolved%20AND%20priority%20%3D%20Major%20ORDER%20BY%20key%20DESC).

Author: Michael Giannakopoulos <miccagiann@gmail.com>

Closes #1624 from miccagiann/new-branch and squashes the following commits:

c02e5f5 [Michael Giannakopoulos] Merge cleanly with upstream/master.
8dcb888 [Michael Giannakopoulos] Putting the if/else if statements in brackets.
fed8eaa [Michael Giannakopoulos] Adding a space in the message related to the IllegalArgumentException.
44e6ff0 [Michael Giannakopoulos] Adding a blank line before python class LinearRegressionWithSGD.
8eba9c5 [Michael Giannakopoulos] Change function signatures. Exception is thrown from the scala component and not from the python one.
638be47 [Michael Giannakopoulos] Modified code to comply with code standards.
ec50ee9 [Michael Giannakopoulos] Shorten the if-elif-else statement in regression.py file
b962744 [Michael Giannakopoulos] Replaced the enum classes, with strings-keywords for defining the values of 'regType' parameter.
78853ec [Michael Giannakopoulos] Providing intercept and regualizer functionallity for linear methods in only one function.
3ac8874 [Michael Giannakopoulos] Added support for regularizer and intercection parameters for linear regression method.
2014-08-01 21:00:31 -07:00
Jeremy Freeman f6a1899306 Streaming mllib [SPARK-2438][MLLIB]
This PR implements a streaming linear regression analysis, in which a linear regression model is trained online as new data arrive. The design is based on discussions with tdas and mengxr, in which we determined how to add this functionality in a general way, with minimal changes to existing libraries.

__Summary of additions:__

_StreamingLinearAlgorithm_
- An abstract class for fitting generalized linear models online to streaming data, including training on (and updating) a model, and making predictions.

_StreamingLinearRegressionWithSGD_
- Class and companion object for running streaming linear regression

_StreamingLinearRegressionTestSuite_
- Unit tests

_StreamingLinearRegression_
- Example use case: fitting a model online to data from one stream, and making predictions on other data

__Notes__
- If this looks good, I can use the StreamingLinearAlgorithm class to easily implement other analyses that follow the same logic (Ridge, Lasso, Logistic, SVM).

Author: Jeremy Freeman <the.freeman.lab@gmail.com>
Author: freeman <the.freeman.lab@gmail.com>

Closes #1361 from freeman-lab/streaming-mllib and squashes the following commits:

775ea29 [Jeremy Freeman] Throw error if user doesn't initialize weights
4086fee [Jeremy Freeman] Fixed current weight formatting
8b95b27 [Jeremy Freeman] Restored broadcasting
29f27ec [Jeremy Freeman] Formatting
8711c41 [Jeremy Freeman] Used return to avoid indentation
777b596 [Jeremy Freeman] Restored treeAggregate
74cf440 [Jeremy Freeman] Removed static methods
d28cf9a [Jeremy Freeman] Added usage notes
c3326e7 [Jeremy Freeman] Improved documentation
9541a41 [Jeremy Freeman] Merge remote-tracking branch 'upstream/master' into streaming-mllib
66eba5e [Jeremy Freeman] Fixed line lengths
2fe0720 [Jeremy Freeman] Minor cleanup
7d51378 [Jeremy Freeman] Moved streaming loader to MLUtils
b9b69f6 [Jeremy Freeman] Added setter methods
c3f8b5a [Jeremy Freeman] Modified logging
00aafdc [Jeremy Freeman] Add modifiers
14b801e [Jeremy Freeman] Name changes
c7d38a3 [Jeremy Freeman] Move check for empty data to GradientDescent
4b0a5d3 [Jeremy Freeman] Cleaned up tests
74188d6 [Jeremy Freeman] Eliminate dependency on commons
50dd237 [Jeremy Freeman] Removed experimental tag
6bfe1e6 [Jeremy Freeman] Fixed imports
a2a63ad [freeman] Makes convergence test more robust
86220bc [freeman] Streaming linear regression unit tests
fb4683a [freeman] Minor changes for scalastyle consistency
fd31e03 [freeman] Changed logging behavior
453974e [freeman] Fixed indentation
c4b1143 [freeman] Streaming linear regression
604f4d7 [freeman] Expanded private class to include mllib
d99aa85 [freeman] Helper methods for streaming MLlib apps
0898add [freeman] Added dependency on streaming
2014-08-01 20:10:26 -07:00
Joseph K. Bradley 7058a5393b [SPARK-2796] [mllib] DecisionTree bug fix: ordered categorical features
Bug: In DecisionTree, the method sequentialBinSearchForOrderedCategoricalFeatureInClassification() indexed bins from 0 to (math.pow(2, featureCategories.toInt - 1) - 1). This upper bound is the bound for unordered categorical features, not ordered ones. The upper bound should be the arity (i.e., max value) of the feature.

Added new test to DecisionTreeSuite to catch this: "regression stump with categorical variables of arity 2"

Bug fix: Modified upper bound discussed above.

Also: Small improvements to coding style in DecisionTree.

CC mengxr manishamde

Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>

Closes #1720 from jkbradley/decisiontree-bugfix2 and squashes the following commits:

225822f [Joseph K. Bradley] Bug: In DecisionTree, the method sequentialBinSearchForOrderedCategoricalFeatureInClassification() indexed bins from 0 to (math.pow(2, featureCategories.toInt - 1) - 1). This upper bound is the bound for unordered categorical features, not ordered ones. The upper bound should be the arity (i.e., max value) of the feature.
2014-08-01 15:52:21 -07:00
Doris Xin d88e695613 [SPARK-2786][mllib] Python correlations
Author: Doris Xin <doris.s.xin@gmail.com>

Closes #1713 from dorx/pythonCorrelation and squashes the following commits:

5f1e60c [Doris Xin] reviewer comments.
46ff6eb [Doris Xin] reviewer comments.
ad44085 [Doris Xin] style fix
e69d446 [Doris Xin] fixed missed conflicts.
eb5bf56 [Doris Xin] merge master
cc9f725 [Doris Xin] units passed.
9141a63 [Doris Xin] WIP2
d199f1f [Doris Xin] Moved correlation names into a public object
cd163d6 [Doris Xin] WIP
2014-08-01 15:02:17 -07:00
Xiangrui Meng 5328c0aaa0 [HOTFIX] downgrade breeze version to 0.7
breeze-0.8.1 causes dependency issues, as discussed in #940 .

Author: Xiangrui Meng <meng@databricks.com>

Closes #1718 from mengxr/revert-breeze and squashes the following commits:

99c4681 [Xiangrui Meng] downgrade breeze version to 0.7
2014-08-01 10:00:46 -07:00
witgo 0dacb1adb5 [SPARK-1997] update breeze to version 0.8.1
`breeze 0.8.1`  dependent on  `scala-logging-slf4j 2.1.1` The relevant code on #1369

Author: witgo <witgo@qq.com>

Closes #940 from witgo/breeze-8.0.1 and squashes the following commits:

65cc65e [witgo] update breeze  to version 0.8.1
2014-08-01 07:47:44 -07:00
Sean Owen 82d209d43f SPARK-2768 [MLLIB] Add product, user recommend method to MatrixFactorizationModel
Right now, `MatrixFactorizationModel` can only predict a score for one or more `(user,product)` tuples. As a comment in the file notes, it would be more useful to expose a recommend method, that computes top N scoring products for a user (or vice versa – users for a product).

(This also corrects some long lines in the Java ALS test suite.)

As you can see, it's a little messy to access the class from Java. Should there be a Java-friendly wrapper for it? with a pointer about where that should go, I could add that.

Author: Sean Owen <srowen@gmail.com>

Closes #1687 from srowen/SPARK-2768 and squashes the following commits:

b349675 [Sean Owen] Additional review changes
c9edb04 [Sean Owen] Updates from code review
7bc35f9 [Sean Owen] Add recommend methods to MatrixFactorizationModel
2014-08-01 07:32:53 -07:00
Doris Xin c4755403e7 [SPARK-2782][mllib] Bug fix for getRanks in SpearmanCorrelation
getRanks computes the wrong rank when numPartition >= size in the input RDDs before this patch. added units to address this bug.

Author: Doris Xin <doris.s.xin@gmail.com>

Closes #1710 from dorx/correlationBug and squashes the following commits:

733def4 [Doris Xin] bugs and reviewer comments.
31db920 [Doris Xin] revert unnecessary change
043ff83 [Doris Xin] bug fix for spearman corner case
2014-07-31 21:23:35 -07:00
Xiangrui Meng b19008320b [SPARK-2777][MLLIB] change ALS factors storage level to MEMORY_AND_DISK
Now the factors are persisted in memory only. If they get kicked off by later jobs, we might have to start the computation from very beginning. A better solution is changing the storage level to `MEMORY_AND_DISK`.

srowen

Author: Xiangrui Meng <meng@databricks.com>

Closes #1700 from mengxr/als-level and squashes the following commits:

c103d76 [Xiangrui Meng] change ALS factors storage level to MEMORY_AND_DISK
2014-07-31 21:14:08 -07:00
Joseph K. Bradley b124de584a [SPARK-2756] [mllib] Decision tree bug fixes
(1) Inconsistent aggregate (agg) indexing for unordered features.
(2) Fixed gain calculations for edge cases.
(3) One-off error in choosing thresholds for continuous features for small datasets.
(4) (not a bug) Changed meaning of tree depth by 1 to fit scikit-learn and rpart. (Depth 1 used to mean 1 leaf node; depth 0 now means 1 leaf node.)

Other updates, to help with tests:
* Updated DecisionTreeRunner to print more info.
* Added utility functions to DecisionTreeModel: toString, depth, numNodes
* Improved internal DecisionTree documentation

Bug fix details:

(1) Indexing was inconsistent for aggregate calculations for unordered features (in multiclass classification with categorical features, where the features had few enough values such that they could be considered unordered, i.e., isSpaceSufficientForAllCategoricalSplits=true).

* updateBinForUnorderedFeature indexed agg as (node, feature, featureValue, binIndex), where
** featureValue was from arr (so it was a feature value)
** binIndex was in [0,…, 2^(maxFeatureValue-1)-1)
* The rest of the code indexed agg as (node, feature, binIndex, label).
* Corrected this bug by changing updateBinForUnorderedFeature to use the second indexing pattern.

Unit tests in DecisionTreeSuite
* Updated a few tests to train a model and test its training accuracy, which catches the indexing bug from updateBinForUnorderedFeature() discussed above.
* Added new test (“stump with categorical variables for multiclass classification, with just enough bins”) to test bin extremes.

(2) Bug fix: calculateGainForSplit (for classification):
* It used to return dummy prediction values when either the right or left children had 0 weight.  These were incorrect for multiclass classification.  It has been corrected.

Updated impurities to allow for count = 0.  This was related to the above bug fix for calculateGainForSplit (for classification).

Small updates to documentation and coding style.

(3) Bug fix: Off-by-1 when finding thresholds for splits for continuous features.

* Exhibited bug in new test in DecisionTreeSuite: “stump with 1 continuous variable for binary classification, to check off-by-1 error”
* Description: When finding thresholds for possible splits for continuous features in DecisionTree.findSplitsBins, the thresholds were set according to individual training examples’ feature values.
* Fix: The threshold is set to be the average of 2 consecutive (sorted) examples’ feature values.  E.g.: If the old code set the threshold using example i, the new code sets the threshold using exam
* Note: In 4 DecisionTreeSuite tests with all labels identical, removed check of threshold since it is somewhat arbitrary.

CC: mengxr manishamde  Please let me know if I missed something!

Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>

Closes #1673 from jkbradley/decisiontree-bugfix and squashes the following commits:

2b20c61 [Joseph K. Bradley] Small doc and style updates
dab0b67 [Joseph K. Bradley] Added documentation for DecisionTree internals
8bb8aa0 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-bugfix
978cfcf [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-bugfix
6eed482 [Joseph K. Bradley] In DecisionTree: Changed from using procedural syntax for functions returning Unit to explicitly writing Unit return type.
376dca2 [Joseph K. Bradley] Updated meaning of maxDepth by 1 to fit scikit-learn and rpart. * In code, replaced usages of maxDepth <-- maxDepth + 1 * In params, replace settings of maxDepth <-- maxDepth - 1
59750f8 [Joseph K. Bradley] * Updated Strategy to check numClassesForClassification only if algo=Classification. * Updates based on comments: ** DecisionTreeRunner *** Made dataFormat arg default to libsvm ** Small cleanups ** tree.Node: Made recursive helper methods private, and renamed them.
52e17c5 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-bugfix
da50db7 [Joseph K. Bradley] Added one more test to DecisionTreeSuite: stump with 2 continuous variables for binary classification.  Caused problems in past, but fixed now.
8ea8750 [Joseph K. Bradley] Bug fix: Off-by-1 when finding thresholds for splits for continuous features.
2283df8 [Joseph K. Bradley] 2 bug fixes.
73fbea2 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-bugfix
5f920a1 [Joseph K. Bradley] Demonstration of bug before submitting fix: Updated DecisionTreeSuite so that 3 tests fail.  Will describe bug in next commit.
2014-07-31 20:51:48 -07:00
Doris Xin d8430148ee [SPARK-2724] Python version of RandomRDDGenerators
RandomRDDGenerators but without support for randomRDD and randomVectorRDD, which take in arbitrary DistributionGenerator.

`randomRDD.py` is named to avoid collision with the built-in Python `random` package.

Author: Doris Xin <doris.s.xin@gmail.com>

Closes #1628 from dorx/pythonRDD and squashes the following commits:

55c6de8 [Doris Xin] review comments. all python units passed.
f831d9b [Doris Xin] moved default args logic into PythonMLLibAPI
2d73917 [Doris Xin] fix for linalg.py
8663e6a [Doris Xin] reverting back to a single python file for random
f47c481 [Doris Xin] docs update
687aac0 [Doris Xin] add RandomRDDGenerators.py to run-tests
4338f40 [Doris Xin] renamed randomRDD to rand and import as random
29d205e [Doris Xin] created mllib.random package
bd2df13 [Doris Xin] typos
07ddff2 [Doris Xin] units passed.
23b2ecd [Doris Xin] WIP
2014-07-31 20:32:57 -07:00
Xiangrui Meng dc0865bc7e [SPARK-2511][MLLIB] add HashingTF and IDF
This is roughly the TF-IDF implementation used in the Databricks Cloud Demo: http://databricks.com/cloud/ .

Both `HashingTF` and `IDF` are implemented as transformers, similar to scikit-learn.

Author: Xiangrui Meng <meng@databricks.com>

Closes #1671 from mengxr/tfidf and squashes the following commits:

7d65888 [Xiangrui Meng] use JavaConverters._
5fe9ec4 [Xiangrui Meng] fix unit test
6e214ec [Xiangrui Meng] add apache header
cfd9aed [Xiangrui Meng] add Java-friendly methods move classes to mllib.feature
3814440 [Xiangrui Meng] add HashingTF and IDF
2014-07-31 12:55:00 -07:00
Sean Owen e9b275b769 SPARK-2341 [MLLIB] loadLibSVMFile doesn't handle regression datasets
Per discussion at https://issues.apache.org/jira/browse/SPARK-2341 , this is a look at deprecating the multiclass parameter. Thoughts welcome of course.

Author: Sean Owen <srowen@gmail.com>

Closes #1663 from srowen/SPARK-2341 and squashes the following commits:

8a3abd7 [Sean Owen] Suppress MIMA error for removed package private classes
18a8c8e [Sean Owen] Updates from review
83d0092 [Sean Owen] Deprecated methods with multiclass, and instead always parse target as a double (ie. multiclass = true)
2014-07-30 17:34:32 -07:00
Sean Owen 6ab96a6fd0 SPARK-2749 [BUILD]. Spark SQL Java tests aren't compiling in Jenkins' Maven builds; missing junit:junit dep
The Maven-based builds in the build matrix have been failing for a few days:

https://amplab.cs.berkeley.edu/jenkins/view/Spark/

On inspection, it looks like the Spark SQL Java tests don't compile:

https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-Master-Maven-pre-YARN/hadoop.version=1.0.4,label=centos/244/consoleFull

I confirmed it by repeating the command vs master:

`mvn -Dhadoop.version=1.0.4 -Dlabel=centos -DskipTests clean package`

The problem is that this module doesn't depend on JUnit. In fact, none of the modules do, but `com.novocode:junit-interface` (the SBT-JUnit bridge) pulls it in, in most places. However this module doesn't depend on `com.novocode:junit-interface`

Adding the `junit:junit` dependency fixes the compile problem. In fact, the other modules with Java tests should probably depend on it explicitly instead of happening to get it via `com.novocode:junit-interface`, since that is a bit SBT/Scala-specific (and I am not even sure it's needed).

Author: Sean Owen <srowen@gmail.com>

Closes #1660 from srowen/SPARK-2749 and squashes the following commits:

858ff7c [Sean Owen] Add explicit junit dep to other modules with Java tests for robustness
9636794 [Sean Owen] Add junit dep so that Spark SQL Java tests compile
2014-07-30 15:04:33 -07:00
GuoQiang Li fc47bb6967 [SPARK-2544][MLLIB] Improve ALS algorithm resource usage
Author: GuoQiang Li <witgo@qq.com>
Author: witgo <witgo@qq.com>

Closes #929 from witgo/improve_als and squashes the following commits:

ea25033 [GuoQiang Li] checkpoint products 3,6,9 ...
154dccf [GuoQiang Li] checkpoint products only
c5779ff [witgo] Improve ALS algorithm resource usage
2014-07-30 11:00:11 -07:00
Sean Owen ee07541e99 SPARK-2748 [MLLIB] [GRAPHX] Loss of precision for small arguments to Math.exp, Math.log
In a few places in MLlib, an expression of the form `log(1.0 + p)` is evaluated. When p is so small that `1.0 + p == 1.0`, the result is 0.0. However the correct answer is very near `p`. This is why `Math.log1p` exists.

Similarly for one instance of `exp(m) - 1` in GraphX; there's a special `Math.expm1` method.

While the errors occur only for very small arguments, given their use in machine learning algorithms, this is entirely possible.

Also note the related PR for Python: https://github.com/apache/spark/pull/1652

Author: Sean Owen <srowen@gmail.com>

Closes #1659 from srowen/SPARK-2748 and squashes the following commits:

c5926d4 [Sean Owen] Use log1p, expm1 for better precision for tiny arguments
2014-07-30 08:55:15 -07:00
Xiangrui Meng 20424dad30 [SPARK-2174][MLLIB] treeReduce and treeAggregate
In `reduce` and `aggregate`, the driver node spends linear time on the number of partitions. It becomes a bottleneck when there are many partitions and the data from each partition is big.

SPARK-1485 (#506) tracks the progress of implementing AllReduce on Spark. I did several implementations including butterfly, reduce + broadcast, and treeReduce + broadcast. treeReduce + BT broadcast seems to be right way to go for Spark. Using binary tree may introduce some overhead in communication, because the driver still need to coordinate on data shuffling. In my experiments, n -> sqrt(n) -> 1 gives the best performance in general, which is why I set "depth = 2" in MLlib algorithms. But it certainly needs more testing.

I left `treeReduce` and `treeAggregate` public for easy testing. Some numbers from a test on 32-node m3.2xlarge cluster.

code:

~~~
import breeze.linalg._
import org.apache.log4j._

Logger.getRootLogger.setLevel(Level.OFF)

for (n <- Seq(1, 10, 100, 1000, 10000, 100000, 1000000)) {
  val vv = sc.parallelize(0 until 1024, 1024).map(i => DenseVector.zeros[Double](n))
  var start = System.nanoTime(); vv.treeReduce(_ + _, 2); println((System.nanoTime() - start) / 1e9)
  start = System.nanoTime(); vv.reduce(_ + _); println((System.nanoTime() - start) / 1e9)
}
~~~

out:

| n | treeReduce(,2) | reduce |
|---|---------------------|-----------|
| 10 | 0.215538731 | 0.204206899 |
| 100 | 0.278405907 | 0.205732582 |
| 1000 | 0.208972182 | 0.214298272 |
| 10000 | 0.194792071 | 0.349353687 |
| 100000 | 0.347683285 | 6.086671892 |
| 1000000 | 2.589350682 | 66.572906702 |

CC: @pwendell

This is clearly more scalable than the default implementation. My question is whether we should use this implementation in `reduce` and `aggregate` or put them as separate methods. The concern is that users may use `reduce` and `aggregate` as collect, where having multiple stages doesn't reduce the data size. However, in this case, `collect` is more appropriate.

Author: Xiangrui Meng <meng@databricks.com>

Closes #1110 from mengxr/tree and squashes the following commits:

c6cd267 [Xiangrui Meng] make depth default to 2
b04b96a [Xiangrui Meng] address comments
9bcc5d3 [Xiangrui Meng] add depth for readability
7495681 [Xiangrui Meng] fix compile error
142a857 [Xiangrui Meng] merge master
d58a087 [Xiangrui Meng] move treeReduce and treeAggregate to mllib
8a2a59c [Xiangrui Meng] Merge branch 'master' into tree
be6a88a [Xiangrui Meng] use treeAggregate in mllib
0f94490 [Xiangrui Meng] add docs
eb71c33 [Xiangrui Meng] add treeReduce
fe42a5e [Xiangrui Meng] add treeAggregate
2014-07-29 01:16:41 -07:00
Cheng Lian a7a9d14479 [SPARK-2410][SQL] Merging Hive Thrift/JDBC server (with Maven profile fix)
JIRA issue: [SPARK-2410](https://issues.apache.org/jira/browse/SPARK-2410)

Another try for #1399 & #1600. Those two PR breaks Jenkins builds because we made a separate profile `hive-thriftserver` in sub-project `assembly`, but the `hive-thriftserver` module is defined outside the `hive-thriftserver` profile. Thus every time a pull request that doesn't touch SQL code will also execute test suites defined in `hive-thriftserver`, but tests fail because related .class files are not included in the assembly jar.

In the most recent commit, module `hive-thriftserver` is moved into its own profile to fix this problem. All previous commits are squashed for clarity.

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #1620 from liancheng/jdbc-with-maven-fix and squashes the following commits:

629988e [Cheng Lian] Moved hive-thriftserver module definition into its own profile
ec3c7a7 [Cheng Lian] Cherry picked the Hive Thrift server
2014-07-28 12:07:30 -07:00
DB Tsai 255b56f9f5 [SPARK-2479][MLlib] Comparing floating-point numbers using relative error in UnitTests
Floating point math is not exact, and most floating-point numbers end up being slightly imprecise due to rounding errors.

Simple values like 0.1 cannot be precisely represented using binary floating point numbers, and the limited precision of floating point numbers means that slight changes in the order of operations or the precision of intermediates can change the result.

That means that comparing two floats to see if they are equal is usually not what we want. As long as this imprecision stays small, it can usually be ignored.

Based on discussion in the community, we have implemented two different APIs for relative tolerance, and absolute tolerance. It makes sense that test writers should know which one they need depending on their circumstances.

Developers also need to explicitly specify the eps, and there is no default value which will sometimes cause confusion.

When comparing against zero using relative tolerance, a exception will be raised to warn users that it's meaningless.

For relative tolerance, users can now write

    assert(23.1 ~== 23.52 relTol 0.02)
    assert(23.1 ~== 22.74 relTol 0.02)
    assert(23.1 ~= 23.52 relTol 0.02)
    assert(23.1 ~= 22.74 relTol 0.02)
    assert(!(23.1 !~= 23.52 relTol 0.02))
    assert(!(23.1 !~= 22.74 relTol 0.02))

    // This will throw exception with the following message.
    // "Did not expect 23.1 and 23.52 to be within 0.02 using relative tolerance."
    assert(23.1 !~== 23.52 relTol 0.02)

    // "Expected 23.1 and 22.34 to be within 0.02 using relative tolerance."
    assert(23.1 ~== 22.34 relTol 0.02)

For absolute error,

    assert(17.8 ~== 17.99 absTol 0.2)
    assert(17.8 ~== 17.61 absTol 0.2)
    assert(17.8 ~= 17.99 absTol 0.2)
    assert(17.8 ~= 17.61 absTol 0.2)
    assert(!(17.8 !~= 17.99 absTol 0.2))
    assert(!(17.8 !~= 17.61 absTol 0.2))

    // This will throw exception with the following message.
    // "Did not expect 17.8 and 17.99 to be within 0.2 using absolute error."
    assert(17.8 !~== 17.99 absTol 0.2)

    // "Expected 17.8 and 17.59 to be within 0.2 using absolute error."
    assert(17.8 ~== 17.59 absTol 0.2)

Authors:
  DB Tsai <dbtsaialpinenow.com>
  Marek Kolodziej <marekalpinenow.com>

Author: DB Tsai <dbtsai@alpinenow.com>

Closes #1425 from dbtsai/SPARK-2479_comparing_floating_point and squashes the following commits:

8c7cbcc [DB Tsai] Alpine Data Labs
2014-07-28 11:34:19 -07:00
Patrick Wendell e5bbce9a60 Revert "[SPARK-2410][SQL] Merging Hive Thrift/JDBC server"
This reverts commit f6ff2a61d0.
2014-07-27 18:46:58 -07:00
Doris Xin 81fcdd22c8 [SPARK-2514] [mllib] Random RDD generator
Utilities for generating random RDDs.

RandomRDD and RandomVectorRDD are created instead of using `sc.parallelize(range:Range)` because `Range` objects in Scala can only have `size <= Int.MaxValue`.

The object `RandomRDDGenerators` can be transformed into a generator class to reduce the number of auxiliary methods for optional arguments.

Author: Doris Xin <doris.s.xin@gmail.com>

Closes #1520 from dorx/randomRDD and squashes the following commits:

01121ac [Doris Xin] reviewer comments
6bf27d8 [Doris Xin] Merge branch 'master' into randomRDD
a8ea92d [Doris Xin] Reviewer comments
063ea0b [Doris Xin] Merge branch 'master' into randomRDD
aec68eb [Doris Xin] newline
bc90234 [Doris Xin] units passed.
d56cacb [Doris Xin] impl with RandomRDD
92d6f1c [Doris Xin] solution for Cloneable
df5bcff [Doris Xin] Merge branch 'generator' into randomRDD
f46d928 [Doris Xin] WIP
49ed20d [Doris Xin] alternative poisson distribution generator
7cb0e40 [Doris Xin] fix for data inconsistency
8881444 [Doris Xin] RandomRDDGenerator: initial design
2014-07-27 16:16:39 -07:00
Cheng Lian f6ff2a61d0 [SPARK-2410][SQL] Merging Hive Thrift/JDBC server
(This is a replacement of #1399, trying to fix potential `HiveThriftServer2` port collision between parallel builds. Please refer to [these comments](https://github.com/apache/spark/pull/1399#issuecomment-50212572) for details.)

JIRA issue: [SPARK-2410](https://issues.apache.org/jira/browse/SPARK-2410)

Merging the Hive Thrift/JDBC server from [branch-1.0-jdbc](https://github.com/apache/spark/tree/branch-1.0-jdbc).

Thanks chenghao-intel for his initial contribution of the Spark SQL CLI.

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #1600 from liancheng/jdbc and squashes the following commits:

ac4618b [Cheng Lian] Uses random port for HiveThriftServer2 to avoid collision with parallel builds
090beea [Cheng Lian] Revert changes related to SPARK-2678, decided to move them to another PR
21c6cf4 [Cheng Lian] Updated Spark SQL programming guide docs
fe0af31 [Cheng Lian] Reordered spark-submit options in spark-shell[.cmd]
199e3fb [Cheng Lian] Disabled MIMA for hive-thriftserver
1083e9d [Cheng Lian] Fixed failed test suites
7db82a1 [Cheng Lian] Fixed spark-submit application options handling logic
9cc0f06 [Cheng Lian] Starts beeline with spark-submit
cfcf461 [Cheng Lian] Updated documents and build scripts for the newly added hive-thriftserver profile
061880f [Cheng Lian] Addressed all comments by @pwendell
7755062 [Cheng Lian] Adapts test suites to spark-submit settings
40bafef [Cheng Lian] Fixed more license header issues
e214aab [Cheng Lian] Added missing license headers
b8905ba [Cheng Lian] Fixed minor issues in spark-sql and start-thriftserver.sh
f975d22 [Cheng Lian] Updated docs for Hive compatibility and Shark migration guide draft
3ad4e75 [Cheng Lian] Starts spark-sql shell with spark-submit
a5310d1 [Cheng Lian] Make HiveThriftServer2 play well with spark-submit
61f39f4 [Cheng Lian] Starts Hive Thrift server via spark-submit
2c4c539 [Cheng Lian] Cherry picked the Hive Thrift server
2014-07-27 13:03:38 -07:00
Doris Xin 3a69c72e5c [SPARK-2679] [MLLib] Ser/De for Double
Added a set of serializer/deserializer for Double in _common.py and PythonMLLibAPI in MLLib.

Author: Doris Xin <doris.s.xin@gmail.com>

Closes #1581 from dorx/doubleSerDe and squashes the following commits:

86a85b3 [Doris Xin] Merge branch 'master' into doubleSerDe
2bfe7a4 [Doris Xin] Removed magic byte
ad4d0d9 [Doris Xin] removed a space in unit
a9020bc [Doris Xin] units passed
7dad9af [Doris Xin] WIP
2014-07-27 07:21:07 -07:00
Xiangrui Meng aaf2b735fd [SPARK-2361][MLLIB] Use broadcast instead of serializing data directly into task closure
We saw task serialization problems with large feature dimension, which could be avoid if we don't serialize data directly into task but use broadcast variables. This PR uses broadcast in both training and prediction and adds tests to make sure the task size is small.

Author: Xiangrui Meng <meng@databricks.com>

Closes #1427 from mengxr/broadcast-new and squashes the following commits:

b9a1228 [Xiangrui Meng] style update
b97c184 [Xiangrui Meng] minimal change to LBFGS
9ebadcc [Xiangrui Meng] add task size test to RowMatrix
9427bf0 [Xiangrui Meng] add task size tests to linear methods
e0a5cf2 [Xiangrui Meng] add task size test to GD
28a8411 [Xiangrui Meng] add test for NaiveBayes
380778c [Xiangrui Meng] update KMeans test
bccab92 [Xiangrui Meng] add task size test to LBFGS
02103ba [Xiangrui Meng] remove print
e73d68e [Xiangrui Meng] update tests for k-means
174cb15 [Xiangrui Meng] use local-cluster for test with a small akka.frameSize
1928a5a [Xiangrui Meng] add test for KMeans task size
e00c2da [Xiangrui Meng] use broadcast in GD, KMeans
010d076 [Xiangrui Meng] modify NaiveBayesModel and GLM to use broadcast
2014-07-26 22:56:07 -07:00
Michael Armbrust afd757a241 Revert "[SPARK-2410][SQL] Merging Hive Thrift/JDBC server"
This reverts commit 06dc0d2c6b.

#1399 is making Jenkins fail.  We should investigate and put this back after its passing tests.

Author: Michael Armbrust <michael@databricks.com>

Closes #1594 from marmbrus/revertJDBC and squashes the following commits:

59748da [Michael Armbrust] Revert "[SPARK-2410][SQL] Merging Hive Thrift/JDBC server"
2014-07-25 15:36:57 -07:00
Cheng Lian 06dc0d2c6b [SPARK-2410][SQL] Merging Hive Thrift/JDBC server
JIRA issue:

- Main: [SPARK-2410](https://issues.apache.org/jira/browse/SPARK-2410)
- Related: [SPARK-2678](https://issues.apache.org/jira/browse/SPARK-2678)

Cherry picked the Hive Thrift/JDBC server from [branch-1.0-jdbc](https://github.com/apache/spark/tree/branch-1.0-jdbc).

(Thanks chenghao-intel for his initial contribution of the Spark SQL CLI.)

TODO

- [x] Use `spark-submit` to launch the server, the CLI and beeline
- [x] Migration guideline draft for Shark users

----

Hit by a bug in `SparkSubmitArguments` while working on this PR: all application options that are recognized by `SparkSubmitArguments` are stolen as `SparkSubmit` options. For example:

```bash
$ spark-submit --class org.apache.hive.beeline.BeeLine spark-internal --help
```

This actually shows usage information of `SparkSubmit` rather than `BeeLine`.

~~Fixed this bug here since the `spark-internal` related stuff also touches `SparkSubmitArguments` and I'd like to avoid conflict.~~

**UPDATE** The bug mentioned above is now tracked by [SPARK-2678](https://issues.apache.org/jira/browse/SPARK-2678). Decided to revert changes to this bug since it involves more subtle considerations and worth a separate PR.

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #1399 from liancheng/thriftserver and squashes the following commits:

090beea [Cheng Lian] Revert changes related to SPARK-2678, decided to move them to another PR
21c6cf4 [Cheng Lian] Updated Spark SQL programming guide docs
fe0af31 [Cheng Lian] Reordered spark-submit options in spark-shell[.cmd]
199e3fb [Cheng Lian] Disabled MIMA for hive-thriftserver
1083e9d [Cheng Lian] Fixed failed test suites
7db82a1 [Cheng Lian] Fixed spark-submit application options handling logic
9cc0f06 [Cheng Lian] Starts beeline with spark-submit
cfcf461 [Cheng Lian] Updated documents and build scripts for the newly added hive-thriftserver profile
061880f [Cheng Lian] Addressed all comments by @pwendell
7755062 [Cheng Lian] Adapts test suites to spark-submit settings
40bafef [Cheng Lian] Fixed more license header issues
e214aab [Cheng Lian] Added missing license headers
b8905ba [Cheng Lian] Fixed minor issues in spark-sql and start-thriftserver.sh
f975d22 [Cheng Lian] Updated docs for Hive compatibility and Shark migration guide draft
3ad4e75 [Cheng Lian] Starts spark-sql shell with spark-submit
a5310d1 [Cheng Lian] Make HiveThriftServer2 play well with spark-submit
61f39f4 [Cheng Lian] Starts Hive Thrift server via spark-submit
2c4c539 [Cheng Lian] Cherry picked the Hive Thrift server
2014-07-25 12:20:49 -07:00
Matei Zaharia 8529ced35c SPARK-2657 Use more compact data structures than ArrayBuffer in groupBy & cogroup
JIRA: https://issues.apache.org/jira/browse/SPARK-2657

Our current code uses ArrayBuffers for each group of values in groupBy, as well as for the key's elements in CoGroupedRDD. ArrayBuffers have a lot of overhead if there are few values in them, which is likely to happen in cases such as join. In particular, they have a pointer to an Object[] of size 16 by default, which is 24 bytes for the array header + 128 for the pointers in there, plus at least 32 for the ArrayBuffer data structure. This patch replaces the per-group buffers with a CompactBuffer class that can store up to 2 elements more efficiently (in fields of itself) and acts like an ArrayBuffer beyond that. For a key's elements in CoGroupedRDD, we use an Array of CompactBuffers instead of an ArrayBuffer of ArrayBuffers.

There are some changes throughout the code to deal with CoGroupedRDD returning Array instead. We can also decide not to do that but CoGroupedRDD is a `DeveloperAPI` so I think it's okay to change it here.

Author: Matei Zaharia <matei@databricks.com>

Closes #1555 from mateiz/compact-groupby and squashes the following commits:

845a356 [Matei Zaharia] Lower initial size of CompactBuffer's vector to 8
07621a7 [Matei Zaharia] Review comments
0c1cd12 [Matei Zaharia] Don't use varargs in CompactBuffer.apply
bdc8a39 [Matei Zaharia] Small tweak to +=, and typos
f61f040 [Matei Zaharia] Fix line lengths
59da88b0 [Matei Zaharia] Fix line lengths
197cde8 [Matei Zaharia] Make CompactBuffer extend Seq to make its toSeq more efficient
775110f [Matei Zaharia] Change CoGroupedRDD to give (K, Array[Iterable[_]]) to avoid wrappers
9b4c6e8 [Matei Zaharia] Use CompactBuffer in CoGroupedRDD
ed577ab [Matei Zaharia] Use CompactBuffer in groupByKey
10f0de1 [Matei Zaharia] A CompactBuffer that's more memory-efficient than ArrayBuffer for small buffers
2014-07-25 00:32:32 -07:00
Xiangrui Meng c960b50518 [SPARK-2479 (partial)][MLLIB] fix binary metrics unit tests
Allow small errors in comparison.

@dbtsai , this unit test blocks https://github.com/apache/spark/pull/1562 . I may need to merge this one first. We can change it to use the tools in https://github.com/apache/spark/pull/1425 after that PR gets merged.

Author: Xiangrui Meng <meng@databricks.com>

Closes #1576 from mengxr/fix-binary-metrics-unit-tests and squashes the following commits:

5076a7f [Xiangrui Meng] fix binary metrics unit tests
2014-07-24 12:37:02 -07:00
Xiangrui Meng 4c7243e109 [SPARK-2617] Correct doc and usages of preservesPartitioning
The name `preservesPartitioning` is ambiguous: 1) preserves the indices of partitions, 2) preserves the partitioner. The latter is correct and `preservesPartitioning` should really be called `preservesPartitioner` to avoid confusion. Unfortunately, this is already part of the API and we cannot change. We should be clear in the doc and fix wrong usages.

This PR

1. adds notes in `maPartitions*`,
2. makes `RDD.sample` preserve partitioner,
3. changes `preservesPartitioning` to false in  `RDD.zip` because the keys of the first RDD are no longer the keys of the zipped RDD,
4. fixes some wrong usages in MLlib.

Author: Xiangrui Meng <meng@databricks.com>

Closes #1526 from mengxr/preserve-partitioner and squashes the following commits:

b361e65 [Xiangrui Meng] update doc based on pwendell's comments
3b1ba19 [Xiangrui Meng] update doc
357575c [Xiangrui Meng] fix unit test
20b4816 [Xiangrui Meng] Merge branch 'master' into preserve-partitioner
d1caa65 [Xiangrui Meng] add doc to explain preservesPartitioning fix wrong usage of preservesPartitioning make sample preserse partitioning
2014-07-23 00:58:55 -07:00
peng.zhang 75db1742ab [SPARK-2612] [mllib] Fix data skew in ALS
Author: peng.zhang <peng.zhang@xiaomi.com>

Closes #1521 from renozhang/fix-als and squashes the following commits:

b5727a4 [peng.zhang] Remove no need argument
1a4f7a0 [peng.zhang] Fix data skew in ALS
2014-07-22 02:39:07 -07:00
Xiangrui Meng 1b10b8114a [SPARK-2495][MLLIB] remove private[mllib] from linear models' constructors
This is part of SPARK-2495 to allow users construct linear models manually.

Author: Xiangrui Meng <meng@databricks.com>

Closes #1492 from mengxr/public-constructor and squashes the following commits:

a48b766 [Xiangrui Meng] remove private[mllib] from linear models' constructors
2014-07-20 13:04:59 -07:00
Doris Xin a243364b22 [SPARK-2359][MLlib] Correlations
Implementation for Pearson and Spearman's correlation.

Author: Doris Xin <doris.s.xin@gmail.com>

Closes #1367 from dorx/correlation and squashes the following commits:

c0dd7dc [Doris Xin] here we go
32d83a3 [Doris Xin] Reviewer comments
4db0da1 [Doris Xin] added private[stat] to Spearman
b716f70 [Doris Xin] minor fixes
6e1b42a [Doris Xin] More comments addressed. Still some open questions
8104f44 [Doris Xin] addressed comments. some open questions still
39387c2 [Doris Xin] added missing header
bd3cf19 [Doris Xin] Merge branch 'master' into correlation
6341884 [Doris Xin] race condition bug squished
bd2bacf [Doris Xin] Race condition bug
b775ff9 [Doris Xin] old wrong impl
534ebf2 [Doris Xin] Merge branch 'master' into correlation
818fa31 [Doris Xin] wip units
9d808ee [Doris Xin] wip units
b843a13 [Doris Xin] revert change in stat counter
28561b6 [Doris Xin] wip
bb2e977 [Doris Xin] minor fix
8e02c63 [Doris Xin] Merge branch 'master' into correlation
2a40aa1 [Doris Xin] initial, untested implementation of Pearson
dfc4854 [Doris Xin] WIP
2014-07-18 17:25:32 -07:00
Manish Amde d88f6be446 [MLlib] SPARK-1536: multiclass classification support for decision tree
The ability to perform multiclass classification is a big advantage for using decision trees and was a highly requested feature for mllib. This pull request adds multiclass classification support to the MLlib decision tree. It also adds sample weights support using WeightedLabeledPoint class for handling unbalanced datasets during classification. It will also support algorithms such as AdaBoost which requires instances to be weighted.

It handles the special case where the categorical variables cannot be ordered for multiclass classification and thus the optimizations used for speeding up binary classification cannot be directly used for multiclass classification with categorical variables. More specifically, for m categories in a categorical feature, it analyses all the ```2^(m-1) - 1``` categorical splits provided that #splits are less than the maxBins provided in the input. This condition will not be met for features with large number of categories -- using decision trees is not recommended for such datasets in general since the categorical features are favored over continuous features. Moreover, the user can use a combination of tricks (increasing bin size of the tree algorithms, use binary encoding for categorical features or use one-vs-all classification strategy) to avoid these constraints.

The new code is accompanied by unit tests and has also been tested on the iris and covtype datasets.

cc: mengxr, etrain, hirakendu, atalwalkar, srowen

Author: Manish Amde <manish9ue@gmail.com>
Author: manishamde <manish9ue@gmail.com>
Author: Evan Sparks <sparks@cs.berkeley.edu>

Closes #886 from manishamde/multiclass and squashes the following commits:

26f8acc [Manish Amde] another attempt at fixing mima
c5b2d04 [Manish Amde] more MIMA fixes
1ce7212 [Manish Amde] change problem filter for mima
10fdd82 [Manish Amde] fixing MIMA excludes
e1c970d [Manish Amde] merged master
abf2901 [Manish Amde] adding classes to MimaExcludes.scala
45e767a [Manish Amde] adding developer api annotation for overriden methods
c8428c4 [Manish Amde] fixing weird multiline bug
afced16 [Manish Amde] removed label weights support
2d85a48 [Manish Amde] minor: fixed scalastyle issues reprise
4e85f2c [Manish Amde] minor: fixed scalastyle issues
b2ae41f [Manish Amde] minor: scalastyle
e4c1321 [Manish Amde] using while loop for regression histograms
d75ac32 [Manish Amde] removed WeightedLabeledPoint from this PR
0fecd38 [Manish Amde] minor: add newline to EOF
2061cf5 [Manish Amde] merged from master
06b1690 [Manish Amde] fixed off-by-one error in bin to split conversion
9cc3e31 [Manish Amde] added implicit conversion import
5c1b2ca [Manish Amde] doc for PointConverter class
485eaae [Manish Amde] implicit conversion from LabeledPoint to WeightedLabeledPoint
3d7f911 [Manish Amde] updated doc
8e44ab8 [Manish Amde] updated doc
adc7315 [Manish Amde] support ordered categorical splits for multiclass classification
e3e8843 [Manish Amde] minor code formatting
23d4268 [Manish Amde] minor: another minor code style
34ee7b9 [Manish Amde] minor: code style
237762d [Manish Amde] renaming functions
12e6d0a [Manish Amde] minor: removing line in doc
9a90c93 [Manish Amde] Merge branch 'master' into multiclass
1892a2c [Manish Amde] tests and use multiclass binaggregate length when atleast one categorical feature is present
f5f6b83 [Manish Amde] multiclass for continous variables
8cfd3b6 [Manish Amde] working for categorical multiclass classification
828ff16 [Manish Amde] added categorical variable test
bce835f [Manish Amde] code cleanup
7e5f08c [Manish Amde] minor doc
1dd2735 [Manish Amde] bin search logic for multiclass
f16a9bb [Manish Amde] fixing while loop
d811425 [Manish Amde] multiclass bin aggregate logic
ab5cb21 [Manish Amde] multiclass logic
d8e4a11 [Manish Amde] sample weights
ed5a2df [Manish Amde] fixed classification requirements
d012be7 [Manish Amde] fixed while loop
18d2835 [Manish Amde] changing default values for num classes
6b912dc [Manish Amde] added numclasses to tree runner, predict logic for multiclass, add multiclass option to train
75f2bfc [Manish Amde] minor code style fix
e547151 [Manish Amde] minor modifications
34549d0 [Manish Amde] fixing error during merge
098e8c5 [Manish Amde] merged master
e006f9d [Manish Amde] changing variable names
5c78e1a [Manish Amde] added multiclass support
6c7af22 [Manish Amde] prepared for multiclass without breaking binary classification
46e06ee [Manish Amde] minor mods
3f85a17 [Manish Amde] tests for multiclass classification
4d5f70c [Manish Amde] added multiclass support for find splits bins
46f909c [Manish Amde] todo for multiclass support
455bea9 [Manish Amde] fixed tests
14aea48 [Manish Amde] changing instance format to weighted labeled point
a1a6e09 [Manish Amde] added weighted point class
968ca9d [Manish Amde] merged master
7fc9545 [Manish Amde] added docs
ce004a1 [Manish Amde] minor formatting
b27ad2c [Manish Amde] formatting
426bb28 [Manish Amde] programming guide blurb
8053fed [Manish Amde] more formatting
5eca9e4 [Manish Amde] grammar
4731cda [Manish Amde] formatting
5e82202 [Manish Amde] added documentation, fixed off by 1 error in max level calculation
cbd9f14 [Manish Amde] modified scala.math to math
dad9652 [Manish Amde] removed unused imports
e0426ee [Manish Amde] renamed parameter
718506b [Manish Amde] added unit test
1517155 [Manish Amde] updated documentation
9dbdabe [Manish Amde] merge from master
719d009 [Manish Amde] updating user documentation
fecf89a [manishamde] Merge pull request #6 from etrain/deep_tree
0287772 [Evan Sparks] Fixing scalastyle issue.
2f1e093 [Manish Amde] minor: added doc for maxMemory parameter
2f6072c [manishamde] Merge pull request #5 from etrain/deep_tree
abc5a23 [Evan Sparks] Parameterizing max memory.
50b143a [Manish Amde] adding support for very deep trees
2014-07-18 14:00:13 -07:00
Joseph K. Bradley 935fe65ff6 SPARK-1215 [MLLIB]: Clustering: Index out of bounds error (2)
Added check to LocalKMeans.scala: kMeansPlusPlus initialization to handle case with fewer distinct data points than clusters k.  Added two related unit tests to KMeansSuite.  (Re-submitting PR after tangling commits in PR 1407 https://github.com/apache/spark/pull/1407 )

Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>

Closes #1468 from jkbradley/kmeans-fix and squashes the following commits:

4e9bd1e [Joseph K. Bradley] Updated PR per comments from mengxr
6c7a2ec [Joseph K. Bradley] Added check to LocalKMeans.scala: kMeansPlusPlus initialization to handle case with fewer distinct data points than clusters k.  Added two related unit tests to KMeansSuite.
2014-07-17 15:05:02 -07:00
Alexander Ulanov 04b01bb101 [MLLIB] [SPARK-2222] Add multiclass evaluation metrics
Adding two classes:
1) MulticlassMetrics implements various multiclass evaluation metrics
2) MulticlassMetricsSuite implements unit tests for MulticlassMetrics

Author: Alexander Ulanov <nashb@yandex.ru>
Author: unknown <ulanov@ULANOV1.emea.hpqcorp.net>
Author: Xiangrui Meng <meng@databricks.com>

Closes #1155 from avulanov/master and squashes the following commits:

2eae80f [Alexander Ulanov] Merge pull request #1 from mengxr/avulanov-master
5ebeb08 [Xiangrui Meng] minor updates
79c3555 [Alexander Ulanov] Addressing reviewers comments mengxr
0fa9511 [Alexander Ulanov] Addressing reviewers comments mengxr
f0dadc9 [Alexander Ulanov] Addressing reviewers comments mengxr
4811378 [Alexander Ulanov] Removing println
87fb11f [Alexander Ulanov] Addressing reviewers comments mengxr. Added confusion matrix
e3db569 [Alexander Ulanov] Addressing reviewers comments mengxr. Added true positive rate and false positive rate. Test suite code style.
a7e8bf0 [Alexander Ulanov] Addressing reviewers comments mengxr
c3a77ad [Alexander Ulanov] Addressing reviewers comments mengxr
e2c91c3 [Alexander Ulanov] Fixes to mutliclass metics
d5ce981 [unknown] Comments about Double
a5c8ba4 [unknown] Unit tests. Class rename
fcee82d [unknown] Unit tests. Class rename
d535d62 [unknown] Multiclass evaluation
2014-07-15 08:40:22 -07:00
DB Tsai 52beb20f79 [SPARK-2477][MLlib] Using appendBias for adding intercept in GeneralizedLinearAlgorithm
Instead of using prependOne currently in GeneralizedLinearAlgorithm, we would like to use appendBias for 1) keeping the indices of original training set unchanged by adding the intercept into the last element of vector and 2) using the same public API for consistently adding intercept.

Author: DB Tsai <dbtsai@alpinenow.com>

Closes #1410 from dbtsai/SPARK-2477_intercept_with_appendBias and squashes the following commits:

011432c [DB Tsai] From Alpine Data Labs
2014-07-15 02:14:58 -07:00
Sean Owen 635888cbed SPARK-2363. Clean MLlib's sample data files
(Just made a PR for this, mengxr was the reporter of:)

MLlib has sample data under serveral folders:
1) data/mllib
2) data/
3) mllib/data/*
Per previous discussion with Matei Zaharia, we want to put them under `data/mllib` and clean outdated files.

Author: Sean Owen <sowen@cloudera.com>

Closes #1394 from srowen/SPARK-2363 and squashes the following commits:

54313dd [Sean Owen] Move ML example data from /mllib/data/ and /data/ into /data/mllib/
2014-07-13 19:27:43 -07:00
Sandy Ryza 4c8be64e76 SPARK-2462. Make Vector.apply public.
Apologies if there's an already-discussed reason I missed for why this doesn't make sense.

Author: Sandy Ryza <sandy@cloudera.com>

Closes #1389 from sryza/sandy-spark-2462 and squashes the following commits:

2e5e201 [Sandy Ryza] SPARK-2462.  Make Vector.apply public.
2014-07-12 16:55:15 -07:00
Li Pu d38887b8a0 use specialized axpy in RowMatrix for SVD
After running some more tests on large matrix, found that the BV axpy (breeze/linalg/Vector.scala, axpy) is slower than the BSV axpy (breeze/linalg/operators/SparseVectorOps.scala, sv_dv_axpy), 8s v.s. 2s for each multiplication. The BV axpy operates on an iterator while BSV axpy directly operates on the underlying array. I think the overhead comes from creating the iterator (with a zip) and advancing the pointers.

Author: Li Pu <lpu@twitter.com>
Author: Xiangrui Meng <meng@databricks.com>
Author: Li Pu <li.pu@outlook.com>

Closes #1378 from vrilleup/master and squashes the following commits:

6fb01a3 [Li Pu] use specialized axpy in RowMatrix
5255f2a [Li Pu] Merge remote-tracking branch 'upstream/master'
7312ec1 [Li Pu] very minor comment fix
4c618e9 [Li Pu] Merge pull request #1 from mengxr/vrilleup-master
a461082 [Xiangrui Meng] make superscript show up correctly in doc
861ec48 [Xiangrui Meng] simplify axpy
62969fa [Xiangrui Meng] use BDV directly in symmetricEigs change the computation mode to local-svd, local-eigs, and dist-eigs update tests and docs
c273771 [Li Pu] automatically determine SVD compute mode and parameters
7148426 [Li Pu] improve RowMatrix multiply
5543cce [Li Pu] improve svd api
819824b [Li Pu] add flag for dense svd or sparse svd
eb15100 [Li Pu] fix binary compatibility
4c7aec3 [Li Pu] improve comments
e7850ed [Li Pu] use aggregate and axpy
827411b [Li Pu] fix EOF new line
9c80515 [Li Pu] use non-sparse implementation when k = n
fe983b0 [Li Pu] improve scala style
96d2ecb [Li Pu] improve eigenvalue sorting
e1db950 [Li Pu] SPARK-1782: svd for sparse matrix using ARPACK
2014-07-11 23:26:47 -07:00
DB Tsai 5596086935 [SPARK-1969][MLlib] Online summarizer APIs for mean, variance, min, and max
It basically moved the private ColumnStatisticsAggregator class from RowMatrix to public available DeveloperApi with documentation and unitests.

Changes:
1) Moved the private implementation from org.apache.spark.mllib.linalg.ColumnStatisticsAggregator to org.apache.spark.mllib.stat.MultivariateOnlineSummarizer
2) When creating OnlineSummarizer object, the number of columns is not needed in the constructor. It's determined when users add the first sample.
3) Added the APIs documentation for MultivariateOnlineSummarizer.
4) Added the unittests for MultivariateOnlineSummarizer.

Author: DB Tsai <dbtsai@dbtsai.com>

Closes #955 from dbtsai/dbtsai-summarizer and squashes the following commits:

b13ac90 [DB Tsai] dbtsai-summarizer
2014-07-11 23:04:43 -07:00
Xiangrui Meng 2f59ce7dbe [SPARK-2358][MLLIB] Add an option to include native BLAS/LAPACK loader in the build
It would be easy for users to include the netlib-java jniloader in the spark jar, which is LGPL-licensed. We can follow the same approach as ganglia support in Spark, which could be enabled by turning on "-Pganglia-lgpl" at build time. We can use "-Pnetlib-lgpl" flag for this.

Author: Xiangrui Meng <meng@databricks.com>

Closes #1295 from mengxr/netlib-lgpl and squashes the following commits:

aebf001 [Xiangrui Meng] add a profile to optionally include native BLAS/LAPACK loader in mllib
2014-07-10 21:57:54 -07:00
Prashant Sharma 628932b8d0 [SPARK-1776] Have Spark's SBT build read dependencies from Maven.
Patch introduces the new way of working also retaining the existing ways of doing things.

For example build instruction for yarn in maven is
`mvn -Pyarn -PHadoop2.2 clean package -DskipTests`
in sbt it can become
`MAVEN_PROFILES="yarn, hadoop-2.2" sbt/sbt clean assembly`
Also supports
`sbt/sbt -Pyarn -Phadoop-2.2 -Dhadoop.version=2.2.0 clean assembly`

Author: Prashant Sharma <prashant.s@imaginea.com>
Author: Patrick Wendell <pwendell@gmail.com>

Closes #772 from ScrapCodes/sbt-maven and squashes the following commits:

a8ac951 [Prashant Sharma] Updated sbt version.
62b09bb [Prashant Sharma] Improvements.
fa6221d [Prashant Sharma] Excluding sql from mima
4b8875e [Prashant Sharma] Sbt assembly no longer builds tools by default.
72651ca [Prashant Sharma] Addresses code reivew comments.
acab73d [Prashant Sharma] Revert "Small fix to run-examples script."
ac4312c [Prashant Sharma] Revert "minor fix"
6af91ac [Prashant Sharma] Ported oldDeps back. + fixes issues with prev commit.
65cf06c [Prashant Sharma] Servelet API jars mess up with the other servlet jars on the class path.
446768e [Prashant Sharma] minor fix
89b9777 [Prashant Sharma] Merge conflicts
d0a02f2 [Prashant Sharma] Bumped up pom versions, Since the build now depends on pom it is better updated there. + general cleanups.
dccc8ac [Prashant Sharma] updated mima to check against 1.0
a49c61b [Prashant Sharma] Fix for tools jar
a2f5ae1 [Prashant Sharma] Fixes a bug in dependencies.
cf88758 [Prashant Sharma] cleanup
9439ea3 [Prashant Sharma] Small fix to run-examples script.
96cea1f [Prashant Sharma] SPARK-1776 Have Spark's SBT build read dependencies from Maven.
36efa62 [Patrick Wendell] Set project name in pom files and added eclipse/intellij plugins.
4973dbd [Patrick Wendell] Example build using pom reader.
2014-07-10 11:03:37 -07:00
Li Pu 1f33e1f201 SPARK-1782: svd for sparse matrix using ARPACK
copy ARPACK dsaupd/dseupd code from latest breeze
change RowMatrix to use sparse SVD
change tests for sparse SVD

All tests passed. I will run it against some large matrices.

Author: Li Pu <lpu@twitter.com>
Author: Xiangrui Meng <meng@databricks.com>
Author: Li Pu <li.pu@outlook.com>

Closes #964 from vrilleup/master and squashes the following commits:

7312ec1 [Li Pu] very minor comment fix
4c618e9 [Li Pu] Merge pull request #1 from mengxr/vrilleup-master
a461082 [Xiangrui Meng] make superscript show up correctly in doc
861ec48 [Xiangrui Meng] simplify axpy
62969fa [Xiangrui Meng] use BDV directly in symmetricEigs change the computation mode to local-svd, local-eigs, and dist-eigs update tests and docs
c273771 [Li Pu] automatically determine SVD compute mode and parameters
7148426 [Li Pu] improve RowMatrix multiply
5543cce [Li Pu] improve svd api
819824b [Li Pu] add flag for dense svd or sparse svd
eb15100 [Li Pu] fix binary compatibility
4c7aec3 [Li Pu] improve comments
e7850ed [Li Pu] use aggregate and axpy
827411b [Li Pu] fix EOF new line
9c80515 [Li Pu] use non-sparse implementation when k = n
fe983b0 [Li Pu] improve scala style
96d2ecb [Li Pu] improve eigenvalue sorting
e1db950 [Li Pu] SPARK-1782: svd for sparse matrix using ARPACK
2014-07-09 12:15:08 -07:00
johnnywalleye d35e3db232 [SPARK-2417][MLlib] Fix DecisionTree tests
Fixes test failures introduced by https://github.com/apache/spark/pull/1316.

For both the regression and classification cases,
val stats is the InformationGainStats for the best tree split.
stats.predict is the predicted value for the data, before the split is made.
Since 600 of the 1,000 values generated by DecisionTreeSuite.generateCategoricalDataPoints() are 1.0 and the rest 0.0, the regression tree and classification tree both correctly predict a value of 0.6 for this data now, and the assertions have been changed to reflect that.

Author: johnnywalleye <jsondag@gmail.com>

Closes #1343 from johnnywalleye/decision-tree-tests and squashes the following commits:

ef80603 [johnnywalleye] [SPARK-2417][MLlib] Fix DecisionTree tests
2014-07-09 11:06:34 -07:00
johnnywalleye 1114207cc8 [SPARK-2152][MLlib] fix bin offset in DecisionTree node aggregations (also resolves SPARK-2160)
Hi, this pull fixes (what I believe to be) a bug in DecisionTree.scala.

In the extractLeftRightNodeAggregates function, the first set of rightNodeAgg values for Regression are set in line 792 as follows:

rightNodeAgg(featureIndex)(2 * (numBins - 2))
  = binData(shift + (2 * numBins - 1)))

Then there is a loop that sets the rest of the values, as in line 809:

rightNodeAgg(featureIndex)(2 * (numBins - 2 - splitIndex)) =
  binData(shift + (2 *(numBins - 2 - splitIndex))) +
  rightNodeAgg(featureIndex)(2 * (numBins - 1 - splitIndex))

But since splitIndex starts at 1, this ends up skipping a set of binData values.

The changes here address this issue, for both the Regression and Classification cases.

Author: johnnywalleye <jsondag@gmail.com>

Closes #1316 from johnnywalleye/master and squashes the following commits:

73809da [johnnywalleye] fix bin offset in DecisionTree node aggregations
2014-07-08 19:17:26 -07:00
Sean Owen 2b36344f58 SPARK-1675. Make clear whether computePrincipalComponents requires centered data
Just closing out this small JIRA, resolving with a comment change.

Author: Sean Owen <sowen@cloudera.com>

Closes #1171 from srowen/SPARK-1675 and squashes the following commits:

45ee9b7 [Sean Owen] Add simple note that data need not be centered for computePrincipalComponents
2014-07-03 11:54:51 -07:00
Szul, Piotr 441cdcca64 [SPARK-2172] PySpark cannot import mllib modules in YARN-client mode
Include pyspark/mllib python sources as resources in the mllib.jar.
This way they will be included in the final assembly

Author: Szul, Piotr <Piotr.Szul@csiro.au>

Closes #1223 from piotrszul/branch-1.0 and squashes the following commits:

69d5174 [Szul, Piotr] Removed unsed resource directory src/main/resource from mllib pom
f8c52a0 [Szul, Piotr] [SPARK-2172] PySpark cannot import mllib modules in YARN-client mode Include pyspark/mllib python sources as resources in the jar

(cherry picked from commit fa167194ce)
Signed-off-by: Reynold Xin <rxin@apache.org>
2014-06-25 23:07:16 -07:00
Gang Bai d484ddeff1 [SPARK-2163] class LBFGS optimize with Double tolerance instead of Int
https://issues.apache.org/jira/browse/SPARK-2163

This pull request includes the change for **[SPARK-2163]**:

* Changed the convergence tolerance parameter from type `Int` to type `Double`.
* Added types for vars in `class LBFGS`, making the style consistent with `class GradientDescent`.
* Added associated test to check that optimizing via `class LBFGS` produces the same results as via calling `runLBFGS` from `object LBFGS`.

This is a very minor change but it will solve the problem in my implementation of a regression model for count data, where I make use of LBFGS for parameter estimation.

Author: Gang Bai <me@baigang.net>

Closes #1104 from BaiGang/fix_int_tol and squashes the following commits:

cecf02c [Gang Bai] Changed setConvergenceTol'' to specify tolerance with a parameter of type Double. For the reason and the problem caused by an Int parameter, please check https://issues.apache.org/jira/browse/SPARK-2163. Added a test in LBFGSSuite for validating that optimizing via class LBFGS produces the same results as calling runLBFGS from object LBFGS. Keep the indentations and styles correct.
2014-06-20 08:52:20 -07:00
Doris Xin 566f70f214 Squishing a typo bug before it causes real harm
in updateNumRows method in RowMatrix

Author: Doris Xin <doris.s.xin@gmail.com>

Closes #1125 from dorx/updateNumRows and squashes the following commits:

8564aef [Doris Xin] Squishing a typo bug before it causes real harm
2014-06-18 22:19:06 -07:00
Shuo Xiang a6e0afdcf0 SPARK-2085: [MLlib] Apply user-specific regularization instead of uniform regularization in ALS
The current implementation of ALS takes a single regularization parameter and apply it on both of the user factors and the product factors. This kind of regularization can be less effective while user number is significantly larger than the number of products (and vice versa). For example, if we have 10M users and 1K product, regularization on user factors will dominate. Following the discussion in [this thread](http://apache-spark-user-list.1001560.n3.nabble.com/possible-bug-in-Spark-s-ALS-implementation-tt2567.html#a2704), the implementation in this PR will regularize each factor vector by #ratings * lambda.

Author: Shuo Xiang <sxiang@twitter.com>

Closes #1026 from coderxiang/als-reg and squashes the following commits:

93dfdb4 [Shuo Xiang] Merge remote-tracking branch 'upstream/master' into als-reg
b98f19c [Shuo Xiang] merge latest master
52c7b58 [Shuo Xiang] Apply user-specific regularization instead of uniform regularization in Alternating Least Squares (ALS)
2014-06-12 17:37:06 -07:00
Tor Myklebust d9203350b0 [SPARK-1672][MLLIB] Separate user and product partitioning in ALS
Some clean up work following #593.

1. Allow to set different number user blocks and number product blocks in `ALS`.
2. Update `MovieLensALS` to reflect the change.

Author: Tor Myklebust <tmyklebu@gmail.com>
Author: Xiangrui Meng <meng@databricks.com>

Closes #1014 from mengxr/SPARK-1672 and squashes the following commits:

0e910dd [Xiangrui Meng] change private[this] to private[recommendation]
36420c7 [Xiangrui Meng] set exclusion rules for ALS
9128b77 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-1672
294efe9 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-1672
9bab77b [Xiangrui Meng] clean up add numUserBlocks and numProductBlocks to MovieLensALS
84c8e8c [Xiangrui Meng] Merge branch 'master' into SPARK-1672
d17a8bf [Xiangrui Meng] merge master
a4925fd [Tor Myklebust] Style.
bd8a75c [Tor Myklebust] Merge branch 'master' of github.com:apache/spark into alsseppar
021f54b [Tor Myklebust] Separate user and product blocks.
dcf583a [Tor Myklebust] Remove the partitioner member variable; instead, thread that needle everywhere it needs to go.
23d6f91 [Tor Myklebust] Stop making the partitioner configurable.
495784f [Tor Myklebust] Merge branch 'master' of https://github.com/apache/spark
674933a [Tor Myklebust] Fix style.
40edc23 [Tor Myklebust] Fix missing space.
f841345 [Tor Myklebust] Fix daft bug creating 'pairs', also for -> foreach.
5ec9e6c [Tor Myklebust] Clean a couple of things up using 'map'.
36a0f43 [Tor Myklebust] Make the partitioner private.
d872b09 [Tor Myklebust] Add negative id ALS test.
df27697 [Tor Myklebust] Support custom partitioners.  Currently we use the same partitioner for users and products.
c90b6d8 [Tor Myklebust] Scramble user and product ids before bucketing.
c774d7d [Tor Myklebust] Make the partitioner a member variable and use it instead of modding directly.
2014-06-11 18:16:33 -07:00
witgo c48b6222ea Resolve scalatest warnings during build
Author: witgo <witgo@qq.com>

Closes #1032 from witgo/ShouldMatchers and squashes the following commits:

7ebf34c [witgo] Resolve scalatest warnings during build
2014-06-10 20:24:05 -07:00
Marcelo Vanzin 668cb1defe Remove compile-scoped junit dependency.
This avoids having junit classes showing up in the assembly jar.
I verified that only test classes in the jtransforms package
use junit.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #794 from vanzin/junit-dep-exclusion and squashes the following commits:

274e1c2 [Marcelo Vanzin] Remove junit from assembly in sbt build also.
ad950be [Marcelo Vanzin] Remove compile-scoped junit dependency.
2014-06-05 13:13:33 -07:00
Takuya UESHIN 7c160293d6 [SPARK-2029] Bump pom.xml version number of master branch to 1.1.0-SNAPSHOT.
Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #974 from ueshin/issues/SPARK-2029 and squashes the following commits:

e19e8f4 [Takuya UESHIN] Bump version number to 1.1.0-SNAPSHOT.
2014-06-05 11:27:33 -07:00
Xiangrui Meng 189df165bb [SPARK-1752][MLLIB] Standardize text format for vectors and labeled points
We should standardize the text format used to represent vectors and labeled points. The proposed formats are the following:

1. dense vector: `[v0,v1,..]`
2. sparse vector: `(size,[i0,i1],[v0,v1])`
3. labeled point: `(label,vector)`

where "(..)" indicates a tuple and "[...]" indicate an array. `loadLabeledPoints` is added to pyspark's `MLUtils`. I didn't add `loadVectors` to pyspark because `RDD.saveAsTextFile` cannot stringify dense vectors in the proposed format automatically.

`MLUtils#saveLabeledData` and `MLUtils#loadLabeledData` are deprecated. Users should use `RDD#saveAsTextFile` and `MLUtils#loadLabeledPoints` instead. In Scala, `MLUtils#loadLabeledPoints` is compatible with the format used by `MLUtils#loadLabeledData`.

CC: @mateiz, @srowen

Author: Xiangrui Meng <meng@databricks.com>

Closes #685 from mengxr/labeled-io and squashes the following commits:

2d1116a [Xiangrui Meng] make loadLabeledData/saveLabeledData deprecated since 1.0.1
297be75 [Xiangrui Meng] change LabeledPoint.parse to LabeledPointParser.parse to maintain binary compatibility
d6b1473 [Xiangrui Meng] Merge branch 'master' into labeled-io
56746ea [Xiangrui Meng] replace # by .
623a5f0 [Xiangrui Meng] merge master
f06d5ba [Xiangrui Meng] add docs and minor updates
640fe0c [Xiangrui Meng] throw SparkException
5bcfbc4 [Xiangrui Meng] update test to add scientific notations
e86bf38 [Xiangrui Meng] remove NumericTokenizer
050fca4 [Xiangrui Meng] use StringTokenizer
6155b75 [Xiangrui Meng] merge master
f644438 [Xiangrui Meng] remove parse methods based on eval from pyspark
a41675a [Xiangrui Meng] python loadLabeledPoint uses Scala's implementation
ce9a475 [Xiangrui Meng] add deserialize_labeled_point to pyspark with tests
e9fcd49 [Xiangrui Meng] add serializeLabeledPoint and tests
aea4ae3 [Xiangrui Meng] minor updates
810d6df [Xiangrui Meng] update tokenizer/parser implementation
7aac03a [Xiangrui Meng] remove Scala parsers
c1885c1 [Xiangrui Meng] add headers and minor changes
b0c50cb [Xiangrui Meng] add customized parser
d731817 [Xiangrui Meng] style update
63dc396 [Xiangrui Meng] add loadLabeledPoints to pyspark
ea122b5 [Xiangrui Meng] Merge branch 'master' into labeled-io
cd6c78f [Xiangrui Meng] add __str__ and parse to LabeledPoint
a7a178e [Xiangrui Meng] add stringify to pyspark's Vectors
5c2dbfa [Xiangrui Meng] add parse to pyspark's Vectors
7853f88 [Xiangrui Meng] update pyspark's SparseVector.__str__
e761d32 [Xiangrui Meng] make LabelPoint.parse compatible with the dense format used before v1.0 and deprecate loadLabeledData and saveLabeledData
9e63a02 [Xiangrui Meng] add loadVectors and loadLabeledPoints
19aa523 [Xiangrui Meng] update toString and add parsers for Vectors and LabeledPoint
2014-06-04 12:56:56 -07:00
Neville Li b8d2580039 [MLLIB] set RDD names in ALS
This is very useful when debugging & fine tuning jobs with large data sets.

Author: Neville Li <neville@spotify.com>

Closes #966 from nevillelyh/master and squashes the following commits:

6747764 [Neville Li] [MLLIB] use string interpolation for RDD names
3b15d34 [Neville Li] [MLLIB] set RDD names in ALS
2014-06-04 01:51:34 -07:00
DB Tsai f4dd665c85 Fixed a typo
in RowMatrix.scala

Author: DB Tsai <dbtsai@dbtsai.com>

Closes #959 from dbtsai/dbtsai-typo and squashes the following commits:

fab0e0e [DB Tsai] Fixed typo
2014-06-03 18:10:58 -07:00
Syed Hashmi 7782a304ad [SPARK-1942] Stop clearing spark.driver.port in unit tests
stop resetting spark.driver.port in unit tests (scala, java and python).

Author: Syed Hashmi <shashmi@cloudera.com>
Author: CodingCat <zhunansjtu@gmail.com>

Closes #943 from syedhashmi/master and squashes the following commits:

885f210 [Syed Hashmi] Removing unnecessary file (created by mergetool)
b8bd4b5 [Syed Hashmi] Merge remote-tracking branch 'upstream/master'
b895e59 [Syed Hashmi] Revert "[SPARK-1784] Add a new partitioner"
57b6587 [Syed Hashmi] Revert "[SPARK-1784] Add a balanced partitioner"
1574769 [Syed Hashmi] [SPARK-1942] Stop clearing spark.driver.port in unit tests
4354836 [Syed Hashmi] Revert "SPARK-1686: keep schedule() calling in the main thread"
fd36542 [Syed Hashmi] [SPARK-1784] Add a balanced partitioner
6668015 [CodingCat] SPARK-1686: keep schedule() calling in the main thread
4ca94cc [Syed Hashmi] [SPARK-1784] Add a new partitioner
2014-06-03 12:04:47 -07:00
Tor Myklebust 9a5d482e09 [SPARK-1553] Alternating nonnegative least-squares
This pull request includes a nonnegative least-squares solver (NNLS) tailored to the kinds of small-scale problems that come up when training matrix factorisation models by alternating nonnegative least-squares (ANNLS).

The method used for the NNLS subproblems is based on the classical method of projected gradients.  There is a modification where, if the set of active constraints has not changed since the last iteration, a conjugate gradient step is considered and possibly rejected in favour of the gradient; this improves convergence once the optimal face has been located.

The NNLS solver is in `org.apache.spark.mllib.optimization.NNLSbyPCG`.

Author: Tor Myklebust <tmyklebu@gmail.com>

Closes #460 from tmyklebu/annls and squashes the following commits:

79bc4b5 [Tor Myklebust] Merge branch 'master' of https://github.com/apache/spark into annls
199b0bc [Tor Myklebust] Make the ctor private again and use the builder pattern.
7fbabf1 [Tor Myklebust] Cleanup matrix math in NNLSSuite.
65ef7f2 [Tor Myklebust] Make ALS's ctor public and remove a couple of "convenience" wrappers.
2d4f3cb [Tor Myklebust] Cleanup.
0cb4481 [Tor Myklebust] Drop the iteration limit from 40k to max(400,20n).
e2a01d1 [Tor Myklebust] Create a workspace object for NNLS to cut down on memory allocations.
b285106 [Tor Myklebust] Clean up NNLS test cases.
9c820b6 [Tor Myklebust] Tweak variable names.
8a1a436 [Tor Myklebust] Describe the problem and add a reference to Polyak's paper.
5345402 [Tor Myklebust] Style fixes that got eaten.
ac673bd [Tor Myklebust] More safeguards against numerical ridiculousness.
c288b6a [Tor Myklebust] Finish moving the NNLS solver.
9a82fa6 [Tor Myklebust] Fix scalastyle moanings.
33bf4f2 [Tor Myklebust] Fix missing space.
89ea0a8 [Tor Myklebust] Hack ALSSuite to support NNLS testing.
f5dbf4d [Tor Myklebust] Teach ALS how to use the NNLS solver.
6cb563c [Tor Myklebust] Tests for the nonnegative least squares solver.
a68ac10 [Tor Myklebust] A nonnegative least-squares solver.
2014-06-02 11:48:09 -07:00
zsxwing cb7fe50348 SPARK-1925: Replace '&' with '&&'
JIRA: https://issues.apache.org/jira/browse/SPARK-1925

Author: zsxwing <zsxwing@gmail.com>

Closes #879 from zsxwing/SPARK-1925 and squashes the following commits:

5cf5a6d [zsxwing] SPARK-1925: Replace '&' with '&&'
2014-05-26 14:34:58 -07:00
baishuo(白硕) a08262d876 Update LBFGSSuite.scala
the same reason as https://github.com/apache/spark/pull/588

Author: baishuo(白硕) <vc_java@hotmail.com>

Closes #815 from baishuo/master and squashes the following commits:

6876c1e [baishuo(白硕)] Update LBFGSSuite.scala
2014-05-23 13:02:40 -07:00
Xiangrui Meng d52761d67f [SPARK-1741][MLLIB] add predict(JavaRDD) to RegressionModel, ClassificationModel, and KMeans
`model.predict` returns a RDD of Scala primitive type (Int/Double), which is recognized as Object in Java. Adding predict(JavaRDD) could make life easier for Java users.

Added tests for KMeans, LinearRegression, and NaiveBayes.

Will update examples after https://github.com/apache/spark/pull/653 gets merged.

cc: @srowen

Author: Xiangrui Meng <meng@databricks.com>

Closes #670 from mengxr/predict-javardd and squashes the following commits:

b77ccd8 [Xiangrui Meng] Merge branch 'master' into predict-javardd
43caac9 [Xiangrui Meng] add predict(JavaRDD) to RegressionModel, ClassificationModel, and KMeans
2014-05-15 11:59:59 -07:00
Prashant Sharma 46324279da Package docs
This is a few changes based on the original patch by @scrapcodes.

Author: Prashant Sharma <prashant.s@imaginea.com>
Author: Patrick Wendell <pwendell@gmail.com>

Closes #785 from pwendell/package-docs and squashes the following commits:

c32b731 [Patrick Wendell] Changes based on Prashant's patch
c0463d3 [Prashant Sharma] added eof new line
ce8bf73 [Prashant Sharma] Added eof new line to all files.
4c35f2e [Prashant Sharma] SPARK-1563 Add package-info.java and package.scala files for all packages that appear in docs
2014-05-14 22:24:41 -07:00
Xiangrui Meng e3d72a74ad [SPARK-1696][MLLIB] use alpha in dense dspr
It doesn't affect existing code because only `alpha = 1.0` is used in the code.

Author: Xiangrui Meng <meng@databricks.com>

Closes #778 from mengxr/mllib-dspr-fix and squashes the following commits:

a37402e [Xiangrui Meng] use alpha in dense dspr
2014-05-14 17:18:30 -07:00
Andrew Tulloch d1e487473f SPARK-1791 - SVM implementation does not use threshold parameter
Summary:
https://issues.apache.org/jira/browse/SPARK-1791

Simple fix, and backward compatible, since

- anyone who set the threshold was getting completely wrong answers.
- anyone who did not set the threshold had the default 0.0 value for the threshold anyway.

Test Plan:
Unit test added that is verified to fail under the old implementation,
and pass under the new implementation.

Reviewers:

CC:

Author: Andrew Tulloch <andrew@tullo.ch>

Closes #725 from ajtulloch/SPARK-1791-SVM and squashes the following commits:

770f55d [Andrew Tulloch] SPARK-1791 - SVM implementation does not use threshold parameter
2014-05-13 17:31:27 -07:00
Sean Owen 7120a2979d SPARK-1798. Tests should clean up temp files
Three issues related to temp files that tests generate – these should be touched up for hygiene but are not urgent.

Modules have a log4j.properties which directs the unit-test.log output file to a directory like `[module]/target/unit-test.log`. But this ends up creating `[module]/[module]/target/unit-test.log` instead of former.

The `work/` directory is not deleted by "mvn clean", in the parent and in modules. Neither is the `checkpoint/` directory created under the various external modules.

Many tests create a temp directory, which is not usually deleted. This can be largely resolved by calling `deleteOnExit()` at creation and trying to call `Utils.deleteRecursively` consistently to clean up, sometimes in an `@After` method.

_If anyone seconds the motion, I can create a more significant change that introduces a new test trait along the lines of `LocalSparkContext`, which provides management of temp directories for subclasses to take advantage of._

Author: Sean Owen <sowen@cloudera.com>

Closes #732 from srowen/SPARK-1798 and squashes the following commits:

5af578e [Sean Owen] Try to consistently delete test temp dirs and files, and set deleteOnExit() for each
b21b356 [Sean Owen] Remove work/ and checkpoint/ dirs with mvn clean
bdd0f41 [Sean Owen] Remove duplicate module dir in log4j.properties output path for tests
2014-05-12 14:16:19 -07:00
Funes 191279ce4e Bug fix of sparse vector conversion
Fixed a small bug caused by the inconsistency of index/data array size and vector length.

Author: Funes <tianshaocun@gmail.com>
Author: funes <tianshaocun@gmail.com>

Closes #661 from funes/bugfix and squashes the following commits:

edb2b9d [funes] remove unused import
75dced3 [Funes] update test case
d129a66 [Funes] Add test for sparse breeze by vector builder
64e7198 [Funes] Copy data only when necessary
b85806c [Funes] Bug fix of sparse vector conversion
2014-05-08 17:54:10 -07:00
DB Tsai 910a13b3c5 [SPARK-1157][MLlib] Bug fix: lossHistory should exclude rejection steps, and remove miniBatch
Getting the lossHistory from Breeze's API which already excludes the rejection steps in line search. Also, remove the miniBatch in LBFGS since those quasi-Newton methods approximate the inverse of Hessian. It doesn't make sense if the gradients are computed from a varying objective.

Author: DB Tsai <dbtsai@alpinenow.com>

Closes #582 from dbtsai/dbtsai-lbfgs-bug and squashes the following commits:

9cc6cf9 [DB Tsai] Removed the miniBatch in LBFGS.
1ba6a33 [DB Tsai] Formatting the code.
d72c679 [DB Tsai] Using Breeze's states to get the loss.
2014-05-08 17:53:22 -07:00
Manish Amde f269b016ac SPARK-1544 Add support for deep decision trees.
@etrain and I came with a PR for arbitrarily deep decision trees at the cost of multiple passes over the data at deep tree levels.

To summarize:
1) We take a parameter that indicates the amount of memory users want to reserve for computation on each worker (and 2x that at the driver).
2) Using that information, we calculate two things - the maximum depth to which we train as usual (which is, implicitly, the maximum number of nodes we want to train in parallel), and the size of the groups we should use in the case where we exceed this depth.

cc: @atalwalkar, @hirakendu, @mengxr

Author: Manish Amde <manish9ue@gmail.com>
Author: manishamde <manish9ue@gmail.com>
Author: Evan Sparks <sparks@cs.berkeley.edu>

Closes #475 from manishamde/deep_tree and squashes the following commits:

968ca9d [Manish Amde] merged master
7fc9545 [Manish Amde] added docs
ce004a1 [Manish Amde] minor formatting
b27ad2c [Manish Amde] formatting
426bb28 [Manish Amde] programming guide blurb
8053fed [Manish Amde] more formatting
5eca9e4 [Manish Amde] grammar
4731cda [Manish Amde] formatting
5e82202 [Manish Amde] added documentation, fixed off by 1 error in max level calculation
cbd9f14 [Manish Amde] modified scala.math to math
dad9652 [Manish Amde] removed unused imports
e0426ee [Manish Amde] renamed parameter
718506b [Manish Amde] added unit test
1517155 [Manish Amde] updated documentation
9dbdabe [Manish Amde] merge from master
719d009 [Manish Amde] updating user documentation
fecf89a [manishamde] Merge pull request #6 from etrain/deep_tree
0287772 [Evan Sparks] Fixing scalastyle issue.
2f1e093 [Manish Amde] minor: added doc for maxMemory parameter
2f6072c [manishamde] Merge pull request #5 from etrain/deep_tree
abc5a23 [Evan Sparks] Parameterizing max memory.
50b143a [Manish Amde] adding support for very deep trees
2014-05-07 17:08:38 -07:00
baishuo(白硕) 0c19bb161b Update GradientDescentSuite.scala
use more faster way to construct an array

Author: baishuo(白硕) <vc_java@hotmail.com>

Closes #588 from baishuo/master and squashes the following commits:

45b95fb [baishuo(白硕)] Update GradientDescentSuite.scala
c03b61c [baishuo(白硕)] Update GradientDescentSuite.scala
b666d27 [baishuo(白硕)] Update GradientDescentSuite.scala
2014-05-07 16:02:55 -07:00
Sean Owen 25ad8f9301 SPARK-1727. Correct small compile errors, typos, and markdown issues in (primarly) MLlib docs
While play-testing the Scala and Java code examples in the MLlib docs, I noticed a number of small compile errors, and some typos. This led to finding and fixing a few similar items in other docs.

Then in the course of building the site docs to check the result, I found a few small suggestions for the build instructions. I also found a few more formatting and markdown issues uncovered when I accidentally used maruku instead of kramdown.

Author: Sean Owen <sowen@cloudera.com>

Closes #653 from srowen/SPARK-1727 and squashes the following commits:

6e7c38a [Sean Owen] Final doc updates - one more compile error, and use of mean instead of sum and count
8f5e847 [Sean Owen] Fix markdown syntax issues that maruku flags, even though we use kramdown (but only those that do not affect kramdown's output)
99966a9 [Sean Owen] Update issue tracker URL in docs
23c9ac3 [Sean Owen] Add Scala Naive Bayes example, to use existing example data file (whose format needed a tweak)
8c81982 [Sean Owen] Fix small compile errors and typos across MLlib docs
2014-05-06 20:07:22 -07:00
Xiangrui Meng 98750a74da [SPARK-1594][MLLIB] Cleaning up MLlib APIs and guide
Final pass before the v1.0 release.

* Remove `VectorRDDs`
* Move `BinaryClassificationMetrics` from `evaluation.binary` to `evaluation`
* Change default value of `addIntercept` to false and allow to add intercept in Ridge and Lasso.
* Clean `DecisionTree` package doc and test suite.
* Mark model constructors `private[spark]`
* Rename `loadLibSVMData` to `loadLibSVMFile` and hide `LabelParser` from users.
* Add `saveAsLibSVMFile`.
* Add `appendBias` to `MLUtils`.

Author: Xiangrui Meng <meng@databricks.com>

Closes #524 from mengxr/mllib-cleaning and squashes the following commits:

295dc8b [Xiangrui Meng] update loadLibSVMFile doc
1977ac1 [Xiangrui Meng] fix doc of appendBias
649fcf0 [Xiangrui Meng] rename loadLibSVMData to loadLibSVMFile; hide LabelParser from user APIs
54b812c [Xiangrui Meng] add appendBias
a71e7d0 [Xiangrui Meng] add saveAsLibSVMFile
d976295 [Xiangrui Meng] Merge branch 'master' into mllib-cleaning
b7e5cec [Xiangrui Meng] remove some experimental annotations and make model constructors private[mllib]
9b02b93 [Xiangrui Meng] minor code style update
a593ddc [Xiangrui Meng] fix python tests
fc28c18 [Xiangrui Meng] mark more classes experimental
f6cbbff [Xiangrui Meng] fix Java tests
0af70b0 [Xiangrui Meng] minor
6e139ef [Xiangrui Meng] Merge branch 'master' into mllib-cleaning
94e6dce [Xiangrui Meng] move BinaryLabelCounter and BinaryConfusionMatrixImpl to evaluation.binary
df34907 [Xiangrui Meng] clean DecisionTreeSuite to use LocalSparkContext
c81807f [Xiangrui Meng] set the default value of AddIntercept to false
03389c0 [Xiangrui Meng] allow to add intercept in Ridge and Lasso
c66c56f [Xiangrui Meng] move tree md to package object doc
a2695df [Xiangrui Meng] update guide for BinaryClassificationMetrics
9194f4c [Xiangrui Meng] move BinaryClassificationMetrics one level up
1c1a0e3 [Xiangrui Meng] remove VectorRDDs because it only contains one function that is not necessary for us to maintain
2014-05-05 18:32:54 -07:00
Tor Myklebust 5c0cd5c1a5 [SPARK-1646] Micro-optimisation of ALS
This change replaces some Scala `for` and `foreach` constructs with `while` constructs.  There may be a slight performance gain on the order of 1-2% when training an ALS model.

I trained an ALS model on the Movielens 10M-rating dataset repeatedly both with and without these changes.  All 7 runs in both columns were done in a Scala `for` loop like this:

    for (iter <- 0 to 10) {
      val before = System.currentTimeMillis()
      val model = ALS.train(rats, 20, 10)
      val after = System.currentTimeMillis()
      println("%d ms".format(after-before))
      println("rmse %g".format(computeRmse(model, rats, numRatings)))
    }

The timings were done on a multiuser machine, and I stopped one set of timings after 7 had been completed.  It would be nice if somebody with dedicated hardware could confirm my timings.

    After           Before
    121980 ms       122041 ms
    117069 ms       117127 ms
    115332 ms       117523 ms
    115381 ms       117402 ms
    114635 ms       116550 ms
    114140 ms       114076 ms
    112993 ms       117200 ms

Ratios are about 1.0005, 1.0005, 1.019, 1.0175, 1.01671, 0.99944, and 1.03723.  I therefore suspect these changes make for a slight performance gain on the order of 1-2%.

Author: Tor Myklebust <tmyklebu@gmail.com>

Closes #568 from tmyklebu/alsopt and squashes the following commits:

5ded80f [Tor Myklebust] Fix style.
79595ff [Tor Myklebust] Fix style error.
4ef0313 [Tor Myklebust] Merge branch 'master' of github.com:apache/spark into alsopt
114fb74 [Tor Myklebust] Turn some 'for' loops into 'while' loops.
dcf583a [Tor Myklebust] Remove the partitioner member variable; instead, thread that needle everywhere it needs to go.
23d6f91 [Tor Myklebust] Stop making the partitioner configurable.
495784f [Tor Myklebust] Merge branch 'master' of https://github.com/apache/spark
674933a [Tor Myklebust] Fix style.
40edc23 [Tor Myklebust] Fix missing space.
f841345 [Tor Myklebust] Fix daft bug creating 'pairs', also for -> foreach.
5ec9e6c [Tor Myklebust] Clean a couple of things up using 'map'.
36a0f43 [Tor Myklebust] Make the partitioner private.
d872b09 [Tor Myklebust] Add negative id ALS test.
df27697 [Tor Myklebust] Support custom partitioners.  Currently we use the same partitioner for users and products.
c90b6d8 [Tor Myklebust] Scramble user and product ids before bucketing.
c774d7d [Tor Myklebust] Make the partitioner a member variable and use it instead of modding directly.
2014-04-29 22:04:34 -07:00
Xiangrui Meng 3f38334f44 [SPARK-1636][MLLIB] Move main methods to examples
* `NaiveBayes` -> `SparseNaiveBayes`
* `KMeans` -> `DenseKMeans`
* `SVMWithSGD` and `LogisticRegerssionWithSGD` -> `BinaryClassification`
* `ALS` -> `MovieLensALS`
* `LinearRegressionWithSGD`, `LassoWithSGD`, and `RidgeRegressionWithSGD` -> `LinearRegression`
* `DecisionTree` -> `DecisionTreeRunner`

`scopt` is used for parsing command-line parameters. `scopt` has MIT license and it only depends on `scala-library`.

Example help message:

~~~
BinaryClassification: an example app for binary classification.
Usage: BinaryClassification [options] <input>

  --numIterations <value>
        number of iterations
  --stepSize <value>
        initial step size, default: 1.0
  --algorithm <value>
        algorithm (SVM,LR), default: LR
  --regType <value>
        regularization type (L1,L2), default: L2
  --regParam <value>
        regularization parameter, default: 0.1
  <input>
        input paths to labeled examples in LIBSVM format
~~~

Author: Xiangrui Meng <meng@databricks.com>

Closes #584 from mengxr/mllib-main and squashes the following commits:

7b58c60 [Xiangrui Meng] minor
6e35d7e [Xiangrui Meng] make imports explicit and fix code style
c6178c9 [Xiangrui Meng] update TS PCA/SVD to use new spark-submit
6acff75 [Xiangrui Meng] use scopt for DecisionTreeRunner
be86069 [Xiangrui Meng] use main instead of extending App
b3edf68 [Xiangrui Meng] move DecisionTree's main method to examples
8bfaa5a [Xiangrui Meng] change NaiveBayesParams to Params
fe23dcb [Xiangrui Meng] remove main from KMeans and add DenseKMeans as an example
67f4448 [Xiangrui Meng] remove main methods from linear regression algorithms and add LinearRegression example
b066bbc [Xiangrui Meng] remove main from ALS and add MovieLensALS example
b040f3b [Xiangrui Meng] change BinaryClassificationParams to Params
577945b [Xiangrui Meng] remove unused imports from NB
3d299bc [Xiangrui Meng] remove main from LR/SVM and add an example app for binary classification
f70878e [Xiangrui Meng] remove main from NaiveBayes and add an example NaiveBayes app
01ec2cd [Xiangrui Meng] Merge branch 'master' into mllib-main
9420692 [Xiangrui Meng] add scopt to examples dependencies
2014-04-29 00:41:03 -07:00
witgo 030f2c2126 Improved build configuration
1, Fix SPARK-1441: compile spark core error with hadoop 0.23.x
2, Fix SPARK-1491: maven hadoop-provided profile fails to build
3, Fix org.scala-lang: * ,org.apache.avro:* inconsistent versions dependency
4, A modified on the sql/catalyst/pom.xml,sql/hive/pom.xml,sql/core/pom.xml (Four spaces formatted into two spaces)

Author: witgo <witgo@qq.com>

Closes #480 from witgo/format_pom and squashes the following commits:

03f652f [witgo] review commit
b452680 [witgo] Merge branch 'master' of https://github.com/apache/spark into format_pom
bee920d [witgo] revert fix SPARK-1629: Spark Core missing commons-lang dependence
7382a07 [witgo] Merge branch 'master' of https://github.com/apache/spark into format_pom
6902c91 [witgo] fix SPARK-1629: Spark Core missing commons-lang dependence
0da4bc3 [witgo] merge master
d1718ed [witgo] Merge branch 'master' of https://github.com/apache/spark into format_pom
e345919 [witgo] add avro dependency to yarn-alpha
77fad08 [witgo] Merge branch 'master' of https://github.com/apache/spark into format_pom
62d0862 [witgo] Fix org.scala-lang: * inconsistent versions dependency
1a162d7 [witgo] Merge branch 'master' of https://github.com/apache/spark into format_pom
934f24d [witgo] review commit
cf46edc [witgo] exclude jruby
06e7328 [witgo] Merge branch 'SparkBuild' into format_pom
99464d2 [witgo] fix maven hadoop-provided profile fails to build
0c6c1fc [witgo] Fix compile spark core error with hadoop 0.23.x
6851bec [witgo] Maintain consistent SparkBuild.scala, pom.xml
2014-04-28 22:51:46 -07:00
Sandeep bb68f47745 [Fix #79] Replace Breakable For Loops By While Loops
Author: Sandeep <sandeep@techaddict.me>

Closes #503 from techaddict/fix-79 and squashes the following commits:

e3f6746 [Sandeep] Style changes
07a4f6b [Sandeep] for loop to While loop
0a6d8e9 [Sandeep] Breakable for loop to While loop
2014-04-23 22:47:59 -07:00
Xiangrui Meng 26d35f3fd9 [SPARK-1506][MLLIB] Documentation improvements for MLlib 1.0
Preview: http://54.82.240.23:4000/mllib-guide.html

Table of contents:

* Basics
  * Data types
  * Summary statistics
* Classification and regression
  * linear support vector machine (SVM)
  * logistic regression
  * linear linear squares, Lasso, and ridge regression
  * decision tree
  * naive Bayes
* Collaborative Filtering
  * alternating least squares (ALS)
* Clustering
  * k-means
* Dimensionality reduction
  * singular value decomposition (SVD)
  * principal component analysis (PCA)
* Optimization
  * stochastic gradient descent
  * limited-memory BFGS (L-BFGS)

Author: Xiangrui Meng <meng@databricks.com>

Closes #422 from mengxr/mllib-doc and squashes the following commits:

944e3a9 [Xiangrui Meng] merge master
f9fda28 [Xiangrui Meng] minor
9474065 [Xiangrui Meng] add alpha to ALS examples
928e630 [Xiangrui Meng] initialization_mode -> initializationMode
5bbff49 [Xiangrui Meng] add imports to labeled point examples
c17440d [Xiangrui Meng] fix python nb example
28f40dc [Xiangrui Meng] remove localhost:4000
369a4d3 [Xiangrui Meng] Merge branch 'master' into mllib-doc
7dc95cc [Xiangrui Meng] update linear methods
053ad8a [Xiangrui Meng] add links to go back to the main page
abbbf7e [Xiangrui Meng] update ALS argument names
648283e [Xiangrui Meng] level down statistics
14e2287 [Xiangrui Meng] add sample libsvm data and use it in guide
8cd2441 [Xiangrui Meng] minor updates
186ab07 [Xiangrui Meng] update section names
6568d65 [Xiangrui Meng] update toc, level up lr and svm
162ee12 [Xiangrui Meng] rename section names
5c1e1b1 [Xiangrui Meng] minor
8aeaba1 [Xiangrui Meng] wrap long lines
6ce6a6f [Xiangrui Meng] add summary statistics to toc
5760045 [Xiangrui Meng] claim beta
cc604bf [Xiangrui Meng] remove classification and regression
92747b3 [Xiangrui Meng] make section titles consistent
e605dd6 [Xiangrui Meng] add LIBSVM loader
f639674 [Xiangrui Meng] add python section to migration guide
c82ffb4 [Xiangrui Meng] clean optimization
31660eb [Xiangrui Meng] update linear algebra and stat
0a40837 [Xiangrui Meng] first pass over linear methods
1fc8271 [Xiangrui Meng] update toc
906ed0a [Xiangrui Meng] add a python example to naive bayes
5f0a700 [Xiangrui Meng] update collaborative filtering
656d416 [Xiangrui Meng] update mllib-clustering
86e143a [Xiangrui Meng] remove data types section from main page
8d1a128 [Xiangrui Meng] move part of linear algebra to data types and add Java/Python examples
d1b5cbf [Xiangrui Meng] merge master
72e4804 [Xiangrui Meng] one pass over tree guide
64f8995 [Xiangrui Meng] move decision tree guide to a separate file
9fca001 [Xiangrui Meng] add first version of linear algebra guide
53c9552 [Xiangrui Meng] update dependencies
f316ec2 [Xiangrui Meng] add migration guide
f399f6c [Xiangrui Meng] move linear-algebra to dimensionality-reduction
182460f [Xiangrui Meng] add guide for naive Bayes
137fd1d [Xiangrui Meng] re-organize toc
a61e434 [Xiangrui Meng] update mllib's toc
2014-04-22 11:20:47 -07:00
Tor Myklebust bf9d49b6d1 [SPARK-1281] Improve partitioning in ALS
ALS was using HashPartitioner and explicit uses of `%` together.  Further, the naked use of `%` meant that, if the number of partitions corresponded with the stride of arithmetic progressions appearing in user and product ids, users and products could be mapped into buckets in an unfair or unwise way.

This pull request:
1) Makes the Partitioner an instance variable of ALS.
2) Replaces the direct uses of `%` with calls to a Partitioner.
3) Defines an anonymous Partitioner that scrambles the bits of the object's hashCode before reducing to the number of present buckets.

This pull request does not make the partitioner user-configurable.

I'm not all that happy about the way I did (1).  It introduces an icky lifetime issue and dances around it by nulling something.  However, I don't know a better way to make the partitioner visible everywhere it needs to be visible.

Author: Tor Myklebust <tmyklebu@gmail.com>

Closes #407 from tmyklebu/master and squashes the following commits:

dcf583a [Tor Myklebust] Remove the partitioner member variable; instead, thread that needle everywhere it needs to go.
23d6f91 [Tor Myklebust] Stop making the partitioner configurable.
495784f [Tor Myklebust] Merge branch 'master' of https://github.com/apache/spark
674933a [Tor Myklebust] Fix style.
40edc23 [Tor Myklebust] Fix missing space.
f841345 [Tor Myklebust] Fix daft bug creating 'pairs', also for -> foreach.
5ec9e6c [Tor Myklebust] Clean a couple of things up using 'map'.
36a0f43 [Tor Myklebust] Make the partitioner private.
d872b09 [Tor Myklebust] Add negative id ALS test.
df27697 [Tor Myklebust] Support custom partitioners.  Currently we use the same partitioner for users and products.
c90b6d8 [Tor Myklebust] Scramble user and product ids before bucketing.
c774d7d [Tor Myklebust] Make the partitioner a member variable and use it instead of modding directly.
2014-04-22 11:07:30 -07:00
Andrew Or b3e5366f69 [Fix #274] Document + fix annotation usages
... so that we don't follow an unspoken set of forbidden rules for adding **@AlphaComponent**, **@DeveloperApi**, and **@Experimental** annotations in the code.

In addition, this PR
(1) removes unnecessary `:: * ::` tags,
(2) adds missing `:: * ::` tags, and
(3) removes annotations for internal APIs.

Author: Andrew Or <andrewor14@gmail.com>

Closes #470 from andrewor14/annotations-fix and squashes the following commits:

92a7f42 [Andrew Or] Document + fix annotation usages
2014-04-21 22:24:44 -07:00
Tor Myklebust 25fc31884b [SPARK-1535] ALS: Avoid the garbage-creating ctor of DoubleMatrix
`new DoubleMatrix(double[])` creates a garbage `double[]` of the same length as its argument and immediately throws it away.  This pull request avoids that constructor in the ALS code.

Author: Tor Myklebust <tmyklebu@gmail.com>

Closes #442 from tmyklebu/foo2 and squashes the following commits:

2784fc5 [Tor Myklebust] Mention that this is probably fixed as of jblas 1.2.4; repunctuate.
a09904f [Tor Myklebust] Helper function for wrapping Array[Double]'s with DoubleMatrix's.
2014-04-19 15:10:18 -07:00
Sean Owen 8aa1f4c4f6 SPARK-1357 (addendum). More Experimental items in MLlib
Per discussion, this is my suggestion to make ALS Rating, ClassificationModel, RegressionModel experimental for now, to reserve the right to possibly change after 1.0. See what you think of this much.

Author: Sean Owen <sowen@cloudera.com>

Closes #372 from srowen/SPARK-1357Addendum and squashes the following commits:

17cf1ea [Sean Owen] Remove (another) blank line after ":: Experimental ::"
6800e4c [Sean Owen] Remove blank line after ":: Experimental ::"
b3a88d2 [Sean Owen] Make ALS Rating, ClassificationModel, RegressionModel experimental for now, to reserve the right to possibly change after 1.0
2014-04-18 10:04:02 -07:00
CodingCat e31c8ffca6 SPARK-1483: Rename minSplits to minPartitions in public APIs
https://issues.apache.org/jira/browse/SPARK-1483

From the original JIRA: " The parameter name is part of the public API in Scala and Python, since you can pass named parameters to a method, so we should name it to this more descriptive term. Everywhere else we refer to "splits" as partitions." - @mateiz

Author: CodingCat <zhunansjtu@gmail.com>

Closes #430 from CodingCat/SPARK-1483 and squashes the following commits:

4b60541 [CodingCat] deprecate defaultMinSplits
ba2c663 [CodingCat] Rename minSplits to minPartitions in public APIs
2014-04-18 10:01:16 -07:00
Holden Karau c3527a333a SPARK-1310: Start adding k-fold cross validation to MLLib [adds kFold to MLUtils & fixes bug in BernoulliSampler]
Author: Holden Karau <holden@pigscanfly.ca>

Closes #18 from holdenk/addkfoldcrossvalidation and squashes the following commits:

208db9b [Holden Karau] Fix a bad space
e84f2fc [Holden Karau] Fix the test, we should be looking at the second element instead
6ddbf05 [Holden Karau] swap training and validation order
7157ae9 [Holden Karau] CR feedback
90896c7 [Holden Karau] New line
150889c [Holden Karau] Fix up error messages in the MLUtilsSuite
2cb90b3 [Holden Karau] Fix the names in kFold
c702a96 [Holden Karau] Fix imports in MLUtils
e187e35 [Holden Karau] Move { up to same line as whenExecuting(random) in RandomSamplerSuite.scala
c5b723f [Holden Karau] clean up
7ebe4d5 [Holden Karau] CR feedback, remove unecessary learners (came back during merge mistake) and insert an empty line
bb5fa56 [Holden Karau] extra line sadness
163c5b1 [Holden Karau] code review feedback 1.to -> 1 to and folds -> numFolds
5a33f1d [Holden Karau] Code review follow up.
e8741a7 [Holden Karau] CR feedback
b78804e [Holden Karau] Remove cross validation [TODO in another pull request]
91eae64 [Holden Karau] Consolidate things in mlutils
264502a [Holden Karau] Add a test for the bug that was found with BernoulliSampler not copying the complement param
dd0b737 [Holden Karau] Wrap long lines (oops)
c0b7fa4 [Holden Karau] Switch FoldedRDD to use BernoulliSampler and PartitionwiseSampledRDD
08f8e4d [Holden Karau] Fix BernoulliSampler to respect complement
a751ec6 [Holden Karau] Add k-fold cross validation to MLLib
2014-04-16 09:33:27 -07:00
Matei Zaharia 63ca581d9c [WIP] SPARK-1430: Support sparse data in Python MLlib
This PR adds a SparseVector class in PySpark and updates all the regression, classification and clustering algorithms and models to support sparse data, similar to MLlib. I chose to add this class because SciPy is quite difficult to install in many environments (more so than NumPy), but I plan to add support for SciPy sparse vectors later too, and make the methods work transparently on objects of either type.

On the Scala side, we keep Python sparse vectors sparse and pass them to MLlib. We always return dense vectors from our models.

Some to-do items left:
- [x] Support SciPy's scipy.sparse matrix objects when SciPy is available. We can easily add a function to convert these to our own SparseVector.
- [x] MLlib currently uses a vector with one extra column on the left to represent what we call LabeledPoint in Scala. Do we really want this? It may get annoying once you deal with sparse data since you must add/subtract 1 to each feature index when training. We can remove this API in 1.0 and use tuples for labeling.
- [x] Explain how to use these in the Python MLlib docs.

CC @mengxr, @joshrosen

Author: Matei Zaharia <matei@databricks.com>

Closes #341 from mateiz/py-ml-update and squashes the following commits:

d52e763 [Matei Zaharia] Remove no-longer-needed slice code and handle review comments
ea5a25a [Matei Zaharia] Fix remaining uses of copyto() after merge
b9f97a3 [Matei Zaharia] Fix test
1e1bd0f [Matei Zaharia] Add MLlib logistic regression example in Python
88bc01f [Matei Zaharia] Clean up inheritance of LinearModel in Python, and expose its parametrs
37ab747 [Matei Zaharia] Fix some examples and docs due to changes in MLlib API
da0f27e [Matei Zaharia] Added a MLlib K-means example and updated docs to discuss sparse data
c48e85a [Matei Zaharia] Added some tests for passing lists as input, and added mllib/tests.py to run-tests script.
a07ba10 [Matei Zaharia] Fix some typos and calculation of initial weights
74eefe7 [Matei Zaharia] Added LabeledPoint class in Python
889dde8 [Matei Zaharia] Support scipy.sparse matrices in all our algorithms and models
ab244d1 [Matei Zaharia] Allow SparseVectors to be initialized using a dict
a5d6426 [Matei Zaharia] Add linalg.py to run-tests script
0e7a3d8 [Matei Zaharia] Keep vectors sparse in Java when reading LabeledPoints
eaee759 [Matei Zaharia] Update regression, classification and clustering models for sparse data
2abbb44 [Matei Zaharia] Further work to get linear models working with sparse data
154f45d [Matei Zaharia] Update docs, name some magic values
881fef7 [Matei Zaharia] Added a sparse vector in Python and made Java-Python format more compact
2014-04-15 20:33:24 -07:00
Manish Amde 07d72fe696 Decision Tree documentation for MLlib programming guide
Added documentation for user to use the decision tree algorithms for classification and regression in Spark 1.0 release.

Apart from a general review, I need specific input on the following:
* I had to move a lot of the existing documentation under the *linear methods* umbrella to accommodate decision trees. I wonder if there is a better way to organize the programming guide given we are so close to the release.
* I have not looked closely at pyspark but I am wondering new mllib algorithms are automatically plugged in or do we need to some extra work to call mllib functions from pyspark. I will add to the pyspark examples based upon the advice I get.

cc: @mengxr, @hirakendu, @etrain, @atalwalkar

Author: Manish Amde <manish9ue@gmail.com>

Closes #402 from manishamde/tree_doc and squashes the following commits:

022485a [Manish Amde] more documentation
865826e [Manish Amde] minor: grammar
dbb0e5e [Manish Amde] minor improvements to text
b9ef6c4 [Manish Amde] basic decision tree code examples
6e297d7 [Manish Amde] added subsections
f427e84 [Manish Amde] renaming sections
9c0c4be [Manish Amde] split candidate
6925275 [Manish Amde] impurity and information gain
94fd2f9 [Manish Amde] more reorg
b93125c [Manish Amde] more subsection reorg
3ecb2ad [Manish Amde] minor text addition
1537dd3 [Manish Amde] added placeholders and some doc
d06511d [Manish Amde] basic skeleton
2014-04-15 11:14:28 -07:00
DB Tsai 6843d637e7 [SPARK-1157][MLlib] L-BFGS Optimizer based on Breeze's implementation.
This PR uses Breeze's L-BFGS implement, and Breeze dependency has already been introduced by Xiangrui's sparse input format work in SPARK-1212. Nice work, @mengxr !

When use with regularized updater, we need compute the regVal and regGradient (the gradient of regularized part in the cost function), and in the currently updater design, we can compute those two values by the following way.

Let's review how updater works when returning newWeights given the input parameters.

w' = w - thisIterStepSize * (gradient + regGradient(w))  Note that regGradient is function of w!
If we set gradient = 0, thisIterStepSize = 1, then
regGradient(w) = w - w'

As a result, for regVal, it can be computed by

    val regVal = updater.compute(
      weights,
      new DoubleMatrix(initialWeights.length, 1), 0, 1, regParam)._2
and for regGradient, it can be obtained by

      val regGradient = weights.sub(
        updater.compute(weights, new DoubleMatrix(initialWeights.length, 1), 1, 1, regParam)._1)

The PR includes the tests which compare the result with SGD with/without regularization.

We did a comparison between LBFGS and SGD, and often we saw 10x less
steps in LBFGS while the cost of per step is the same (just computing
the gradient).

The following is the paper by Prof. Ng at Stanford comparing different
optimizers including LBFGS and SGD. They use them in the context of
deep learning, but worth as reference.
http://cs.stanford.edu/~jngiam/papers/LeNgiamCoatesLahiriProchnowNg2011.pdf

Author: DB Tsai <dbtsai@alpinenow.com>

Closes #353 from dbtsai/dbtsai-LBFGS and squashes the following commits:

984b18e [DB Tsai] L-BFGS Optimizer based on Breeze's implementation. Also fixed indentation issue in GradientDescent optimizer.
2014-04-15 11:12:47 -07:00
Sean Owen 0247b5c546 SPARK-1488. Resolve scalac feature warnings during build
For your consideration: scalac currently notes a number of feature warnings during compilation:

```
[warn] there were 65 feature warning(s); re-run with -feature for details
```

Warnings are like:

```
[warn] /Users/srowen/Documents/spark/core/src/main/scala/org/apache/spark/SparkContext.scala:1261: implicit conversion method rddToPairRDDFunctions should be enabled
[warn] by making the implicit value scala.language.implicitConversions visible.
[warn] This can be achieved by adding the import clause 'import scala.language.implicitConversions'
[warn] or by setting the compiler option -language:implicitConversions.
[warn] See the Scala docs for value scala.language.implicitConversions for a discussion
[warn] why the feature should be explicitly enabled.
[warn]   implicit def rddToPairRDDFunctions[K: ClassTag, V: ClassTag](rdd: RDD[(K, V)]) =
[warn]                ^
```

scalac is suggesting that it's just best practice to explicitly enable certain language features by importing them where used.

This PR simply adds the imports it suggests (and squashes one other Java warning along the way). This leaves just deprecation warnings in the build.

Author: Sean Owen <sowen@cloudera.com>

Closes #404 from srowen/SPARK-1488 and squashes the following commits:

8598980 [Sean Owen] Quiet scalac warnings about language features by explicitly importing language features.
39bc831 [Sean Owen] Enable -feature in scalac to emit language feature warnings
2014-04-14 19:50:00 -07:00
Xusen Yin fdfb45e691 [WIP] [SPARK-1328] Add vector statistics
As with the new vector system in MLlib, we find that it is good to add some new APIs to precess the `RDD[Vector]`. Beside, the former implementation of `computeStat` is not stable which could loss precision, and has the possibility to cause `Nan` in scientific computing, just as said in the [SPARK-1328](https://spark-project.atlassian.net/browse/SPARK-1328).

APIs contain:

* rowMeans(): RDD[Double]
* rowNorm2(): RDD[Double]
* rowSDs(): RDD[Double]
* colMeans(): Vector
* colMeans(size: Int): Vector
* colNorm2(): Vector
* colNorm2(size: Int): Vector
* colSDs(): Vector
* colSDs(size: Int): Vector
* maxOption((Vector, Vector) => Boolean): Option[Vector]
* minOption((Vector, Vector) => Boolean): Option[Vector]
* rowShrink(): RDD[Vector]
* colShrink(): RDD[Vector]

This is working in process now, and some more APIs will add to `LabeledPoint`. Moreover, the implicit declaration will move from `MLUtils` to `MLContext` later.

Author: Xusen Yin <yinxusen@gmail.com>
Author: Xiangrui Meng <meng@databricks.com>

Closes #268 from yinxusen/vector-statistics and squashes the following commits:

d61363f [Xusen Yin] rebase to latest master
16ae684 [Xusen Yin] fix minor error and remove useless method
10cf5d3 [Xusen Yin] refine some return type
b064714 [Xusen Yin] remove computeStat in MLUtils
cbbefdb [Xiangrui Meng] update multivariate statistical summary interface and clean tests
4eaf28a [Xusen Yin] merge VectorRDDStatistics into RowMatrix
48ee053 [Xusen Yin] fix minor error
e624f93 [Xusen Yin] fix scala style error
1fba230 [Xusen Yin] merge while loop together
69e1f37 [Xusen Yin] remove lazy eval, and minor memory footprint
548e9de [Xusen Yin] minor revision
86522c4 [Xusen Yin] add comments on functions
dc77e38 [Xusen Yin] test sparse vector RDD
18cf072 [Xusen Yin] change def to lazy val to make sure that the computations in function be evaluated only once
f7a3ca2 [Xusen Yin] fix the corner case of maxmin
967d041 [Xusen Yin] full revision with Aggregator class
138300c [Xusen Yin] add new Aggregator class
1376ff4 [Xusen Yin] rename variables and adjust code
4a5c38d [Xusen Yin] add scala doc, refine code and comments
036b7a5 [Xusen Yin] fix the bug of Nan occur
f6e8e9a [Xusen Yin] add sparse vectors test
4cfbadf [Xusen Yin] fix bug of min max
4e4fbd1 [Xusen Yin] separate seqop and combop out as independent functions
a6d5a2e [Xusen Yin] rewrite for only computing non-zero elements
3980287 [Xusen Yin] rename variables
62a2c3e [Xusen Yin] use axpy and in-place if possible
9a75ebd [Xusen Yin] add case class to wrap return values
d816ac7 [Xusen Yin] remove useless APIs
c4651bb [Xusen Yin] remove row-wise APIs and refine code
1338ea1 [Xusen Yin] all-in-one version test passed
cc65810 [Xusen Yin] add parallel mean and variance
9af2e95 [Xusen Yin] refine the code style
ad6c82d [Xusen Yin] add shrink test
e09d5d2 [Xusen Yin] add scala docs and refine shrink method
8ef3377 [Xusen Yin] pass all tests
28cf060 [Xusen Yin] fix error of column means
54b19ab [Xusen Yin] add new API to shrink RDD[Vector]
8c6c0e1 [Xusen Yin] add basic statistics
2014-04-11 19:43:22 -07:00
Xiangrui Meng f5ace8da34 [SPARK-1225, 1241] [MLLIB] Add AreaUnderCurve and BinaryClassificationMetrics
This PR implements a generic version of `AreaUnderCurve` using the `RDD.sliding` implementation from https://github.com/apache/spark/pull/136 . It also contains refactoring of https://github.com/apache/spark/pull/160 for binary classification evaluation.

Author: Xiangrui Meng <meng@databricks.com>

Closes #364 from mengxr/auc and squashes the following commits:

a05941d [Xiangrui Meng] replace TP/FP/TN/FN by their full names
3f42e98 [Xiangrui Meng] add (0, 0), (1, 1) to roc, and (0, 1) to pr
fb4b6d2 [Xiangrui Meng] rename Evaluator to Metrics and add more metrics
b1b7dab [Xiangrui Meng] fix code styles
9dc3518 [Xiangrui Meng] add tests for BinaryClassificationEvaluator
ca31da5 [Xiangrui Meng] remove PredictionAndResponse
3d71525 [Xiangrui Meng] move binary evalution classes to evaluation.binary
8f78958 [Xiangrui Meng] add PredictionAndResponse
dda82d5 [Xiangrui Meng] add confusion matrix
aa7e278 [Xiangrui Meng] add initial version of binary classification evaluator
221ebce [Xiangrui Meng] add a new test to sliding
a920865 [Xiangrui Meng] Merge branch 'sliding' into auc
a9b250a [Xiangrui Meng] move sliding to mllib
cab9a52 [Xiangrui Meng] use last for the last element
db6cb30 [Xiangrui Meng] remove unnecessary toSeq
9916202 [Xiangrui Meng] change RDD.sliding return type to RDD[Seq[T]]
284d991 [Xiangrui Meng] change SlidedRDD to SlidingRDD
c1c6c22 [Xiangrui Meng] add AreaUnderCurve
65461b2 [Xiangrui Meng] Merge branch 'sliding' into auc
5ee6001 [Xiangrui Meng] add TODO
d2a600d [Xiangrui Meng] add sliding to rdd
2014-04-11 12:06:13 -07:00
Sandeep 930b70f052 Remove Unnecessary Whitespace's
stack these together in a commit else they show up chunk by chunk in different commits.

Author: Sandeep <sandeep@techaddict.me>

Closes #380 from techaddict/white_space and squashes the following commits:

b58f294 [Sandeep] Remove Unnecessary Whitespace's
2014-04-10 15:04:13 -07:00
Xiangrui Meng 0adc932add [SPARK-1357 (fix)] remove empty line after :: DeveloperApi/Experimental ::
Remove empty line after :: DeveloperApi/Experimental :: in comments to make the original doc show up in the preview of the generated html docs. Thanks @andrewor14 !

Author: Xiangrui Meng <meng@databricks.com>

Closes #373 from mengxr/api and squashes the following commits:

9c35bdc [Xiangrui Meng] remove the empty line after :: DeveloperApi/Experimental ::
2014-04-09 17:08:17 -07:00
Xiangrui Meng bde9cc11fe [SPARK-1357] [MLLIB] Annotate developer and experimental APIs
Annotate developer and experimental APIs in MLlib.

Author: Xiangrui Meng <meng@databricks.com>

Closes #298 from mengxr/api and squashes the following commits:

13390e8 [Xiangrui Meng] Merge branch 'master' into api
dc4cbb3 [Xiangrui Meng] mark distribute matrices experimental
6b9f8e2 [Xiangrui Meng] add Experimental annotation
8773d0d [Xiangrui Meng] add DeveloperApi annotation
da31733 [Xiangrui Meng] update developer and experimental tags
555e0fe [Xiangrui Meng] Merge branch 'master' into api
ef1a717 [Xiangrui Meng] mark some constructors private add default parameters to JavaDoc
00ffbcc [Xiangrui Meng] update tree API annotation
0b674fa [Xiangrui Meng] mark decision tree APIs
86b9e34 [Xiangrui Meng] one pass over APIs of GLMs, NaiveBayes, and ALS
f21d862 [Xiangrui Meng] Merge branch 'master' into api
2b133d6 [Xiangrui Meng] intial annotation of developer and experimental apis
2014-04-09 02:21:15 -07:00
Xiangrui Meng 9689b663a2 [SPARK-1390] Refactoring of matrices backed by RDDs
This is to refactor interfaces for matrices backed by RDDs. It would be better if we have a clear separation of local matrices and those backed by RDDs. Right now, we have

1. `org.apache.spark.mllib.linalg.SparseMatrix`, which is a wrapper over an RDD of matrix entries, i.e., coordinate list format.
2. `org.apache.spark.mllib.linalg.TallSkinnyDenseMatrix`, which is a wrapper over RDD[Array[Double]], i.e. row-oriented format.

We will see naming collision when we introduce local `SparseMatrix`, and the name `TallSkinnyDenseMatrix` is not exact if we switch to `RDD[Vector]` from `RDD[Array[Double]]`. It would be better to have "RDD" in the class name to suggest that operations may trigger jobs.

The proposed names are (all under `org.apache.spark.mllib.linalg.rdd`):

1. `RDDMatrix`: trait for matrices backed by one or more RDDs
2. `CoordinateRDDMatrix`: wrapper of `RDD[(Long, Long, Double)]`
3. `RowRDDMatrix`: wrapper of `RDD[Vector]` whose rows do not have special ordering
4. `IndexedRowRDDMatrix`: wrapper of `RDD[(Long, Vector)]` whose rows are associated with indices

The current code also introduces local matrices.

Author: Xiangrui Meng <meng@databricks.com>

Closes #296 from mengxr/mat and squashes the following commits:

24d8294 [Xiangrui Meng] fix for groupBy returning Iterable
bfc2b26 [Xiangrui Meng] merge master
8e4f1f5 [Xiangrui Meng] Merge branch 'master' into mat
0135193 [Xiangrui Meng] address Reza's comments
03cd7e1 [Xiangrui Meng] add pca/gram to IndexedRowMatrix add toBreeze to DistributedMatrix for test simplify tests
b177ff1 [Xiangrui Meng] address Matei's comments
be119fe [Xiangrui Meng] rename m/n to numRows/numCols for local matrix add tests for matrices
b881506 [Xiangrui Meng] rename SparkPCA/SVD to TallSkinnyPCA/SVD
e7d0d4a [Xiangrui Meng] move IndexedRDDMatrixRow to IndexedRowRDDMatrix
0d1491c [Xiangrui Meng] fix test errors
a85262a [Xiangrui Meng] rename RDDMatrixRow to IndexedRDDMatrixRow
b8b6ac3 [Xiangrui Meng] Remove old code
4cf679c [Xiangrui Meng] port pca to RowRDDMatrix, and add multiply and covariance
7836e2f [Xiangrui Meng] initial refactoring of matrices backed by RDDs
2014-04-08 23:01:15 -07:00
Xiangrui Meng b9e0c937df [SPARK-1434] [MLLIB] change labelParser from anonymous function to trait
This is a patch to address @mateiz 's comment in https://github.com/apache/spark/pull/245

MLUtils#loadLibSVMData uses an anonymous function for the label parser. Java users won't like it. So I make a trait for LabelParser and provide two implementations: binary and multiclass.

Author: Xiangrui Meng <meng@databricks.com>

Closes #345 from mengxr/label-parser and squashes the following commits:

ac44409 [Xiangrui Meng] use singleton objects for label parsers
3b1a7c6 [Xiangrui Meng] add tests for label parsers
c2e571c [Xiangrui Meng] rename LabelParser.apply to LabelParser.parse use extends for singleton
11c94e0 [Xiangrui Meng] add return types
7f8eb36 [Xiangrui Meng] change labelParser from annoymous function to trait
2014-04-08 20:37:01 -07:00
Holden Karau ce8ec54561 Spark 1271: Co-Group and Group-By should pass Iterable[X]
Author: Holden Karau <holden@pigscanfly.ca>

Closes #242 from holdenk/spark-1320-cogroupandgroupshouldpassiterator and squashes the following commits:

f289536 [Holden Karau] Fix bad merge, should have been Iterable rather than Iterator
77048f8 [Holden Karau] Fix merge up to master
d3fe909 [Holden Karau] use toSeq instead
7a092a3 [Holden Karau] switch resultitr to resultiterable
eb06216 [Holden Karau] maybe I should have had a coffee first. use correct import for guava iterables
c5075aa [Holden Karau] If guava 14 had iterables
2d06e10 [Holden Karau] Fix Java 8 cogroup tests for the new API
11e730c [Holden Karau] Fix streaming tests
66b583d [Holden Karau] Fix the core test suite to compile
4ed579b [Holden Karau] Refactor from iterator to iterable
d052c07 [Holden Karau] Python tests now pass with iterator pandas
3bcd81d [Holden Karau] Revert "Try and make pickling list iterators work"
cd1e81c [Holden Karau] Try and make pickling list iterators work
c60233a [Holden Karau] Start investigating moving to iterators for python API like the Java/Scala one. tl;dr: We will have to write our own iterator since the default one doesn't pickle well
88a5cef [Holden Karau] Fix cogroup test in JavaAPISuite for streaming
a5ee714 [Holden Karau] oops, was checking wrong iterator
e687f21 [Holden Karau] Fix groupbykey test in JavaAPISuite of streaming
ec8cc3e [Holden Karau] Fix test issues\!
4b0eeb9 [Holden Karau] Switch cast in PairDStreamFunctions
fa395c9 [Holden Karau] Revert "Add a join based on the problem in SVD"
ec99e32 [Holden Karau] Revert "Revert this but for now put things in list pandas"
b692868 [Holden Karau] Revert
7e533f7 [Holden Karau] Fix the bug
8a5153a [Holden Karau] Revert me, but we have some stuff to debug
b4e86a9 [Holden Karau] Add a join based on the problem in SVD
c4510e2 [Holden Karau] Revert this but for now put things in list pandas
b4e0b1d [Holden Karau] Fix style issues
71e8b9f [Holden Karau] I really need to stop calling size on iterators, it is the path of sadness.
b1ae51a [Holden Karau] Fix some of the types in the streaming JavaAPI suite. Probably still needs more work
37888ec [Holden Karau] core/tests now pass
249abde [Holden Karau] org.apache.spark.rdd.PairRDDFunctionsSuite passes
6698186 [Holden Karau] Revert "I think this might be a bad rabbit hole. Started work to make CoGroupedRDD use iterator and then went crazy"
fe992fe [Holden Karau] hmmm try and fix up basic operation suite
172705c [Holden Karau] Fix Java API suite
caafa63 [Holden Karau] I think this might be a bad rabbit hole. Started work to make CoGroupedRDD use iterator and then went crazy
88b3329 [Holden Karau] Fix groupbykey to actually give back an iterator
4991af6 [Holden Karau] Fix some tests
be50246 [Holden Karau] Calling size on an iterator is not so good if we want to use it after
687ffbc [Holden Karau] This is the it compiles point of replacing Seq with Iterator and JList with JIterator in the groupby and cogroup signatures
2014-04-08 18:15:59 -07:00
Sean Owen 856c50f59b SPARK-1387. Update build plugins, avoid plugin version warning, centralize versions
Another handful of small build changes to organize and standardize a bit, and avoid warnings:

- Update Maven plugin versions for good measure
- Since plugins need maven 3.0.4 already, require it explicitly (<3.0.4 had some bugs anyway)
- Use variables to define versions across dependencies where they should move in lock step
- ... and make this consistent between Maven/SBT

OK, I also updated the JIRA URL while I was at it here.

Author: Sean Owen <sowen@cloudera.com>

Closes #291 from srowen/SPARK-1387 and squashes the following commits:

461eca1 [Sean Owen] Couldn't resist also updating JIRA location to new one
c2d5cc5 [Sean Owen] Update plugins and Maven version; use variables consistently across Maven/SBT to define dependency versions that should stay in step.
2014-04-06 17:41:01 -07:00
Xiangrui Meng 9c65fa76f9 [SPARK-1212, Part II] Support sparse data in MLlib
In PR https://github.com/apache/spark/pull/117, we added dense/sparse vector data model and updated KMeans to support sparse input. This PR is to replace all other `Array[Double]` usage by `Vector` in generalized linear models (GLMs) and Naive Bayes. Major changes:

1. `LabeledPoint` becomes `LabeledPoint(Double, Vector)`.
2. Methods that accept `RDD[Array[Double]]` now accept `RDD[Vector]`. We cannot support both in an elegant way because of type erasure.
3. Mark 'createModel' and 'predictPoint' protected because they are not for end users.
4. Add libSVMFile to MLContext.
5. NaiveBayes can accept arbitrary labels (introducing a breaking change to Python's `NaiveBayesModel`).
6. Gradient computation no longer creates temp vectors.
7. Column normalization and centering are removed from Lasso and Ridge because the operation will densify the data. Simple feature transformation can be done before training.

TODO:
1. ~~Use axpy when possible.~~
2. ~~Optimize Naive Bayes.~~

Author: Xiangrui Meng <meng@databricks.com>

Closes #245 from mengxr/vector and squashes the following commits:

eb6e793 [Xiangrui Meng] move libSVMFile to MLUtils and rename to loadLibSVMData
c26c4fc [Xiangrui Meng] update DecisionTree to use RDD[Vector]
11999c7 [Xiangrui Meng] Merge branch 'master' into vector
f7da54b [Xiangrui Meng] add minSplits to libSVMFile
da25e24 [Xiangrui Meng] revert the change to default addIntercept because it might change the behavior of existing code without warning
493f26f [Xiangrui Meng] Merge branch 'master' into vector
7c1bc01 [Xiangrui Meng] add a TODO to NB
b9b7ef7 [Xiangrui Meng] change default value of addIntercept to false
b01df54 [Xiangrui Meng] allow to change or clear threshold in LR and SVM
4addc50 [Xiangrui Meng] merge master
4ca5b1b [Xiangrui Meng] remove normalization from Lasso and update tests
f04fe8a [Xiangrui Meng] remove normalization from RidgeRegression and update tests
d088552 [Xiangrui Meng] use static constructor for MLContext
6f59eed [Xiangrui Meng] update libSVMFile to determine number of features automatically
3432e84 [Xiangrui Meng] update NaiveBayes to support sparse data
0f8759b [Xiangrui Meng] minor updates to NB
b11659c [Xiangrui Meng] style update
78c4671 [Xiangrui Meng] add libSVMFile to MLContext
f0fe616 [Xiangrui Meng] add a test for sparse linear regression
44733e1 [Xiangrui Meng] use in-place gradient computation
e981396 [Xiangrui Meng] use axpy in Updater
db808a1 [Xiangrui Meng] update JavaLR example
befa592 [Xiangrui Meng] passed scala/java tests
75c83a4 [Xiangrui Meng] passed test compile
1859701 [Xiangrui Meng] passed compile
834ada2 [Xiangrui Meng] optimized MLUtils.computeStats update some ml algorithms to use Vector (cont.)
135ab72 [Xiangrui Meng] merge glm
0e57aa4 [Xiangrui Meng] update Lasso and RidgeRegression to parse the weights correctly from GLM mark createModel protected mark predictPoint protected
d7f629f [Xiangrui Meng] fix a bug in GLM when intercept is not used
3f346ba [Xiangrui Meng] update some ml algorithms to use Vector
2014-04-02 14:01:12 -07:00
Manish Amde 8b3045ceab MLI-1 Decision Trees
Joint work with @hirakendu, @etrain, @atalwalkar and @harsha2010.

Key features:
+ Supports binary classification and regression
+ Supports gini, entropy and variance for information gain calculation
+ Supports both continuous and categorical features

The algorithm has gone through several development iterations over the last few months leading to a highly optimized implementation. Optimizations include:

1. Level-wise training to reduce passes over the entire dataset.
2. Bin-wise split calculation to reduce computation overhead.
3. Aggregation over partitions before combining to reduce communication overhead.

Author: Manish Amde <manish9ue@gmail.com>
Author: manishamde <manish9ue@gmail.com>
Author: Xiangrui Meng <meng@databricks.com>

Closes #79 from manishamde/tree and squashes the following commits:

1e8c704 [Manish Amde] remove numBins field in the Strategy class
7d54b4f [manishamde] Merge pull request #4 from mengxr/dtree
f536ae9 [Xiangrui Meng] another pass on code style
e1dd86f [Manish Amde] implementing code style suggestions
62dc723 [Manish Amde] updating javadoc and converting helper methods to package private to allow unit testing
201702f [Manish Amde] making some more methods private
f963ef5 [Manish Amde] making methods private
c487e6a [manishamde] Merge pull request #1 from mengxr/dtree
24500c5 [Xiangrui Meng] minor style updates
4576b64 [Manish Amde] documentation and for to while loop conversion
ff363a7 [Manish Amde] binary search for bins and while loop for categorical feature bins
632818f [Manish Amde] removing threshold for classification predict method
2116360 [Manish Amde] removing dummy bin calculation for categorical variables
6068356 [Manish Amde] ensuring num bins is always greater than max number of categories
62c2562 [Manish Amde] fixing comment indentation
ad1fc21 [Manish Amde] incorporated mengxr's code style suggestions
d1ef4f6 [Manish Amde] more documentation
794ff4d [Manish Amde] minor improvements to docs and style
eb8fcbe [Manish Amde] minor code style updates
cd2c2b4 [Manish Amde] fixing code style based on feedback
63e786b [Manish Amde] added multiple train methods for java compatability
d3023b3 [Manish Amde] adding more docs for nested methods
84f85d6 [Manish Amde] code documentation
9372779 [Manish Amde] code style: max line lenght <= 100
dd0c0d7 [Manish Amde] minor: some docs
0dd7659 [manishamde] basic doc
5841c28 [Manish Amde] unit tests for categorical features
f067d68 [Manish Amde] minor cleanup
c0e522b [Manish Amde] updated predict and split threshold logic
b09dc98 [Manish Amde] minor refactoring
6b7de78 [Manish Amde] minor refactoring and tests
d504eb1 [Manish Amde] more tests for categorical features
dbb7ac1 [Manish Amde] categorical feature support
6df35b9 [Manish Amde] regression predict logic
53108ed [Manish Amde] fixing index for highest bin
e23c2e5 [Manish Amde] added regression support
c8f6d60 [Manish Amde] adding enum for feature type
b0e3e76 [Manish Amde] adding enum for feature type
154aa77 [Manish Amde] enums for configurations
733d6dd [Manish Amde] fixed tests
02c595c [Manish Amde] added command line parsing
98ec8d5 [Manish Amde] tree building and prediction logic
b0eb866 [Manish Amde] added logic to handle leaf nodes
80e8c66 [Manish Amde] working version of multi-level split calculation
4798aae [Manish Amde] added gain stats class
dad0afc [Manish Amde] decison stump functionality working
03f534c [Manish Amde] some more tests
0012a77 [Manish Amde] basic stump working
8bca1e2 [Manish Amde] additional code for creating intermediate RDD
92cedce [Manish Amde] basic building blocks for intermediate RDD calculation. untested.
cd53eae [Manish Amde] skeletal framework
2014-04-01 21:40:49 -07:00
Xiangrui Meng d679843a39 [SPARK-1327] GLM needs to check addIntercept for intercept and weights
GLM needs to check addIntercept for intercept and weights. The current implementation always uses the first weight as intercept. Added a test for training without adding intercept.

JIRA: https://spark-project.atlassian.net/browse/SPARK-1327

Author: Xiangrui Meng <meng@databricks.com>

Closes #236 from mengxr/glm and squashes the following commits:

bcac1ac [Xiangrui Meng] add two tests to ensure {Lasso, Ridge}.setIntercept will throw an exceptions
a104072 [Xiangrui Meng] remove protected to be compatible with 0.9
0e57aa4 [Xiangrui Meng] update Lasso and RidgeRegression to parse the weights correctly from GLM mark createModel protected mark predictPoint protected
d7f629f [Xiangrui Meng] fix a bug in GLM when intercept is not used
2014-03-26 19:30:20 -07:00
Xiangrui Meng 80c29689ae [SPARK-1212] Adding sparse data support and update KMeans
Continue our discussions from https://github.com/apache/incubator-spark/pull/575

This PR is WIP because it depends on a SNAPSHOT version of breeze.

Per previous discussions and benchmarks, I switched to breeze for linear algebra operations. @dlwh and I made some improvements to breeze to keep its performance comparable to the bare-bone implementation, including norm computation and squared distance. This is why this PR needs to depend on a SNAPSHOT version of breeze.

@fommil , please find the notice of using netlib-core in `NOTICE`. This is following Apache's instructions on appropriate labeling.

I'm going to update this PR to include:

1. Fast distance computation: using `\|a\|_2^2 + \|b\|_2^2 - 2 a^T b` when it doesn't introduce too much numerical error. The squared norms are pre-computed. Otherwise, computing the distance between the center (dense) and a point (possibly sparse) always takes O(n) time.

2. Some numbers about the performance.

3. A released version of breeze. @dlwh, a minor release of breeze will help this PR get merged early. Do you mind sharing breeze's release plan? Thanks!

Author: Xiangrui Meng <meng@databricks.com>

Closes #117 from mengxr/sparse-kmeans and squashes the following commits:

67b368d [Xiangrui Meng] fix SparseVector.toArray
5eda0de [Xiangrui Meng] update NOTICE
67abe31 [Xiangrui Meng] move ArrayRDDs to mllib.rdd
1da1033 [Xiangrui Meng] remove dependency on commons-math3 and compute EPSILON directly
9bb1b31 [Xiangrui Meng] optimize SparseVector.toArray
226d2cd [Xiangrui Meng] update Java friendly methods in Vectors
238ba34 [Xiangrui Meng] add VectorRDDs with a converter from RDD[Array[Double]]
b28ba2f [Xiangrui Meng] add toArray to Vector
e69b10c [Xiangrui Meng] remove examples/JavaKMeans.java, which is replaced by mllib/examples/JavaKMeans.java
72bde33 [Xiangrui Meng] clean up code for distance computation
712cb88 [Xiangrui Meng] make Vectors.sparse Java friendly
27858e4 [Xiangrui Meng] update breeze version to 0.7
07c3cf2 [Xiangrui Meng] change Mahout to breeze in doc use a simple lower bound to avoid unnecessary distance computation
6f5cdde [Xiangrui Meng] fix a bug in filtering finished runs
42512f2 [Xiangrui Meng] Merge branch 'master' into sparse-kmeans
d6e6c07 [Xiangrui Meng] add predict(RDD[Vector]) to KMeansModel
42b4e50 [Xiangrui Meng] line feed at the end
a4ace73 [Xiangrui Meng] Merge branch 'fast-dist' into sparse-kmeans
3ed1a24 [Xiangrui Meng] add doc to BreezeVectorWithSquaredNorm
0107e19 [Xiangrui Meng] update NOTICE
87bc755 [Xiangrui Meng] tuned the KMeans code: changed some for loops to while, use view to avoid copying arrays
0ff8046 [Xiangrui Meng] update KMeans to use fastSquaredDistance
f355411 [Xiangrui Meng] add BreezeVectorWithSquaredNorm case class
ab74f67 [Xiangrui Meng] add fastSquaredDistance for KMeans
4e7d5ca [Xiangrui Meng] minor style update
07ffaf2 [Xiangrui Meng] add dense/sparse vector data models and conversions to/from breeze vectors use breeze to implement KMeans in order to support both dense and sparse data
2014-03-23 17:34:02 -07:00
Reza Zadeh 66a03e5fe0 Principal Component Analysis
# Principal Component Analysis

Computes the top k principal component coefficients for the m-by-n data matrix X. Rows of X correspond to observations and columns correspond to variables. The coefficient matrix is n-by-k. Each column of the coefficients return matrix contains coefficients for one principal component, and the columns are in descending order of component variance. This function centers the data and uses the singular value decomposition (SVD) algorithm.

## Testing
Tests included:
 * All principal components
 * Only top k principal components
 * Dense SVD tests
 * Dense/sparse matrix tests

The results are tested against MATLAB's pca: http://www.mathworks.com/help/stats/pca.html

## Documentation
Added to mllib-guide.md

## Example Usage
Added to examples directory under SparkPCA.scala

Author: Reza Zadeh <rizlar@gmail.com>

Closes #88 from rezazadeh/sparkpca and squashes the following commits:

e298700 [Reza Zadeh] reformat using IDE
3f23271 [Reza Zadeh] documentation and cleanup
b025ab2 [Reza Zadeh] documentation
e2667d4 [Reza Zadeh] assertMatrixApproximatelyEquals
3787bb4 [Reza Zadeh] stylin
c6ecc1f [Reza Zadeh] docs
aa2bbcb [Reza Zadeh] rename sparseToTallSkinnyDense
56975b0 [Reza Zadeh] docs
2df9bde [Reza Zadeh] docs update
8fb0015 [Reza Zadeh] rcond documentation
dbf7797 [Reza Zadeh] correct argument number
a9f1f62 [Reza Zadeh] documentation
4ce6caa [Reza Zadeh] style changes
9a56a02 [Reza Zadeh] use rcond relative to larget svalue
120f796 [Reza Zadeh] housekeeping
156ff78 [Reza Zadeh] string comprehension
2e1cf43 [Reza Zadeh] rename rcond
ea223a6 [Reza Zadeh] many style changes
f4002d7 [Reza Zadeh] more docs
bd53c7a [Reza Zadeh] proper accumulator
a8b5ecf [Reza Zadeh] Don't use for loops
0dc7980 [Reza Zadeh] filter zeros in sparse
6115610 [Reza Zadeh] More documentation
36d51e8 [Reza Zadeh] use JBLAS for UVS^-1 computation
bc4599f [Reza Zadeh] configurable rcond
86f7515 [Reza Zadeh] compute per parition, use while
09726b3 [Reza Zadeh] more style changes
4195e69 [Reza Zadeh] private, accumulator
17002be [Reza Zadeh] style changes
4ba7471 [Reza Zadeh] style change
f4982e6 [Reza Zadeh] Use dense matrix in example
2828d28 [Reza Zadeh] optimizations: normalize once, use inplace ops
72c9fa1 [Reza Zadeh] rename DenseMatrix to TallSkinnyDenseMatrix, lean
f807be9 [Reza Zadeh] fix typo
2d7ccde [Reza Zadeh] Array interface for dense svd and pca
cd290fa [Reza Zadeh] provide RDD[Array[Double]] support
398d123 [Reza Zadeh] style change
55abbfa [Reza Zadeh] docs fix
ef29644 [Reza Zadeh] bad chnage undo
472566e [Reza Zadeh] all files from old pr
555168f [Reza Zadeh] initial files
2014-03-20 10:39:20 -07:00
Xiangrui Meng f9d8a83c00 [SPARK-1266] persist factors in implicit ALS
In implicit ALS computation, the user or product factor is used twice in each iteration. Caching can certainly help accelerate the computation. I saw the running time decreased by ~70% for implicit ALS on the movielens data.

I also made the following changes:

1. Change `YtYb` type from `Broadcast[Option[DoubleMatrix]]` to `Option[Broadcast[DoubleMatrix]]`, so we don't need to broadcast None in explicit computation.

2. Mark methods `computeYtY`, `unblockFactors`, `updateBlock`, and `updateFeatures private`. Users do not need those methods.

3. Materialize the final matrix factors before returning the model. It allows us to clean up other cached RDDs before returning the model. I do not have a better solution here, so I use `RDD.count()`.

JIRA: https://spark-project.atlassian.net/browse/SPARK-1266

Author: Xiangrui Meng <meng@databricks.com>

Closes #165 from mengxr/als and squashes the following commits:

c9676a6 [Xiangrui Meng] add a comment about the last products.persist
d3a88aa [Xiangrui Meng] change implicitPrefs match to if ... else ...
63862d6 [Xiangrui Meng] persist factors in implicit ALS
2014-03-18 17:20:42 -07:00
Xiangrui Meng e108b9ab94 [SPARK-1260]: faster construction of features with intercept
The current implementation uses `Array(1.0, features: _*)` to construct a new array with intercept. This is not efficient for big arrays because `Array.apply` uses a for loop that iterates over the arguments. `Array.+:` is a better choice here.

Also, I don't see a reason to set initial weights to ones. So I set them to zeros.

JIRA: https://spark-project.atlassian.net/browse/SPARK-1260

Author: Xiangrui Meng <meng@databricks.com>

Closes #161 from mengxr/sgd and squashes the following commits:

b5cfc53 [Xiangrui Meng] set default weights to zeros
a1439c2 [Xiangrui Meng] faster construction of features with intercept
2014-03-18 15:14:13 -07:00
Xiangrui Meng e4e8d8f395 [SPARK-1237, 1238] Improve the computation of YtY for implicit ALS
Computing YtY can be implemented using BLAS's DSPR operations instead of generating y_i y_i^T and then combining them. The latter generates many k-by-k matrices. On the movielens data, this change improves the performance by 10-20%. The algorithm remains the same, verified by computing RMSE on the movielens data.

To compare the results, I also added an option to set a random seed in ALS.

JIRA:
1. https://spark-project.atlassian.net/browse/SPARK-1237
2. https://spark-project.atlassian.net/browse/SPARK-1238

Author: Xiangrui Meng <meng@databricks.com>

Closes #131 from mengxr/als and squashes the following commits:

ed00432 [Xiangrui Meng] minor changes
d984623 [Xiangrui Meng] minor changes
2fc1641 [Xiangrui Meng] remove commented code
4c7cde2 [Xiangrui Meng] allow specifying a random seed in ALS
200bef0 [Xiangrui Meng] optimize computeYtY and updateBlock
2014-03-13 00:43:19 -07:00
CodingCat 9032f7c0d5 SPARK-1160: Deprecate toArray in RDD
https://spark-project.atlassian.net/browse/SPARK-1160

reported by @mateiz: "It's redundant with collect() and the name doesn't make sense in Java, where we return a List (we can't return an array due to the way Java generics work). It's also missing in Python."

In this patch, I deprecated the method and changed the source files using it by replacing toArray with collect() directly

Author: CodingCat <zhunansjtu@gmail.com>

Closes #105 from CodingCat/SPARK-1060 and squashes the following commits:

286f163 [CodingCat] deprecate in JavaRDDLike
ee17b4e [CodingCat] add message and since
2ff7319 [CodingCat] deprecate toArray in RDD
2014-03-12 17:43:12 -07:00
Sandy Ryza a99fb3747a SPARK-1193. Fix indentation in pom.xmls
Author: Sandy Ryza <sandy@cloudera.com>

Closes #91 from sryza/sandy-spark-1193 and squashes the following commits:

a878124 [Sandy Ryza] SPARK-1193. Fix indentation in pom.xmls
2014-03-07 23:10:35 -08:00
Patrick Wendell c3f5e07533 SPARK-1121: Include avro for yarn-alpha builds
This lets us explicitly include Avro based on a profile for 0.23.X
builds. It makes me sad how convoluted it is to express this logic
in Maven. @tgraves and @sryza curious if this works for you.

I'm also considering just reverting to how it was before. The only
real problem was that Spark advertised a dependency on Avro
even though it only really depends transitively on Avro through
other deps.

Author: Patrick Wendell <pwendell@gmail.com>

Closes #49 from pwendell/avro-build-fix and squashes the following commits:

8d6ee92 [Patrick Wendell] SPARK-1121: Add avro to yarn-alpha profile
2014-03-02 15:18:19 -08:00
Patrick Wendell 1fd2bfd3dd Remove remaining references to incubation
This removes some loose ends not caught by the other (incubating -> tlp) patches. @markhamstra this updates the version as you mentioned earlier.

Author: Patrick Wendell <pwendell@gmail.com>

Closes #51 from pwendell/tlp and squashes the following commits:

d553b1b [Patrick Wendell] Remove remaining references to incubation
2014-03-02 01:00:16 -08:00
DB Tsai 6fc76e49c1 Initialized the regVal for first iteration in SGD optimizer
Ported from https://github.com/apache/incubator-spark/pull/633

In runMiniBatchSGD, the regVal (for 1st iter) should be initialized
as sum of sqrt of weights if it's L2 update; for L1 update, the same logic is followed.

It maybe not be important here for SGD since the updater doesn't take the loss
as parameter to find the new weights. But it will give us the correct history of loss.
However, for LBFGS optimizer we implemented, the correct loss with regVal is crucial to
find the new weights.

Author: DB Tsai <dbtsai@alpinenow.com>

Closes #40 from dbtsai/dbtsai-smallRegValFix and squashes the following commits:

77d47da [DB Tsai] In runMiniBatchSGD, the regVal (for 1st iter) should be initialized as sum of sqrt of weights if it's L2 update; for L1 update, the same logic is followed.
2014-03-02 00:31:59 -08:00
Sean Owen c8a4c9b1f6 MLLIB-25: Implicit ALS runs out of memory for moderately large numbers of features
There's a step in implicit ALS where the matrix `Yt * Y` is computed. It's computed as the sum of matrices; an f x f matrix is created for each of n user/item rows in a partition. In `ALS.scala:214`:

```
        factors.flatMapValues{ case factorArray =>
          factorArray.map{ vector =>
            val x = new DoubleMatrix(vector)
            x.mmul(x.transpose())
          }
        }.reduceByKeyLocally((a, b) => a.addi(b))
         .values
         .reduce((a, b) => a.addi(b))
```

Completely correct, but there's a subtle but quite large memory problem here. map() is going to create all of these matrices in memory at once, when they don't need to ever all exist at the same time.
For example, if a partition has n = 100000 rows, and f = 200, then this intermediate product requires 32GB of heap. The computation will never work unless you can cough up workers with (more than) that much heap.

Fortunately there's a trivial change that fixes it; just add `.view` in there.

Author: Sean Owen <sowen@cloudera.com>

Closes #629 from srowen/ALSMatrixAllocationOptimization and squashes the following commits:

062cda9 [Sean Owen] Update style per review comments
e9a5d63 [Sean Owen] Avoid unnecessary out of memory situation by not simultaneously allocating lots of matrices
2014-02-21 12:46:12 -08:00
Sean Owen 9e63f80e75 MLLIB-22. Support negative implicit input in ALS
I'm back with another less trivial suggestion for ALS:

In ALS for implicit feedback, input values are treated as weights on squared-errors in a loss function (or rather, the weight is a simple function of the input r, like c = 1 + alpha*r). The paper on which it's based assumes that the input is positive. Indeed, if the input is negative, it will create a negative weight on squared-errors, which causes things to go haywire. The optimization will try to make the error in a cell as large possible, and the result is silently bogus.

There is a good use case for negative input values though. Implicit feedback is usually collected from signals of positive interaction like a view or like or buy, but equally, can come from "not interested" signals. The natural representation is negative values.

The algorithm can be extended quite simply to provide a sound interpretation of these values: negative values should encourage the factorization to come up with 0 for cells with large negative input values, just as much as positive values encourage it to come up with 1.

The implications for the algorithm are simple:
* the confidence function value must not be negative, and so can become 1 + alpha*|r|
* the matrix P should have a value 1 where the input R is _positive_, not merely where it is non-zero. Actually, that's what the paper already says, it's just that we can't assume P = 1 when a cell in R is specified anymore, since it may be negative

This in turn entails just a few lines of code change in `ALS.scala`:
* `rs(i)` becomes `abs(rs(i))`
* When constructing `userXy(us(i))`, it's implicitly only adding where P is 1. That had been true for any us(i) that is iterated over, before, since these are exactly the ones for which P is 1. But now P is zero where rs(i) <= 0, and should not be added

I think it's a safe change because:
* It doesn't change any existing behavior (unless you're using negative values, in which case results are already borked)
* It's the simplest direct extension of the paper's algorithm
* (I've used it to good effect in production FWIW)

Tests included.

I tweaked minor things en route:
* `ALS.scala` javadoc writes "R = Xt*Y" when the paper and rest of code defines it as "R = X*Yt"
* RMSE in the ALS tests uses a confidence-weighted mean, but the denominator is not actually sum of weights

Excuse my Scala style; I'm sure it needs tweaks.

Author: Sean Owen <sowen@cloudera.com>

Closes #500 from srowen/ALSNegativeImplicitInput and squashes the following commits:

cf902a9 [Sean Owen] Support negative implicit input in ALS
953be1c [Sean Owen] Make weighted RMSE in ALS test actually weighted; adjust comment about R = X*Yt
2014-02-19 23:44:53 -08:00
Chen Chao f9b7d64a4e MLLIB-24: url of "Collaborative Filtering for Implicit Feedback Datasets" in ALS is invalid now
url of "Collaborative Filtering for Implicit Feedback Datasets"  is invalid now. A new url is provided. http://research.yahoo.com/files/HuKorenVolinsky-ICDM08.pdf

Author: Chen Chao <crazyjvm@gmail.com>

Closes #619 from CrazyJvm/master and squashes the following commits:

a0b54e4 [Chen Chao] change url to IEEE
9e0e9f0 [Chen Chao] correct spell mistale
fcfab5d [Chen Chao] wrap line to to fit within 100 chars
590d56e [Chen Chao] url error
2014-02-19 22:06:35 -08:00
Martin Jaggi 2182aa3c55 Merge pull request #566 from martinjaggi/copy-MLlib-d.
new MLlib documentation for optimization, regression and classification

new documentation with tex formulas, hopefully improving usability and reproducibility of the offered MLlib methods.
also did some minor changes in the code for consistency. scala tests pass.

this is the rebased branch, i deleted the old PR

jira:
https://spark-project.atlassian.net/browse/MLLIB-19

Author: Martin Jaggi <m.jaggi@gmail.com>

Closes #566 and squashes the following commits:

5f0f31e [Martin Jaggi] line wrap at 100 chars
4e094fb [Martin Jaggi] better description of GradientDescent
1d6965d [Martin Jaggi] remove broken url
ea569c3 [Martin Jaggi] telling what updater actually does
964732b [Martin Jaggi] lambda R() in documentation
a6c6228 [Martin Jaggi] better comments in SGD code for regression
b32224a [Martin Jaggi] new optimization documentation
d5dfef7 [Martin Jaggi] new classification and regression documentation
b07ead6 [Martin Jaggi] correct scaling for MSE loss
ba6158c [Martin Jaggi] use d for the number of features
bab2ed2 [Martin Jaggi] renaming LeastSquaresGradient
2014-02-09 15:19:50 -08:00
Patrick Wendell b69f8b2a01 Merge pull request #557 from ScrapCodes/style. Closes #557.
SPARK-1058, Fix Style Errors and Add Scala Style to Spark Build.

Author: Patrick Wendell <pwendell@gmail.com>
Author: Prashant Sharma <scrapcodes@gmail.com>

== Merge branch commits ==

commit 1a8bd1c059b842cb95cc246aaea74a79fec684f4
Author: Prashant Sharma <scrapcodes@gmail.com>
Date:   Sun Feb 9 17:39:07 2014 +0530

    scala style fixes

commit f91709887a8e0b608c5c2b282db19b8a44d53a43
Author: Patrick Wendell <pwendell@gmail.com>
Date:   Fri Jan 24 11:22:53 2014 -0800

    Adding scalastyle snapshot
2014-02-09 10:09:19 -08:00
Mark Hamstra c2341c92bb Merge pull request #542 from markhamstra/versionBump. Closes #542.
Version number to 1.0.0-SNAPSHOT

Since 0.9.0-incubating is done and out the door, we shouldn't be building 0.9.0-incubating-SNAPSHOT anymore.

@pwendell

Author: Mark Hamstra <markhamstra@gmail.com>

== Merge branch commits ==

commit 1b00a8a7c1a7f251b4bb3774b84b9e64758eaa71
Author: Mark Hamstra <markhamstra@gmail.com>
Date:   Wed Feb 5 09:30:32 2014 -0800

    Version number to 1.0.0-SNAPSHOT
2014-02-08 16:00:43 -08:00
Xiangrui Meng 23af00f9e0 Merge pull request #528 from mengxr/sample. Closes #528.
Refactor RDD sampling and add randomSplit to RDD (update)

Replace SampledRDD by PartitionwiseSampledRDD, which accepts a RandomSampler instance as input. The current sample with/without replacement can be easily integrated via BernoulliSampler and PoissonSampler. The benefits are:

1) RDD.randomSplit is implemented in the same way, related to https://github.com/apache/incubator-spark/pull/513
2) Stratified sampling and importance sampling can be implemented in the same manner as well.

Unit tests are included for samplers and RDD.randomSplit.

This should performance better than my previous request where the BernoulliSampler creates many Iterator instances:
https://github.com/apache/incubator-spark/pull/513

Author: Xiangrui Meng <meng@databricks.com>

== Merge branch commits ==

commit e8ce957e5f0a600f2dec057924f4a2ca6adba373
Author: Xiangrui Meng <meng@databricks.com>
Date:   Mon Feb 3 12:21:08 2014 -0800

    more docs to PartitionwiseSampledRDD

commit fbb4586d0478ff638b24bce95f75ff06f713d43b
Author: Xiangrui Meng <meng@databricks.com>
Date:   Mon Feb 3 00:44:23 2014 -0800

    move XORShiftRandom to util.random and use it in BernoulliSampler

commit 987456b0ee8612fd4f73cb8c40967112dc3c4c2d
Author: Xiangrui Meng <meng@databricks.com>
Date:   Sat Feb 1 11:06:59 2014 -0800

    relax assertions in SortingSuite because the RangePartitioner has large variance in this case

commit 3690aae416b2dc9b2f9ba32efa465ba7948477f4
Author: Xiangrui Meng <meng@databricks.com>
Date:   Sat Feb 1 09:56:28 2014 -0800

    test split ratio of RDD.randomSplit

commit 8a410bc933a60c4d63852606f8bbc812e416d6ae
Author: Xiangrui Meng <meng@databricks.com>
Date:   Sat Feb 1 09:25:22 2014 -0800

    add a test to ensure seed distribution and minor style update

commit ce7e866f674c30ab48a9ceb09da846d5362ab4b6
Author: Xiangrui Meng <meng@databricks.com>
Date:   Fri Jan 31 18:06:22 2014 -0800

    minor style change

commit 750912b4d77596ed807d361347bd2b7e3b9b7a74
Author: Xiangrui Meng <meng@databricks.com>
Date:   Fri Jan 31 18:04:54 2014 -0800

    fix some long lines

commit c446a25c38d81db02821f7f194b0ce5ab4ed7ff5
Author: Xiangrui Meng <meng@databricks.com>
Date:   Fri Jan 31 17:59:59 2014 -0800

    add complement to BernoulliSampler and minor style changes

commit dbe2bc2bd888a7bdccb127ee6595840274499403
Author: Xiangrui Meng <meng@databricks.com>
Date:   Fri Jan 31 17:45:08 2014 -0800

    switch to partition-wise sampling for better performance

commit a1fca5232308feb369339eac67864c787455bb23
Merge: ac712e4 cf6128f
Author: Xiangrui Meng <meng@databricks.com>
Date:   Fri Jan 31 16:33:09 2014 -0800

    Merge branch 'sample' of github.com:mengxr/incubator-spark into sample

commit cf6128fb672e8c589615adbd3eaa3cbdb72bd461
Author: Xiangrui Meng <meng@databricks.com>
Date:   Sun Jan 26 14:40:07 2014 -0800

    set SampledRDD deprecated in 1.0

commit f430f847c3df91a3894687c513f23f823f77c255
Author: Xiangrui Meng <meng@databricks.com>
Date:   Sun Jan 26 14:38:59 2014 -0800

    update code style

commit a8b5e2021a9204e318c80a44d00c5c495f1befb6
Author: Xiangrui Meng <meng@databricks.com>
Date:   Sun Jan 26 12:56:27 2014 -0800

    move package random to util.random

commit ab0fa2c4965033737a9e3a9bf0a59cbb0df6a6f5
Author: Xiangrui Meng <meng@databricks.com>
Date:   Sun Jan 26 12:50:35 2014 -0800

    add Apache headers and update code style

commit 985609fe1a55655ad11966e05a93c18c138a403d
Author: Xiangrui Meng <meng@databricks.com>
Date:   Sun Jan 26 11:49:25 2014 -0800

    add new lines

commit b21bddf29850a2c006a868869b8f91960a029322
Author: Xiangrui Meng <meng@databricks.com>
Date:   Sun Jan 26 11:46:35 2014 -0800

    move samplers to random.IndependentRandomSampler and add tests

commit c02dacb4a941618e434cefc129c002915db08be6
Author: Xiangrui Meng <meng@databricks.com>
Date:   Sat Jan 25 15:20:24 2014 -0800

    add RandomSampler

commit 8ff7ba3c5cf1fc338c29ae8b5fa06c222640e89c
Author: Xiangrui Meng <meng@databricks.com>
Date:   Fri Jan 24 13:23:22 2014 -0800

    init impl of IndependentlySampledRDD
2014-02-03 13:02:09 -08:00
Sean Owen f67ce3e229 Merge pull request #460 from srowen/RandomInitialALSVectors
Choose initial user/item vectors uniformly on the unit sphere

...rather than within the unit square to possibly avoid bias in the initial state and improve convergence.

The current implementation picks the N vector elements uniformly at random from [0,1). This means they all point into one quadrant of the vector space. As N gets just a little large, the vector tend strongly to point into the "corner", towards (1,1,1...,1). The vectors are not unit vectors either.

I suggest choosing the elements as Gaussian ~ N(0,1) and normalizing. This gets you uniform random choices on the unit sphere which is more what's of interest here. It has worked a little better for me in the past.

This is pretty minor but wanted to warm up suggesting a few tweaks to ALS.
Please excuse my Scala, pretty new to it.

Author: Sean Owen <sowen@cloudera.com>

== Merge branch commits ==

commit 492b13a7469e5a4ed7591ee8e56d8bd7570dfab6
Author: Sean Owen <sowen@cloudera.com>
Date:   Mon Jan 27 08:05:25 2014 +0000

    Style: spaces around binary operators

commit ce2b5b5a4fefa0356875701f668f01f02ba4d87e
Author: Sean Owen <sowen@cloudera.com>
Date:   Sun Jan 19 22:50:03 2014 +0000

    Generate factors with all positive components, per discussion in https://github.com/apache/incubator-spark/pull/460

commit b6f7a8a61643a8209e8bc662e8e81f2d15c710c7
Author: Sean Owen <sowen@cloudera.com>
Date:   Sat Jan 18 15:54:42 2014 +0000

    Choose initial user/item vectors uniformly on the unit sphere rather than within the unit square to possibly avoid bias in the initial state and improve convergence
2014-01-27 11:15:51 -08:00
Matei Zaharia d009b17d13 Merge pull request #315 from rezazadeh/sparsesvd
Sparse SVD

# Singular Value Decomposition
Given an *m x n* matrix *A*, compute matrices *U, S, V* such that

*A = U * S * V^T*

There is no restriction on m, but we require n^2 doubles to fit in memory.
Further, n should be less than m.

The decomposition is computed by first computing *A^TA = V S^2 V^T*,
computing svd locally on that (since n x n is small),
from which we recover S and V.
Then we compute U via easy matrix multiplication
as *U =  A * V * S^-1*

Only singular vectors associated with the largest k singular values
If there are k such values, then the dimensions of the return will be:

* *S* is *k x k* and diagonal, holding the singular values on diagonal.
* *U* is *m x k* and satisfies U^T*U = eye(k).
* *V* is *n x k* and satisfies V^TV = eye(k).

All input and output is expected in sparse matrix format, 0-indexed
as tuples of the form ((i,j),value) all in RDDs.

# Testing
Tests included. They test:
- Decomposition promise (A = USV^T)
- For small matrices, output is compared to that of jblas
- Rank 1 matrix test included
- Full Rank matrix test included
- Middle-rank matrix forced via k included

# Example Usage

import org.apache.spark.SparkContext
import org.apache.spark.mllib.linalg.SVD
import org.apache.spark.mllib.linalg.SparseMatrix
import org.apache.spark.mllib.linalg.MatrixyEntry

// Load and parse the data file
val data = sc.textFile("mllib/data/als/test.data").map { line =>
      val parts = line.split(',')
      MatrixEntry(parts(0).toInt, parts(1).toInt, parts(2).toDouble)
}
val m = 4
val n = 4

// recover top 1 singular vector
val decomposed = SVD.sparseSVD(SparseMatrix(data, m, n), 1)

println("singular values = " + decomposed.S.data.toArray.mkString)

# Documentation
Added to docs/mllib-guide.md
2014-01-22 14:01:30 -08:00
Andrew Tulloch 3a067b4a76 Fixed import order 2014-01-21 13:36:53 +00:00
Andrew Tulloch 720836a761 LocalSparkContext for MLlib 2014-01-19 17:51:00 +00:00
Sean Owen e91ad3f164 Correct L2 regularized weight update with canonical form 2014-01-18 12:53:01 +00:00
Reza Zadeh 85b95d039d rename to MatrixSVD 2014-01-17 14:40:51 -08:00
Reza Zadeh fa3299835b rename to MatrixSVD 2014-01-17 14:39:30 -08:00
Reza Zadeh caf97a25a2 Merge remote-tracking branch 'upstream/master' into sparsesvd 2014-01-17 14:34:03 -08:00
Reza Zadeh c9b4845bc1 prettify 2014-01-17 14:14:29 -08:00
Reza Zadeh dbec69bbf4 add rename computeSVD 2014-01-17 13:59:05 -08:00
Reza Zadeh eb2d8c431f replace this.type with SVD 2014-01-17 13:57:27 -08:00
Reza Zadeh cb13b15a60 use 0-indexing 2014-01-17 13:55:42 -08:00
Reynold Xin 84595ea3e2 Merge pull request #414 from soulmachine/code-style
Code clean up for mllib

* Removed unnecessary parentheses
* Removed unused imports
* Simplified `filter...size()` to `count ...`
* Removed obsoleted parameters' comments
2014-01-15 20:15:29 -08:00
Frank Dai 57fcfc75b3 Added parentheses for that getDouble() also has side effect 2014-01-14 18:56:11 +08:00
Patrick Wendell 23034798d7 Add missing header files 2014-01-14 01:17:13 -08:00
Reza Zadeh 845e568fad Merge remote-tracking branch 'upstream/master' into sparsesvd 2014-01-13 23:52:34 -08:00
Frank Dai a3da468d8b Merge remote-tracking branch 'upstream/master' into code-style 2014-01-14 15:29:17 +08:00
Patrick Wendell fdaabdc673 Merge pull request #380 from mateiz/py-bayes
Add Naive Bayes to Python MLlib, and some API fixes

- Added a Python wrapper for Naive Bayes
- Updated the Scala Naive Bayes to match the style of our other
  algorithms better and in particular make it easier to call from Java
  (added builder pattern, removed default value in train method)
- Updated Python MLlib functions to not require a SparkContext; we can
  get that from the RDD the user gives
- Added a toString method in LabeledPoint
- Made the Python MLlib tests run as part of run-tests as well (before
  they could only be run individually through each file)
2014-01-13 23:08:26 -08:00
Frank Dai c2852cf42e Indent two spaces 2014-01-14 14:59:01 +08:00
Frank Dai 12386b3eea Since getLong() and getInt() have side effect, get back parentheses, and remove an empty line 2014-01-14 14:53:10 +08:00
Frank Dai 0d94d74edf Code clean up for mllib 2014-01-14 14:37:26 +08:00
Henry Saputra 91a563608e Merge branch 'master' into remove_simpleredundantreturn_scala 2014-01-12 10:34:13 -08:00
Henry Saputra 93a65e5fde Remove simple redundant return statement for Scala methods/functions:
-) Only change simple return statements at the end of method
-) Ignore the complex if-else check
-) Ignore the ones inside synchronized
2014-01-12 10:30:04 -08:00
Matei Zaharia f00e949f84 Added Java unit test, data, and main method for Naive Bayes
Also fixes mains of a few other algorithms to print the final model
2014-01-11 22:30:48 -08:00
Matei Zaharia 9a0dfdf868 Add Naive Bayes to Python MLlib, and some API fixes
- Added a Python wrapper for Naive Bayes
- Updated the Scala Naive Bayes to match the style of our other
  algorithms better and in particular make it easier to call from Java
  (added builder pattern, removed default value in train method)
- Updated Python MLlib functions to not require a SparkContext; we can
  get that from the RDD the user gives
- Added a toString method in LabeledPoint
- Made the Python MLlib tests run as part of run-tests as well (before
  they could only be run individually through each file)
2014-01-11 22:30:48 -08:00
jerryshao cbfbc01938 Fix configure didn't work small problem in ALS 2014-01-11 16:22:45 +08:00
Reza Zadeh 21c8a54c08 Merge remote-tracking branch 'upstream/master' into sparsesvd
Conflicts:
	docs/mllib-guide.md
2014-01-09 22:45:32 -08:00
Reza Zadeh 7d7490b67b More sparse matrix usage. 2014-01-07 17:16:17 -08:00
Hossein Falaki 3a8beb46cb Merge branch 'master' into MatrixFactorizationModel-fix 2014-01-07 15:22:42 -08:00
Hossein Falaki 04132ea9b2 Added Rating deserializer 2014-01-06 12:19:08 -08:00
Hossein Falaki 11a93fb5a8 Added serializing method for Rating object 2014-01-06 12:18:03 -08:00
Xusen Yin 05e6d5b454 Added GradientDescentSuite 2014-01-06 16:54:00 +08:00
Xusen Yin a72107284a fix logistic loss bug 2014-01-06 12:30:17 +08:00
Reynold Xin d43ad3ef2c Merge pull request #292 from soulmachine/naive-bayes
standard Naive Bayes classifier

Has implemented the standard Naive Bayes classifier. This is an updated version of #288, which is closed because of misoperations.
2014-01-04 16:29:30 -08:00
Hossein Falaki 8d0c2f7399 Added python binding for bulk recommendation 2014-01-04 16:23:17 -08:00
Reza Zadeh 06c0f7628a use SparseMatrix everywhere 2014-01-04 14:28:07 -08:00
Reza Zadeh cdff9fc858 prettify 2014-01-04 12:44:04 -08:00
Reza Zadeh e9bd6cb51d new example file 2014-01-04 12:33:22 -08:00
Reza Zadeh 8bfcce1ad8 fix tests 2014-01-04 11:52:42 -08:00
Reza Zadeh 35adc72794 set methods 2014-01-04 11:30:36 -08:00
Reza Zadeh 73daa700bd add k parameter 2014-01-04 01:52:28 -08:00
Reza Zadeh 26a74f0c41 using decomposed matrix struct now 2014-01-04 00:38:53 -08:00
Reza Zadeh d2d5e5e062 new return struct 2014-01-04 00:15:04 -08:00
Reza Zadeh 7f631dd2a9 start using matrixentry 2014-01-03 22:17:24 -08:00
Reza Zadeh 6bcdb762a1 rename sparsesvd.scala 2014-01-03 21:55:38 -08:00
Reza Zadeh b059a2a00c New matrix entry file 2014-01-03 21:54:57 -08:00
Hossein Falaki dfe57fa84c Removed unnecessary blank line 2014-01-03 15:40:53 -08:00
Hossein Falaki 2c1cba851c Added unit tests for bulk prediction in MatrixFactorizationModel 2014-01-03 15:35:20 -08:00
Hossein Falaki 67f937ec22 Added a method to enable bulk prediction 2014-01-03 15:34:16 -08:00
Reza Zadeh e617ae2dad fix error message 2014-01-02 01:51:38 -08:00
Reza Zadeh 61405785bc Merge remote-tracking branch 'upstream/master' into sparsesvd 2014-01-02 01:50:30 -08:00
Reza Zadeh 2612164f85 more docs yay 2014-01-01 20:22:29 -08:00
Reza Zadeh 915d53f8ac javadoc for sparsesvd 2014-01-01 20:20:16 -08:00
Reza Zadeh 185c882606 tweaks to docs 2014-01-01 19:53:14 -08:00
Lian, Cheng dd6033e685 Aggregated all sample points to driver without any shuffle 2014-01-02 01:38:24 +08:00
Lian, Cheng 6d0e2e86df Response to comments from Reynold, Ameet and Evan
* Arguments renamed according to Ameet's suggestion
* Using DoubleMatrix instead of Array[Double] in computation
* Removed arguments C (kinds of label) and D (dimension of feature vector) from NaiveBayes.train()
* Replaced reduceByKey with foldByKey to avoid modifying original input data
2013-12-30 22:46:32 +08:00
Matei Zaharia b4ceed40d6 Merge remote-tracking branch 'origin/master' into conf2
Conflicts:
	core/src/main/scala/org/apache/spark/SparkContext.scala
	core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
	core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala
	core/src/main/scala/org/apache/spark/scheduler/cluster/ClusterTaskSetManager.scala
	core/src/main/scala/org/apache/spark/scheduler/local/LocalScheduler.scala
	core/src/main/scala/org/apache/spark/util/MetadataCleaner.scala
	core/src/test/scala/org/apache/spark/scheduler/TaskResultGetterSuite.scala
	core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala
	new-yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
	streaming/src/main/scala/org/apache/spark/streaming/Checkpoint.scala
	streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaStreamingContext.scala
	streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobGenerator.scala
	streaming/src/test/scala/org/apache/spark/streaming/BasicOperationsSuite.scala
	streaming/src/test/scala/org/apache/spark/streaming/CheckpointSuite.scala
	streaming/src/test/scala/org/apache/spark/streaming/InputStreamsSuite.scala
	streaming/src/test/scala/org/apache/spark/streaming/TestSuiteBase.scala
	streaming/src/test/scala/org/apache/spark/streaming/WindowOperationsSuite.scala
2013-12-29 15:08:08 -05:00
Lian, Cheng f150b6e76c Response to Reynold's comments 2013-12-29 17:13:01 +08:00
Matei Zaharia 642029e7f4 Various fixes to configuration code
- Got rid of global SparkContext.globalConf
- Pass SparkConf to serializers and compression codecs
- Made SparkConf public instead of private[spark]
- Improved API of SparkContext and SparkConf
- Switched executor environment vars to be passed through SparkConf
- Fixed some places that were still using system properties
- Fixed some tests, though others are still failing

This still fails several tests in core, repl and streaming, likely due
to properties not being set or cleared correctly (some of the tests run
fine in isolation).
2013-12-28 17:13:15 -05:00
Reza Zadeh ae5102acc0 large scale considerations 2013-12-27 04:15:13 -05:00
Reza Zadeh 642ab5c1e1 initial large scale testing begin 2013-12-27 01:51:19 -05:00
Reza Zadeh 3369c2d487 cleanup documentation 2013-12-27 00:41:46 -05:00
Reza Zadeh bdb5037987 add all tests 2013-12-27 00:36:41 -05:00
Reza Zadeh fa1e8d8cbf test for truncated svd 2013-12-27 00:34:59 -05:00
Reza Zadeh 16de5268e3 full rank matrix test added 2013-12-26 23:21:57 -05:00
Lian, Cheng d7086dc28a Added Apache license header to NaiveBayesSuite 2013-12-27 08:20:41 +08:00
Reza Zadeh fe1a132d40 Main method added for svd 2013-12-26 18:13:21 -05:00
Reza Zadeh 1a21ba2967 new main file 2013-12-26 18:09:33 -05:00
Reza Zadeh 6c3674cd23 Object to hold the svd methods 2013-12-26 17:39:25 -05:00
Reza Zadeh 6e740cc901 Some documentation 2013-12-26 16:12:40 -05:00
Lian, Cheng 654f42174a Reformatted some lines commented by Matei 2013-12-27 04:45:04 +08:00
Reza Zadeh 1a173f00bd Initial files - no tests 2013-12-26 15:01:03 -05:00
Lian, Cheng c0337c5bbf Let reduceByKey to take care of local combine
Also refactored some heavy FP code to improve readability and reduce memory footprint.
2013-12-25 22:45:57 +08:00
Lian, Cheng 3bb714eaa3 Refactored NaiveBayes
* Minimized shuffle output with mapPartitions.
* Reduced RDD actions from 3 to 1.
2013-12-25 17:15:38 +08:00
Frank Dai 3dc655aa19 standard Naive Bayes classifier 2013-12-25 16:50:42 +08:00
Tor Myklebust 4e821390bc Scala stubs for updated Python bindings. 2013-12-25 00:09:00 -05:00
Tor Myklebust 58e2a7d6d4 Move PythonMLLibAPI into its own package. 2013-12-24 16:48:40 -05:00
Tor Myklebust 2402180b32 Fix error message ugliness. 2013-12-24 16:18:33 -05:00
Prashant Sharma 2573add94c spark-544, introducing SparkConf and related configuration overhaul. 2013-12-25 00:09:36 +05:30
Tor Myklebust 20f85eca3d Java stubs for ALSModel. 2013-12-21 14:54:13 -05:00
Tor Myklebust b454fdc2eb Javadocs; also, declare some things private. 2013-12-20 02:10:21 -05:00
Tor Myklebust b835ddf3df Licence notice. 2013-12-20 01:55:03 -05:00
Tor Myklebust f99970e8cd Scala classification and clustering stubs; matrix serialization/deserialization. 2013-12-20 00:12:22 -05:00
Tor Myklebust ded67ee90c Bindings for linear, Lasso, and ridge regression. 2013-12-19 22:42:12 -05:00
Tor Myklebust 2a41c9aad3 Un-semicolon PythonMLLibAPI. 2013-12-19 21:27:11 -05:00
Tor Myklebust 95915f8b3b First cut at python mllib bindings. Only LinearRegression is supported. 2013-12-19 01:29:09 -05:00
Mark Hamstra 09ed7ddfa0 Use scala.binary.version in POMs 2013-12-15 12:39:58 -08:00
Prashant Sharma 17db6a9041 Style fixes and addressed review comments at #221 2013-12-10 11:47:16 +05:30
Prashant Sharma 7ad6921ae0 Incorporated Patrick's feedback comment on #211 and made maven build/dep-resolution atleast a bit faster. 2013-12-07 12:45:57 +05:30
Prashant Sharma 44fd30d3fb Merge branch 'master' into scala-2.10-wip
Conflicts:
	core/src/main/scala/org/apache/spark/rdd/RDD.scala
	project/SparkBuild.scala
2013-11-25 18:10:54 +05:30
Marek Kolodziej 22724659db Make XORShiftRandom explicit in KMeans and roll it back for RDD 2013-11-20 07:03:36 -05:00
Marek Kolodziej 99cfe89c68 Updates to reflect pull request code review 2013-11-18 22:00:36 -05:00
Marek Kolodziej 09bdfe3b16 XORShift RNG with unit tests and benchmark
To run unit test, start SBT console and type:
compile
test-only org.apache.spark.util.XORShiftRandomSuite
To run benchmark, type:
project core
console
Once the Scala console starts, type:
org.apache.spark.util.XORShiftRandom.benchmark(100000000)
2013-11-18 15:21:43 -05:00
Prashant Sharma 026ab75661 Merge branch 'master' of github.com:apache/incubator-spark into scala-2.10 2013-10-10 09:42:55 +05:30
Prashant Sharma 26860639c5 Merge branch 'scala-2.10' of github.com:ScrapCodes/spark into scala-2.10
Conflicts:
	core/src/main/scala/org/apache/spark/scheduler/cluster/ClusterTaskSetManager.scala
	project/SparkBuild.scala
2013-10-10 09:42:23 +05:30
Prashant Sharma 7be75682b9 Merge branch 'master' into wip-merge-master
Conflicts:
	bagel/pom.xml
	core/pom.xml
	core/src/test/scala/org/apache/spark/ui/UISuite.scala
	examples/pom.xml
	mllib/pom.xml
	pom.xml
	project/SparkBuild.scala
	repl/pom.xml
	streaming/pom.xml
	tools/pom.xml

In scala 2.10, a shorter representation is used for naming artifacts
 so changed to shorter scala version for artifacts and made it a property in pom.
2013-10-08 11:29:40 +05:30
Nick Pentreath a5e58b8f98 Merge branch 'master' into implicit-als 2013-10-07 11:46:17 +02:00
Nick Pentreath b0f5f4d441 Bumping up test matrix size to eliminate random failures 2013-10-07 11:44:22 +02:00
Patrick Wendell aa9fb84994 Merging build changes in from 0.8 2013-10-05 22:07:00 -07:00
Martin Weindel e09f4a9601 fixed some warnings 2013-10-05 23:08:23 +02:00
Nick Pentreath c6ceaeae50 Style fix using 'if' rather than 'match' on boolean 2013-10-04 13:52:53 +02:00
Nick Pentreath 6a7836cddc Fixing closing brace indentation 2013-10-04 13:33:01 +02:00
Nick Pentreath 0bd9b373d1 Reverting to using comma-delimited split 2013-10-04 13:30:33 +02:00
Nick Pentreath 1cbdcb9cb6 Merge remote-tracking branch 'upstream/master' into implicit-als 2013-10-04 13:25:34 +02:00
Prashant Sharma 5829692885 Merge branch 'master' into scala-2.10
Conflicts:
	core/src/main/scala/org/apache/spark/ui/jobs/JobProgressUI.scala
	docs/_config.yml
	project/SparkBuild.scala
	repl/src/main/scala/org/apache/spark/repl/SparkILoop.scala
2013-10-01 11:57:24 +05:30
Prashant Sharma 7ff4c2d399 fixed maven build for scala 2.10 2013-09-26 10:48:24 +05:30
Patrick Wendell 6079721fa1 Update build version in master 2013-09-24 11:41:51 -07:00
Nick Pentreath d952f04c8e Merge remote-tracking branch 'upstream/master' into implicit-als 2013-09-23 13:07:40 +02:00
Prashant Sharma 383e151fd7 Merge branch 'master' of git://github.com/mesos/spark into scala-2.10
Conflicts:
	core/src/main/scala/org/apache/spark/SparkContext.scala
	project/SparkBuild.scala
2013-09-15 10:55:12 +05:30
Matei Zaharia 7a5c4b647b Small tweaks to MLlib docs 2013-09-08 21:47:24 -07:00
Ameet Talwalkar 81a8bd46ac respose to PR comments 2013-09-08 19:21:30 -07:00
Nick Pentreath 737f01a1ef Adding algorithm for implicit feedback data to ALS 2013-09-06 14:45:05 +02:00
Prashant Sharma 4106ae9fbf Merged with master 2013-09-06 17:53:01 +05:30
Matei Zaharia 12b2f1f9c9 Add missing license headers found with RAT 2013-09-02 12:23:03 -07:00
Matei Zaharia 0a8cc30921 Move some classes to more appropriate packages:
* RDD, *RDDFunctions -> org.apache.spark.rdd
* Utils, ClosureCleaner, SizeEstimator -> org.apache.spark.util
* JavaSerializer, KryoSerializer -> org.apache.spark.serializer
2013-09-01 14:13:16 -07:00
Matei Zaharia 5701eb92c7 Fix some URLs 2013-09-01 14:13:16 -07:00
Matei Zaharia 46eecd110a Initial work to rename package to org.apache.spark 2013-09-01 14:13:13 -07:00
Shivaram Venkataraman adc700582b Fix broken build by removing addIntercept 2013-08-30 00:16:32 -07:00
Evan Sparks 016787de32 Merge pull request #863 from shivaram/etrain-ridge
Adding linear regression and refactoring Ridge regression to use SGD
2013-08-29 22:15:14 -07:00
Evan Sparks 852d810787 Merge pull request #819 from shivaram/sgd-cleanup
Change SVM to use {0,1} labels
2013-08-29 22:13:15 -07:00
Shivaram Venkataraman dc06b52879 Add an option to turn off data validation, test it.
Also moves addIntercept to have default true to make it similar
to validateData option
2013-08-25 23:14:35 -07:00
Shivaram Venkataraman b8c50a0642 Center & scale variables in Ridge, Lasso.
Also add a unit test that checks if ridge regression lowers
cross-validation error.
2013-08-25 22:24:27 -07:00
Matei Zaharia 215c13dd41 Fix code style and a nondeterministic RDD issue in ALS 2013-08-22 16:13:46 -07:00
Matei Zaharia 46ea0c1b47 Merge pull request #814 from holdenk/master
Create less instances of the random class during ALS initialization.
2013-08-22 15:57:28 -07:00
Jey Kottalam 23f4622aff Remove redundant dependencies from POMs 2013-08-18 18:53:57 -07:00
Evan Sparks 07fe910669 Fixing typos in Java tests, and addressing alignment issues. 2013-08-18 15:03:13 -07:00
Evan Sparks b291db712e Centralizing linear data generator and mllib regression tests to use it. 2013-08-18 15:03:13 -07:00
Evan Sparks b659af83d3 Adding Linear Regression, and refactoring Ridge Regression. 2013-08-18 15:03:13 -07:00
Jey Kottalam ad580b94d5 Maven build now also works with YARN 2013-08-16 13:50:12 -07:00
Jey Kottalam 9dd15fe700 Don't mark hadoop-client as 'provided' 2013-08-16 13:50:12 -07:00
Jey Kottalam 11b42a84db Maven build now works with CDH hadoop-2.0.0-mr1 2013-08-16 13:50:12 -07:00
Jey Kottalam 353fab2440 Initial changes to make Maven build agnostic of hadoop version 2013-08-16 13:50:12 -07:00
Holden Karau 8fc40818d7 Fix 2013-08-15 23:08:48 -07:00
Shivaram Venkataraman c874625354 Specify label format in LogisticRegression. 2013-08-13 16:55:53 -07:00
Shivaram Venkataraman 0ab6ff4c32 Fix SVM model and unit test to work with {0,1}.
Also rename validateFuncs to validators.
2013-08-13 13:57:06 -07:00
Shivaram Venkataraman 654087194d Change SVM to use {0,1} labels.
Also add a data validation check to make sure classification labels
are always 0 or 1 and add an appropriate test case.
2013-08-13 11:44:47 -07:00
Holden Karau d145da818e Code review feedback :) 2013-08-12 22:13:08 -07:00
Holden Karau 705c9ace2a Use less instances of the random class during ALS setup 2013-08-12 22:08:36 -07:00
Matei Zaharia 9e02da2763 Merge pull request #812 from shivaram/maven-mllib-tests
Create SparkContext in beforeAll for MLLib tests
2013-08-12 20:22:27 -07:00
Shivaram Venkataraman 4935a2558b Clean up scaladoc in ML Lib.
Also build and copy ML Lib scaladoc in Spark docs build.
Some more minor cleanup with respect to naming, test locations etc.
2013-08-11 19:02:43 -07:00
Shivaram Venkataraman ecc9bfe377 Create SparkContext in beforeAll for MLLib tests
This overcomes test failures that occur using Maven
2013-08-11 17:04:00 -07:00
Evan Sparks ff9ebfabb4 Merge pull request #762 from shivaram/sgd-cleanup
Refactor SGD options into a new class.
2013-08-11 10:52:55 -07:00
Shivaram Venkataraman a65a6ed514 Fix GLM code review comments and move java tests 2013-08-10 18:54:10 -07:00
Matei Zaharia cd247ba5bb Merge pull request #786 from shivaram/mllib-java
Java fixes, tests and examples for ALS, KMeans
2013-08-09 20:41:13 -07:00
Reynold Xin 01f20a941e Fixed a typo in mllib inline documentation. 2013-08-08 16:42:54 -07:00
Shivaram Venkataraman 2812e72200 Add setters for optimizer, gradient in SGD.
Also remove java-specific constructor for LabeledPoint.
2013-08-08 16:24:31 -07:00
Shivaram Venkataraman e1a209f791 Remove Java-specific constructor for Rating.
The scala constructor works for native type java types. Modify examples
to match this.
2013-08-08 14:36:02 -07:00
Shivaram Venkataraman 338b7a7455 Merge branch 'master' of git://github.com/mesos/spark into sgd-cleanup
Conflicts:
	mllib/src/main/scala/spark/mllib/util/MLUtils.scala
2013-08-06 21:21:55 -07:00
Shivaram Venkataraman 7db69d56f2 Refactor GLM algorithms and add Java tests
This change adds Java examples and unit tests for all GLM algorithms
to make sure the MLLib interface works from Java. Changes include
- Introduce LabeledPoint and avoid using Doubles in train arguments
- Rename train to run in class methods
- Make the optimizer a member variable of GLM to make sure the builder
  pattern works
2013-08-06 17:23:22 -07:00
Shivaram Venkataraman 6caec3f441 Add a test case for random initialization.
Also workaround a bug where double[][] class cast fails
2013-08-06 16:35:47 -07:00
Shivaram Venkataraman 471fbadd0c Java examples, tests for KMeans and ALS
- Changes ALS to accept RDD[Rating] instead of (Int, Int, Double) making it
  easier to call from Java
- Renames class methods from `train` to `run` to enable static methods to be
  called from Java.
- Add unit tests which check if both static / class methods can be called.
- Also add examples which port the main() function in ALS, KMeans to the
  examples project.

Couple of minor changes to existing code:
- Add a toJavaRDD method in RDD to convert scala RDD to java RDD easily
- Workaround a bug where using double[] from Java leads to class cast exception in
  KMeans init
2013-08-06 15:43:46 -07:00
Ginger Smith bf7033f3eb fixing formatting, style, and input 2013-08-05 21:26:24 -07:00
Ginger Smith 8c8947e2b6 fixing formatting 2013-08-05 11:22:18 -07:00
Shivaram Venkataraman 7388e27668 Move implicit arg to constructor for Java access. 2013-08-03 18:08:43 -07:00
Ginger Smith 4ab4df5edb adding matrix factorization data generator 2013-08-02 22:22:36 -07:00
Shivaram Venkataraman 00339cc032 Refactor optimizers and create GLMs
This change refactors the structure of GLMs to use mixins which maintain
a similar interface to other ML lib algorithms. This change also creates
an Optimizer trait which allows GLMs to be extended to use other optimization
techniques.
2013-08-02 19:15:34 -07:00
Matei Zaharia abfa9e6f70 Increase Kryo buffer size in ALS since some arrays become big 2013-08-02 16:17:32 -07:00
shivaram 58756b72f1 Merge pull request #761 from mateiz/kmeans-generator
Add data generator for K-means
2013-07-31 23:45:41 -07:00
Matei Zaharia 52dba89261 Turn on caching in KMeans.main 2013-07-31 23:08:12 -07:00
Matei Zaharia b2b86c2575 Merge pull request #753 from shivaram/glm-refactor
Build changes for ML lib
2013-07-31 15:51:39 -07:00
Matei Zaharia f607ffb9e1 Added data generator for K-means
Also made it possible to specify the number of runs in KMeans.main().
2013-07-31 14:31:07 -07:00
Shivaram Venkataraman cef178873b Refactor SGD options into a new class.
This refactoring pulls out code shared between SVM, Lasso, LR into
a common GradientDescentOpts class. Some style cleanup as well
2013-07-31 14:15:17 -07:00
Matei Zaharia 9a444cffe7 Use the Char version of split() instead of the String one for efficiency 2013-07-31 11:28:39 -07:00
Shivaram Venkataraman 48851d4dd9 Add bagel, mllib to SBT assembly.
Also add jblas dependency to mllib pom.xml
2013-07-30 14:03:15 -07:00
Reynold Xin 366f7735eb Minor style cleanup of mllib. 2013-07-30 13:59:32 -07:00
Reynold Xin 47011e6854 Use a tigher bound in logistic regression unit test's prediction validation. 2013-07-30 13:58:23 -07:00
Reynold Xin e35966ae9a Renamed Classification.scala to ClassificationModel.scala and Regression.scala to RegressionModel.scala 2013-07-30 13:28:31 -07:00
Ameet Talwalkar e4387ddf5d made SimpleUpdater consistent with other updaters 2013-07-29 22:21:50 -07:00
Shivaram Venkataraman 3ca9faa341 Clarify how regVal is computed in Updater docs 2013-07-29 18:37:28 -07:00
Shivaram Venkataraman 07da72b451 Remove duplicate loss history and clarify why.
Also some minor style fixes.
2013-07-29 16:25:17 -07:00
Xinghao 2b2630ba3c Style fix
Lines shortened to < 100 characters
2013-07-29 09:22:49 -07:00
Xinghao 07f17439a5 Fix validatePrediction functions for Classification models
Classifiers return categorical (Int) values that should be compared
directly
2013-07-29 09:22:31 -07:00
Xinghao 3a8d07df8c Deleting extra LogisticRegressionGenerator and RidgeRegressionGenerator 2013-07-29 09:20:26 -07:00
Xinghao 75f3757300 Fix rounding error in LogisticRegression.scala 2013-07-29 09:19:56 -07:00
Xinghao c823ee1e2b Replace map-reduce with dot operator using DoubleMatrix 2013-07-28 22:17:53 -07:00
Xinghao 96e04f4cb7 Fixed SVM and LR train functions to take Int instead of Double for Classification 2013-07-28 22:12:39 -07:00
Xinghao 9398dced03 Changed Classification to return Int instead of Double
Also minor changes to formatting and comments
2013-07-28 21:39:19 -07:00
Xinghao 67de051bbb SVMSuite and LassoSuite rewritten to follow closely with LogisticRegressionSuite 2013-07-28 21:09:56 -07:00
Xinghao 29e042940a Move data generators to util 2013-07-28 20:39:52 -07:00
Xinghao ccfa362dde Change *_LocalRandomSGD to *LocalRandomSGD 2013-07-28 10:33:57 -07:00
Xinghao b0bbc7f6a8 Resolve conflicts with master, removed regParam for LogisticRegression 2013-07-26 18:57:39 -07:00
Xinghao 071afe2a33 New files from merge with master 2013-07-26 18:21:20 -07:00
Xinghao 10fd3949e6 Making ClassificationModel serializable 2013-07-26 17:49:11 -07:00
Xinghao f0a1f95228 Rename LogisticRegression, SVM and Lasso to *_LocalRandomSGD 2013-07-26 17:36:14 -07:00
Xinghao f74a03c6d8 Multiple changes
- Changed LogisticRegression regularization parameter to 0
- Removed println from SVM predict function
- Fixed "Lasso" -> "SVM" in SVMGenerator
- Added comment in Updater.scala to indicate L1 regularization leads to
soft thresholding proximal function
2013-07-26 17:29:44 -07:00
Xinghao eef678703e Adding SVM and Lasso, moving LogisticRegression to classification from regression
Also, add regularization parameter to SGD
2013-07-24 15:32:50 -07:00
Reynold Xin 2210e8ccf8 Use a different validation dataset for Logistic Regression prediction testing. 2013-07-23 12:52:15 -07:00
Reynold Xin 87a9dd898f Made RegressionModel serializable and added unit tests to make sure predict methods would work. 2013-07-23 12:13:27 -07:00
Matei Zaharia c40f0f21f1 Merge pull request #711 from shivaram/ml-generators
Move ML lib data generator files to util/
2013-07-19 13:33:04 -07:00
Shivaram Venkataraman 2c9ea56db4 Rename classes to be called DataGenerator 2013-07-18 11:57:14 -07:00
Shivaram Venkataraman 7ab1170503 Refactor data generators to have a function that can be used in tests. 2013-07-18 11:55:19 -07:00
Shivaram Venkataraman 217667174e Return Array[Double] from SGD instead of DoubleMatrix 2013-07-17 16:08:34 -07:00
Shivaram Venkataraman 45f3c85518 Change weights to be Array[Double] in LR model.
Also ensure weights are initialized to a column vector.
2013-07-17 16:03:29 -07:00
Shivaram Venkataraman 3bf9897136 Rename loss -> stochasticLoss and add a note to explain why we have
multiple train methods.
2013-07-17 14:20:24 -07:00
Shivaram Venkataraman 64b88e039a Move ML lib data generator files to util/ 2013-07-17 14:11:44 -07:00
Shivaram Venkataraman 84fa20c2a1 Allow initial weight vectors in LogisticRegression.
Also move LogisticGradient to the LogisticRegression file and fix the
unit tests log path.
2013-07-17 14:04:05 -07:00
Matei Zaharia af3c9d5042 Add Apache license headers and LICENSE and NOTICE files 2013-07-16 17:21:33 -07:00
Matei Zaharia 4698a0d688 Shuffle ratings in a more efficient way at start of ALS 2013-07-15 02:54:11 +00:00
Matei Zaharia ed7fd501cf Make number of blocks in ALS configurable and lower the default 2013-07-15 00:30:10 +00:00
Matei Zaharia 931e4c96ef Fix a comment 2013-07-14 08:03:13 +00:00
Matei Zaharia c5c38d1987 Some optimizations to loading phase of ALS 2013-07-14 07:59:50 +00:00
Ameet Talwalkar bf4c9a5e0f renamed with labeled prefix 2013-07-08 14:37:42 -07:00
ryanlecompte be123aa6ef update to use ListBuffer, faster than Vector for append operations 2013-07-07 15:35:06 -07:00
ryanlecompte f78f8d0b41 fix formatting and use Vector instead of List to maintain order 2013-07-06 16:46:53 -07:00
ryanlecompte 757e56dfc7 make binSearch a tail-recursive method 2013-07-05 19:54:28 -07:00
Matei Zaharia 8bbe907556 Replaced string constants in test 2013-07-05 17:25:23 -07:00
Matei Zaharia 653043beb6 Renamed files to match package 2013-07-05 17:18:55 -07:00
Matei Zaharia de67deeaab Addressed style comments from Ryan LeCompte 2013-07-05 17:16:49 -07:00
Matei Zaharia 43b24635ee Renamed ML package to MLlib and added it to classpath 2013-07-05 11:38:53 -07:00