Commit graph

527 commits

Author SHA1 Message Date
Xiangrui Meng 84324fbcb9 [SPARK-4355][MLLIB] fix OnlineSummarizer.merge when other.mean is zero
See inline comment about the bug. I also did some code clean-up. dbtsai I moved `update` to a private method of `MultivariateOnlineSummarizer`. I don't think it will cause performance regression, but it would be great if you have some time to test.

Author: Xiangrui Meng <meng@databricks.com>

Closes #3220 from mengxr/SPARK-4355 and squashes the following commits:

5ef601f [Xiangrui Meng] fix OnlineSummarizer.merge when other.mean is zero and some code clean-up
2014-11-12 01:50:11 -08:00
Manish Amde 2ef016b130 [MLLIB] SPARK-4347: Reducing GradientBoostingSuite run time.
Before:
[info] GradientBoostingSuite:
[info] - Regression with continuous features: SquaredError (22 seconds, 115 milliseconds)
[info] - Regression with continuous features: Absolute Error (19 seconds, 330 milliseconds)
[info] - Binary classification with continuous features: Log Loss (19 seconds, 17 milliseconds)

After:
[info] - Regression with continuous features: SquaredError (7 seconds, 69 milliseconds)
[info] - Regression with continuous features: Absolute Error (4 seconds, 617 milliseconds)
[info] - Binary classification with continuous features: Log Loss (4 seconds, 658 milliseconds)

cc: mengxr, jkbradley

Author: Manish Amde <manish9ue@gmail.com>

Closes #3214 from manishamde/gbt_test_speedup and squashes the following commits:

8994552 [Manish Amde] reducing gbt test run times
2014-11-11 22:47:53 -08:00
Michelangelo D'Agostino 7e9d975676 [MLLIB] [PYTHON] SPARK-4221: Expose nonnegative ALS in the python API
SPARK-1553 added alternating nonnegative least squares to MLLib, however it's not possible to access it via the python API.  This pull request resolves that.

Author: Michelangelo D'Agostino <mdagostino@civisanalytics.com>

Closes #3095 from mdagost/python_nmf and squashes the following commits:

a6743ad [Michelangelo D'Agostino] Use setters instead of static methods in PythonMLLibAPI.  Remove the new static methods I added.  Set seed in tests.  Change ratings to ratingsRDD in both train and trainImplicit for consistency.
7cffd39 [Michelangelo D'Agostino] Swapped nonnegative and seed in a few more places.
3fdc851 [Michelangelo D'Agostino] Moved seed to the end of the python parameter list.
bdcc154 [Michelangelo D'Agostino] Change seed type to java.lang.Long so that it can handle null.
cedf043 [Michelangelo D'Agostino] Added in ability to set the seed from python and made that play nice with the nonnegative changes.  Also made the python ALS tests more exact.
a72fdc9 [Michelangelo D'Agostino] Expose nonnegative ALS in the python API.
2014-11-07 22:53:01 -08:00
Joseph K. Bradley 5b3b6f6f5f [SPARK-4197] [mllib] GradientBoosting API cleanup and examples in Scala, Java
### Summary

* Made it easier to construct default Strategy and BoostingStrategy and to set parameters using simple types.
* Added Scala and Java examples for GradientBoostedTrees
* small cleanups and fixes

### Details

GradientBoosting bug fixes (“bug” = bad default options)
* Force boostingStrategy.weakLearnerParams.algo = Regression
* Force boostingStrategy.weakLearnerParams.impurity = impurity.Variance
* Only persist data if not yet persisted (since it causes an error if persisted twice)

BoostingStrategy
* numEstimators: renamed to numIterations
* removed subsamplingRate (duplicated by Strategy)
* removed categoricalFeaturesInfo since it belongs with the weak learner params (since boosting can be oblivious to feature type)
* Changed algo to var (not val) and added BeanProperty, with overload taking String argument
* Added assertValid() method
* Updated defaultParams() method and eliminated defaultWeakLearnerParams() since that belongs in Strategy

Strategy (for DecisionTree)
* Changed algo to var (not val) and added BeanProperty, with overload taking String argument
* Added setCategoricalFeaturesInfo method taking Java Map.
* Cleaned up assertValid
* Changed val’s to def’s since parameters can now be changed.

CC: manishamde mengxr codedeft

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #3094 from jkbradley/gbt-api and squashes the following commits:

7a27e22 [Joseph K. Bradley] scalastyle fix
52013d5 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into gbt-api
e9b8410 [Joseph K. Bradley] Summary of changes
2014-11-05 10:33:13 -08:00
Davies Liu c8abddc516 [SPARK-3964] [MLlib] [PySpark] add Hypothesis test Python API
```
pyspark.mllib.stat.StatisticschiSqTest(observed, expected=None)
    :: Experimental ::

    If `observed` is Vector, conduct Pearson's chi-squared goodness
    of fit test of the observed data against the expected distribution,
    or againt the uniform distribution (by default), with each category
    having an expected frequency of `1 / len(observed)`.
    (Note: `observed` cannot contain negative values)

    If `observed` is matrix, conduct Pearson's independence test on the
    input contingency matrix, which cannot contain negative entries or
    columns or rows that sum up to 0.

    If `observed` is an RDD of LabeledPoint, conduct Pearson's independence
    test for every feature against the label across the input RDD.
    For each feature, the (feature, label) pairs are converted into a
    contingency matrix for which the chi-squared statistic is computed.
    All label and feature values must be categorical.

    :param observed: it could be a vector containing the observed categorical
                     counts/relative frequencies, or the contingency matrix
                     (containing either counts or relative frequencies),
                     or an RDD of LabeledPoint containing the labeled dataset
                     with categorical features. Real-valued features will be
                     treated as categorical for each distinct value.
    :param expected: Vector containing the expected categorical counts/relative
                     frequencies. `expected` is rescaled if the `expected` sum
                     differs from the `observed` sum.
    :return: ChiSquaredTest object containing the test statistic, degrees
             of freedom, p-value, the method used, and the null hypothesis.
```

Author: Davies Liu <davies@databricks.com>

Closes #3091 from davies/his and squashes the following commits:

145d16c [Davies Liu] address comments
0ab0764 [Davies Liu] fix float
5097d54 [Davies Liu] add Hypothesis test Python API
2014-11-04 21:35:52 -08:00
Niklas Wilcke f90ad5d426 [Spark-4060] [MLlib] exposing special rdd functions to the public
Author: Niklas Wilcke <1wilcke@informatik.uni-hamburg.de>

Closes #2907 from numbnut/master and squashes the following commits:

7f7c767 [Niklas Wilcke] [Spark-4060] [MLlib] exposing special rdd functions to the public, #2907
2014-11-04 09:57:03 -08:00
Davies Liu e4f42631a6 [SPARK-3886] [PySpark] simplify serializer, use AutoBatchedSerializer by default.
This PR simplify serializer, always use batched serializer (AutoBatchedSerializer as default), even batch size is 1.

Author: Davies Liu <davies@databricks.com>

This patch had conflicts when merged, resolved by
Committer: Josh Rosen <joshrosen@databricks.com>

Closes #2920 from davies/fix_autobatch and squashes the following commits:

e544ef9 [Davies Liu] revert unrelated change
6880b14 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch
1d557fc [Davies Liu] fix tests
8180907 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch
76abdce [Davies Liu] clean up
53fa60b [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch
d7ac751 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch
2cc2497 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch
b4292ce [Davies Liu] fix bug in master
d79744c [Davies Liu] recover hive tests
be37ece [Davies Liu] refactor
eb3938d [Davies Liu] refactor serializer in scala
8d77ef2 [Davies Liu] simplify serializer, use AutoBatchedSerializer by default.
2014-11-03 23:56:14 -08:00
Xiangrui Meng 1a9c6cddad [SPARK-3573][MLLIB] Make MLlib's Vector compatible with SQL's SchemaRDD
Register MLlib's Vector as a SQL user-defined type (UDT) in both Scala and Python. With this PR, we can easily map a RDD[LabeledPoint] to a SchemaRDD, and then select columns or save to a Parquet file. Examples in Scala/Python are attached. The Scala code was copied from jkbradley.

~~This PR contains the changes from #3068 . I will rebase after #3068 is merged.~~

marmbrus jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #3070 from mengxr/SPARK-3573 and squashes the following commits:

3a0b6e5 [Xiangrui Meng] organize imports
236f0a0 [Xiangrui Meng] register vector as UDT and provide dataset examples
2014-11-03 22:29:48 -08:00
Xiangrui Meng c5912ecc7b [FIX][MLLIB] fix seed in BaggedPointSuite
Saw Jenkins test failures due to random seeds.

jkbradley manishamde

Author: Xiangrui Meng <meng@databricks.com>

Closes #3084 from mengxr/fix-baggedpoint-suite and squashes the following commits:

f735a43 [Xiangrui Meng] fix seed in BaggedPointSuite
2014-11-03 18:50:37 -08:00
Sung Chung 56f2c61cde [SPARK-3161][MLLIB] Adding a node Id caching mechanism for training deci...
...sion trees. jkbradley mengxr chouqin Please review this.

Author: Sung Chung <schung@alpinenow.com>

Closes #2868 from codedeft/SPARK-3161 and squashes the following commits:

5f5a156 [Sung Chung] [SPARK-3161][MLLIB] Adding a node Id caching mechanism for training decision trees.
2014-11-01 16:58:26 -07:00
Xiangrui Meng d8176b1c2f [SPARK-4121] Set commons-math3 version based on hadoop profiles, instead of shading
In #2928 , we shade commons-math3 to prevent future conflicts with hadoop. It caused problems with our Jenkins master build with maven. Some tests used local-cluster mode, where the assembly jar contains relocated math3 classes, while mllib test code still compiles with core and the untouched math3 classes.

This PR sets commons-math3 version based on hadoop profiles.

pwendell JoshRosen srowen

Author: Xiangrui Meng <meng@databricks.com>

Closes #3023 from mengxr/SPARK-4121-alt and squashes the following commits:

580f6d9 [Xiangrui Meng] replace tab by spaces
7f71f08 [Xiangrui Meng] revert changes to PoissonSampler to avoid conflicts
d3353d9 [Xiangrui Meng] do not shade commons-math3
b4180dc [Xiangrui Meng] temp work
2014-11-01 15:21:36 -07:00
freeman 98c556ebbc Streaming KMeans [MLLIB][SPARK-3254]
This adds a Streaming KMeans algorithm to MLlib. It uses an update rule that generalizes the mini-batch KMeans update to incorporate a decay factor, which allows past data to be forgotten. The decay factor can be specified explicitly, or via a more intuitive "fractional decay" setting, in units of either data points or batches.

The PR includes:
- StreamingKMeans algorithm with decay factor settings
- Usage example
- Additions to documentation clustering page
- Unit tests of basic behavior and decay behaviors

tdas mengxr rezazadeh

Author: freeman <the.freeman.lab@gmail.com>
Author: Jeremy Freeman <the.freeman.lab@gmail.com>
Author: Xiangrui Meng <meng@databricks.com>

Closes #2942 from freeman-lab/streaming-kmeans and squashes the following commits:

b2e5b4a [freeman] Fixes to docs / examples
078617c [Jeremy Freeman] Merge pull request #1 from mengxr/SPARK-3254
2e682c0 [Xiangrui Meng] take discount on previous weights; use BLAS; detect dying clusters
0411bf5 [freeman] Change decay parameterization
9f7aea9 [freeman] Style fixes
374a706 [freeman] Formatting
ad9bdc2 [freeman] Use labeled points and predictOnValues in examples
77dbd3f [freeman] Make initialization check an assertion
9cfc301 [freeman] Make random seed an argument
44050a9 [freeman] Simpler constructor
c7050d5 [freeman] Fix spacing
2899623 [freeman] Use pattern matching for clarity
a4a316b [freeman] Use collect
1472ec5 [freeman] Doc formatting
ea22ec8 [freeman] Fix imports
2086bdc [freeman] Log cluster center updates
ea9877c [freeman] More documentation
9facbe3 [freeman] Bug fix
5db7074 [freeman] Example usage for StreamingKMeans
f33684b [freeman] Add explanation and example to docs
b5b5f8d [freeman] Add better documentation
a0fd790 [freeman] Merge remote-tracking branch 'upstream/master' into streaming-kmeans
9fd9c15 [freeman] Merge remote-tracking branch 'upstream/master' into streaming-kmeans
b93350f [freeman] Streaming KMeans with decay
2014-10-31 22:30:12 -07:00
Manish Amde 8602195510 [MLLIB] SPARK-1547: Add Gradient Boosting to MLlib
Given the popular demand for gradient boosting and AdaBoost in MLlib, I am creating a WIP branch for early feedback on gradient boosting with AdaBoost to follow soon after this PR is accepted. This is based on work done along with hirakendu that was pending due to decision tree optimizations and random forests work.

Ideally, boosting algorithms should work with any base learners.  This will soon be possible once the MLlib API is finalized -- we want to ensure we use a consistent interface for the underlying base learners. In the meantime, this PR uses decision trees as base learners for the gradient boosting algorithm. The current PR allows "pluggable" loss functions and provides least squares error and least absolute error by default.

Here is the task list:
- [x] Gradient boosting support
- [x] Pluggable loss functions
- [x] Stochastic gradient boosting support – Re-use the BaggedPoint approach used for RandomForest.
- [x] Binary classification support
- [x] Support configurable checkpointing – This approach will avoid long lineage chains.
- [x] Create classification and regression APIs
- [x] Weighted Ensemble Model -- created a WeightedEnsembleModel class that can be used by ensemble algorithms such as random forests and boosting.
- [x] Unit Tests

Future work:
+ Multi-class classification is currently not supported by this PR since it requires discussion on the best way to support "deviance" as a loss function.
+ BaggedRDD caching -- Avoid repeating feature to bin mapping for each tree estimator after standard API work is completed.

cc: jkbradley hirakendu mengxr etrain atalwalkar chouqin

Author: Manish Amde <manish9ue@gmail.com>
Author: manishamde <manish9ue@gmail.com>

Closes #2607 from manishamde/gbt and squashes the following commits:

991c7b5 [Manish Amde] public api
ff2a796 [Manish Amde] addressing comments
b4c1318 [Manish Amde] removing spaces
8476b6b [Manish Amde] fixing line length
0183cb9 [Manish Amde] fixed naming and formatting issues
1c40c33 [Manish Amde] add newline, removed spaces
e33ab61 [Manish Amde] minor comment
eadbf09 [Manish Amde] parameter renaming
035a2ed [Manish Amde] jkbradley formatting suggestions
9f7359d [Manish Amde] simplified gbt logic and added more tests
49ba107 [Manish Amde] merged from master
eff21fe [Manish Amde] Added gradient boosting tests
3fd0528 [Manish Amde] moved helper methods to new class
a32a5ab [Manish Amde] added test for subsampling without replacement
781542a [Manish Amde] added support for fractional subsampling with replacement
3a18cc1 [Manish Amde] cleaned up api for conversion to bagged point and moved tests to it's own test suite
0e81906 [Manish Amde] improving caching unpersisting logic
d971f73 [Manish Amde] moved RF code to use WeightedEnsembleModel class
fee06d3 [Manish Amde] added weighted ensemble model
1b01943 [Manish Amde] add weights for base learners
9bc6e74 [Manish Amde] adding random seed as parameter
d2c8323 [Manish Amde] Merge branch 'master' into gbt
2ae97b7 [Manish Amde] added documentation for the loss classes
9366b8f [Manish Amde] minor: using numTrees instead of trees.size
3b43896 [Manish Amde] added learning rate for prediction
9b2e35e [Manish Amde] Merge branch 'master' into gbt
6a11c02 [manishamde] fixing formatting
823691b [Manish Amde] fixing RF test
1f47941 [Manish Amde] changing access modifier
5b67102 [Manish Amde] shortened parameter list
5ab3796 [Manish Amde] minor reformatting
9155a9d [Manish Amde] consolidated boosting configuration and added public API
631baea [Manish Amde] Merge branch 'master' into gbt
2cb1258 [Manish Amde] public API support
3b8ffc0 [Manish Amde] added documentation
8e10c63 [Manish Amde] modified unpersist strategy
f62bc48 [Manish Amde] added unpersist
bdca43a [Manish Amde] added timing parameters
2fbc9c7 [Manish Amde] fixing binomial classification prediction
6dd4dd8 [Manish Amde] added support for log loss
9af0231 [Manish Amde] classification attempt
62cc000 [Manish Amde] basic checkpointing
4784091 [Manish Amde] formatting
78ed452 [Manish Amde] added newline and fixed if statement
3973dd1 [Manish Amde] minor indicating subsample is double during comparison
aa8fae7 [Manish Amde] minor refactoring
1a8031c [Manish Amde] sampling with replacement
f1c9ef7 [Manish Amde] Merge branch 'master' into gbt
cdceeef [Manish Amde] added documentation
6251fd5 [Manish Amde] modified method name
5538521 [Manish Amde] disable checkpointing for now
0ae1c0a [Manish Amde] basic gradient boosting code from earlier branches
2014-10-31 18:57:55 -07:00
Alexander Ulanov 62d01d255c [MLLIB] SPARK-2329 Add multi-label evaluation metrics
Implementation of various multi-label classification measures, including: Hamming-loss, strict and default Accuracy, macro-averaged Precision, Recall and F1-measure based on documents and labels, micro-averaged measures: https://issues.apache.org/jira/browse/SPARK-2329

Multi-class measures are currently in the following pull request: https://github.com/apache/spark/pull/1155

Author: Alexander Ulanov <nashb@yandex.ru>
Author: avulanov <nashb@yandex.ru>

Closes #1270 from avulanov/multilabelmetrics and squashes the following commits:

fc8175e [Alexander Ulanov] Merge with previous updates
43a613e [Alexander Ulanov] Addressing reviewers comments: change Set to Array
517a594 [avulanov] Addressing reviewers comments: Scala style
cf4222bc [avulanov] Addressing reviewers comments: renaming. Added label method that returns the list of labels
1843f73 [Alexander Ulanov] Scala style fix
79e8476 [Alexander Ulanov] Replacing fold(_ + _) with sum as suggested by srowen
ca46765 [Alexander Ulanov] Cosmetic changes: Apache header and parameter explanation
40593f5 [Alexander Ulanov] Multi-label metrics: Hamming-loss, strict and normal accuracy, fix to macro measures, bunch of tests
ad62df0 [Alexander Ulanov] Comments and scala style check
154164b [Alexander Ulanov] Multilabel evaluation metics and tests: macro precision and recall averaged by docs, micro and per-class precision and recall averaged by class
2014-10-31 18:31:03 -07:00
Erik Erlandson ad3bd0dff8 [SPARK-3250] Implement Gap Sampling optimization for random sampling
More efficient sampling, based on Gap Sampling optimization:
http://erikerlandson.github.io/blog/2014/09/11/faster-random-samples-with-gap-sampling/

Author: Erik Erlandson <eerlands@redhat.com>

Closes #2455 from erikerlandson/spark-3250-pr and squashes the following commits:

72496bc [Erik Erlandson] [SPARK-3250] Implement Gap Sampling optimization for random sampling
2014-10-30 22:30:52 -07:00
Davies Liu 872fc669b4 [SPARK-4124] [MLlib] [PySpark] simplify serialization in MLlib Python API
Create several helper functions to call MLlib Java API, convert the arguments to Java type and convert return value to Python object automatically, this simplify serialization in MLlib Python API very much.

After this, the MLlib Python API does not need to deal with serialization details anymore, it's easier to add new API.

cc mengxr

Author: Davies Liu <davies@databricks.com>

Closes #2995 from davies/cleanup and squashes the following commits:

8fa6ec6 [Davies Liu] address comments
16b85a0 [Davies Liu] Merge branch 'master' of github.com:apache/spark into cleanup
43743e5 [Davies Liu] bugfix
731331f [Davies Liu] simplify serialization in MLlib Python API
2014-10-30 22:25:18 -07:00
Yanbo Liang d9327192ee SPARK-4111 [MLlib] add regression metrics
Add RegressionMetrics.scala as regression metrics used for evaluation and corresponding test case RegressionMetricsSuite.scala.

Author: Yanbo Liang <yanbohappy@gmail.com>
Author: liangyanbo <liangyanbo@meituan.com>

Closes #2978 from yanbohappy/regression_metrics and squashes the following commits:

730d0a9 [Yanbo Liang] more clearly annotation
3d0bec1 [Yanbo Liang] rename and keep code style
a8ad3e3 [Yanbo Liang] simplify code for keeping style
d454909 [Yanbo Liang] rename parameter and function names, delete unused columns, add reference
2e56282 [liangyanbo] rename r2_score() and remove unused column
43bb12b [liangyanbo] add regression metrics
2014-10-30 12:00:56 -07:00
Joseph E. Gonzalez c7ad085208 [SPARK-4130][MLlib] Fixing libSVM parser bug with extra whitespace
This simple patch filters out extra whitespace entries.

Author: Joseph E. Gonzalez <joseph.e.gonzalez@gmail.com>
Author: Joey <joseph.e.gonzalez@gmail.com>

Closes #2996 from jegonzal/loadLibSVM and squashes the following commits:

e0227ab [Joey] improving readability
e028e84 [Joseph E. Gonzalez] fixing whitespace bug in loadLibSVMFile when parsing libSVM files
2014-10-30 00:05:57 -07:00
DB Tsai 51ce997355 [SPARK-4129][MLlib] Performance tuning in MultivariateOnlineSummarizer
In MultivariateOnlineSummarizer, breeze's activeIterator is used
to loop through the nonZero elements in the vector. However,
activeIterator doesn't perform well due to lots of overhead.
In this PR, native while loop is used for both DenseVector and SparseVector.

The benchmark result with 20 executors using mnist8m dataset:
Before:
DenseVector: 48.2 seconds
SparseVector: 16.3 seconds

After:
DenseVector: 17.8 seconds
SparseVector: 11.2 seconds

Since MultivariateOnlineSummarizer is used in several places,
the overall performance gain in mllib library will be significant with this PR.

Author: DB Tsai <dbtsai@alpinenow.com>

Closes #2992 from dbtsai/SPARK-4129 and squashes the following commits:

b99db6c [DB Tsai] fixed java.lang.ArrayIndexOutOfBoundsException
2b5e882 [DB Tsai] small refactoring
ebe3e74 [DB Tsai] First commit
2014-10-29 10:14:53 -07:00
Davies Liu fae095bc7c [SPARK-3961] [MLlib] [PySpark] Python API for mllib.feature
Added completed Python API for MLlib.feature

Normalizer
StandardScalerModel
StandardScaler
HashTF
IDFModel
IDF

cc mengxr

Author: Davies Liu <davies@databricks.com>
Author: Davies Liu <davies.liu@gmail.com>

Closes #2819 from davies/feature and squashes the following commits:

4f48f48 [Davies Liu] add a note for HashingTF
67f6d21 [Davies Liu] address comments
b628693 [Davies Liu] rollback changes in Word2Vec
efb4f4f [Davies Liu] Merge branch 'master' into feature
806c7c2 [Davies Liu] address comments
3abb8c2 [Davies Liu] address comments
59781b9 [Davies Liu] Merge branch 'master' of github.com:apache/spark into feature
a405ae7 [Davies Liu] fix tests
7a1891a [Davies Liu] fix tests
486795f [Davies Liu] update programming guide, HashTF -> HashingTF
8a50584 [Davies Liu] Python API for mllib.feature
2014-10-28 03:50:22 -07:00
coderxiang 7e3a1ada86 [MLlib] SPARK-3987: add test case on objective value for NNLS
Also update step parameter to pass the proposed test

Author: coderxiang <shuoxiangpub@gmail.com>

Closes #2965 from coderxiang/nnls-test and squashes the following commits:

24b06f9 [coderxiang] add test case on objective value for NNLS; update step parameter to pass the test
2014-10-27 19:43:39 -07:00
Sean Owen bfa614b127 SPARK-4022 [CORE] [MLLIB] Replace colt dependency (LGPL) with commons-math
This change replaces usages of colt with commons-math3 equivalents, and makes some minor necessary adjustments to related code and tests to match.

Author: Sean Owen <sowen@cloudera.com>

Closes #2928 from srowen/SPARK-4022 and squashes the following commits:

61a232f [Sean Owen] Fix failure due to different sampling in JavaAPISuite.sample()
16d66b8 [Sean Owen] Simplify seeding with call to reseedRandomGenerator
a1a78e0 [Sean Owen] Use Well19937c
31c7641 [Sean Owen] Fix Python Poisson test by choosing a different seed; about 88% of seeds should work but 1 didn't, it seems
5c9c67f [Sean Owen] Additional test fixes from review
d8f88e0 [Sean Owen] Replace colt with commons-math3. Some tests do not pass yet.
2014-10-27 10:53:15 -07:00
Sean Owen df7974b8e5 SPARK-3359 [DOCS] sbt/sbt unidoc doesn't work with Java 8
This follows https://github.com/apache/spark/pull/2893 , but does not completely fix SPARK-3359 either. This fixes minor scaladoc/javadoc issues that Javadoc 8 will treat as errors.

Author: Sean Owen <sowen@cloudera.com>

Closes #2909 from srowen/SPARK-3359 and squashes the following commits:

f62c347 [Sean Owen] Fix some javadoc issues that javadoc 8 considers errors. This is not all of the errors turned up when javadoc 8 runs on output of genjavadoc.
2014-10-25 23:18:02 -07:00
Kousuke Saruta f799700eec [SPARK-4055][MLlib] Inconsistent spelling 'MLlib' and 'MLLib'
Thare are some inconsistent spellings 'MLlib' and 'MLLib' in some documents and source codes.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #2903 from sarutak/SPARK-4055 and squashes the following commits:

b031640 [Kousuke Saruta] Fixed inconsistent spelling "MLlib and MLLib"
2014-10-23 09:19:32 -07:00
coderxiang 814a9cd7fa SPARK-3568 [mllib] add ranking metrics
Add common metrics for ranking algorithms (http://www-nlp.stanford.edu/IR-book/), including:
 - Mean Average Precision
 - Precisionn: top-n precision
 - Discounted cumulative gain (DCG) and NDCG

The following methods and the corresponding tests are implemented:

```
class RankingMetrics[T](predictionAndLabels: RDD[(Array[T], Array[T])]) {
  /* Returns the precsionk for each query */
  lazy val precAtK: RDD[Array[Double]]

  /**
   * param k the position to compute the truncated precision
   * return the average precision at the first k ranking positions
   */
  def precision(k: Int): Double

  /* Returns the average precision for each query */
  lazy val avePrec: RDD[Double]

  /*Returns the mean average precision (MAP) of all the queries*/
  lazy val meanAvePrec: Double

  /*Returns the normalized discounted cumulative gain for each query */
  lazy val ndcgAtK: RDD[Array[Double]]

  /**
   * param k the position to compute the truncated ndcg
   * return the average ndcg at the first k ranking positions
   */
  def ndcg(k: Int): Double
}
```

Author: coderxiang <shuoxiangpub@gmail.com>

Closes #2667 from coderxiang/rankingmetrics and squashes the following commits:

d881097 [coderxiang] update doc
14d9cd9 [coderxiang] remove unexpected files
d7fb93f [coderxiang] style change and remove ignored files
f113ee1 [coderxiang] modify doc for displaying superscript and subscript
f626896 [coderxiang] improve doc and remove unnecessary computation while labSet is empty
be6645e [coderxiang] set the precision of empty labset to 0.0
d64c120 [coderxiang] add logWarning for empty ground truth set
dfae292 [coderxiang] handle empty labSet for map. add test
62047c4 [coderxiang] style change and add documentation
f66612d [coderxiang] add additional test of precisionAt
b794cb2 [coderxiang] move private members precAtK, ndcgAtK into public methods. style change
77c9e5d [coderxiang] set precAtK and ndcgAtK as private member. Improve documentation
5f87bce [coderxiang] add API to calculate precision and ndcg at each ranking position
b7851cc [coderxiang] Use generic type to represent IDs
e443fee [coderxiang] change style and use alternative builtin methods
3a5a6ff [coderxiang] add ranking metrics
2014-10-21 15:45:47 -07:00
Michelangelo D'Agostino 1a623b2e16 SPARK-3770: Make userFeatures accessible from python
https://issues.apache.org/jira/browse/SPARK-3770

We need access to the underlying latent user features from python. However, the userFeatures RDD from the MatrixFactorizationModel isn't accessible from the python bindings. I've added a method to the underlying scala class to turn the RDD[(Int, Array[Double])] to an RDD[String]. This is then accessed from the python recommendation.py

Author: Michelangelo D'Agostino <mdagostino@civisanalytics.com>

Closes #2636 from mdagost/mf_user_features and squashes the following commits:

c98f9e2 [Michelangelo D'Agostino] Added unit tests for userFeatures and productFeatures and merged master.
d5eadf8 [Michelangelo D'Agostino] Merge branch 'master' into mf_user_features
2481a2a [Michelangelo D'Agostino] Merged master and resolved conflict.
a6ffb96 [Michelangelo D'Agostino] Eliminated a function from our first approach to this problem that is no longer needed now that we added the fromTuple2RDD function.
2aa1bf8 [Michelangelo D'Agostino] Implemented a function called fromTuple2RDD in PythonMLLibAPI and used it to expose the MF userFeatures and productFeatures in python.
34cb2a2 [Michelangelo D'Agostino] A couple of lint cleanups and a comment.
cdd98e3 [Michelangelo D'Agostino] It's working now.
e1fbe5e [Michelangelo D'Agostino] Added scala function to stringify userFeatures for access in python.
2014-10-21 11:49:39 -07:00
Qiping Li eadc4c590e [SPARK-3207][MLLIB]Choose splits for continuous features in DecisionTree more adaptively
DecisionTree splits on continuous features by choosing an array of values from a subsample of the data.
Currently, it does not check for identical values in the subsample, so it could end up having multiple copies of the same split. In this PR, we choose splits for a continuous feature in 3 steps:

1. Sort sample values for this feature
2. Get number of occurrence of each distinct value
3. Iterate the value count array computed in step 2 to choose splits.

After find splits, `numSplits` and `numBins` in metadata will be updated.

CC: mengxr manishamde jkbradley, please help me review this, thanks.

Author: Qiping Li <liqiping1991@gmail.com>
Author: chouqin <liqiping1991@gmail.com>
Author: liqi <liqiping1991@gmail.com>
Author: qiping.lqp <qiping.lqp@alibaba-inc.com>

Closes #2780 from chouqin/dt-findsplits and squashes the following commits:

18d0301 [Qiping Li] check explicitly findsplits return distinct splits
8dc28ab [chouqin] remove blank lines
ffc920f [chouqin] adjust code based on comments and add more test cases
9857039 [chouqin] Merge branch 'master' of https://github.com/apache/spark into dt-findsplits
d353596 [qiping.lqp] fix pyspark doc test
9e64699 [Qiping Li] fix random forest unit test
3c72913 [Qiping Li] fix random forest unit test
092efcb [Qiping Li] fix bug
f69f47f [Qiping Li] fix bug
ab303a4 [Qiping Li] fix bug
af6dc97 [Qiping Li] fix bug
2a8267a [Qiping Li] fix bug
c339a61 [Qiping Li] fix bug
369f812 [Qiping Li] fix style
8f46af6 [Qiping Li] add comments and unit test
9e7138e [Qiping Li] Merge branch 'dt-findsplits' of https://github.com/chouqin/spark into dt-findsplits
1b25a35 [Qiping Li] Merge branch 'master' of https://github.com/apache/spark into dt-findsplits
0cd744a [liqi] fix bug
3652823 [Qiping Li] fix bug
af7cb79 [Qiping Li] Choose splits for continuous features in DecisionTree more adaptively
2014-10-20 13:12:26 -07:00
Joseph K. Bradley 477c6481cc [SPARK-3934] [SPARK-3918] [mllib] Bug fixes for RandomForest, DecisionTree
SPARK-3934: When run with a mix of unordered categorical and continuous features, on multiclass classification, RandomForest fails. The bug is in the sanity checks in getFeatureOffset and getLeftRightFeatureOffsets, which use the wrong indices for checking whether features are unordered.
Fix: Remove the sanity checks since they are not really needed, and since they would require DTStatsAggregator to keep track of an extra set of indices (for the feature subset).

Added test to RandomForestSuite which failed with old version but now works.

SPARK-3918: Added baggedInput.unpersist at end of training.

Also:
* I removed DTStatsAggregator.isUnordered since it is no longer used.
* DecisionTreeMetadata: Added logWarning when maxBins is automatically reduced.
* Updated DecisionTreeRunner to explicitly fix the test data to have the same number of features as the training data.  This is a temporary fix which should eventually be replaced by pre-indexing both datasets.
* RandomForestModel: Updated toString to print total number of nodes in forest.
* Changed Predict class to be public DeveloperApi.  This was necessary to allow users to create their own trees by hand (for testing).

CC: mengxr  manishamde chouqin codedeft  Just notifying you of these small bug fixes.

Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>

Closes #2785 from jkbradley/dtrunner-update and squashes the following commits:

9132321 [Joseph K. Bradley] merged with master, fixed imports
9dbd000 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dtrunner-update
e116473 [Joseph K. Bradley] Changed Predict class to be public DeveloperApi.
f502e65 [Joseph K. Bradley] bug fix for SPARK-3934
7f3d60f [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dtrunner-update
ba567ab [Joseph K. Bradley] Changed DTRunner to load test data using same number of features as in training data.
4e88c1f [Joseph K. Bradley] changed RF toString to print total number of nodes
2014-10-17 15:02:57 -07:00
Davies Liu 091d32c52e [SPARK-3971] [MLLib] [PySpark] hotfix: Customized pickler should work in cluster mode
Customized pickler should be registered before unpickling, but in executor, there is no way to register the picklers before run the tasks.

So, we need to register the picklers in the tasks itself, duplicate the javaToPython() and pythonToJava() in MLlib, call SerDe.initialize() before pickling or unpickling.

Author: Davies Liu <davies.liu@gmail.com>

Closes #2830 from davies/fix_pickle and squashes the following commits:

0c85fb9 [Davies Liu] revert the privacy change
6b94e15 [Davies Liu] use JavaConverters instead of JavaConversions
0f02050 [Davies Liu] hotfix: Customized pickler does not work in cluster
2014-10-16 14:56:50 -07:00
Sean Owen 56096dbaa8 SPARK-3803 [MLLIB] ArrayIndexOutOfBoundsException found in executing computePrincipalComponents
Avoid overflow in computing n*(n+1)/2 as much as possible; throw explicit error when Gramian computation will fail due to negative array size; warn about large result when computing Gramian too

Author: Sean Owen <sowen@cloudera.com>

Closes #2801 from srowen/SPARK-3803 and squashes the following commits:

b4e6d92 [Sean Owen] Avoid overflow in computing n*(n+1)/2 as much as possible; throw explicit error when Gramian computation will fail due to negative array size; warn about large result when computing Gramian too
2014-10-14 14:42:09 -07:00
omgteam 942847fd94 Bug Fix: without unpersist method in RandomForest.scala
During trainning Gradient Boosting Decision Tree on large-scale sparse data, spark spill hundreds of data onto disk. And find the bug below:
    In version 1.1.0 DecisionTree.scala, train Method, treeInput has been persisted in Memory, but without unpersist. It caused heavy DISK usage.
    In github version(1.2.0 maybe), RandomForest.scala, train Method, baggedInput has been persisted but without unpersisted too.

After added unpersist, it works right.
https://issues.apache.org/jira/browse/SPARK-3918

Author: omgteam <Kimlong.Liu@gmail.com>

Closes #2775 from omgteam/master and squashes the following commits:

815d543 [omgteam] adjust tab to spaces
1a36f83 [omgteam] Bug: fix without unpersist baggedInput in RandomForest.scala
2014-10-13 09:59:41 -07:00
Sean Owen 363baacade SPARK-3811 [CORE] More robust / standard Utils.deleteRecursively, Utils.createTempDir
I noticed a few issues with how temp directories are created and deleted:

*Minor*

* Guava's `Files.createTempDir()` plus `File.deleteOnExit()` is used in many tests to make a temp dir, but `Utils.createTempDir()` seems to be the standard Spark mechanism
* Call to `File.deleteOnExit()` could be pushed into `Utils.createTempDir()` as well, along with this replacement
* _I messed up the message in an exception in `Utils` in SPARK-3794; fixed here_

*Bit Less Minor*

* `Utils.deleteRecursively()` fails immediately if any `IOException` occurs, instead of trying to delete any remaining files and subdirectories. I've observed this leave temp dirs around. I suggest changing it to continue in the face of an exception and throw one of the possibly several exceptions that occur at the end.
* `Utils.createTempDir()` will add a JVM shutdown hook every time the method is called. Even if the subdir is the parent of another parent dir, since this check is inside the hook. However `Utils` manages a set of all dirs to delete on shutdown already, called `shutdownDeletePaths`. A single hook can be registered to delete all of these on exit. This is how Tachyon temp paths are cleaned up in `TachyonBlockManager`.

I noticed a few other things that might be changed but wanted to ask first:

* Shouldn't the set of dirs to delete be `File`, not just `String` paths?
* `Utils` manages the set of `TachyonFile` that have been registered for deletion, but the shutdown hook is managed in `TachyonBlockManager`. Should this logic not live together, and not in `Utils`? it's more specific to Tachyon, and looks a slight bit odd to import in such a generic place.

Author: Sean Owen <sowen@cloudera.com>

Closes #2670 from srowen/SPARK-3811 and squashes the following commits:

071ae60 [Sean Owen] Update per @vanzin's review
da0146d [Sean Owen] Make Utils.deleteRecursively try to delete all paths even when an exception occurs; use one shutdown hook instead of one per method call to delete temp dirs
3a0faa4 [Sean Owen] Standardize on Utils.createTempDir instead of Files.createTempDir
2014-10-09 18:21:59 -07:00
GuoQiang Li 1e0aa4deba [Minor] use norm operator after breeze 0.10 upgrade
cc mengxr

Author: GuoQiang Li <witgo@qq.com>

Closes #2730 from witgo/SPARK-3856 and squashes the following commits:

2cffce1 [GuoQiang Li] use norm operator after breeze 0.10 upgrade
2014-10-09 09:22:32 -07:00
Qiping Li 14f222f7f7 [SPARK-3158][MLLIB]Avoid 1 extra aggregation for DecisionTree training
Currently, the implementation does one unnecessary aggregation step. The aggregation step for level L (to choose splits) gives enough information to set the predictions of any leaf nodes at level L+1. We can use that info and skip the aggregation step for the last level of the tree (which only has leaf nodes).

### Implementation Details

Each node now has a `impurity` field and the `predict` is changed from type `Double` to type `Predict`(this can be used to compute predict probability in the future) When compute best splits for each node, we also compute impurity and predict for the child nodes, which is used to constructed newly allocated child nodes. So at level L, we have set impurity and predict for nodes at level L +1.
If level L+1 is the last level, then we can avoid aggregation. What's more, calculation of parent impurity in

Top nodes for each tree needs to be treated differently because we have to compute impurity and predict for them first. In `binsToBestSplit`, if current node is top node(level == 0), we calculate impurity and predict first.
after finding best split, top node's predict and impurity is set to the calculated value. Non-top nodes's impurity and predict are already calculated and don't need to be recalculated again. I have considered to add a initialization step to set top nodes' impurity and predict and then we can treat all nodes in the same way, but this will need a lot of duplication of code(all the code to do seq operation(BinSeqOp) needs to be duplicated), so I choose the current way.

 CC mengxr manishamde jkbradley, please help me review this, thanks.

Author: Qiping Li <liqiping1991@gmail.com>

Closes #2708 from chouqin/avoid-agg and squashes the following commits:

8e269ea [Qiping Li] adjust code and comments
eefeef1 [Qiping Li] adjust comments and check child nodes' impurity
c41b1b6 [Qiping Li] fix pyspark unit test
7ad7a71 [Qiping Li] fix unit test
822c912 [Qiping Li] add comments and unit test
e41d715 [Qiping Li] fix bug in test suite
6cc0333 [Qiping Li] SPARK-3158: Avoid 1 extra aggregation for DecisionTree training
2014-10-09 01:36:58 -07:00
Xiangrui Meng 9c439d3316 [SPARK-3856][MLLIB] use norm operator after breeze 0.10 upgrade
Got warning msg:

~~~
[warn] /Users/meng/src/spark/mllib/src/main/scala/org/apache/spark/mllib/feature/Normalizer.scala:50: method norm in trait NumericOps is deprecated: Use norm(XXX) instead of XXX.norm
[warn]     var norm = vector.toBreeze.norm(p)
~~~

dbtsai

Author: Xiangrui Meng <meng@databricks.com>

Closes #2718 from mengxr/SPARK-3856 and squashes the following commits:

4f38169 [Xiangrui Meng] use norm operator
2014-10-08 22:35:14 -07:00
DB Tsai b32bb72e81 [SPARK-3832][MLlib] Upgrade Breeze dependency to 0.10
In Breeze 0.10, the L1regParam can be configured through anonymous function in OWLQN, and each component can be penalized differently. This is required for GLMNET in MLlib with L1/L2 regularization.
2570911026

Author: DB Tsai <dbtsai@dbtsai.com>

Closes #2693 from dbtsai/breeze0.10 and squashes the following commits:

7a0c45c [DB Tsai] In Breeze 0.10, the L1regParam can be configured through anonymous function in OWLQN, and each component can be penalized differently. This is required for GLMNET in MLlib with L1/L2 regularization. 2570911026
2014-10-07 16:47:24 -07:00
Liquan Pei 098c7344e6 [SPARK-3486][MLlib][PySpark] PySpark support for Word2Vec
mengxr
Added PySpark support for Word2Vec
Change list
(1) PySpark support for Word2Vec
(2) SerDe support of string sequence both on python side and JVM side
(3) Test for SerDe of string sequence on JVM side

Author: Liquan Pei <liquanpei@gmail.com>

Closes #2356 from Ishiihara/Word2Vec-python and squashes the following commits:

476ea34 [Liquan Pei] style fixes
b13a0b9 [Liquan Pei] resolve merge conflicts and minor fixes
8671eba [Liquan Pei] Merge remote-tracking branch 'upstream/master' into Word2Vec-python
daf88a6 [Liquan Pei] modification according to feedback
a73fa19 [Liquan Pei] clean up
3d8007b [Liquan Pei] fix findSynonyms for vector
1bdcd2e [Liquan Pei] minor fixes
cdef9f4 [Liquan Pei] add missing comments
b7447eb [Liquan Pei] modify according to feedback
b9a7383 [Liquan Pei] cache words RDD in fit
89490bf [Liquan Pei] add tests and Word2VecModelWrapper
78bbb53 [Liquan Pei] use pickle for seq string SerDe
a264b08 [Liquan Pei] Merge remote-tracking branch 'upstream/master' into Word2Vec-python
ca1e5ff [Liquan Pei] fix test
68e7276 [Liquan Pei] minor style fixes
48d5e72 [Liquan Pei] Functionality improvement
0ad3ac1 [Liquan Pei] minor fix
c867fdf [Liquan Pei] add Word2Vec to pyspark
2014-10-07 16:43:34 -07:00
Sandy Ryza 20ea54cc7a [SPARK-2461] [PySpark] Add a toString method to GeneralizedLinearModel
Add a toString method to GeneralizedLinearModel, also change `__str__` to `__repr__` for some classes, to provide better message in repr.

This PR is based on #1388, thanks to sryza!

closes #1388

Author: Sandy Ryza <sandy@cloudera.com>
Author: Davies Liu <davies.liu@gmail.com>

Closes #2625 from davies/string and squashes the following commits:

3544aad [Davies Liu] fix LinearModel
0bcd642 [Davies Liu] Merge branch 'sandy-spark-2461' of github.com:sryza/spark
1ce5c2d [Sandy Ryza] __repr__ back to __str__ in a couple places
aa9e962 [Sandy Ryza] Switch __str__ to __repr__
a0c5041 [Sandy Ryza] Add labels back in
1aa17f5 [Sandy Ryza] Match existing conventions
fac1bc4 [Sandy Ryza] Fix PEP8 error
f7b58ed [Sandy Ryza] SPARK-2461. Add a toString method to GeneralizedLinearModel
2014-10-06 14:05:45 -07:00
qiping.lqp 2e4eae3a52 [SPARK-3366][MLLIB]Compute best splits distributively in decision tree
Currently, all best splits are computed on the driver, which makes the driver a bottleneck for both communication and computation. This PR fix this problem by computed best splits on executors.
Instead of send all aggregate stats to the driver node, we can send aggregate stats for a node to a particular executor, using `reduceByKey` operation, then we can compute best split for this node there.

Implementation details:

Each node now has a nodeStatsAggregator, which save aggregate stats for all features and bins.
First use mapPartition to compute node aggregate stats for all nodes in each partition.
Then transform node aggregate stats to (nodeIndex, nodeStatsAggregator) pairs and use to `reduceByKey` operation to combine nodeStatsAggregator for the same node.
After all stats have been combined, best splits can be computed for each node based on the node aggregate stats. Best split result is collected to driver to construct the decision tree.

CC: mengxr manishamde jkbradley, please help me review this, thanks.

Author: qiping.lqp <qiping.lqp@alibaba-inc.com>
Author: chouqin <liqiping1991@gmail.com>

Closes #2595 from chouqin/dt-dist-agg and squashes the following commits:

db0d24a [chouqin] fix a minor bug and adjust code
a0d9de3 [chouqin] adjust code based on comments
9f201a6 [chouqin] fix bug: statsSize -> allStatsSize
a8a7ed0 [chouqin] Merge branch 'master' of https://github.com/apache/spark into dt-dist-agg
f13b346 [chouqin] adjust randomforest comments
c32636e [chouqin] adjust code based on comments
ac6a505 [chouqin] adjust code based on comments
7bbb787 [chouqin] add comments
bdd2a63 [qiping.lqp] fix test suite
a75df27 [qiping.lqp] fix test suite
b5b0bc2 [qiping.lqp] fix style
e76414f [qiping.lqp] fix testsuite
748bd45 [qiping.lqp] fix type-mismatch bug
24eacd8 [qiping.lqp] fix type-mismatch bug
5f63d6c [qiping.lqp] add multiclassification using One-Vs-All strategy
4f56496 [qiping.lqp] fix bug
f00fc22 [qiping.lqp] fix bug
532993a [qiping.lqp] Compute best splits distributively in decision tree
2014-10-03 03:26:17 -07:00
Reynold Xin 3888ee2f38 [SPARK-3748] Log thread name in unit test logs
Thread names are useful for correlating failures.

Author: Reynold Xin <rxin@apache.org>

Closes #2600 from rxin/log4j and squashes the following commits:

83ffe88 [Reynold Xin] [SPARK-3748] Log thread name in unit test logs
2014-10-01 01:03:49 -07:00
Joseph K. Bradley 7bf6cc9701 [SPARK-3751] [mllib] DecisionTree: example update + print options
DecisionTreeRunner functionality additions:
* Allow user to pass in a test dataset
* Do not print full model if the model is too large.

As part of this, modify DecisionTreeModel and RandomForestModel to allow printing less info.  Proposed updates:
* toString: prints model summary
* toDebugString: prints full model (named after RDD.toDebugString)

Similar update to Python API:
* __repr__() now prints a model summary
* toDebugString() now prints the full model

CC: mengxr  chouqin manishamde codedeft  Small update (whomever can take a look).  Thanks!

Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>

Closes #2604 from jkbradley/dtrunner-update and squashes the following commits:

b2b3c60 [Joseph K. Bradley] re-added python sql doc test, temporarily removed before
07b1fae [Joseph K. Bradley] repr() now prints a model summary toDebugString() now prints the full model
1d0d93d [Joseph K. Bradley] Updated DT and RF to print less when toString is called. Added toDebugString for verbose printing.
22eac8c [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dtrunner-update
e007a95 [Joseph K. Bradley] Updated DecisionTreeRunner to accept a test dataset.
2014-10-01 01:03:24 -07:00
Xiangrui Meng d75496b189 [SPARK-3701][MLLIB] update python linalg api and small fixes
1. doc updates
2. simple checks on vector dimensions
3. use column major for matrices

davies jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #2548 from mengxr/mllib-py-clean and squashes the following commits:

6dce2df [Xiangrui Meng] address comments
116b5db [Xiangrui Meng] use np.dot instead of array.dot
75f2fcc [Xiangrui Meng] fix python style
fefce00 [Xiangrui Meng] better check of vector size with more tests
067ef71 [Xiangrui Meng] majored -> major
ef853f9 [Xiangrui Meng] update python linalg api and small fixes
2014-09-30 17:10:36 -07:00
Reza Zadeh 587a0cd7ed [MLlib] [SPARK-2885] DIMSUM: All-pairs similarity
# All-pairs similarity via DIMSUM
Compute all pairs of similar vectors using brute force approach, and also DIMSUM sampling approach.

Laying down some notation: we are looking for all pairs of similar columns in an m x n RowMatrix whose entries are denoted a_ij, with the i’th row denoted r_i and the j’th column denoted c_j. There is an oversampling parameter labeled ɣ that should be set to 4 log(n)/s to get provably correct results (with high probability), where s is the similarity threshold.

The algorithm is stated with a Map and Reduce, with proofs of correctness and efficiency in published papers [1] [2]. The reducer is simply the summation reducer. The mapper is more interesting, and is also the heart of the scheme. As an exercise, you should try to see why in expectation, the map-reduce below outputs cosine similarities.

![dimsumv2](https://cloud.githubusercontent.com/assets/3220351/3807272/d1d9514e-1c62-11e4-9f12-3cfdb1d78b3a.png)

[1] Bosagh-Zadeh, Reza and Carlsson, Gunnar (2013), Dimension Independent Matrix Square using MapReduce, arXiv:1304.1467 http://arxiv.org/abs/1304.1467

[2] Bosagh-Zadeh, Reza and Goel, Ashish (2012), Dimension Independent Similarity Computation, arXiv:1206.2082 http://arxiv.org/abs/1206.2082

# Testing

Tests for all invocations included.

Added L1 and L2 norm computation to MultivariateStatisticalSummary since it was needed. Added tests for both of them.

Author: Reza Zadeh <rizlar@gmail.com>
Author: Xiangrui Meng <meng@databricks.com>

Closes #1778 from rezazadeh/dimsumv2 and squashes the following commits:

404c64c [Reza Zadeh] Merge remote-tracking branch 'upstream/master' into dimsumv2
4eb71c6 [Reza Zadeh] Add excludes for normL1 and normL2
ee8bd65 [Reza Zadeh] Merge remote-tracking branch 'upstream/master' into dimsumv2
976ddd4 [Reza Zadeh] Broadcast colMags. Avoid div by zero.
3467cff [Reza Zadeh] Merge remote-tracking branch 'upstream/master' into dimsumv2
aea0247 [Reza Zadeh] Allow large thresholds to promote sparsity
9fe17c0 [Xiangrui Meng] organize imports
2196ba5 [Xiangrui Meng] Merge branch 'rezazadeh-dimsumv2' into dimsumv2
254ca08 [Reza Zadeh] Merge remote-tracking branch 'upstream/master' into dimsumv2
f2947e4 [Xiangrui Meng] some optimization
3c4cf41 [Xiangrui Meng] Merge branch 'master' into rezazadeh-dimsumv2
0e4eda4 [Reza Zadeh] Use partition index for RNG
251bb9c [Reza Zadeh] Documentation
25e9d0d [Reza Zadeh] Line length for style
fb296f6 [Reza Zadeh] renamed to normL1 and normL2
3764983 [Reza Zadeh] Documentation
e9c6791 [Reza Zadeh] New interface and documentation
613f261 [Reza Zadeh] Column magnitude summary
75a0b51 [Reza Zadeh] Use Ints instead of Longs in the shuffle
0f12ade [Reza Zadeh] Style changes
eb1dc20 [Reza Zadeh] Use Double.PositiveInfinity instead of Double.Max
f56a882 [Reza Zadeh] Remove changes to MultivariateOnlineSummarizer
dbc55ba [Reza Zadeh] Make colMagnitudes a method in RowMatrix
41e8ece [Reza Zadeh] style changes
139c8e1 [Reza Zadeh] Syntax changes
029aa9c [Reza Zadeh] javadoc and new test
75edb25 [Reza Zadeh] All tests passing!
05e59b8 [Reza Zadeh] Add test
502ce52 [Reza Zadeh] new interface
654c4fb [Reza Zadeh] default methods
3726ca9 [Reza Zadeh] Remove MatrixAlgebra
6bebabb [Reza Zadeh] remove changes to MatrixSuite
5b8cd7d [Reza Zadeh] Initial files
2014-09-29 11:15:09 -07:00
Joseph K. Bradley 0dc2b6361d [SPARK-1545] [mllib] Add Random Forests
This PR adds RandomForest to MLlib.  The implementation is basic, and future performance optimizations will be important.  (Note: RFs = Random Forests.)

# Overview

## RandomForest
* trains multiple trees at once to reduce the number of passes over the data
* allows feature subsets at each node
* uses a queue of nodes instead of fixed groups for each level

This implementation is based an implementation by manishamde and the [Alpine Labs Sequoia Forest](https://github.com/AlpineNow/SparkML2) by codedeft (in particular, the TreePoint, BaggedPoint, and node queue implementations).  Thank you for your inputs!

## Testing

Correctness: This has been tested for correctness with the test suites and with DecisionTreeRunner on example datasets.

Performance: This has been performance tested using [this branch of spark-perf](https://github.com/jkbradley/spark-perf/tree/rfs).  Results below.

### Regression tests for DecisionTree

Summary: For training 1 tree, there are small regressions, especially from feature subsampling.

In the table below, each row is a single (random) dataset.  The 2 different sets of result columns are for 2 different RF implementations:
* (numTrees): This is from an earlier commit, after implementing RandomForest to train multiple trees at once.  It does not include any code for feature subsampling.
* (feature subsets): This is from this current PR's code, after implementing feature subsampling.
These tests were to identify regressions in DecisionTree, so they are training 1 tree with all of the features (i.e., no feature subsampling).

These were run on an EC2 cluster with 15 workers, training 1 tree with maxDepth = 5 (= 6 levels).  Speedup values < 1 indicate slowdowns from the old DecisionTree implementation.

numInstances | numFeatures | runtime (sec) | speedup | runtime (sec) | speedup
---- | ---- | ---- | ---- | ---- | ----
 | | (numTrees) | (numTrees) | (feature subsets) | (feature subsets)
20000 | 100 | 4.051 | 1.044433473 | 4.478 | 0.9448414471
20000 | 500 | 8.472 | 1.104461756 | 9.315 | 1.004508857
20000 | 1500 | 19.354 | 1.05854087 | 20.863 | 0.9819776638
20000 | 3500 | 43.674 | 1.072033704 | 45.887 | 1.020332556
200000 | 100 | 4.196 | 1.171830315 | 4.848 | 1.014232673
200000 | 500 | 8.926 | 1.082791844 | 9.771 | 0.989151571
200000 | 1500 | 20.58 | 1.068415938 | 22.134 | 0.9934038131
200000 | 3500 | 48.043 | 1.075203464 | 52.249 | 0.9886505005
2000000 | 100 | 4.944 | 1.01355178 | 5.796 | 0.8645617667
2000000 | 500 | 11.11 | 1.016831683 | 12.482 | 0.9050632911
2000000 | 1500 | 31.144 | 1.017852556 | 35.274 | 0.8986789136
2000000 | 3500 | 79.981 | 1.085382778 | 101.105 | 0.8586123337
20000000 | 100 | 8.304 | 0.9270231214 | 9.073 | 0.8484514494
20000000 | 500 | 28.174 | 1.083268262 | 34.236 | 0.8914592826
20000000 | 1500 | 143.97 | 0.9579634646 | 159.275 | 0.8659111599

### Tests for forests

I have run other tests with numTrees=10 and with sqrt(numFeatures), and those indicate that multi-model training and feature subsets can speed up training for forests, especially when training deeper trees.

# Details on specific classes

## Changes to DecisionTree
* Main train() method is now in RandomForest.
* findBestSplits() is no longer needed.  (It split levels into groups, but we now use a queue of nodes.)
* Many small changes to support RFs.  (Note: These methods should be moved to RandomForest.scala in a later PR, but are in DecisionTree.scala to make code comparison easier.)

## RandomForest
* Main train() method is from old DecisionTree.
* selectNodesToSplit: Note that it selects nodes and feature subsets jointly to track memory usage.

## RandomForestModel
* Stores an Array[DecisionTreeModel]
* Prediction:
 * For classification, most common label.  For regression, mean.
 * We could support other methods later.

## examples/.../DecisionTreeRunner
* This now takes numTrees and featureSubsetStrategy, to support RFs.

## DTStatsAggregator
* 2 types of functionality (w/ and w/o subsampling features): These require different indexing methods.  (We could treat both as subsampling, but this is less efficient
  DTStatsAggregator is now abstract, and 2 child classes implement these 2 types of functionality.

## impurities
* These now take instance weights.

## Node
* Some vals changed to vars.
 * This is unfortunately a public API change (DeveloperApi).  This could be avoided by creating a LearningNode struct, but would be awkward.

## RandomForestSuite
Please let me know if there are missing tests!

## BaggedPoint
This wraps TreePoint and holds bootstrap weights/counts.

# Design decisions

* BaggedPoint: BaggedPoint is separate from TreePoint since it may be useful for other bagging algorithms later on.

* RandomForest public API: What options should be easily supported by the train* methods?  Should ALL options be in the Java-friendly constructors?  Should there be a constructor taking Strategy?

* Feature subsampling options: What options should be supported?  scikit-learn supports the same options, except for "onethird."  One option would be to allow users to specific fractions ("0.1"): the current options could be supported, and any unrecognized values would be parsed as Doubles in [0,1].

* Splits and bins are computed before bootstrapping, so all trees use the same discretization.

* One queue, instead of one queue per tree.

CC: mengxr manishamde codedeft chouqin  Please let me know if you have suggestions---thanks!

Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>
Author: qiping.lqp <qiping.lqp@alibaba-inc.com>
Author: chouqin <liqiping1991@gmail.com>

Closes #2435 from jkbradley/rfs-new and squashes the following commits:

c694174 [Joseph K. Bradley] Fixed typo
cc59d78 [Joseph K. Bradley] fixed imports
e25909f [Joseph K. Bradley] Simplified node group maps.  Specifically, created NodeIndexInfo to store node index in agg and feature subsets, and no longer create extra maps in findBestSplits
fbe9a1e [Joseph K. Bradley] Changed default featureSubsetStrategy to be sqrt for classification, onethird for regression.  Updated docs with references.
ef7c293 [Joseph K. Bradley] Updates based on code review.  Most substantial changes: * Simplified DTStatsAggregator * Made RandomForestModel.trees public * Added test for regression to RandomForestSuite
593b13c [Joseph K. Bradley] Fixed bug in metadata for computing log2(num features).  Now it checks >= 1.
a1a08df [Joseph K. Bradley] Removed old comments
866e766 [Joseph K. Bradley] Changed RandomForestSuite randomized tests to use multiple fixed random seeds.
ff8bb96 [Joseph K. Bradley] removed usage of null from RandomForest and replaced with Option
bf1a4c5 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into rfs-new
6b79c07 [Joseph K. Bradley] Added RandomForestSuite, and fixed small bugs, style issues.
d7753d4 [Joseph K. Bradley] Added numTrees and featureSubsetStrategy to DecisionTreeRunner (to support RandomForest).  Fixed bugs so that RandomForest now runs.
746d43c [Joseph K. Bradley] Implemented feature subsampling.  Tested DecisionTree but not RandomForest.
6309d1d [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into rfs-new.  Added RandomForestModel.toString
b7ae594 [Joseph K. Bradley] Updated docs.  Small fix for bug which does not cause errors: No longer allocate unused child nodes for leaf nodes.
121c74e [Joseph K. Bradley] Basic random forests are implemented.  Random features per node not yet implemented.  Test suite not implemented.
325d18a [Joseph K. Bradley] Merge branch 'chouqin-dt-preprune' into rfs-new
4ef9bf1 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into rfs-new
61b2e72 [Joseph K. Bradley] Added max of 10GB for maxMemoryInMB in Strategy.
a95e7c8 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into chouqin-dt-preprune
6da8571 [Joseph K. Bradley] RFs partly implemented, not done yet
eddd1eb [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into rfs-new
5c4ac33 [Joseph K. Bradley] Added check in Strategy to make sure minInstancesPerNode >= 1
0dd4d87 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-spark-3160
95c479d [Joseph K. Bradley] * Fixed typo in tree suite test "do not choose split that does not satisfy min instance per node requirements" * small style fixes
e2628b6 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into chouqin-dt-preprune
19b01af [Joseph K. Bradley] Merge remote-tracking branch 'chouqin/dt-preprune' into chouqin-dt-preprune
f1d11d1 [chouqin] fix typo
c7ebaf1 [chouqin] fix typo
39f9b60 [chouqin] change edge `minInstancesPerNode` to 2 and add one more test
c6e2dfc [Joseph K. Bradley] Added minInstancesPerNode and minInfoGain parameters to DecisionTreeRunner.scala and to Python API in tree.py
306120f [Joseph K. Bradley] Fixed typo in DecisionTreeModel.scala doc
eaa1dcf [Joseph K. Bradley] Added topNode doc in DecisionTree and scalastyle fix
d4d7864 [Joseph K. Bradley] Marked Node.build as deprecated
d4dbb99 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-spark-3160
1a8f0ad [Joseph K. Bradley] Eliminated pre-allocated nodes array in main train() method. * Nodes are constructed and added to the tree structure as needed during training.
0278a11 [chouqin] remove `noSplit` and set `Predict` private to tree
d593ec7 [chouqin] fix docs and change minInstancesPerNode to 1
2ab763b [Joseph K. Bradley] Simplifications to DecisionTree code:
efcc736 [qiping.lqp] fix bug
10b8012 [qiping.lqp] fix style
6728fad [qiping.lqp] minor fix: remove empty lines
bb465ca [qiping.lqp] Merge branch 'master' of https://github.com/apache/spark into dt-preprune
cadd569 [qiping.lqp] add api docs
46b891f [qiping.lqp] fix bug
e72c7e4 [qiping.lqp] add comments
845c6fa [qiping.lqp] fix style
f195e83 [qiping.lqp] fix style
987cbf4 [qiping.lqp] fix bug
ff34845 [qiping.lqp] separate calculation of predict of node from calculation of info gain
ac42378 [qiping.lqp] add min info gain and min instances per node parameters in decision tree
2014-09-28 21:44:50 -07:00
RJ Nowling ec9df6a765 [SPARK-3614][MLLIB] Add minimumOccurence filtering to IDF
This PR for [SPARK-3614](https://issues.apache.org/jira/browse/SPARK-3614) adds functionality for filtering out terms which do not appear in at least a minimum number of documents.

This is implemented using a minimumOccurence parameter (default 0).  When terms' document frequencies are less than minimumOccurence, their IDFs are set to 0, just like when the DF is 0.  As a result, the TF-IDFs for the terms are found to be 0, as if the terms were not present in the documents.

This PR makes the following changes:
* Add a minimumOccurence parameter to the IDF and DocumentFrequencyAggregator classes.
* Create a parameter-less constructor for IDF with a default minimumOccurence value of 0 to remain backwards-compatibility with the original IDF API.
* Sets the IDFs to 0 for terms which DFs are less than minimumOccurence
* Add tests to the Spark IDFSuite and Java JavaTfIdfSuite test suites
* Updated the MLLib Feature Extraction programming guide to describe the new feature

Author: RJ Nowling <rnowling@gmail.com>

Closes #2494 from rnowling/spark-3614-idf-filter and squashes the following commits:

0aa3c63 [RJ Nowling] Fix identation
e6523a8 [RJ Nowling] Remove unnecessary toDouble's from IDFSuite
bfa82ec [RJ Nowling] Add space after if
30d20b3 [RJ Nowling] Add spaces around equals signs
9013447 [RJ Nowling] Add space before division operator
79978fc [RJ Nowling] Remove unnecessary semi-colon
40fd70c [RJ Nowling] Change minimumOccurence to minDocFreq in code and docs
47850ab [RJ Nowling] Changed minimumOccurence to Int from Long
9fb4093 [RJ Nowling] Remove unnecessary lines from IDF class docs
1fc09d8 [RJ Nowling] Add backwards-compatible constructor to DocumentFrequencyAggregator
1801fd2 [RJ Nowling] Fix style errors in IDF.scala
6897252 [RJ Nowling] Preface minimumOccurence members with val to make them final and immutable
a200bab [RJ Nowling] Remove unnecessary else statement
4b974f5 [RJ Nowling] Remove accidentally-added import from testing
c0cc643 [RJ Nowling] Add minimumOccurence filtering to IDF
2014-09-26 09:58:47 -07:00
Aaron Staple ff637c9380 [SPARK-1484][MLLIB] Warn when running an iterative algorithm on uncached data.
Add warnings to KMeans, GeneralizedLinearAlgorithm, and computeSVD when called with input data that is not cached. KMeans is implemented iteratively, and I believe that GeneralizedLinearAlgorithm’s current optimizers are iterative and its future optimizers are also likely to be iterative. RowMatrix’s computeSVD is iterative against an RDD when run in DistARPACK mode. ALS and DecisionTree are iterative as well, but they implement RDD caching internally so do not require a warning.

I added a warning to GeneralizedLinearAlgorithm rather than inside its optimizers, where the iteration actually occurs, because internally GeneralizedLinearAlgorithm maps its input data to an uncached RDD before passing it to an optimizer. (In other words, the warning would be printed for every GeneralizedLinearAlgorithm run, regardless of whether its input is cached, if the warning were in GradientDescent or other optimizer.) I assume that use of an uncached RDD by GeneralizedLinearAlgorithm is intentional, and that the mapping there (adding label, intercepts and scaling) is a lightweight operation. Arguably a user calling an optimizer such as GradientDescent will be knowledgable enough to cache their data without needing a log warning, so lack of a warning in the optimizers may be ok.

Some of the documentation examples making use of these iterative algorithms did not cache their training RDDs (while others did). I updated the examples to always cache. I also fixed some (unrelated) minor errors in the documentation examples.

Author: Aaron Staple <aaron.staple@gmail.com>

Closes #2347 from staple/SPARK-1484 and squashes the following commits:

bd49701 [Aaron Staple] Address review comments.
ab2d4a4 [Aaron Staple] Disable warnings on python code path.
a7a0f99 [Aaron Staple] Change code comments per review comments.
7cca1dc [Aaron Staple] Change warning message text.
c77e939 [Aaron Staple] [SPARK-1484][MLLIB] Warn when running an iterative algorithm on uncached data.
3b6c511 [Aaron Staple] Minor doc example fixes.
2014-09-25 16:11:00 -07:00
Davies Liu fce5e251d6 [SPARK-3491] [MLlib] [PySpark] use pickle to serialize data in MLlib
Currently, we serialize the data between JVM and Python case by case manually, this cannot scale to support so many APIs in MLlib.

This patch will try to address this problem by serialize the data using pickle protocol, using Pyrolite library to serialize/deserialize in JVM. Pickle protocol can be easily extended to support customized class.

All the modules are refactored to use this protocol.

Known issues: There will be some performance regression (both CPU and memory, the serialized data increased)

Author: Davies Liu <davies.liu@gmail.com>

Closes #2378 from davies/pickle_mllib and squashes the following commits:

dffbba2 [Davies Liu] Merge branch 'master' of github.com:apache/spark into pickle_mllib
810f97f [Davies Liu] fix equal of matrix
032cd62 [Davies Liu] add more type check and conversion for user_product
bd738ab [Davies Liu] address comments
e431377 [Davies Liu] fix cache of rdd, refactor
19d0967 [Davies Liu] refactor Picklers
2511e76 [Davies Liu] cleanup
1fccf1a [Davies Liu] address comments
a2cc855 [Davies Liu] fix tests
9ceff73 [Davies Liu] test size of serialized Rating
44e0551 [Davies Liu] fix cache
a379a81 [Davies Liu] fix pickle array in python2.7
df625c7 [Davies Liu] Merge commit '154d141' into pickle_mllib
154d141 [Davies Liu] fix autobatchedpickler
44736d7 [Davies Liu] speed up pickling array in Python 2.7
e1d1bfc [Davies Liu] refactor
708dc02 [Davies Liu] fix tests
9dcfb63 [Davies Liu] fix style
88034f0 [Davies Liu] rafactor, address comments
46a501e [Davies Liu] choose batch size automatically
df19464 [Davies Liu] memorize the module and class name during pickleing
f3506c5 [Davies Liu] Merge branch 'master' into pickle_mllib
722dd96 [Davies Liu] cleanup _common.py
0ee1525 [Davies Liu] remove outdated tests
b02e34f [Davies Liu] remove _common.py
84c721d [Davies Liu] Merge branch 'master' into pickle_mllib
4d7963e [Davies Liu] remove muanlly serialization
6d26b03 [Davies Liu] fix tests
c383544 [Davies Liu] classification
f2a0856 [Davies Liu] mllib/regression
d9f691f [Davies Liu] mllib/util
cccb8b1 [Davies Liu] mllib/tree
8fe166a [Davies Liu] Merge branch 'pickle' into pickle_mllib
aa2287e [Davies Liu] random
f1544c4 [Davies Liu] refactor clustering
52d1350 [Davies Liu] use new protocol in mllib/stat
b30ef35 [Davies Liu] use pickle to serialize data for mllib/recommendation
f44f771 [Davies Liu] enable tests about array
3908f5c [Davies Liu] Merge branch 'master' into pickle
c77c87b [Davies Liu] cleanup debugging code
60e4e2f [Davies Liu] support unpickle array.array for Python 2.6
2014-09-19 15:01:11 -07:00
Burak e76ef5cb8e [SPARK-3418] Sparse Matrix support (CCS) and additional native BLAS operations added
Local `SparseMatrix` support added in Compressed Column Storage (CCS) format in addition to Level-2 and Level-3 BLAS operations such as dgemv and dgemm respectively.

BLAS doesn't support  sparse matrix operations, therefore support for `SparseMatrix`-`DenseMatrix` multiplication and `SparseMatrix`-`DenseVector` implementations have been added. I will post performance comparisons in the comments momentarily.

Author: Burak <brkyvz@gmail.com>

Closes #2294 from brkyvz/SPARK-3418 and squashes the following commits:

88814ed [Burak] Hopefully fixed MiMa this time
47e49d5 [Burak] really fixed MiMa issue
f0bae57 [Burak] [SPARK-3418] Fixed MiMa compatibility issues (excluded from check)
4b7dbec [Burak] 9/17 comments addressed
7af2f83 [Burak] sealed traits Vector and Matrix
d3a8a16 [Burak] [SPARK-3418] Squashed missing alpha bug.
421045f [Burak] [SPARK-3418] New code review comments addressed
f35a161 [Burak] [SPARK-3418] Code review comments addressed and multiplication further optimized
2508577 [Burak] [SPARK-3418] Fixed one more style issue
d16e8a0 [Burak] [SPARK-3418] Fixed style issues and added documentation for methods
204a3f7 [Burak] [SPARK-3418] Fixed failing Matrix unit test
6025297 [Burak] [SPARK-3418] Fixed Scala-style errors
dc7be71 [Burak] [SPARK-3418][MLlib] Matrix unit tests expanded with indexing and updating
d2d5851 [Burak] [SPARK-3418][MLlib] Sparse Matrix support and additional native BLAS operations added
2014-09-18 22:18:51 -07:00
qiping.lqp fdb302f49c [SPARK-3516] [mllib] DecisionTree: Add minInstancesPerNode, minInfoGain params to example and Python API
Added minInstancesPerNode, minInfoGain params to:
* DecisionTreeRunner.scala example
* Python API (tree.py)

Also:
* Fixed typo in tree suite test "do not choose split that does not satisfy min instance per node requirements"
* small style fixes

CC: mengxr

Author: qiping.lqp <qiping.lqp@alibaba-inc.com>
Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>
Author: chouqin <liqiping1991@gmail.com>

Closes #2349 from jkbradley/chouqin-dt-preprune and squashes the following commits:

61b2e72 [Joseph K. Bradley] Added max of 10GB for maxMemoryInMB in Strategy.
a95e7c8 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into chouqin-dt-preprune
95c479d [Joseph K. Bradley] * Fixed typo in tree suite test "do not choose split that does not satisfy min instance per node requirements" * small style fixes
e2628b6 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into chouqin-dt-preprune
19b01af [Joseph K. Bradley] Merge remote-tracking branch 'chouqin/dt-preprune' into chouqin-dt-preprune
f1d11d1 [chouqin] fix typo
c7ebaf1 [chouqin] fix typo
39f9b60 [chouqin] change edge `minInstancesPerNode` to 2 and add one more test
c6e2dfc [Joseph K. Bradley] Added minInstancesPerNode and minInfoGain parameters to DecisionTreeRunner.scala and to Python API in tree.py
0278a11 [chouqin] remove `noSplit` and set `Predict` private to tree
d593ec7 [chouqin] fix docs and change minInstancesPerNode to 1
efcc736 [qiping.lqp] fix bug
10b8012 [qiping.lqp] fix style
6728fad [qiping.lqp] minor fix: remove empty lines
bb465ca [qiping.lqp] Merge branch 'master' of https://github.com/apache/spark into dt-preprune
cadd569 [qiping.lqp] add api docs
46b891f [qiping.lqp] fix bug
e72c7e4 [qiping.lqp] add comments
845c6fa [qiping.lqp] fix style
f195e83 [qiping.lqp] fix style
987cbf4 [qiping.lqp] fix bug
ff34845 [qiping.lqp] separate calculation of predict of node from calculation of info gain
ac42378 [qiping.lqp] add min info gain and min instances per node parameters in decision tree
2014-09-15 17:43:26 -07:00
Reza Zadeh 983d6a9c48 [MLlib] Update SVD documentation in IndexedRowMatrix
Updating this to reflect the newest SVD via ARPACK

Author: Reza Zadeh <rizlar@gmail.com>

Closes #2389 from rezazadeh/irmdocs and squashes the following commits:

7fa1313 [Reza Zadeh] Update svd docs
715da25 [Reza Zadeh] Updated computeSVD documentation IndexedRowMatrix
2014-09-15 17:41:15 -07:00
Christoph Sawade 3b93128139 [SPARK-3396][MLLIB] Use SquaredL2Updater in LogisticRegressionWithSGD
SimpleUpdater ignores the regularizer, which leads to an unregularized
LogReg. To enable the common L2 regularizer (and the corresponding
regularization parameter) for logistic regression the SquaredL2Updater
has to be used in SGD (see, e.g., [SVMWithSGD])

Author: Christoph Sawade <christoph@sawade.me>

Closes #2398 from BigCrunsh/fix-regparam-logreg and squashes the following commits:

0820c04 [Christoph Sawade] Use SquaredL2Updater in LogisticRegressionWithSGD
2014-09-15 17:39:31 -07:00
Joseph K. Bradley b8634df1f1 [SPARK-3160] [SPARK-3494] [mllib] DecisionTree: eliminate pre-allocated nodes, parentImpurities arrays. Memory calc bug fix.
This PR includes some code simplifications and re-organization which will be helpful for implementing random forests.  The main changes are that the nodes and parentImpurities arrays are no longer pre-allocated in the main train() method.

Also added 2 bug fixes:
* maxMemoryUsage calculation
* over-allocation of space for bins in DTStatsAggregator for unordered features.

Relation to RFs:
* Since RFs will be deeper and will therefore be more likely sparse (not full trees), it could be a cost savings to avoid pre-allocating a full tree.
* The associated re-organization also reduces bookkeeping, which will make RFs easier to implement.
* The return code doneTraining may be generalized to include cases such as nodes ready for local training.

Details:

No longer pre-allocate parentImpurities array in main train() method.
* parentImpurities values are now stored in individual nodes (in Node.stats.impurity).
* These were not really needed.  They were used in calculateGainForSplit(), but they can be calculated anyways using parentNodeAgg.

No longer using Node.build since tree structure is constructed on-the-fly.
* Did not eliminate since it is public (Developer) API.  Marked as deprecated.

Eliminated pre-allocated nodes array in main train() method.
* Nodes are constructed and added to the tree structure as needed during training.
* Moved tree construction from main train() method into findBestSplitsPerGroup() since there is no need to keep the (split, gain) array for an entire level of nodes.  Only one element of that array is needed at a time, so we do not the array.

findBestSplits() now returns 2 items:
* rootNode (newly created root node on first iteration, same root node on later iterations)
* doneTraining (indicating if all nodes at that level were leafs)

Updated DecisionTreeSuite.  Notes:
* Improved test "Second level node building with vs. without groups"
** generateOrderedLabeledPoints() modified so that it really does require 2 levels of internal nodes.
* Related update: Added Node.deepCopy (private[tree]), used for test suite

CC: mengxr

Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>

Closes #2341 from jkbradley/dt-spark-3160 and squashes the following commits:

07dd1ee [Joseph K. Bradley] Fixed overflow bug with computing maxMemoryUsage in DecisionTree.  Also fixed bug with over-allocating space in DTStatsAggregator for unordered features.
debe072 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-spark-3160
5c4ac33 [Joseph K. Bradley] Added check in Strategy to make sure minInstancesPerNode >= 1
0dd4d87 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-spark-3160
306120f [Joseph K. Bradley] Fixed typo in DecisionTreeModel.scala doc
eaa1dcf [Joseph K. Bradley] Added topNode doc in DecisionTree and scalastyle fix
d4d7864 [Joseph K. Bradley] Marked Node.build as deprecated
d4dbb99 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-spark-3160
1a8f0ad [Joseph K. Bradley] Eliminated pre-allocated nodes array in main train() method. * Nodes are constructed and added to the tree structure as needed during training.
2ab763b [Joseph K. Bradley] Simplifications to DecisionTree code:
2014-09-12 01:37:59 -07:00
qiping.lqp 79cdb9b64a [SPARK-2207][SPARK-3272][MLLib]Add minimum information gain and minimum instances per node as training parameters for decision tree.
These two parameters can act as early stop rules to do pre-pruning. When a split cause cause left or right child to have less than `minInstancesPerNode` or has less information gain than `minInfoGain`, current node will not be split by this split.

When there is no possible splits that satisfy requirements, there is no useful information gain stats, but we still need to calculate the predict value for current node. So I separated calculation of predict from calculation of information gain, which can also save computation when the number of possible splits is large. Please see [SPARK-3272](https://issues.apache.org/jira/browse/SPARK-3272) for more details.

CC: mengxr manishamde jkbradley, please help me review this, thanks.

Author: qiping.lqp <qiping.lqp@alibaba-inc.com>
Author: chouqin <liqiping1991@gmail.com>

Closes #2332 from chouqin/dt-preprune and squashes the following commits:

f1d11d1 [chouqin] fix typo
c7ebaf1 [chouqin] fix typo
39f9b60 [chouqin] change edge `minInstancesPerNode` to 2 and add one more test
0278a11 [chouqin] remove `noSplit` and set `Predict` private to tree
d593ec7 [chouqin] fix docs and change minInstancesPerNode to 1
efcc736 [qiping.lqp] fix bug
10b8012 [qiping.lqp] fix style
6728fad [qiping.lqp] minor fix: remove empty lines
bb465ca [qiping.lqp] Merge branch 'master' of https://github.com/apache/spark into dt-preprune
cadd569 [qiping.lqp] add api docs
46b891f [qiping.lqp] fix bug
e72c7e4 [qiping.lqp] add comments
845c6fa [qiping.lqp] fix style
f195e83 [qiping.lqp] fix style
987cbf4 [qiping.lqp] fix bug
ff34845 [qiping.lqp] separate calculation of predict of node from calculation of info gain
ac42378 [qiping.lqp] add min info gain and min instances per node parameters in decision tree
2014-09-10 15:37:10 -07:00
Xiangrui Meng 50a4fa774a [SPARK-3443][MLLIB] update default values of tree:
Adjust the default values of decision tree, based on the memory requirement discussed in https://github.com/apache/spark/pull/2125 :

1. maxMemoryInMB: 128 -> 256
2. maxBins: 100 -> 32
3. maxDepth: 4 -> 5 (in some example code)

jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #2322 from mengxr/tree-defaults and squashes the following commits:

cda453a [Xiangrui Meng] fix tests
5900445 [Xiangrui Meng] update comments
8c81831 [Xiangrui Meng] update default values of tree:
2014-09-08 18:59:57 -07:00
Joseph K. Bradley 711356b422 [SPARK-3086] [SPARK-3043] [SPARK-3156] [mllib] DecisionTree aggregation improvements
Summary:
1. Variable numBins for each feature [SPARK-3043]
2. Reduced data reshaping in aggregation [SPARK-3043]
3. Choose ordering for ordered categorical features adaptively [SPARK-3156]
4. Changed nodes to use 1-indexing [SPARK-3086]
5. Small clean-ups

Note: This PR looks bigger than it is since I moved several functions from inside findBestSplitsPerGroup to outside of it (to make it clear what was being serialized in the aggregation).

Speedups: This update helps most when many features use few bins but a few features use many bins.  Some example results on speedups with 2M examples, 3.5K features (15-worker EC2 cluster):
* Example where old code was reasonably efficient (1/2 continuous, 1/4 binary, 1/4 20-category): 164.813 --> 116.491 sec
* Example where old code wasted many bins (1/10 continuous, 81/100 binary, 9/100 20-category): 128.701 --> 39.334 sec

Details:

(1) Variable numBins for each feature [SPARK-3043]

DecisionTreeMetadata now computes a variable numBins for each feature.  It also tracks numSplits.

(2) Reduced data reshaping in aggregation [SPARK-3043]

Added DTStatsAggregator, a wrapper around the aggregate statistics array for easy but efficient indexing.
* Added ImpurityAggregator and ImpurityCalculator classes, to make DecisionTree code more oblivious to the type of impurity.
* Design note: I originally tried creating Impurity classes which stored data and storing the aggregates in an Array[Array[Array[Impurity]]].  However, this led to significant slowdowns, perhaps because of overhead in creating so many objects.

The aggregate statistics are never reshaped, and cumulative sums are computed in-place.

Updated the layout of aggregation functions.  The update simplifies things by (1) dividing features into ordered/unordered (instead of ordered/unordered/continuous) and (2) making use of the DTStatsAggregator for indexing.
For this update, the following functions were refactored:
* updateBinForOrderedFeature
* updateBinForUnorderedFeature
* binaryOrNotCategoricalBinSeqOp
* multiclassWithCategoricalBinSeqOp
* regressionBinSeqOp
The above 5 functions were replaced with:
* orderedBinSeqOp
* someUnorderedBinSeqOp

Other changes:
* calculateGainForSplit now treats all feature types the same way.
* Eliminated extractLeftRightNodeAggregates.

(3) Choose ordering for ordered categorical features adaptively [SPARK-3156]

Updated binsToBestSplit():
* This now computes cumulative sums of stats for ordered features.
* For ordered categorical features, it chooses an ordering for categories. (This uses to be done by findSplitsBins.)
* Uses iterators to shorten code and avoid building an Array[Array[InformationGainStats]].

Side effects:
* In findSplitsBins: A sample of the data is only taken for data with continuous features.  It is not needed for data with only categorical features.
* In findSplitsBins: splits and bins are no longer pre-computed for ordered categorical features since they are not needed.
* TreePoint binning is simpler for categorical features.

(4) Changed nodes to use 1-indexing [SPARK-3086]

Nodes used to be indexed from 0.  Now they are indexed from 1.
Node indexing functions are now collected in object Node (Node.scala).

(5) Small clean-ups

Eliminated functions extractNodeInfo() and extractInfoForLowerLevels() to reduce duplicate code.
Eliminated InvalidBinIndex since it is no longer used.

CC: mengxr  manishamde  Please let me know if you have thoughts on this—thanks!

Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>

Closes #2125 from jkbradley/dt-opt3alt and squashes the following commits:

42c192a [Joseph K. Bradley] Merge branch 'rfs' into dt-opt3alt
d3cc46b [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt3alt
00e4404 [Joseph K. Bradley] optimization for TreePoint construction (pre-computing featureArity and isUnordered as arrays)
425716c [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into rfs
a2acea5 [Joseph K. Bradley] Small optimizations based on profiling
aa4e4df [Joseph K. Bradley] Updated DTStatsAggregator with bug fix (nodeString should not be multiplied by statsSize)
4651154 [Joseph K. Bradley] Changed numBins semantics for unordered features. * Before: numBins = numSplits = (1 << k - 1) - 1 * Now: numBins = 2 * numSplits = 2 * [(1 << k - 1) - 1] * This also involved changing the semantics of: ** DecisionTreeMetadata.numUnorderedBins()
1e3b1c7 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt3alt
1485fcc [Joseph K. Bradley] Made some DecisionTree methods private.
92f934f [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt3alt
e676da1 [Joseph K. Bradley] Updated documentation for DecisionTree
37ca845 [Joseph K. Bradley] Fixed problem with how DecisionTree handles ordered categorical	features.
105f8ab [Joseph K. Bradley] Removed commented-out getEmptyBinAggregates from DecisionTree
062c31d [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt3alt
6d32ccd [Joseph K. Bradley] In DecisionTree.binsToBestSplit, changed loops to iterators to shorten code.
807cd00 [Joseph K. Bradley] Finished DTStatsAggregator, a wrapper around the aggregate statistics for easy but hopefully efficient indexing.  Modified old ImpurityAggregator classes and renamed them ImpurityCalculator; added ImpurityAggregator classes which work with DTStatsAggregator but do not store data.  Unit tests all succeed.
f2166fd [Joseph K. Bradley] still working on DTStatsAggregator
92f7118 [Joseph K. Bradley] Added partly written DTStatsAggregator
fd8df30 [Joseph K. Bradley] Moved some aggregation helpers outside of findBestSplitsPerGroup
d7c53ee [Joseph K. Bradley] Added more doc for ImpurityAggregator
a40f8f1 [Joseph K. Bradley] Changed nodes to be indexed from 1.  Tests work.
95cad7c [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt3
5f94342 [Joseph K. Bradley] Added treeAggregate since not yet merged from master.  Moved node indexing functions to Node.
61c4509 [Joseph K. Bradley] Fixed bugs from merge: missing DT timer call, and numBins setting.  Cleaned up DT Suite some.
3ba7166 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt3
b314659 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt3
9c83363 [Joseph K. Bradley] partial merge but not done yet
45f7ea7 [Joseph K. Bradley] partial merge, not yet done
5fce635 [Joseph K. Bradley] Merge branch 'dt-opt2' into dt-opt3
26d10dd [Joseph K. Bradley] Removed tree/model/Filter.scala since no longer used.  Removed debugging println calls in DecisionTree.scala.
356daba [Joseph K. Bradley] Merge branch 'dt-opt1' into dt-opt2
430d782 [Joseph K. Bradley] Added more debug info on binning error.  Added some docs.
d036089 [Joseph K. Bradley] Print timing info to logDebug.
e66f1b1 [Joseph K. Bradley] TreePoint * Updated doc * Made some methods private
8464a6e [Joseph K. Bradley] Moved TimeTracker to tree/impl/ in its own file, and cleaned it up.  Removed debugging println calls from DecisionTree.  Made TreePoint extend Serialiable
a87e08f [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt1
dd4d3aa [Joseph K. Bradley] Mid-process in bug fix: bug for binary classification with categorical features * Bug: Categorical features were all treated as ordered for binary classification.  This is possible but would require the bin ordering to be determined on-the-fly after the aggregation.  Currently, the ordering is determined a priori and fixed for all splits. * (Temp) Fix: Treat low-arity categorical features as unordered for binary classification. * Related change: I removed most tests for isMulticlass in the code.  I instead test metadata for whether there are unordered features. * Status: The bug may be fixed, but more testing needs to be done.
438a660 [Joseph K. Bradley] removed subsampling for mnist8m from DT
86e217f [Joseph K. Bradley] added cache to DT input
e3c84cc [Joseph K. Bradley] Added stuff fro mnist8m to D T Runner
51ef781 [Joseph K. Bradley] Fixed bug introduced by last commit: Variance impurity calculation was incorrect since counts were swapped accidentally
fd65372 [Joseph K. Bradley] Major changes: * Created ImpurityAggregator classes, rather than old aggregates. * Feature split/bin semantics are based on ordered vs. unordered ** E.g.: numSplits = numBins for all unordered features, and numSplits = numBins - 1 for all ordered features. * numBins can differ for each feature
c1565a5 [Joseph K. Bradley] Small DecisionTree updates: * Simplification: Updated calculateGainForSplit to take aggregates for a single (feature, split) pair. * Internal doc: findAggForOrderedFeatureClassification
b914f3b [Joseph K. Bradley] DecisionTree optimization: eliminated filters + small changes
b2ed1f3 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt
0f676e2 [Joseph K. Bradley] Optimizations + Bug fix for DecisionTree
3211f02 [Joseph K. Bradley] Optimizing DecisionTree * Added TreePoint representation to avoid calling findBin multiple times. * (not working yet, but debugging)
f61e9d2 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-timing
bcf874a [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-timing
511ec85 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-timing
a95bc22 [Joseph K. Bradley] timing for DecisionTree internals
2014-09-08 09:47:13 -07:00
GuoQiang Li 607ae39c22 [SPARK-3397] Bump pom.xml version number of master branch to 1.2.0-SNAPSHOT
Author: GuoQiang Li <witgo@qq.com>

Closes #2268 from witgo/SPARK-3397 and squashes the following commits:

eaf913f [GuoQiang Li] Bump pom.xml version number of master branch to 1.2.0-SNAPSHOT
2014-09-06 15:04:50 -07:00
Kousuke Saruta 1bed0a3869 [SPARK-3372] [MLlib] MLlib doesn't pass maven build / checkstyle due to multi-byte character contained in Gradient.scala
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #2248 from sarutak/SPARK-3372 and squashes the following commits:

73a28b8 [Kousuke Saruta] Replaced UTF-8 hyphen with ascii hyphen
2014-09-03 20:47:00 -07:00
Reza Zadeh 0f16b23cd1 [MLlib] Squash bug in IndexedRowMatrix
Kill this bug fast before it does damage.

Author: Reza Zadeh <rizlar@gmail.com>

Closes #2224 from rezazadeh/indexrmbug and squashes the following commits:

53386d6 [Reza Zadeh] Squash bug in IndexedRowMatrix
2014-09-02 09:48:05 -07:00
Xiangrui Meng 220f413686 [SPARK-2495][MLLIB] make KMeans constructor public
to re-construct k-means models freeman-lab

Author: Xiangrui Meng <meng@databricks.com>

Closes #2112 from mengxr/public-constructors and squashes the following commits:

18d53a9 [Xiangrui Meng] make KMeans constructor public
2014-08-25 12:30:02 -07:00
Xiangrui Meng 0a984aa155 [SPARK-3142][MLLIB] output shuffle data directly in Word2Vec
Sorry I didn't realize this in #2043. Ishiihara

Author: Xiangrui Meng <meng@databricks.com>

Closes #2049 from mengxr/more-w2v and squashes the following commits:

050b1c5 [Xiangrui Meng] output shuffle data directly
2014-08-19 22:16:22 -07:00
Xiangrui Meng fce5c0fb63 [HOTFIX][Streaming][MLlib] use temp folder for checkpoint
or Jenkins will complain about no Apache header in checkpoint files. tdas rxin

Author: Xiangrui Meng <meng@databricks.com>

Closes #2046 from mengxr/tmp-checkpoint and squashes the following commits:

0d3ec73 [Xiangrui Meng] remove ssc.stop
9797843 [Xiangrui Meng] change checkpointDir to lazy val
89964ab [Xiangrui Meng] use temp folder for checkpoint
2014-08-19 22:05:29 -07:00
Xiangrui Meng 068b6fe6a1 [SPARK-3130][MLLIB] detect negative values in naive Bayes
because NB treats feature values as term frequencies. jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #2038 from mengxr/nb-neg and squashes the following commits:

52c37c3 [Xiangrui Meng] address comments
65f892d [Xiangrui Meng] detect negative values in nb
2014-08-19 21:01:23 -07:00
Xiangrui Meng 1870dbaa55 [MLLIB] minor update to word2vec
very minor update Ishiihara

Author: Xiangrui Meng <meng@databricks.com>

Closes #2043 from mengxr/minor-w2v and squashes the following commits:

be649fd [Xiangrui Meng] remove map because we only need append
eccefcc [Xiangrui Meng] minor updates to word2vec
2014-08-19 17:41:37 -07:00
Xiangrui Meng 825d4fe47b [SPARK-3136][MLLIB] Create Java-friendly methods in RandomRDDs
Though we don't use default argument for methods in RandomRDDs, it is still not easy for Java users to use because the output type is either `RDD[Double]` or `RDD[Vector]`. Java users should expect `JavaDoubleRDD` and `JavaRDD[Vector]`, respectively. We should create dedicated methods for Java users, and allow default arguments in Scala methods in RandomRDDs, to make life easier for both Java and Scala users. This PR also contains documentation for random data generation. brkyvz

Author: Xiangrui Meng <meng@databricks.com>

Closes #2041 from mengxr/stat-doc and squashes the following commits:

fc5eedf [Xiangrui Meng] add missing comma
ffde810 [Xiangrui Meng] address comments
aef6d07 [Xiangrui Meng] add doc for random data generation
b99d94b [Xiangrui Meng] add java-friendly methods to RandomRDDs
2014-08-19 16:06:48 -07:00
freeman 31f0b071ef [SPARK-3128][MLLIB] Use streaming test suite for StreamingLR
Refactored tests for streaming linear regression to use existing  streaming test utilities. Summary of changes:
- Made ``mllib`` depend on tests from ``streaming``
- Rewrote accuracy and convergence tests to use ``setupStreams`` and ``runStreams``
- Added new test for the accuracy of predictions generated by ``predictOnValue``

These tests should run faster, be easier to extend/maintain, and provide a reference for new tests.

mengxr tdas

Author: freeman <the.freeman.lab@gmail.com>

Closes #2037 from freeman-lab/streamingLR-predict-tests and squashes the following commits:

e851ca7 [freeman] Fixed long lines
50eb0bf [freeman] Refactored tests to use streaming test tools
32c43c2 [freeman] Added test for prediction
2014-08-19 13:28:57 -07:00
Xiangrui Meng 217b5e915e [SPARK-3108][MLLIB] add predictOnValues to StreamingLR and fix predictOn
It is useful in streaming to allow users to carry extra data with the prediction, for monitoring the prediction error for example. freeman-lab

Author: Xiangrui Meng <meng@databricks.com>

Closes #2023 from mengxr/predict-on-values and squashes the following commits:

cac47b8 [Xiangrui Meng] add classtag
2821b3b [Xiangrui Meng] use mapValues
0925efa [Xiangrui Meng] add predictOnValues to StreamingLR and fix predictOn
2014-08-18 18:20:54 -07:00
Joseph K. Bradley c8b16ca0d8 [SPARK-2850] [SPARK-2626] [mllib] MLlib stats examples + small fixes
Added examples for statistical summarization:
* Scala: StatisticalSummary.scala
** Tests: correlation, MultivariateOnlineSummarizer
* python: statistical_summary.py
** Tests: correlation (since MultivariateOnlineSummarizer has no Python API)

Added examples for random and sampled RDDs:
* Scala: RandomAndSampledRDDs.scala
* python: random_and_sampled_rdds.py
* Both test:
** RandomRDDGenerators.normalRDD, normalVectorRDD
** RDD.sample, takeSample, sampleByKey

Added sc.stop() to all examples.

CorrelationSuite.scala
* Added 1 test for RDDs with only 1 value

RowMatrix.scala
* numCols(): Added check for numRows = 0, with error message.
* computeCovariance(): Added check for numRows <= 1, with error message.

Python SparseVector (pyspark/mllib/linalg.py)
* Added toDense() function

python/run-tests script
* Added stat.py (doc test)

CC: mengxr dorx  Main changes were examples to show usage across APIs.

Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>

Closes #1878 from jkbradley/mllib-stats-api-check and squashes the following commits:

ea5c047 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
dafebe2 [Joseph K. Bradley] Bug fixes for examples SampledRDDs.scala and sampled_rdds.py: Check for division by 0 and for missing key in maps.
8d1e555 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
60c72d9 [Joseph K. Bradley] Fixed stat.py doc test to work for Python versions printing nan or NaN.
b20d90a [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
4e5d15e [Joseph K. Bradley] Changed pyspark/mllib/stat.py doc tests to use NaN instead of nan.
32173b7 [Joseph K. Bradley] Stats examples update.
c8c20dc [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
cf70b07 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
0b7cec3 [Joseph K. Bradley] Small updates based on code review.  Renamed statistical_summary.py to correlations.py
ab48f6e [Joseph K. Bradley] RowMatrix.scala * numCols(): Added check for numRows = 0, with error message. * computeCovariance(): Added check for numRows <= 1, with error message.
65e4ebc [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
8195c78 [Joseph K. Bradley] Added examples for random and sampled RDDs: * Scala: RandomAndSampledRDDs.scala * python: random_and_sampled_rdds.py * Both test: ** RandomRDDGenerators.normalRDD, normalVectorRDD ** RDD.sample, takeSample, sampleByKey
064985b [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
ee918e9 [Joseph K. Bradley] Added examples for statistical summarization: * Scala: StatisticalSummary.scala ** Tests: correlation, MultivariateOnlineSummarizer * python: statistical_summary.py ** Tests: correlation (since MultivariateOnlineSummarizer has no Python API)
2014-08-18 18:01:39 -07:00
Joseph K. Bradley 115eeb30dd [mllib] DecisionTree: treeAggregate + Python example bug fix
Small DecisionTree updates:
* Changed main DecisionTree aggregate to treeAggregate.
* Fixed bug in python example decision_tree_runner.py with missing argument (since categoricalFeaturesInfo is no longer an optional argument for trainClassifier).
* Fixed same bug in python doc tests, and added tree.py to doc tests.

CC: mengxr

Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>

Closes #2015 from jkbradley/dt-opt2 and squashes the following commits:

b5114fa [Joseph K. Bradley] Fixed python tree.py doc test (extra newline)
8e4665d [Joseph K. Bradley] Added tree.py to python doc tests.  Fixed bug from missing categoricalFeaturesInfo argument.
b7b2922 [Joseph K. Bradley] Fixed bug in python example decision_tree_runner.py with missing argument.  Changed main DecisionTree aggregate to treeAggregate.
85bbc1f [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt2
66d076f [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt2
a0ed0da [Joseph K. Bradley] Renamed DTMetadata to DecisionTreeMetadata.  Small doc updates.
3726d20 [Joseph K. Bradley] Small code improvements based on code review.
ac0b9f8 [Joseph K. Bradley] Small updates based on code review. Main change: Now using << instead of math.pow.
db0d773 [Joseph K. Bradley] scala style fix
6a38f48 [Joseph K. Bradley] Added DTMetadata class for cleaner code
931a3a7 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt2
797f68a [Joseph K. Bradley] Fixed DecisionTreeSuite bug for training second level.  Needed to update treePointToNodeIndex with groupShift.
f40381c [Joseph K. Bradley] Merge branch 'dt-opt1' into dt-opt2
5f2dec2 [Joseph K. Bradley] Fixed scalastyle issue in TreePoint
6b5651e [Joseph K. Bradley] Updates based on code review.  1 major change: persisting to memory + disk, not just memory.
2d2aaaf [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt1
26d10dd [Joseph K. Bradley] Removed tree/model/Filter.scala since no longer used.  Removed debugging println calls in DecisionTree.scala.
356daba [Joseph K. Bradley] Merge branch 'dt-opt1' into dt-opt2
430d782 [Joseph K. Bradley] Added more debug info on binning error.  Added some docs.
d036089 [Joseph K. Bradley] Print timing info to logDebug.
e66f1b1 [Joseph K. Bradley] TreePoint * Updated doc * Made some methods private
8464a6e [Joseph K. Bradley] Moved TimeTracker to tree/impl/ in its own file, and cleaned it up.  Removed debugging println calls from DecisionTree.  Made TreePoint extend Serialiable
a87e08f [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt1
c1565a5 [Joseph K. Bradley] Small DecisionTree updates: * Simplification: Updated calculateGainForSplit to take aggregates for a single (feature, split) pair. * Internal doc: findAggForOrderedFeatureClassification
b914f3b [Joseph K. Bradley] DecisionTree optimization: eliminated filters + small changes
b2ed1f3 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt
0f676e2 [Joseph K. Bradley] Optimizations + Bug fix for DecisionTree
3211f02 [Joseph K. Bradley] Optimizing DecisionTree * Added TreePoint representation to avoid calling findBin multiple times. * (not working yet, but debugging)
f61e9d2 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-timing
bcf874a [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-timing
511ec85 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-timing
a95bc22 [Joseph K. Bradley] timing for DecisionTree internals
2014-08-18 14:40:05 -07:00
Liquan Pei 9306b8c6c8 [MLlib] Remove transform(dataset: RDD[String]) from Word2Vec public API
mengxr
Remove  transform(dataset: RDD[String]) from public API.

Author: Liquan Pei <liquanpei@gmail.com>

Closes #2010 from Ishiihara/Word2Vec-api and squashes the following commits:

17b1031 [Liquan Pei] remove transform(dataset: RDD[String]) from public API
2014-08-18 01:15:45 -07:00
Liquan Pei 3c8fa50590 [SPARK-3097][MLlib] Word2Vec performance improvement
mengxr Please review the code. Adding weights in reduceByKey soon.

Only output model entry for words appeared in the partition before merging and use reduceByKey to combine model. In general, this implementation is 30s or so faster than implementation using big array.

Author: Liquan Pei <liquanpei@gmail.com>

Closes #1932 from Ishiihara/Word2Vec-improve2 and squashes the following commits:

d5377a9 [Liquan Pei] use syn0Global and syn1Global to represent model
cad2011 [Liquan Pei] bug fix for synModify array out of bound
083aa66 [Liquan Pei] update synGlobal in place and reduce synOut size
9075e1c [Liquan Pei] combine syn0Global and syn1Global to synGlobal
aa2ab36 [Liquan Pei] use reduceByKey to combine models
2014-08-17 23:29:44 -07:00
Xiangrui Meng c77f40668f [SPARK-3087][MLLIB] fix col indexing bug in chi-square and add a check for number of distinct values
There is a bug determining the column index. dorx

Author: Xiangrui Meng <meng@databricks.com>

Closes #1997 from mengxr/chisq-index and squashes the following commits:

8fc2ab2 [Xiangrui Meng] fix col indexing bug and add a check for number of distinct values
2014-08-17 20:53:18 -07:00
Joseph K. Bradley 73ab7f141c [SPARK-3042] [mllib] DecisionTree Filter top-down instead of bottom-up
DecisionTree needs to match each example to a node at each iteration.  It currently does this with a set of filters very inefficiently: For each example, it examines each node at the current level and traces up to the root to see if that example should be handled by that node.

Fix: Filter top-down using the partly built tree itself.

Major changes:
* Eliminated Filter class, findBinsForLevel() method.
* Set up node parent links in main loop over levels in train().
* Added predictNodeIndex() for filtering top-down.
* Added DTMetadata class

Other changes:
* Pre-compute set of unorderedFeatures.

Notes for following expected PR based on [https://issues.apache.org/jira/browse/SPARK-3043]:
* The unorderedFeatures set will next be stored in a metadata structure to simplify function calls (to store other items such as the data in strategy).

I've done initial tests indicating that this speeds things up, but am only now running large-scale ones.

CC: mengxr manishamde chouqin  Any comments are welcome---thanks!

Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>

Closes #1975 from jkbradley/dt-opt2 and squashes the following commits:

a0ed0da [Joseph K. Bradley] Renamed DTMetadata to DecisionTreeMetadata.  Small doc updates.
3726d20 [Joseph K. Bradley] Small code improvements based on code review.
ac0b9f8 [Joseph K. Bradley] Small updates based on code review. Main change: Now using << instead of math.pow.
db0d773 [Joseph K. Bradley] scala style fix
6a38f48 [Joseph K. Bradley] Added DTMetadata class for cleaner code
931a3a7 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt2
797f68a [Joseph K. Bradley] Fixed DecisionTreeSuite bug for training second level.  Needed to update treePointToNodeIndex with groupShift.
f40381c [Joseph K. Bradley] Merge branch 'dt-opt1' into dt-opt2
5f2dec2 [Joseph K. Bradley] Fixed scalastyle issue in TreePoint
6b5651e [Joseph K. Bradley] Updates based on code review.  1 major change: persisting to memory + disk, not just memory.
2d2aaaf [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt1
26d10dd [Joseph K. Bradley] Removed tree/model/Filter.scala since no longer used.  Removed debugging println calls in DecisionTree.scala.
356daba [Joseph K. Bradley] Merge branch 'dt-opt1' into dt-opt2
430d782 [Joseph K. Bradley] Added more debug info on binning error.  Added some docs.
d036089 [Joseph K. Bradley] Print timing info to logDebug.
e66f1b1 [Joseph K. Bradley] TreePoint * Updated doc * Made some methods private
8464a6e [Joseph K. Bradley] Moved TimeTracker to tree/impl/ in its own file, and cleaned it up.  Removed debugging println calls from DecisionTree.  Made TreePoint extend Serialiable
a87e08f [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt1
c1565a5 [Joseph K. Bradley] Small DecisionTree updates: * Simplification: Updated calculateGainForSplit to take aggregates for a single (feature, split) pair. * Internal doc: findAggForOrderedFeatureClassification
b914f3b [Joseph K. Bradley] DecisionTree optimization: eliminated filters + small changes
b2ed1f3 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt
0f676e2 [Joseph K. Bradley] Optimizations + Bug fix for DecisionTree
3211f02 [Joseph K. Bradley] Optimizing DecisionTree * Added TreePoint representation to avoid calling findBin multiple times. * (not working yet, but debugging)
f61e9d2 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-timing
bcf874a [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-timing
511ec85 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-timing
a95bc22 [Joseph K. Bradley] timing for DecisionTree internals
2014-08-16 23:53:14 -07:00
Xiangrui Meng fbad72288d [SPARK-3077][MLLIB] fix some chisq-test
- promote nullHypothesis field in ChiSqTestResult to TestResult. Every test should have a null hypothesis
- correct null hypothesis statement for independence test
- p-value: 0.01 -> 0.1

Author: Xiangrui Meng <meng@databricks.com>

Closes #1982 from mengxr/fix-chisq and squashes the following commits:

5f0de02 [Xiangrui Meng] make ChiSqTestResult constructor package private
bc74ea1 [Xiangrui Meng] update chisq-test
2014-08-16 21:16:27 -07:00
Xiangrui Meng ac6411c6e7 [SPARK-3081][MLLIB] rename RandomRDDGenerators to RandomRDDs
`RandomRDDGenerators` means factory for `RandomRDDGenerator`. However, its methods return RDDs but not RDDGenerators. So a more proper (and shorter) name would be `RandomRDDs`.

dorx brkyvz

Author: Xiangrui Meng <meng@databricks.com>

Closes #1979 from mengxr/randomrdds and squashes the following commits:

b161a2d [Xiangrui Meng] rename RandomRDDGenerators to RandomRDDs
2014-08-16 15:14:43 -07:00
Xiangrui Meng 7e70708a99 [SPARK-3048][MLLIB] add LabeledPoint.parse and remove loadStreamingLabeledPoints
Move `parse()` from `LabeledPointParser` to `LabeledPoint` and make it public. This breaks binary compatibility only when a user uses synthesized methods like `tupled` and `curried`, which is rare.

`LabeledPoint.parse` is more consistent with `Vectors.parse`, which is why `LabeledPointParser` is not preferred.

freeman-lab tdas

Author: Xiangrui Meng <meng@databricks.com>

Closes #1952 from mengxr/labelparser and squashes the following commits:

c818fb2 [Xiangrui Meng] merge master
ce20e6f [Xiangrui Meng] update mima excludes
b386b8d [Xiangrui Meng] fix tests
2436b3d [Xiangrui Meng] add parse() to LabeledPoint
2014-08-16 15:13:34 -07:00
Xiangrui Meng 2e069ca656 [SPARK-3001][MLLIB] Improve Spearman's correlation
The current implementation requires sorting individual columns, which could be done with a global sort.

result on a 32-node cluster:

m | n | prev | this
---|---|-------|-----
1000000 | 50 | 55s | 9s
10000000 | 50 | 97s | 76s
1000000 | 100  | 119s | 15s

Author: Xiangrui Meng <meng@databricks.com>

Closes #1917 from mengxr/spearman and squashes the following commits:

4d5d262 [Xiangrui Meng] remove unused import
85c48de [Xiangrui Meng] minor updates
a048d0c [Xiangrui Meng] remove cache and set a limit to cachedIds
b98bb18 [Xiangrui Meng] add comments
0846e07 [Xiangrui Meng] first version
2014-08-15 21:07:55 -07:00
Xiangrui Meng 5d25c0b74f [SPARK-3078][MLLIB] Make LRWithLBFGS API consistent with others
Should ask users to set parameters through the optimizer. dbtsai

Author: Xiangrui Meng <meng@databricks.com>

Closes #1973 from mengxr/lr-lbfgs and squashes the following commits:

e3efbb1 [Xiangrui Meng] fix tests
21b3579 [Xiangrui Meng] fix method name
641eea4 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into lr-lbfgs
456ab7c [Xiangrui Meng] update LRWithLBFGS
2014-08-15 21:04:29 -07:00
Joseph K. Bradley c7032290a3 [SPARK-3022] [SPARK-3041] [mllib] Call findBins once per level + unordered feature bug fix
DecisionTree improvements:
(1) TreePoint representation to avoid binning multiple times
(2) Bug fix: isSampleValid indexed bins incorrectly for unordered categorical features
(3) Timing for DecisionTree internals

Details:

(1) TreePoint representation to avoid binning multiple times

[https://issues.apache.org/jira/browse/SPARK-3022]

Added private[tree] TreePoint class for representing binned feature values.

The input RDD of LabeledPoint is converted to the TreePoint representation initially and then cached.  This avoids the previous problem of re-computing bins multiple times.

(2) Bug fix: isSampleValid indexed bins incorrectly for unordered categorical features

[https://issues.apache.org/jira/browse/SPARK-3041]

isSampleValid used to treat unordered categorical features incorrectly: It treated the bins as if indexed by featured values, rather than by subsets of values/categories.
* exhibited for unordered features (multi-class classification with categorical features of low arity)
* Fix: Index bins correctly for unordered categorical features.

(3) Timing for DecisionTree internals

Added tree/impl/TimeTracker.scala class which is private[tree] for now, for timing key parts of DT code.
Prints timing info via logDebug.

CC: mengxr manishamde chouqin  Very similar update, with one bug fix.  Many apologies for the conflicting update, but I hope that a few more optimizations I have on the way (which depend on this update) will prove valuable to you: SPARK-3042 and SPARK-3043

Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>

Closes #1950 from jkbradley/dt-opt1 and squashes the following commits:

5f2dec2 [Joseph K. Bradley] Fixed scalastyle issue in TreePoint
6b5651e [Joseph K. Bradley] Updates based on code review.  1 major change: persisting to memory + disk, not just memory.
2d2aaaf [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt1
430d782 [Joseph K. Bradley] Added more debug info on binning error.  Added some docs.
d036089 [Joseph K. Bradley] Print timing info to logDebug.
e66f1b1 [Joseph K. Bradley] TreePoint * Updated doc * Made some methods private
8464a6e [Joseph K. Bradley] Moved TimeTracker to tree/impl/ in its own file, and cleaned it up.  Removed debugging println calls from DecisionTree.  Made TreePoint extend Serialiable
a87e08f [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt1
0f676e2 [Joseph K. Bradley] Optimizations + Bug fix for DecisionTree
3211f02 [Joseph K. Bradley] Optimizing DecisionTree * Added TreePoint representation to avoid calling findBin multiple times. * (not working yet, but debugging)
f61e9d2 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-timing
bcf874a [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-timing
511ec85 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-timing
a95bc22 [Joseph K. Bradley] timing for DecisionTree internals
2014-08-15 14:50:10 -07:00
DB Tsai 9622106757 [SPARK-2979][MLlib] Improve the convergence rate by minimizing the condition number
In theory, the scale of your inputs are irrelevant to logistic regression.
You can "theoretically" multiply X1 by 1E6 and the estimate for β1 will
adjust accordingly. It will be 1E-6 times smaller than the original β1, due
to the invariance property of MLEs.

However, during the optimization process, the convergence (rate)
depends on the condition number of the training dataset. Scaling
the variables often reduces this condition number, thus improving
the convergence rate.

Without reducing the condition number, some training datasets
mixing the columns with different scales may not be able to converge.

GLMNET and LIBSVM packages perform the scaling to reduce
the condition number, and return the weights in the original scale.
See page 9 in http://cran.r-project.org/web/packages/glmnet/glmnet.pdf

Here, if useFeatureScaling is enabled, we will standardize the training
features by dividing the variance of each column (without subtracting
the mean to densify the sparse vector), and train the model in the
scaled space. Then we transform the coefficients from the scaled space
to the original scale as GLMNET and LIBSVM do.

Currently, it's only enabled in LogisticRegressionWithLBFGS.

Author: DB Tsai <dbtsai@alpinenow.com>

Closes #1897 from dbtsai/dbtsai-feature-scaling and squashes the following commits:

f19fc02 [DB Tsai] Added more comments
1d85289 [DB Tsai] Improve the convergence rate by minimize the condition number in LOR with LBFGS
2014-08-14 11:56:13 -07:00
Xiangrui Meng 69a57a18ee [SPARK-2995][MLLIB] add ALS.setIntermediateRDDStorageLevel
As mentioned in SPARK-2465, using `MEMORY_AND_DISK_SER` for user/product in/out links together with `spark.rdd.compress=true` can help reduce the space requirement by a lot, at the cost of speed. It might be useful to add this option so people can run ALS on much bigger datasets.

Another option for the method name is `setIntermediateRDDStorageLevel`.

Author: Xiangrui Meng <meng@databricks.com>

Closes #1913 from mengxr/als-storagelevel and squashes the following commits:

d942017 [Xiangrui Meng] rename to setIntermediateRDDStorageLevel
7550029 [Xiangrui Meng] add ALS.setIntermediateDataStorageLevel
2014-08-13 23:53:44 -07:00
Xiangrui Meng 7ecb867c4c [MLLIB] use Iterator.fill instead of Array.fill
Iterator.fill uses less memory

Author: Xiangrui Meng <meng@databricks.com>

Closes #1930 from mengxr/rand-gen-iter and squashes the following commits:

24178ca [Xiangrui Meng] use Iterator.fill instead of Array.fill
2014-08-13 16:20:49 -07:00
Doris Xin fe4735958e [SPARK-2993] [MLLib] colStats (wrapper around MultivariateStatisticalSummary) in Statistics
For both Scala and Python.

The ser/de util functions were moved out of `PythonMLLibAPI` and into their own object to avoid creating the `PythonMLLibAPI` object inside of `MultivariateStatisticalSummarySerialized`, which is then referenced inside of a method in `PythonMLLibAPI`.

`MultivariateStatisticalSummarySerialized` was created to serialize the `Vector` fields in `MultivariateStatisticalSummary`.

Author: Doris Xin <doris.s.xin@gmail.com>

Closes #1911 from dorx/colStats and squashes the following commits:

77b9924 [Doris Xin] developerAPI tag
de9cbbe [Doris Xin] reviewer comments and moved more ser/de
459faba [Doris Xin] colStats in Statistics for both Scala and Python
2014-08-12 23:47:42 -07:00
Liquan Pei f0060b75ff [MLlib] Correctly set vectorSize and alpha
mengxr
Correctly set vectorSize and alpha in Word2Vec training.

Author: Liquan Pei <liquanpei@gmail.com>

Closes #1900 from Ishiihara/Word2Vec-bugfix and squashes the following commits:

85f64f2 [Liquan Pei] correctly set vectorSize and alpha
2014-08-12 00:28:00 -07:00
Xiangrui Meng 9038d94e1e [SPARK-2923][MLLIB] Implement some basic BLAS routines
Having some basic BLAS operations implemented in MLlib can help simplify the current implementation and improve some performance.

Tested on my local machine:

~~~
bin/spark-submit --class org.apache.spark.examples.mllib.BinaryClassification \
examples/target/scala-*/spark-examples-*.jar --algorithm LR --regType L2 \
--regParam 1.0 --numIterations 1000 ~/share/data/rcv1.binary/rcv1_train.binary
~~~

1. before: ~1m
2. after: ~30s

CC: jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #1849 from mengxr/ml-blas and squashes the following commits:

ba583a2 [Xiangrui Meng] exclude Vector.copy
a4d7d2f [Xiangrui Meng] Merge branch 'master' into ml-blas
6edeab9 [Xiangrui Meng] address comments
940bdeb [Xiangrui Meng] rename MLlibBLAS to BLAS
c2a38bc [Xiangrui Meng] enhance dot tests
4cfaac4 [Xiangrui Meng] add apache header
48d01d2 [Xiangrui Meng] add tests for zeros and copy
3b882b1 [Xiangrui Meng] use blas.scal in gradient
735eb23 [Xiangrui Meng] remove d from BLAS routines
d2d7d3c [Xiangrui Meng] update gradient and lbfgs
7f78186 [Xiangrui Meng] add zeros to Vectors; add dscal and dcopy to BLAS
14e6645 [Xiangrui Meng] add ddot
cbb8273 [Xiangrui Meng] add daxpy test
07db0bb [Xiangrui Meng] Merge branch 'master' into ml-blas
e8c326d [Xiangrui Meng] axpy
2014-08-11 22:33:45 -07:00
DB Tsai 6fab941b65 [SPARK-2934][MLlib] Adding LogisticRegressionWithLBFGS Interface
for training with LBFGS Optimizer which will converge faster than SGD.

Author: DB Tsai <dbtsai@alpinenow.com>

Closes #1862 from dbtsai/dbtsai-lbfgs-lor and squashes the following commits:

aa84b81 [DB Tsai] small change
f852bcd [DB Tsai] Remove duplicate method
f119fdc [DB Tsai] Formatting
97776aa [DB Tsai] address more feedback
85b4a91 [DB Tsai] address feedback
3cf50c2 [DB Tsai] LogisticRegressionWithLBFGS interface
2014-08-11 19:49:29 -07:00
Doris Xin 32638b5e74 [SPARK-2515][mllib] Chi Squared test
Author: Doris Xin <doris.s.xin@gmail.com>

Closes #1733 from dorx/chisquare and squashes the following commits:

cafb3a7 [Doris Xin] fixed p-value for extreme case.
d286783 [Doris Xin] Merge branch 'master' into chisquare
e95e485 [Doris Xin] reviewer comments.
7dde711 [Doris Xin] ChiSqTestResult renaming and changed to Class
80d03e2 [Doris Xin] Reviewer comments.
c39eeb5 [Doris Xin] units passed with updated API
e90d90a [Doris Xin] Merge branch 'master' into chisquare
7eea80b [Doris Xin] WIP
d64c2fb [Doris Xin] Merge branch 'master' into chisquare
5686082 [Doris Xin] facelift
bc7eb2e [Doris Xin] unit passed; still need docs and some refactoring
50703a5 [Doris Xin] merge master
4e4e361 [Doris Xin] WIP
e6b83f3 [Doris Xin] reviewer comments
3d61582 [Doris Xin] input names
706d436 [Doris Xin] Added API for RDD[Vector]
6598379 [Doris Xin] API and code structure.
ff17423 [Doris Xin] WIP
2014-08-11 19:22:14 -07:00
Xiangrui Meng 74d6f62264 [SPARK-1997][MLLIB] update breeze to 0.9
0.9 dependences (this version doesn't depend on scalalogging and I excluded commons-math3 from its transitive dependencies):
~~~
+-org.scalanlp:breeze_2.10:0.9 [S]
  +-com.github.fommil.netlib:core:1.1.2
  +-com.github.rwl:jtransforms:2.4.0
  +-net.sf.opencsv:opencsv:2.3
  +-net.sourceforge.f2j:arpack_combined_all:0.1
  +-org.scalanlp:breeze-macros_2.10:0.3.1 [S]
  | +-org.scalamacros:quasiquotes_2.10:2.0.0 [S]
  |
  +-org.slf4j:slf4j-api:1.7.5
  +-org.spire-math:spire_2.10:0.7.4 [S]
    +-org.scalamacros:quasiquotes_2.10:2.0.0 [S]
    |
    +-org.spire-math:spire-macros_2.10:0.7.4 [S]
      +-org.scalamacros:quasiquotes_2.10:2.0.0 [S]
~~~

Closes #1749

CC: witgo avati

Author: Xiangrui Meng <meng@databricks.com>

Closes #1857 from mengxr/breeze-0.9 and squashes the following commits:

7fc16b6 [Xiangrui Meng] don't know why but exclude a private method for mima
dcc502e [Xiangrui Meng] update breeze to 0.9
2014-08-08 15:07:31 -07:00
Xiangrui Meng b9e9e53773 [SPARK-2852][MLLIB] Separate model from IDF/StandardScaler algorithms
This is part of SPARK-2828:

1. separate IDF model from IDF algorithm (which generates a model)
2. separate StandardScaler model from StandardScaler

CC: dbtsai

Author: Xiangrui Meng <meng@databricks.com>

Closes #1814 from mengxr/feature-api-update and squashes the following commits:

40d863b [Xiangrui Meng] move mean and variance to model
48a0fff [Xiangrui Meng] separate Model from StandardScaler algorithm
89f3486 [Xiangrui Meng] update IDF to separate Model from Algorithm
2014-08-07 11:28:12 -07:00
Joseph K. Bradley 8d1dec4fa4 [mllib] DecisionTree Strategy parameter checks
Added some checks to Strategy to print out meaningful error messages when given invalid DecisionTree parameters.
CC mengxr

Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>

Closes #1821 from jkbradley/dt-robustness and squashes the following commits:

4dc449a [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-robustness
7a61f7b [Joseph K. Bradley] Added some checks to Strategy to print out meaningful error messages when given invalid DecisionTree parameters
2014-08-07 00:20:38 -07:00
Joseph K. Bradley 47ccd5e71b [SPARK-2851] [mllib] DecisionTree Python consistency update
Added 6 static train methods to match Python API, but without default arguments (but with Python default args noted in docs).

Added factory classes for Algo and Impurity, but made private[mllib].

CC: mengxr dorx  Please let me know if there are other changes which would help with API consistency---thanks!

Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>

Closes #1798 from jkbradley/dt-python-consistency and squashes the following commits:

6f7edf8 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-python-consistency
a0d7dbe [Joseph K. Bradley] DecisionTree: In Java-friendly train* methods, changed to use JavaRDD instead of RDD.
ee1d236 [Joseph K. Bradley] DecisionTree API updates: * Removed train() function in Python API (tree.py) ** Removed corresponding function in Scala/Java API (the ones taking basic types)
00f820e [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-python-consistency
fe6dbfa [Joseph K. Bradley] removed unnecessary imports
e358661 [Joseph K. Bradley] DecisionTree API change: * Added 6 static train methods to match Python API, but without default arguments (but with Python default args noted in docs).
c699850 [Joseph K. Bradley] a few doc comments
eaf84c0 [Joseph K. Bradley] Added DecisionTree static train() methods API to match Python, but without default parameters
2014-08-06 22:58:59 -07:00
Xiangrui Meng 25cff1019d [SPARK-2852][MLLIB] API consistency for mllib.feature
This is part of SPARK-2828:

1. added a Java-friendly fit method to Word2Vec with tests
2. change DeveloperApi to Experimental for Normalizer & StandardScaler
3. change default feature dimension to 2^20 in HashingTF

Author: Xiangrui Meng <meng@databricks.com>

Closes #1807 from mengxr/feature-api-check and squashes the following commits:

773c1a9 [Xiangrui Meng] change default numFeatures to 2^20 in HashingTF change annotation from DeveloperApi to Experimental in Normalizer and StandardScaler
883e122 [Xiangrui Meng] add @Experimental to Word2VecModel add a Java-friendly method to Word2Vec.fit with tests
2014-08-06 14:07:51 -07:00
DB Tsai c7b52010df [MLlib] Use this.type as return type in k-means' builder pattern
to ensure that the return object is itself.

Author: DB Tsai <dbtsai@alpinenow.com>

Closes #1796 from dbtsai/dbtsai-kmeans and squashes the following commits:

658989e [DB Tsai] Alpine Data Labs
2014-08-05 23:32:29 -07:00
Michael Giannakopoulos 1aad9114c9 [SPARK-2550][MLLIB][APACHE SPARK] Support regularization and intercept in pyspark's linear methods
Related to Jira Issue: [SPARK-2550](https://issues.apache.org/jira/browse/SPARK-2550?jql=project%20%3D%20SPARK%20AND%20resolution%20%3D%20Unresolved%20AND%20priority%20%3D%20Major%20ORDER%20BY%20key%20DESC)

Author: Michael Giannakopoulos <miccagiann@gmail.com>

Closes #1775 from miccagiann/linearMethodsReg and squashes the following commits:

cb774c3 [Michael Giannakopoulos] MiniBatchFraction added in related PythonMLLibAPI java stubs.
81fcbc6 [Michael Giannakopoulos] Fixing a typo-error.
8ad263e [Michael Giannakopoulos] Adding regularizer type and intercept parameters to LogisticRegressionWithSGD and SVMWithSGD.
2014-08-05 16:30:32 -07:00
Xiangrui Meng cc491f69cd [SPARK-2864][MLLIB] fix random seed in word2vec; move model to local
It also moves the model to local in order to map `RDD[String]` to `RDD[Vector]`.

Ishiihara

Author: Xiangrui Meng <meng@databricks.com>

Closes #1790 from mengxr/word2vec-fix and squashes the following commits:

a87146c [Xiangrui Meng] add setters and make a default constructor
e5c923b [Xiangrui Meng] fix random seed in word2vec; move model to local
2014-08-05 16:22:41 -07:00
Liquan Pei e053c55819 [MLlib] [SPARK-2510]Word2Vec: Distributed Representation of Words
This is a pull request regarding SPARK-2510 at https://issues.apache.org/jira/browse/SPARK-2510. Word2Vec creates vector representation of words in a text corpus. The algorithm first constructs a vocabulary from the corpus and then learns vector representation of words in the vocabulary. The vector representation can be used as features in natural language processing and machine learning algorithms.

To make our implementation more scalable, we train each partition separately and merge the model of each partition after each iteration. To make the model more accurate, multiple iterations may be needed.

To investigate the vector representations is to find the closest words for a query word. For example, the top 20 closest words to "china" are for 1 partition and 1 iteration :

taiwan 0.8077646146334014
korea 0.740913304563621
japan 0.7240667798885471
republic 0.7107151279078352
thailand 0.6953217332072862
tibet 0.6916782118129544
mongolia 0.6800858715972612
macau 0.6794925677480378
singapore 0.6594048695593799
manchuria 0.658989931844148
laos 0.6512978726001666
nepal 0.6380792327845325
mainland 0.6365469459587788
myanmar 0.6358614338840394
macedonia 0.6322366180313249
xinjiang 0.6285291551708028
russia 0.6279951236068411
india 0.6272874944023487
shanghai 0.6234544135576999
macao 0.6220588462925876

The result with 10 partitions and 5 iterations is:
taiwan 0.8310495079388313
india 0.7737171315919039
japan 0.756777901233668
korea 0.7429767187102452
indonesia 0.7407557427278356
pakistan 0.712883426985585
mainland 0.7053379963140822
thailand 0.696298191073948
mongolia 0.693690656871415
laos 0.6913069680735292
macau 0.6903427690029617
republic 0.6766381604813666
malaysia 0.676460699141784
singapore 0.6728790997360923
malaya 0.672345232966194
manchuria 0.6703732292753156
macedonia 0.6637955686322028
myanmar 0.6589462882439646
kazakhstan 0.657017801081494
cambodia 0.6542383836451932

Author: Liquan Pei <lpei@gopivotal.com>
Author: Xiangrui Meng <meng@databricks.com>
Author: Liquan Pei <liquanpei@gmail.com>

Closes #1719 from Ishiihara/master and squashes the following commits:

2ba9483 [Liquan Pei] minor fix for Word2Vec test
e248441 [Liquan Pei] minor style change
26a948d [Liquan Pei] Merge pull request #1 from mengxr/Ishiihara-master
c14da41 [Xiangrui Meng] fix styles
384c771 [Xiangrui Meng] remove minCount and window from constructor change model to use float instead of double
e93e726 [Liquan Pei] use treeAggregate instead of aggregate
1a8fb41 [Liquan Pei] use weighted sum in combOp
7efbb6f [Liquan Pei] use broadcast version of vocab in aggregate
6bcc8be [Liquan Pei] add multiple iteration support
720b5a3 [Liquan Pei] Add test for Word2Vec algorithm, minor fixes
2e92b59 [Liquan Pei] modify according to feedback
57dc50d [Liquan Pei] code formatting
e4a04d3 [Liquan Pei] minor fix
0aafb1b [Liquan Pei] Add comments, minor fixes
8d6befe [Liquan Pei] initial commit
2014-08-03 23:55:58 -07:00
DB Tsai ae58aea2d1 SPARK-2272 [MLlib] Feature scaling which standardizes the range of independent variables or features of data
Feature scaling is a method used to standardize the range of independent variables or features of data. In data processing, it is generally performed during the data preprocessing step.

In this work, a trait called `VectorTransformer` is defined for generic transformation on a vector. It contains one method to be implemented, `transform` which applies transformation on a vector.

There are two implementations of `VectorTransformer` now, and they all can be easily extended with PMML transformation support.

1) `StandardScaler` - Standardizes features by removing the mean and scaling to unit variance using column summary statistics on the samples in the training set.

2) `Normalizer` - Normalizes samples individually to unit L^n norm

Author: DB Tsai <dbtsai@alpinenow.com>

Closes #1207 from dbtsai/dbtsai-feature-scaling and squashes the following commits:

78c15d3 [DB Tsai] Alpine Data Labs
2014-08-03 21:39:21 -07:00
Joseph K. Bradley 2998e38a94 [SPARK-2197] [mllib] Java DecisionTree bug fix and easy-of-use
Bug fix: Before, when an RDD was created in Java and passed to DecisionTree.train(), the fake class tag caused problems.
* Fix: DecisionTree: Used new RDD.retag() method to allow passing RDDs from Java.

Other improvements to Decision Trees for easy-of-use with Java:
* impurity classes: Added instance() methods to help with Java interface.
* Strategy: Added Java-friendly constructor
--> Note: I removed quantileCalculationStrategy from the Java-friendly constructor since (a) it is a special class and (b) there is only 1 option currently.  I suspect we will redo the API before the other options are included.

CC: mengxr

Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>

Closes #1740 from jkbradley/dt-java-new and squashes the following commits:

0805dc6 [Joseph K. Bradley] Changed Strategy to use JavaConverters instead of JavaConversions
519b1b7 [Joseph K. Bradley] * Organized imports in JavaDecisionTreeSuite.java * Using JavaConverters instead of JavaConversions in DecisionTreeSuite.scala
f7b5ca1 [Joseph K. Bradley] Improvements to make it easier to run DecisionTree from Java. * DecisionTree: Used new RDD.retag() method to allow passing RDDs from Java. * impurity classes: Added instance() methods to help with Java interface. * Strategy: Added Java-friendly constructor ** Note: I removed quantileCalculationStrategy from the Java-friendly constructor since (a) it is a special class and (b) there is only 1 option currently.  I suspect we will redo the API before the other options are included.
d78ada6 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-java
320853f [Joseph K. Bradley] Added JavaDecisionTreeSuite, partly written
13a585e [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-java
f1a8283 [Joseph K. Bradley] Added old JavaDecisionTreeSuite, to be updated later
225822f [Joseph K. Bradley] Bug: In DecisionTree, the method sequentialBinSearchForOrderedCategoricalFeatureInClassification() indexed bins from 0 to (math.pow(2, featureCategories.toInt - 1) - 1). This upper bound is the bound for unordered categorical features, not ordered ones. The upper bound should be the arity (i.e., max value) of the feature.
2014-08-03 10:36:52 -07:00
Joseph K. Bradley 3f67382e7c [SPARK-2478] [mllib] DecisionTree Python API
Added experimental Python API for Decision Trees.

API:
* class DecisionTreeModel
** predict() for single examples and RDDs, taking both feature vectors and LabeledPoints
** numNodes()
** depth()
** __str__()
* class DecisionTree
** trainClassifier()
** trainRegressor()
** train()

Examples and testing:
* Added example testing classification and regression with batch prediction: examples/src/main/python/mllib/tree.py
* Have also tested example usage in doc of python/pyspark/mllib/tree.py which tests single-example prediction with dense and sparse vectors

Also: Small bug fix in python/pyspark/mllib/_common.py: In _linear_predictor_typecheck, changed check for RDD to use isinstance() instead of type() in order to catch RDD subclasses.

CC mengxr manishamde

Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>

Closes #1727 from jkbradley/decisiontree-python-new and squashes the following commits:

3744488 [Joseph K. Bradley] Renamed test tree.py to decision_tree_runner.py Small updates based on github review.
6b86a9d [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
affceb9 [Joseph K. Bradley] * Fixed bug in doc tests in pyspark/mllib/util.py caused by change in loadLibSVMFile behavior.  (It used to threshold labels at 0 to make them 0/1, but it now leaves them as they are.) * Fixed small bug in loadLibSVMFile: If a data file had no features, then loadLibSVMFile would create a single all-zero feature.
67a29bc [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
cf46ad7 [Joseph K. Bradley] Python DecisionTreeModel * predict(empty RDD) returns an empty RDD instead of an error. * Removed support for calling predict() on LabeledPoint and RDD[LabeledPoint] * predict() does not cache serialized RDD any more.
aa29873 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
bf21be4 [Joseph K. Bradley] removed old run() func from DecisionTree
fa10ea7 [Joseph K. Bradley] Small style update
7968692 [Joseph K. Bradley] small braces typo fix
e34c263 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
4801b40 [Joseph K. Bradley] Small style update to DecisionTreeSuite
db0eab2 [Joseph K. Bradley] Merge branch 'decisiontree-bugfix2' into decisiontree-python-new
6873fa9 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
225822f [Joseph K. Bradley] Bug: In DecisionTree, the method sequentialBinSearchForOrderedCategoricalFeatureInClassification() indexed bins from 0 to (math.pow(2, featureCategories.toInt - 1) - 1). This upper bound is the bound for unordered categorical features, not ordered ones. The upper bound should be the arity (i.e., max value) of the feature.
93953f1 [Joseph K. Bradley] Likely done with Python API.
6df89a9 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
4562c08 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
665ba78 [Joseph K. Bradley] Small updates towards Python DecisionTree API
188cb0d [Joseph K. Bradley] Merge branch 'decisiontree-bugfix' into decisiontree-python-new
6622247 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
b8fac57 [Joseph K. Bradley] Finished Python DecisionTree API and example but need to test a bit more.
2b20c61 [Joseph K. Bradley] Small doc and style updates
1b29c13 [Joseph K. Bradley] Merge branch 'decisiontree-bugfix' into decisiontree-python-new
584449a [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
dab0b67 [Joseph K. Bradley] Added documentation for DecisionTree internals
8bb8aa0 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-bugfix
978cfcf [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-bugfix
6eed482 [Joseph K. Bradley] In DecisionTree: Changed from using procedural syntax for functions returning Unit to explicitly writing Unit return type.
376dca2 [Joseph K. Bradley] Updated meaning of maxDepth by 1 to fit scikit-learn and rpart. * In code, replaced usages of maxDepth <-- maxDepth + 1 * In params, replace settings of maxDepth <-- maxDepth - 1
e06e423 [Joseph K. Bradley] Merge branch 'decisiontree-bugfix' into decisiontree-python-new
bab3f19 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
59750f8 [Joseph K. Bradley] * Updated Strategy to check numClassesForClassification only if algo=Classification. * Updates based on comments: ** DecisionTreeRunner *** Made dataFormat arg default to libsvm ** Small cleanups ** tree.Node: Made recursive helper methods private, and renamed them.
52e17c5 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-bugfix
f5a036c [Joseph K. Bradley] Merge branch 'decisiontree-bugfix' into decisiontree-python-new
da50db7 [Joseph K. Bradley] Added one more test to DecisionTreeSuite: stump with 2 continuous variables for binary classification.  Caused problems in past, but fixed now.
8e227ea [Joseph K. Bradley] Changed Strategy so it only requires numClassesForClassification >= 2 for classification
cd1d933 [Joseph K. Bradley] Merge branch 'decisiontree-bugfix' into decisiontree-python-new
8ea8750 [Joseph K. Bradley] Bug fix: Off-by-1 when finding thresholds for splits for continuous features.
8a758db [Joseph K. Bradley] Merge branch 'decisiontree-bugfix' into decisiontree-python-new
5fe44ed [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
2283df8 [Joseph K. Bradley] 2 bug fixes.
73fbea2 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-bugfix
5f920a1 [Joseph K. Bradley] Demonstration of bug before submitting fix: Updated DecisionTreeSuite so that 3 tests fail.  Will describe bug in next commit.
f825352 [Joseph K. Bradley] Wrote Python API and example for DecisionTree.  Also added toString, depth, and numNodes methods to DecisionTreeModel.
2014-08-02 13:07:17 -07:00
Patrick Wendell dab37966b0 Revert "[SPARK-1470][SPARK-1842] Use the scala-logging wrapper instead of the directly sfl4j api"
This reverts commit adc8303294.
2014-08-01 23:55:30 -07:00
GuoQiang Li adc8303294 [SPARK-1470][SPARK-1842] Use the scala-logging wrapper instead of the directly sfl4j api
Author: GuoQiang Li <witgo@qq.com>

Closes #1369 from witgo/SPARK-1470_new and squashes the following commits:

66a1641 [GuoQiang Li] IncompatibleResultTypeProblem
73a89ba [GuoQiang Li] Use the scala-logging wrapper instead of the directly sfl4j api.
2014-08-01 23:55:11 -07:00
Burak fda475987f [SPARK-2801][MLlib]: DistributionGenerator renamed to RandomDataGenerator. RandomRDD is now of generic type
The RandomRDDGenerators used to only output RDD[Double].
Now RandomRDDGenerators.randomRDD can be used to generate a random RDD[T] via a class that extends RandomDataGenerator, by supplying a type T and overriding the nextValue() function as they wish.

Author: Burak <brkyvz@gmail.com>

Closes #1732 from brkyvz/SPARK-2801 and squashes the following commits:

c94a694 [Burak] [SPARK-2801][MLlib] Missing ClassTags added
22d96fe [Burak] [SPARK-2801][MLlib]: DistributionGenerator renamed to RandomDataGenerator, generic types added for RandomRDD instead of Double
2014-08-01 22:32:12 -07:00
Tor Myklebust e25ec06171 [SPARK-1580][MLLIB] Estimate ALS communication and computation costs.
Continue the work from #493.

Closes #493 and Closes #593

Author: Tor Myklebust <tmyklebu@gmail.com>
Author: Xiangrui Meng <meng@databricks.com>

Closes #1731 from mengxr/tmyklebu-alscost and squashes the following commits:

9b56a8b [Xiangrui Meng] updated API and added a simple test
68a3229 [Xiangrui Meng] merge master
217bd1d [Tor Myklebust] Documentation and choleskies -> subproblems.
8cbb718 [Tor Myklebust] Braces get spaces.
0455cd4 [Tor Myklebust] Parens for collectAsMap.
2b2febe [Tor Myklebust] Use `makeLinkRDDs` when estimating costs.
2ab7a5d [Tor Myklebust] Reindent estimateCost's declaration and make it return Seqs.
8b21e6d [Tor Myklebust] Fix overlong lines.
8cbebf1 [Tor Myklebust] Rename and clean up the return format of cost estimator.
6615ed5 [Tor Myklebust] It's more useful to give per-partition estimates.  Do that.
5530678 [Tor Myklebust] Merge branch 'master' of https://github.com/apache/spark into alscost
6c31324 [Tor Myklebust] Make it actually build...
a1184d1 [Tor Myklebust] Mark ALS.evaluatePartitioner DeveloperApi.
657a71b [Tor Myklebust] Simple-minded estimates of computation and communication costs in ALS.
dcf583a [Tor Myklebust] Remove the partitioner member variable; instead, thread that needle everywhere it needs to go.
23d6f91 [Tor Myklebust] Stop making the partitioner configurable.
495784f [Tor Myklebust] Merge branch 'master' of https://github.com/apache/spark
674933a [Tor Myklebust] Fix style.
40edc23 [Tor Myklebust] Fix missing space.
f841345 [Tor Myklebust] Fix daft bug creating 'pairs', also for -> foreach.
5ec9e6c [Tor Myklebust] Clean a couple of things up using 'map'.
36a0f43 [Tor Myklebust] Make the partitioner private.
d872b09 [Tor Myklebust] Add negative id ALS test.
df27697 [Tor Myklebust] Support custom partitioners.  Currently we use the same partitioner for users and products.
c90b6d8 [Tor Myklebust] Scramble user and product ids before bucketing.
c774d7d [Tor Myklebust] Make the partitioner a member variable and use it instead of modding directly.
2014-08-01 21:25:02 -07:00
Michael Giannakopoulos c281189222 [SPARK-2550][MLLIB][APACHE SPARK] Support regularization and intercept in pyspark's linear methods.
Related to issue: [SPARK-2550](https://issues.apache.org/jira/browse/SPARK-2550?jql=project%20%3D%20SPARK%20AND%20resolution%20%3D%20Unresolved%20AND%20priority%20%3D%20Major%20ORDER%20BY%20key%20DESC).

Author: Michael Giannakopoulos <miccagiann@gmail.com>

Closes #1624 from miccagiann/new-branch and squashes the following commits:

c02e5f5 [Michael Giannakopoulos] Merge cleanly with upstream/master.
8dcb888 [Michael Giannakopoulos] Putting the if/else if statements in brackets.
fed8eaa [Michael Giannakopoulos] Adding a space in the message related to the IllegalArgumentException.
44e6ff0 [Michael Giannakopoulos] Adding a blank line before python class LinearRegressionWithSGD.
8eba9c5 [Michael Giannakopoulos] Change function signatures. Exception is thrown from the scala component and not from the python one.
638be47 [Michael Giannakopoulos] Modified code to comply with code standards.
ec50ee9 [Michael Giannakopoulos] Shorten the if-elif-else statement in regression.py file
b962744 [Michael Giannakopoulos] Replaced the enum classes, with strings-keywords for defining the values of 'regType' parameter.
78853ec [Michael Giannakopoulos] Providing intercept and regualizer functionallity for linear methods in only one function.
3ac8874 [Michael Giannakopoulos] Added support for regularizer and intercection parameters for linear regression method.
2014-08-01 21:00:31 -07:00
Jeremy Freeman f6a1899306 Streaming mllib [SPARK-2438][MLLIB]
This PR implements a streaming linear regression analysis, in which a linear regression model is trained online as new data arrive. The design is based on discussions with tdas and mengxr, in which we determined how to add this functionality in a general way, with minimal changes to existing libraries.

__Summary of additions:__

_StreamingLinearAlgorithm_
- An abstract class for fitting generalized linear models online to streaming data, including training on (and updating) a model, and making predictions.

_StreamingLinearRegressionWithSGD_
- Class and companion object for running streaming linear regression

_StreamingLinearRegressionTestSuite_
- Unit tests

_StreamingLinearRegression_
- Example use case: fitting a model online to data from one stream, and making predictions on other data

__Notes__
- If this looks good, I can use the StreamingLinearAlgorithm class to easily implement other analyses that follow the same logic (Ridge, Lasso, Logistic, SVM).

Author: Jeremy Freeman <the.freeman.lab@gmail.com>
Author: freeman <the.freeman.lab@gmail.com>

Closes #1361 from freeman-lab/streaming-mllib and squashes the following commits:

775ea29 [Jeremy Freeman] Throw error if user doesn't initialize weights
4086fee [Jeremy Freeman] Fixed current weight formatting
8b95b27 [Jeremy Freeman] Restored broadcasting
29f27ec [Jeremy Freeman] Formatting
8711c41 [Jeremy Freeman] Used return to avoid indentation
777b596 [Jeremy Freeman] Restored treeAggregate
74cf440 [Jeremy Freeman] Removed static methods
d28cf9a [Jeremy Freeman] Added usage notes
c3326e7 [Jeremy Freeman] Improved documentation
9541a41 [Jeremy Freeman] Merge remote-tracking branch 'upstream/master' into streaming-mllib
66eba5e [Jeremy Freeman] Fixed line lengths
2fe0720 [Jeremy Freeman] Minor cleanup
7d51378 [Jeremy Freeman] Moved streaming loader to MLUtils
b9b69f6 [Jeremy Freeman] Added setter methods
c3f8b5a [Jeremy Freeman] Modified logging
00aafdc [Jeremy Freeman] Add modifiers
14b801e [Jeremy Freeman] Name changes
c7d38a3 [Jeremy Freeman] Move check for empty data to GradientDescent
4b0a5d3 [Jeremy Freeman] Cleaned up tests
74188d6 [Jeremy Freeman] Eliminate dependency on commons
50dd237 [Jeremy Freeman] Removed experimental tag
6bfe1e6 [Jeremy Freeman] Fixed imports
a2a63ad [freeman] Makes convergence test more robust
86220bc [freeman] Streaming linear regression unit tests
fb4683a [freeman] Minor changes for scalastyle consistency
fd31e03 [freeman] Changed logging behavior
453974e [freeman] Fixed indentation
c4b1143 [freeman] Streaming linear regression
604f4d7 [freeman] Expanded private class to include mllib
d99aa85 [freeman] Helper methods for streaming MLlib apps
0898add [freeman] Added dependency on streaming
2014-08-01 20:10:26 -07:00
Joseph K. Bradley 7058a5393b [SPARK-2796] [mllib] DecisionTree bug fix: ordered categorical features
Bug: In DecisionTree, the method sequentialBinSearchForOrderedCategoricalFeatureInClassification() indexed bins from 0 to (math.pow(2, featureCategories.toInt - 1) - 1). This upper bound is the bound for unordered categorical features, not ordered ones. The upper bound should be the arity (i.e., max value) of the feature.

Added new test to DecisionTreeSuite to catch this: "regression stump with categorical variables of arity 2"

Bug fix: Modified upper bound discussed above.

Also: Small improvements to coding style in DecisionTree.

CC mengxr manishamde

Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>

Closes #1720 from jkbradley/decisiontree-bugfix2 and squashes the following commits:

225822f [Joseph K. Bradley] Bug: In DecisionTree, the method sequentialBinSearchForOrderedCategoricalFeatureInClassification() indexed bins from 0 to (math.pow(2, featureCategories.toInt - 1) - 1). This upper bound is the bound for unordered categorical features, not ordered ones. The upper bound should be the arity (i.e., max value) of the feature.
2014-08-01 15:52:21 -07:00
Doris Xin d88e695613 [SPARK-2786][mllib] Python correlations
Author: Doris Xin <doris.s.xin@gmail.com>

Closes #1713 from dorx/pythonCorrelation and squashes the following commits:

5f1e60c [Doris Xin] reviewer comments.
46ff6eb [Doris Xin] reviewer comments.
ad44085 [Doris Xin] style fix
e69d446 [Doris Xin] fixed missed conflicts.
eb5bf56 [Doris Xin] merge master
cc9f725 [Doris Xin] units passed.
9141a63 [Doris Xin] WIP2
d199f1f [Doris Xin] Moved correlation names into a public object
cd163d6 [Doris Xin] WIP
2014-08-01 15:02:17 -07:00
Xiangrui Meng 5328c0aaa0 [HOTFIX] downgrade breeze version to 0.7
breeze-0.8.1 causes dependency issues, as discussed in #940 .

Author: Xiangrui Meng <meng@databricks.com>

Closes #1718 from mengxr/revert-breeze and squashes the following commits:

99c4681 [Xiangrui Meng] downgrade breeze version to 0.7
2014-08-01 10:00:46 -07:00
witgo 0dacb1adb5 [SPARK-1997] update breeze to version 0.8.1
`breeze 0.8.1`  dependent on  `scala-logging-slf4j 2.1.1` The relevant code on #1369

Author: witgo <witgo@qq.com>

Closes #940 from witgo/breeze-8.0.1 and squashes the following commits:

65cc65e [witgo] update breeze  to version 0.8.1
2014-08-01 07:47:44 -07:00
Sean Owen 82d209d43f SPARK-2768 [MLLIB] Add product, user recommend method to MatrixFactorizationModel
Right now, `MatrixFactorizationModel` can only predict a score for one or more `(user,product)` tuples. As a comment in the file notes, it would be more useful to expose a recommend method, that computes top N scoring products for a user (or vice versa – users for a product).

(This also corrects some long lines in the Java ALS test suite.)

As you can see, it's a little messy to access the class from Java. Should there be a Java-friendly wrapper for it? with a pointer about where that should go, I could add that.

Author: Sean Owen <srowen@gmail.com>

Closes #1687 from srowen/SPARK-2768 and squashes the following commits:

b349675 [Sean Owen] Additional review changes
c9edb04 [Sean Owen] Updates from code review
7bc35f9 [Sean Owen] Add recommend methods to MatrixFactorizationModel
2014-08-01 07:32:53 -07:00
Doris Xin c4755403e7 [SPARK-2782][mllib] Bug fix for getRanks in SpearmanCorrelation
getRanks computes the wrong rank when numPartition >= size in the input RDDs before this patch. added units to address this bug.

Author: Doris Xin <doris.s.xin@gmail.com>

Closes #1710 from dorx/correlationBug and squashes the following commits:

733def4 [Doris Xin] bugs and reviewer comments.
31db920 [Doris Xin] revert unnecessary change
043ff83 [Doris Xin] bug fix for spearman corner case
2014-07-31 21:23:35 -07:00
Xiangrui Meng b19008320b [SPARK-2777][MLLIB] change ALS factors storage level to MEMORY_AND_DISK
Now the factors are persisted in memory only. If they get kicked off by later jobs, we might have to start the computation from very beginning. A better solution is changing the storage level to `MEMORY_AND_DISK`.

srowen

Author: Xiangrui Meng <meng@databricks.com>

Closes #1700 from mengxr/als-level and squashes the following commits:

c103d76 [Xiangrui Meng] change ALS factors storage level to MEMORY_AND_DISK
2014-07-31 21:14:08 -07:00
Joseph K. Bradley b124de584a [SPARK-2756] [mllib] Decision tree bug fixes
(1) Inconsistent aggregate (agg) indexing for unordered features.
(2) Fixed gain calculations for edge cases.
(3) One-off error in choosing thresholds for continuous features for small datasets.
(4) (not a bug) Changed meaning of tree depth by 1 to fit scikit-learn and rpart. (Depth 1 used to mean 1 leaf node; depth 0 now means 1 leaf node.)

Other updates, to help with tests:
* Updated DecisionTreeRunner to print more info.
* Added utility functions to DecisionTreeModel: toString, depth, numNodes
* Improved internal DecisionTree documentation

Bug fix details:

(1) Indexing was inconsistent for aggregate calculations for unordered features (in multiclass classification with categorical features, where the features had few enough values such that they could be considered unordered, i.e., isSpaceSufficientForAllCategoricalSplits=true).

* updateBinForUnorderedFeature indexed agg as (node, feature, featureValue, binIndex), where
** featureValue was from arr (so it was a feature value)
** binIndex was in [0,…, 2^(maxFeatureValue-1)-1)
* The rest of the code indexed agg as (node, feature, binIndex, label).
* Corrected this bug by changing updateBinForUnorderedFeature to use the second indexing pattern.

Unit tests in DecisionTreeSuite
* Updated a few tests to train a model and test its training accuracy, which catches the indexing bug from updateBinForUnorderedFeature() discussed above.
* Added new test (“stump with categorical variables for multiclass classification, with just enough bins”) to test bin extremes.

(2) Bug fix: calculateGainForSplit (for classification):
* It used to return dummy prediction values when either the right or left children had 0 weight.  These were incorrect for multiclass classification.  It has been corrected.

Updated impurities to allow for count = 0.  This was related to the above bug fix for calculateGainForSplit (for classification).

Small updates to documentation and coding style.

(3) Bug fix: Off-by-1 when finding thresholds for splits for continuous features.

* Exhibited bug in new test in DecisionTreeSuite: “stump with 1 continuous variable for binary classification, to check off-by-1 error”
* Description: When finding thresholds for possible splits for continuous features in DecisionTree.findSplitsBins, the thresholds were set according to individual training examples’ feature values.
* Fix: The threshold is set to be the average of 2 consecutive (sorted) examples’ feature values.  E.g.: If the old code set the threshold using example i, the new code sets the threshold using exam
* Note: In 4 DecisionTreeSuite tests with all labels identical, removed check of threshold since it is somewhat arbitrary.

CC: mengxr manishamde  Please let me know if I missed something!

Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>

Closes #1673 from jkbradley/decisiontree-bugfix and squashes the following commits:

2b20c61 [Joseph K. Bradley] Small doc and style updates
dab0b67 [Joseph K. Bradley] Added documentation for DecisionTree internals
8bb8aa0 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-bugfix
978cfcf [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-bugfix
6eed482 [Joseph K. Bradley] In DecisionTree: Changed from using procedural syntax for functions returning Unit to explicitly writing Unit return type.
376dca2 [Joseph K. Bradley] Updated meaning of maxDepth by 1 to fit scikit-learn and rpart. * In code, replaced usages of maxDepth <-- maxDepth + 1 * In params, replace settings of maxDepth <-- maxDepth - 1
59750f8 [Joseph K. Bradley] * Updated Strategy to check numClassesForClassification only if algo=Classification. * Updates based on comments: ** DecisionTreeRunner *** Made dataFormat arg default to libsvm ** Small cleanups ** tree.Node: Made recursive helper methods private, and renamed them.
52e17c5 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-bugfix
da50db7 [Joseph K. Bradley] Added one more test to DecisionTreeSuite: stump with 2 continuous variables for binary classification.  Caused problems in past, but fixed now.
8ea8750 [Joseph K. Bradley] Bug fix: Off-by-1 when finding thresholds for splits for continuous features.
2283df8 [Joseph K. Bradley] 2 bug fixes.
73fbea2 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-bugfix
5f920a1 [Joseph K. Bradley] Demonstration of bug before submitting fix: Updated DecisionTreeSuite so that 3 tests fail.  Will describe bug in next commit.
2014-07-31 20:51:48 -07:00
Doris Xin d8430148ee [SPARK-2724] Python version of RandomRDDGenerators
RandomRDDGenerators but without support for randomRDD and randomVectorRDD, which take in arbitrary DistributionGenerator.

`randomRDD.py` is named to avoid collision with the built-in Python `random` package.

Author: Doris Xin <doris.s.xin@gmail.com>

Closes #1628 from dorx/pythonRDD and squashes the following commits:

55c6de8 [Doris Xin] review comments. all python units passed.
f831d9b [Doris Xin] moved default args logic into PythonMLLibAPI
2d73917 [Doris Xin] fix for linalg.py
8663e6a [Doris Xin] reverting back to a single python file for random
f47c481 [Doris Xin] docs update
687aac0 [Doris Xin] add RandomRDDGenerators.py to run-tests
4338f40 [Doris Xin] renamed randomRDD to rand and import as random
29d205e [Doris Xin] created mllib.random package
bd2df13 [Doris Xin] typos
07ddff2 [Doris Xin] units passed.
23b2ecd [Doris Xin] WIP
2014-07-31 20:32:57 -07:00
Xiangrui Meng dc0865bc7e [SPARK-2511][MLLIB] add HashingTF and IDF
This is roughly the TF-IDF implementation used in the Databricks Cloud Demo: http://databricks.com/cloud/ .

Both `HashingTF` and `IDF` are implemented as transformers, similar to scikit-learn.

Author: Xiangrui Meng <meng@databricks.com>

Closes #1671 from mengxr/tfidf and squashes the following commits:

7d65888 [Xiangrui Meng] use JavaConverters._
5fe9ec4 [Xiangrui Meng] fix unit test
6e214ec [Xiangrui Meng] add apache header
cfd9aed [Xiangrui Meng] add Java-friendly methods move classes to mllib.feature
3814440 [Xiangrui Meng] add HashingTF and IDF
2014-07-31 12:55:00 -07:00
Sean Owen e9b275b769 SPARK-2341 [MLLIB] loadLibSVMFile doesn't handle regression datasets
Per discussion at https://issues.apache.org/jira/browse/SPARK-2341 , this is a look at deprecating the multiclass parameter. Thoughts welcome of course.

Author: Sean Owen <srowen@gmail.com>

Closes #1663 from srowen/SPARK-2341 and squashes the following commits:

8a3abd7 [Sean Owen] Suppress MIMA error for removed package private classes
18a8c8e [Sean Owen] Updates from review
83d0092 [Sean Owen] Deprecated methods with multiclass, and instead always parse target as a double (ie. multiclass = true)
2014-07-30 17:34:32 -07:00
Sean Owen 6ab96a6fd0 SPARK-2749 [BUILD]. Spark SQL Java tests aren't compiling in Jenkins' Maven builds; missing junit:junit dep
The Maven-based builds in the build matrix have been failing for a few days:

https://amplab.cs.berkeley.edu/jenkins/view/Spark/

On inspection, it looks like the Spark SQL Java tests don't compile:

https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-Master-Maven-pre-YARN/hadoop.version=1.0.4,label=centos/244/consoleFull

I confirmed it by repeating the command vs master:

`mvn -Dhadoop.version=1.0.4 -Dlabel=centos -DskipTests clean package`

The problem is that this module doesn't depend on JUnit. In fact, none of the modules do, but `com.novocode:junit-interface` (the SBT-JUnit bridge) pulls it in, in most places. However this module doesn't depend on `com.novocode:junit-interface`

Adding the `junit:junit` dependency fixes the compile problem. In fact, the other modules with Java tests should probably depend on it explicitly instead of happening to get it via `com.novocode:junit-interface`, since that is a bit SBT/Scala-specific (and I am not even sure it's needed).

Author: Sean Owen <srowen@gmail.com>

Closes #1660 from srowen/SPARK-2749 and squashes the following commits:

858ff7c [Sean Owen] Add explicit junit dep to other modules with Java tests for robustness
9636794 [Sean Owen] Add junit dep so that Spark SQL Java tests compile
2014-07-30 15:04:33 -07:00
GuoQiang Li fc47bb6967 [SPARK-2544][MLLIB] Improve ALS algorithm resource usage
Author: GuoQiang Li <witgo@qq.com>
Author: witgo <witgo@qq.com>

Closes #929 from witgo/improve_als and squashes the following commits:

ea25033 [GuoQiang Li] checkpoint products 3,6,9 ...
154dccf [GuoQiang Li] checkpoint products only
c5779ff [witgo] Improve ALS algorithm resource usage
2014-07-30 11:00:11 -07:00
Sean Owen ee07541e99 SPARK-2748 [MLLIB] [GRAPHX] Loss of precision for small arguments to Math.exp, Math.log
In a few places in MLlib, an expression of the form `log(1.0 + p)` is evaluated. When p is so small that `1.0 + p == 1.0`, the result is 0.0. However the correct answer is very near `p`. This is why `Math.log1p` exists.

Similarly for one instance of `exp(m) - 1` in GraphX; there's a special `Math.expm1` method.

While the errors occur only for very small arguments, given their use in machine learning algorithms, this is entirely possible.

Also note the related PR for Python: https://github.com/apache/spark/pull/1652

Author: Sean Owen <srowen@gmail.com>

Closes #1659 from srowen/SPARK-2748 and squashes the following commits:

c5926d4 [Sean Owen] Use log1p, expm1 for better precision for tiny arguments
2014-07-30 08:55:15 -07:00
Xiangrui Meng 20424dad30 [SPARK-2174][MLLIB] treeReduce and treeAggregate
In `reduce` and `aggregate`, the driver node spends linear time on the number of partitions. It becomes a bottleneck when there are many partitions and the data from each partition is big.

SPARK-1485 (#506) tracks the progress of implementing AllReduce on Spark. I did several implementations including butterfly, reduce + broadcast, and treeReduce + broadcast. treeReduce + BT broadcast seems to be right way to go for Spark. Using binary tree may introduce some overhead in communication, because the driver still need to coordinate on data shuffling. In my experiments, n -> sqrt(n) -> 1 gives the best performance in general, which is why I set "depth = 2" in MLlib algorithms. But it certainly needs more testing.

I left `treeReduce` and `treeAggregate` public for easy testing. Some numbers from a test on 32-node m3.2xlarge cluster.

code:

~~~
import breeze.linalg._
import org.apache.log4j._

Logger.getRootLogger.setLevel(Level.OFF)

for (n <- Seq(1, 10, 100, 1000, 10000, 100000, 1000000)) {
  val vv = sc.parallelize(0 until 1024, 1024).map(i => DenseVector.zeros[Double](n))
  var start = System.nanoTime(); vv.treeReduce(_ + _, 2); println((System.nanoTime() - start) / 1e9)
  start = System.nanoTime(); vv.reduce(_ + _); println((System.nanoTime() - start) / 1e9)
}
~~~

out:

| n | treeReduce(,2) | reduce |
|---|---------------------|-----------|
| 10 | 0.215538731 | 0.204206899 |
| 100 | 0.278405907 | 0.205732582 |
| 1000 | 0.208972182 | 0.214298272 |
| 10000 | 0.194792071 | 0.349353687 |
| 100000 | 0.347683285 | 6.086671892 |
| 1000000 | 2.589350682 | 66.572906702 |

CC: @pwendell

This is clearly more scalable than the default implementation. My question is whether we should use this implementation in `reduce` and `aggregate` or put them as separate methods. The concern is that users may use `reduce` and `aggregate` as collect, where having multiple stages doesn't reduce the data size. However, in this case, `collect` is more appropriate.

Author: Xiangrui Meng <meng@databricks.com>

Closes #1110 from mengxr/tree and squashes the following commits:

c6cd267 [Xiangrui Meng] make depth default to 2
b04b96a [Xiangrui Meng] address comments
9bcc5d3 [Xiangrui Meng] add depth for readability
7495681 [Xiangrui Meng] fix compile error
142a857 [Xiangrui Meng] merge master
d58a087 [Xiangrui Meng] move treeReduce and treeAggregate to mllib
8a2a59c [Xiangrui Meng] Merge branch 'master' into tree
be6a88a [Xiangrui Meng] use treeAggregate in mllib
0f94490 [Xiangrui Meng] add docs
eb71c33 [Xiangrui Meng] add treeReduce
fe42a5e [Xiangrui Meng] add treeAggregate
2014-07-29 01:16:41 -07:00
Cheng Lian a7a9d14479 [SPARK-2410][SQL] Merging Hive Thrift/JDBC server (with Maven profile fix)
JIRA issue: [SPARK-2410](https://issues.apache.org/jira/browse/SPARK-2410)

Another try for #1399 & #1600. Those two PR breaks Jenkins builds because we made a separate profile `hive-thriftserver` in sub-project `assembly`, but the `hive-thriftserver` module is defined outside the `hive-thriftserver` profile. Thus every time a pull request that doesn't touch SQL code will also execute test suites defined in `hive-thriftserver`, but tests fail because related .class files are not included in the assembly jar.

In the most recent commit, module `hive-thriftserver` is moved into its own profile to fix this problem. All previous commits are squashed for clarity.

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #1620 from liancheng/jdbc-with-maven-fix and squashes the following commits:

629988e [Cheng Lian] Moved hive-thriftserver module definition into its own profile
ec3c7a7 [Cheng Lian] Cherry picked the Hive Thrift server
2014-07-28 12:07:30 -07:00
DB Tsai 255b56f9f5 [SPARK-2479][MLlib] Comparing floating-point numbers using relative error in UnitTests
Floating point math is not exact, and most floating-point numbers end up being slightly imprecise due to rounding errors.

Simple values like 0.1 cannot be precisely represented using binary floating point numbers, and the limited precision of floating point numbers means that slight changes in the order of operations or the precision of intermediates can change the result.

That means that comparing two floats to see if they are equal is usually not what we want. As long as this imprecision stays small, it can usually be ignored.

Based on discussion in the community, we have implemented two different APIs for relative tolerance, and absolute tolerance. It makes sense that test writers should know which one they need depending on their circumstances.

Developers also need to explicitly specify the eps, and there is no default value which will sometimes cause confusion.

When comparing against zero using relative tolerance, a exception will be raised to warn users that it's meaningless.

For relative tolerance, users can now write

    assert(23.1 ~== 23.52 relTol 0.02)
    assert(23.1 ~== 22.74 relTol 0.02)
    assert(23.1 ~= 23.52 relTol 0.02)
    assert(23.1 ~= 22.74 relTol 0.02)
    assert(!(23.1 !~= 23.52 relTol 0.02))
    assert(!(23.1 !~= 22.74 relTol 0.02))

    // This will throw exception with the following message.
    // "Did not expect 23.1 and 23.52 to be within 0.02 using relative tolerance."
    assert(23.1 !~== 23.52 relTol 0.02)

    // "Expected 23.1 and 22.34 to be within 0.02 using relative tolerance."
    assert(23.1 ~== 22.34 relTol 0.02)

For absolute error,

    assert(17.8 ~== 17.99 absTol 0.2)
    assert(17.8 ~== 17.61 absTol 0.2)
    assert(17.8 ~= 17.99 absTol 0.2)
    assert(17.8 ~= 17.61 absTol 0.2)
    assert(!(17.8 !~= 17.99 absTol 0.2))
    assert(!(17.8 !~= 17.61 absTol 0.2))

    // This will throw exception with the following message.
    // "Did not expect 17.8 and 17.99 to be within 0.2 using absolute error."
    assert(17.8 !~== 17.99 absTol 0.2)

    // "Expected 17.8 and 17.59 to be within 0.2 using absolute error."
    assert(17.8 ~== 17.59 absTol 0.2)

Authors:
  DB Tsai <dbtsaialpinenow.com>
  Marek Kolodziej <marekalpinenow.com>

Author: DB Tsai <dbtsai@alpinenow.com>

Closes #1425 from dbtsai/SPARK-2479_comparing_floating_point and squashes the following commits:

8c7cbcc [DB Tsai] Alpine Data Labs
2014-07-28 11:34:19 -07:00
Patrick Wendell e5bbce9a60 Revert "[SPARK-2410][SQL] Merging Hive Thrift/JDBC server"
This reverts commit f6ff2a61d0.
2014-07-27 18:46:58 -07:00
Doris Xin 81fcdd22c8 [SPARK-2514] [mllib] Random RDD generator
Utilities for generating random RDDs.

RandomRDD and RandomVectorRDD are created instead of using `sc.parallelize(range:Range)` because `Range` objects in Scala can only have `size <= Int.MaxValue`.

The object `RandomRDDGenerators` can be transformed into a generator class to reduce the number of auxiliary methods for optional arguments.

Author: Doris Xin <doris.s.xin@gmail.com>

Closes #1520 from dorx/randomRDD and squashes the following commits:

01121ac [Doris Xin] reviewer comments
6bf27d8 [Doris Xin] Merge branch 'master' into randomRDD
a8ea92d [Doris Xin] Reviewer comments
063ea0b [Doris Xin] Merge branch 'master' into randomRDD
aec68eb [Doris Xin] newline
bc90234 [Doris Xin] units passed.
d56cacb [Doris Xin] impl with RandomRDD
92d6f1c [Doris Xin] solution for Cloneable
df5bcff [Doris Xin] Merge branch 'generator' into randomRDD
f46d928 [Doris Xin] WIP
49ed20d [Doris Xin] alternative poisson distribution generator
7cb0e40 [Doris Xin] fix for data inconsistency
8881444 [Doris Xin] RandomRDDGenerator: initial design
2014-07-27 16:16:39 -07:00
Cheng Lian f6ff2a61d0 [SPARK-2410][SQL] Merging Hive Thrift/JDBC server
(This is a replacement of #1399, trying to fix potential `HiveThriftServer2` port collision between parallel builds. Please refer to [these comments](https://github.com/apache/spark/pull/1399#issuecomment-50212572) for details.)

JIRA issue: [SPARK-2410](https://issues.apache.org/jira/browse/SPARK-2410)

Merging the Hive Thrift/JDBC server from [branch-1.0-jdbc](https://github.com/apache/spark/tree/branch-1.0-jdbc).

Thanks chenghao-intel for his initial contribution of the Spark SQL CLI.

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #1600 from liancheng/jdbc and squashes the following commits:

ac4618b [Cheng Lian] Uses random port for HiveThriftServer2 to avoid collision with parallel builds
090beea [Cheng Lian] Revert changes related to SPARK-2678, decided to move them to another PR
21c6cf4 [Cheng Lian] Updated Spark SQL programming guide docs
fe0af31 [Cheng Lian] Reordered spark-submit options in spark-shell[.cmd]
199e3fb [Cheng Lian] Disabled MIMA for hive-thriftserver
1083e9d [Cheng Lian] Fixed failed test suites
7db82a1 [Cheng Lian] Fixed spark-submit application options handling logic
9cc0f06 [Cheng Lian] Starts beeline with spark-submit
cfcf461 [Cheng Lian] Updated documents and build scripts for the newly added hive-thriftserver profile
061880f [Cheng Lian] Addressed all comments by @pwendell
7755062 [Cheng Lian] Adapts test suites to spark-submit settings
40bafef [Cheng Lian] Fixed more license header issues
e214aab [Cheng Lian] Added missing license headers
b8905ba [Cheng Lian] Fixed minor issues in spark-sql and start-thriftserver.sh
f975d22 [Cheng Lian] Updated docs for Hive compatibility and Shark migration guide draft
3ad4e75 [Cheng Lian] Starts spark-sql shell with spark-submit
a5310d1 [Cheng Lian] Make HiveThriftServer2 play well with spark-submit
61f39f4 [Cheng Lian] Starts Hive Thrift server via spark-submit
2c4c539 [Cheng Lian] Cherry picked the Hive Thrift server
2014-07-27 13:03:38 -07:00
Doris Xin 3a69c72e5c [SPARK-2679] [MLLib] Ser/De for Double
Added a set of serializer/deserializer for Double in _common.py and PythonMLLibAPI in MLLib.

Author: Doris Xin <doris.s.xin@gmail.com>

Closes #1581 from dorx/doubleSerDe and squashes the following commits:

86a85b3 [Doris Xin] Merge branch 'master' into doubleSerDe
2bfe7a4 [Doris Xin] Removed magic byte
ad4d0d9 [Doris Xin] removed a space in unit
a9020bc [Doris Xin] units passed
7dad9af [Doris Xin] WIP
2014-07-27 07:21:07 -07:00
Xiangrui Meng aaf2b735fd [SPARK-2361][MLLIB] Use broadcast instead of serializing data directly into task closure
We saw task serialization problems with large feature dimension, which could be avoid if we don't serialize data directly into task but use broadcast variables. This PR uses broadcast in both training and prediction and adds tests to make sure the task size is small.

Author: Xiangrui Meng <meng@databricks.com>

Closes #1427 from mengxr/broadcast-new and squashes the following commits:

b9a1228 [Xiangrui Meng] style update
b97c184 [Xiangrui Meng] minimal change to LBFGS
9ebadcc [Xiangrui Meng] add task size test to RowMatrix
9427bf0 [Xiangrui Meng] add task size tests to linear methods
e0a5cf2 [Xiangrui Meng] add task size test to GD
28a8411 [Xiangrui Meng] add test for NaiveBayes
380778c [Xiangrui Meng] update KMeans test
bccab92 [Xiangrui Meng] add task size test to LBFGS
02103ba [Xiangrui Meng] remove print
e73d68e [Xiangrui Meng] update tests for k-means
174cb15 [Xiangrui Meng] use local-cluster for test with a small akka.frameSize
1928a5a [Xiangrui Meng] add test for KMeans task size
e00c2da [Xiangrui Meng] use broadcast in GD, KMeans
010d076 [Xiangrui Meng] modify NaiveBayesModel and GLM to use broadcast
2014-07-26 22:56:07 -07:00
Michael Armbrust afd757a241 Revert "[SPARK-2410][SQL] Merging Hive Thrift/JDBC server"
This reverts commit 06dc0d2c6b.

#1399 is making Jenkins fail.  We should investigate and put this back after its passing tests.

Author: Michael Armbrust <michael@databricks.com>

Closes #1594 from marmbrus/revertJDBC and squashes the following commits:

59748da [Michael Armbrust] Revert "[SPARK-2410][SQL] Merging Hive Thrift/JDBC server"
2014-07-25 15:36:57 -07:00
Cheng Lian 06dc0d2c6b [SPARK-2410][SQL] Merging Hive Thrift/JDBC server
JIRA issue:

- Main: [SPARK-2410](https://issues.apache.org/jira/browse/SPARK-2410)
- Related: [SPARK-2678](https://issues.apache.org/jira/browse/SPARK-2678)

Cherry picked the Hive Thrift/JDBC server from [branch-1.0-jdbc](https://github.com/apache/spark/tree/branch-1.0-jdbc).

(Thanks chenghao-intel for his initial contribution of the Spark SQL CLI.)

TODO

- [x] Use `spark-submit` to launch the server, the CLI and beeline
- [x] Migration guideline draft for Shark users

----

Hit by a bug in `SparkSubmitArguments` while working on this PR: all application options that are recognized by `SparkSubmitArguments` are stolen as `SparkSubmit` options. For example:

```bash
$ spark-submit --class org.apache.hive.beeline.BeeLine spark-internal --help
```

This actually shows usage information of `SparkSubmit` rather than `BeeLine`.

~~Fixed this bug here since the `spark-internal` related stuff also touches `SparkSubmitArguments` and I'd like to avoid conflict.~~

**UPDATE** The bug mentioned above is now tracked by [SPARK-2678](https://issues.apache.org/jira/browse/SPARK-2678). Decided to revert changes to this bug since it involves more subtle considerations and worth a separate PR.

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #1399 from liancheng/thriftserver and squashes the following commits:

090beea [Cheng Lian] Revert changes related to SPARK-2678, decided to move them to another PR
21c6cf4 [Cheng Lian] Updated Spark SQL programming guide docs
fe0af31 [Cheng Lian] Reordered spark-submit options in spark-shell[.cmd]
199e3fb [Cheng Lian] Disabled MIMA for hive-thriftserver
1083e9d [Cheng Lian] Fixed failed test suites
7db82a1 [Cheng Lian] Fixed spark-submit application options handling logic
9cc0f06 [Cheng Lian] Starts beeline with spark-submit
cfcf461 [Cheng Lian] Updated documents and build scripts for the newly added hive-thriftserver profile
061880f [Cheng Lian] Addressed all comments by @pwendell
7755062 [Cheng Lian] Adapts test suites to spark-submit settings
40bafef [Cheng Lian] Fixed more license header issues
e214aab [Cheng Lian] Added missing license headers
b8905ba [Cheng Lian] Fixed minor issues in spark-sql and start-thriftserver.sh
f975d22 [Cheng Lian] Updated docs for Hive compatibility and Shark migration guide draft
3ad4e75 [Cheng Lian] Starts spark-sql shell with spark-submit
a5310d1 [Cheng Lian] Make HiveThriftServer2 play well with spark-submit
61f39f4 [Cheng Lian] Starts Hive Thrift server via spark-submit
2c4c539 [Cheng Lian] Cherry picked the Hive Thrift server
2014-07-25 12:20:49 -07:00
Matei Zaharia 8529ced35c SPARK-2657 Use more compact data structures than ArrayBuffer in groupBy & cogroup
JIRA: https://issues.apache.org/jira/browse/SPARK-2657

Our current code uses ArrayBuffers for each group of values in groupBy, as well as for the key's elements in CoGroupedRDD. ArrayBuffers have a lot of overhead if there are few values in them, which is likely to happen in cases such as join. In particular, they have a pointer to an Object[] of size 16 by default, which is 24 bytes for the array header + 128 for the pointers in there, plus at least 32 for the ArrayBuffer data structure. This patch replaces the per-group buffers with a CompactBuffer class that can store up to 2 elements more efficiently (in fields of itself) and acts like an ArrayBuffer beyond that. For a key's elements in CoGroupedRDD, we use an Array of CompactBuffers instead of an ArrayBuffer of ArrayBuffers.

There are some changes throughout the code to deal with CoGroupedRDD returning Array instead. We can also decide not to do that but CoGroupedRDD is a `DeveloperAPI` so I think it's okay to change it here.

Author: Matei Zaharia <matei@databricks.com>

Closes #1555 from mateiz/compact-groupby and squashes the following commits:

845a356 [Matei Zaharia] Lower initial size of CompactBuffer's vector to 8
07621a7 [Matei Zaharia] Review comments
0c1cd12 [Matei Zaharia] Don't use varargs in CompactBuffer.apply
bdc8a39 [Matei Zaharia] Small tweak to +=, and typos
f61f040 [Matei Zaharia] Fix line lengths
59da88b0 [Matei Zaharia] Fix line lengths
197cde8 [Matei Zaharia] Make CompactBuffer extend Seq to make its toSeq more efficient
775110f [Matei Zaharia] Change CoGroupedRDD to give (K, Array[Iterable[_]]) to avoid wrappers
9b4c6e8 [Matei Zaharia] Use CompactBuffer in CoGroupedRDD
ed577ab [Matei Zaharia] Use CompactBuffer in groupByKey
10f0de1 [Matei Zaharia] A CompactBuffer that's more memory-efficient than ArrayBuffer for small buffers
2014-07-25 00:32:32 -07:00
Xiangrui Meng c960b50518 [SPARK-2479 (partial)][MLLIB] fix binary metrics unit tests
Allow small errors in comparison.

@dbtsai , this unit test blocks https://github.com/apache/spark/pull/1562 . I may need to merge this one first. We can change it to use the tools in https://github.com/apache/spark/pull/1425 after that PR gets merged.

Author: Xiangrui Meng <meng@databricks.com>

Closes #1576 from mengxr/fix-binary-metrics-unit-tests and squashes the following commits:

5076a7f [Xiangrui Meng] fix binary metrics unit tests
2014-07-24 12:37:02 -07:00
Xiangrui Meng 4c7243e109 [SPARK-2617] Correct doc and usages of preservesPartitioning
The name `preservesPartitioning` is ambiguous: 1) preserves the indices of partitions, 2) preserves the partitioner. The latter is correct and `preservesPartitioning` should really be called `preservesPartitioner` to avoid confusion. Unfortunately, this is already part of the API and we cannot change. We should be clear in the doc and fix wrong usages.

This PR

1. adds notes in `maPartitions*`,
2. makes `RDD.sample` preserve partitioner,
3. changes `preservesPartitioning` to false in  `RDD.zip` because the keys of the first RDD are no longer the keys of the zipped RDD,
4. fixes some wrong usages in MLlib.

Author: Xiangrui Meng <meng@databricks.com>

Closes #1526 from mengxr/preserve-partitioner and squashes the following commits:

b361e65 [Xiangrui Meng] update doc based on pwendell's comments
3b1ba19 [Xiangrui Meng] update doc
357575c [Xiangrui Meng] fix unit test
20b4816 [Xiangrui Meng] Merge branch 'master' into preserve-partitioner
d1caa65 [Xiangrui Meng] add doc to explain preservesPartitioning fix wrong usage of preservesPartitioning make sample preserse partitioning
2014-07-23 00:58:55 -07:00
peng.zhang 75db1742ab [SPARK-2612] [mllib] Fix data skew in ALS
Author: peng.zhang <peng.zhang@xiaomi.com>

Closes #1521 from renozhang/fix-als and squashes the following commits:

b5727a4 [peng.zhang] Remove no need argument
1a4f7a0 [peng.zhang] Fix data skew in ALS
2014-07-22 02:39:07 -07:00
Xiangrui Meng 1b10b8114a [SPARK-2495][MLLIB] remove private[mllib] from linear models' constructors
This is part of SPARK-2495 to allow users construct linear models manually.

Author: Xiangrui Meng <meng@databricks.com>

Closes #1492 from mengxr/public-constructor and squashes the following commits:

a48b766 [Xiangrui Meng] remove private[mllib] from linear models' constructors
2014-07-20 13:04:59 -07:00
Doris Xin a243364b22 [SPARK-2359][MLlib] Correlations
Implementation for Pearson and Spearman's correlation.

Author: Doris Xin <doris.s.xin@gmail.com>

Closes #1367 from dorx/correlation and squashes the following commits:

c0dd7dc [Doris Xin] here we go
32d83a3 [Doris Xin] Reviewer comments
4db0da1 [Doris Xin] added private[stat] to Spearman
b716f70 [Doris Xin] minor fixes
6e1b42a [Doris Xin] More comments addressed. Still some open questions
8104f44 [Doris Xin] addressed comments. some open questions still
39387c2 [Doris Xin] added missing header
bd3cf19 [Doris Xin] Merge branch 'master' into correlation
6341884 [Doris Xin] race condition bug squished
bd2bacf [Doris Xin] Race condition bug
b775ff9 [Doris Xin] old wrong impl
534ebf2 [Doris Xin] Merge branch 'master' into correlation
818fa31 [Doris Xin] wip units
9d808ee [Doris Xin] wip units
b843a13 [Doris Xin] revert change in stat counter
28561b6 [Doris Xin] wip
bb2e977 [Doris Xin] minor fix
8e02c63 [Doris Xin] Merge branch 'master' into correlation
2a40aa1 [Doris Xin] initial, untested implementation of Pearson
dfc4854 [Doris Xin] WIP
2014-07-18 17:25:32 -07:00
Manish Amde d88f6be446 [MLlib] SPARK-1536: multiclass classification support for decision tree
The ability to perform multiclass classification is a big advantage for using decision trees and was a highly requested feature for mllib. This pull request adds multiclass classification support to the MLlib decision tree. It also adds sample weights support using WeightedLabeledPoint class for handling unbalanced datasets during classification. It will also support algorithms such as AdaBoost which requires instances to be weighted.

It handles the special case where the categorical variables cannot be ordered for multiclass classification and thus the optimizations used for speeding up binary classification cannot be directly used for multiclass classification with categorical variables. More specifically, for m categories in a categorical feature, it analyses all the ```2^(m-1) - 1``` categorical splits provided that #splits are less than the maxBins provided in the input. This condition will not be met for features with large number of categories -- using decision trees is not recommended for such datasets in general since the categorical features are favored over continuous features. Moreover, the user can use a combination of tricks (increasing bin size of the tree algorithms, use binary encoding for categorical features or use one-vs-all classification strategy) to avoid these constraints.

The new code is accompanied by unit tests and has also been tested on the iris and covtype datasets.

cc: mengxr, etrain, hirakendu, atalwalkar, srowen

Author: Manish Amde <manish9ue@gmail.com>
Author: manishamde <manish9ue@gmail.com>
Author: Evan Sparks <sparks@cs.berkeley.edu>

Closes #886 from manishamde/multiclass and squashes the following commits:

26f8acc [Manish Amde] another attempt at fixing mima
c5b2d04 [Manish Amde] more MIMA fixes
1ce7212 [Manish Amde] change problem filter for mima
10fdd82 [Manish Amde] fixing MIMA excludes
e1c970d [Manish Amde] merged master
abf2901 [Manish Amde] adding classes to MimaExcludes.scala
45e767a [Manish Amde] adding developer api annotation for overriden methods
c8428c4 [Manish Amde] fixing weird multiline bug
afced16 [Manish Amde] removed label weights support
2d85a48 [Manish Amde] minor: fixed scalastyle issues reprise
4e85f2c [Manish Amde] minor: fixed scalastyle issues
b2ae41f [Manish Amde] minor: scalastyle
e4c1321 [Manish Amde] using while loop for regression histograms
d75ac32 [Manish Amde] removed WeightedLabeledPoint from this PR
0fecd38 [Manish Amde] minor: add newline to EOF
2061cf5 [Manish Amde] merged from master
06b1690 [Manish Amde] fixed off-by-one error in bin to split conversion
9cc3e31 [Manish Amde] added implicit conversion import
5c1b2ca [Manish Amde] doc for PointConverter class
485eaae [Manish Amde] implicit conversion from LabeledPoint to WeightedLabeledPoint
3d7f911 [Manish Amde] updated doc
8e44ab8 [Manish Amde] updated doc
adc7315 [Manish Amde] support ordered categorical splits for multiclass classification
e3e8843 [Manish Amde] minor code formatting
23d4268 [Manish Amde] minor: another minor code style
34ee7b9 [Manish Amde] minor: code style
237762d [Manish Amde] renaming functions
12e6d0a [Manish Amde] minor: removing line in doc
9a90c93 [Manish Amde] Merge branch 'master' into multiclass
1892a2c [Manish Amde] tests and use multiclass binaggregate length when atleast one categorical feature is present
f5f6b83 [Manish Amde] multiclass for continous variables
8cfd3b6 [Manish Amde] working for categorical multiclass classification
828ff16 [Manish Amde] added categorical variable test
bce835f [Manish Amde] code cleanup
7e5f08c [Manish Amde] minor doc
1dd2735 [Manish Amde] bin search logic for multiclass
f16a9bb [Manish Amde] fixing while loop
d811425 [Manish Amde] multiclass bin aggregate logic
ab5cb21 [Manish Amde] multiclass logic
d8e4a11 [Manish Amde] sample weights
ed5a2df [Manish Amde] fixed classification requirements
d012be7 [Manish Amde] fixed while loop
18d2835 [Manish Amde] changing default values for num classes
6b912dc [Manish Amde] added numclasses to tree runner, predict logic for multiclass, add multiclass option to train
75f2bfc [Manish Amde] minor code style fix
e547151 [Manish Amde] minor modifications
34549d0 [Manish Amde] fixing error during merge
098e8c5 [Manish Amde] merged master
e006f9d [Manish Amde] changing variable names
5c78e1a [Manish Amde] added multiclass support
6c7af22 [Manish Amde] prepared for multiclass without breaking binary classification
46e06ee [Manish Amde] minor mods
3f85a17 [Manish Amde] tests for multiclass classification
4d5f70c [Manish Amde] added multiclass support for find splits bins
46f909c [Manish Amde] todo for multiclass support
455bea9 [Manish Amde] fixed tests
14aea48 [Manish Amde] changing instance format to weighted labeled point
a1a6e09 [Manish Amde] added weighted point class
968ca9d [Manish Amde] merged master
7fc9545 [Manish Amde] added docs
ce004a1 [Manish Amde] minor formatting
b27ad2c [Manish Amde] formatting
426bb28 [Manish Amde] programming guide blurb
8053fed [Manish Amde] more formatting
5eca9e4 [Manish Amde] grammar
4731cda [Manish Amde] formatting
5e82202 [Manish Amde] added documentation, fixed off by 1 error in max level calculation
cbd9f14 [Manish Amde] modified scala.math to math
dad9652 [Manish Amde] removed unused imports
e0426ee [Manish Amde] renamed parameter
718506b [Manish Amde] added unit test
1517155 [Manish Amde] updated documentation
9dbdabe [Manish Amde] merge from master
719d009 [Manish Amde] updating user documentation
fecf89a [manishamde] Merge pull request #6 from etrain/deep_tree
0287772 [Evan Sparks] Fixing scalastyle issue.
2f1e093 [Manish Amde] minor: added doc for maxMemory parameter
2f6072c [manishamde] Merge pull request #5 from etrain/deep_tree
abc5a23 [Evan Sparks] Parameterizing max memory.
50b143a [Manish Amde] adding support for very deep trees
2014-07-18 14:00:13 -07:00
Joseph K. Bradley 935fe65ff6 SPARK-1215 [MLLIB]: Clustering: Index out of bounds error (2)
Added check to LocalKMeans.scala: kMeansPlusPlus initialization to handle case with fewer distinct data points than clusters k.  Added two related unit tests to KMeansSuite.  (Re-submitting PR after tangling commits in PR 1407 https://github.com/apache/spark/pull/1407 )

Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>

Closes #1468 from jkbradley/kmeans-fix and squashes the following commits:

4e9bd1e [Joseph K. Bradley] Updated PR per comments from mengxr
6c7a2ec [Joseph K. Bradley] Added check to LocalKMeans.scala: kMeansPlusPlus initialization to handle case with fewer distinct data points than clusters k.  Added two related unit tests to KMeansSuite.
2014-07-17 15:05:02 -07:00
Alexander Ulanov 04b01bb101 [MLLIB] [SPARK-2222] Add multiclass evaluation metrics
Adding two classes:
1) MulticlassMetrics implements various multiclass evaluation metrics
2) MulticlassMetricsSuite implements unit tests for MulticlassMetrics

Author: Alexander Ulanov <nashb@yandex.ru>
Author: unknown <ulanov@ULANOV1.emea.hpqcorp.net>
Author: Xiangrui Meng <meng@databricks.com>

Closes #1155 from avulanov/master and squashes the following commits:

2eae80f [Alexander Ulanov] Merge pull request #1 from mengxr/avulanov-master
5ebeb08 [Xiangrui Meng] minor updates
79c3555 [Alexander Ulanov] Addressing reviewers comments mengxr
0fa9511 [Alexander Ulanov] Addressing reviewers comments mengxr
f0dadc9 [Alexander Ulanov] Addressing reviewers comments mengxr
4811378 [Alexander Ulanov] Removing println
87fb11f [Alexander Ulanov] Addressing reviewers comments mengxr. Added confusion matrix
e3db569 [Alexander Ulanov] Addressing reviewers comments mengxr. Added true positive rate and false positive rate. Test suite code style.
a7e8bf0 [Alexander Ulanov] Addressing reviewers comments mengxr
c3a77ad [Alexander Ulanov] Addressing reviewers comments mengxr
e2c91c3 [Alexander Ulanov] Fixes to mutliclass metics
d5ce981 [unknown] Comments about Double
a5c8ba4 [unknown] Unit tests. Class rename
fcee82d [unknown] Unit tests. Class rename
d535d62 [unknown] Multiclass evaluation
2014-07-15 08:40:22 -07:00
DB Tsai 52beb20f79 [SPARK-2477][MLlib] Using appendBias for adding intercept in GeneralizedLinearAlgorithm
Instead of using prependOne currently in GeneralizedLinearAlgorithm, we would like to use appendBias for 1) keeping the indices of original training set unchanged by adding the intercept into the last element of vector and 2) using the same public API for consistently adding intercept.

Author: DB Tsai <dbtsai@alpinenow.com>

Closes #1410 from dbtsai/SPARK-2477_intercept_with_appendBias and squashes the following commits:

011432c [DB Tsai] From Alpine Data Labs
2014-07-15 02:14:58 -07:00
Sean Owen 635888cbed SPARK-2363. Clean MLlib's sample data files
(Just made a PR for this, mengxr was the reporter of:)

MLlib has sample data under serveral folders:
1) data/mllib
2) data/
3) mllib/data/*
Per previous discussion with Matei Zaharia, we want to put them under `data/mllib` and clean outdated files.

Author: Sean Owen <sowen@cloudera.com>

Closes #1394 from srowen/SPARK-2363 and squashes the following commits:

54313dd [Sean Owen] Move ML example data from /mllib/data/ and /data/ into /data/mllib/
2014-07-13 19:27:43 -07:00
Sandy Ryza 4c8be64e76 SPARK-2462. Make Vector.apply public.
Apologies if there's an already-discussed reason I missed for why this doesn't make sense.

Author: Sandy Ryza <sandy@cloudera.com>

Closes #1389 from sryza/sandy-spark-2462 and squashes the following commits:

2e5e201 [Sandy Ryza] SPARK-2462.  Make Vector.apply public.
2014-07-12 16:55:15 -07:00
Li Pu d38887b8a0 use specialized axpy in RowMatrix for SVD
After running some more tests on large matrix, found that the BV axpy (breeze/linalg/Vector.scala, axpy) is slower than the BSV axpy (breeze/linalg/operators/SparseVectorOps.scala, sv_dv_axpy), 8s v.s. 2s for each multiplication. The BV axpy operates on an iterator while BSV axpy directly operates on the underlying array. I think the overhead comes from creating the iterator (with a zip) and advancing the pointers.

Author: Li Pu <lpu@twitter.com>
Author: Xiangrui Meng <meng@databricks.com>
Author: Li Pu <li.pu@outlook.com>

Closes #1378 from vrilleup/master and squashes the following commits:

6fb01a3 [Li Pu] use specialized axpy in RowMatrix
5255f2a [Li Pu] Merge remote-tracking branch 'upstream/master'
7312ec1 [Li Pu] very minor comment fix
4c618e9 [Li Pu] Merge pull request #1 from mengxr/vrilleup-master
a461082 [Xiangrui Meng] make superscript show up correctly in doc
861ec48 [Xiangrui Meng] simplify axpy
62969fa [Xiangrui Meng] use BDV directly in symmetricEigs change the computation mode to local-svd, local-eigs, and dist-eigs update tests and docs
c273771 [Li Pu] automatically determine SVD compute mode and parameters
7148426 [Li Pu] improve RowMatrix multiply
5543cce [Li Pu] improve svd api
819824b [Li Pu] add flag for dense svd or sparse svd
eb15100 [Li Pu] fix binary compatibility
4c7aec3 [Li Pu] improve comments
e7850ed [Li Pu] use aggregate and axpy
827411b [Li Pu] fix EOF new line
9c80515 [Li Pu] use non-sparse implementation when k = n
fe983b0 [Li Pu] improve scala style
96d2ecb [Li Pu] improve eigenvalue sorting
e1db950 [Li Pu] SPARK-1782: svd for sparse matrix using ARPACK
2014-07-11 23:26:47 -07:00
DB Tsai 5596086935 [SPARK-1969][MLlib] Online summarizer APIs for mean, variance, min, and max
It basically moved the private ColumnStatisticsAggregator class from RowMatrix to public available DeveloperApi with documentation and unitests.

Changes:
1) Moved the private implementation from org.apache.spark.mllib.linalg.ColumnStatisticsAggregator to org.apache.spark.mllib.stat.MultivariateOnlineSummarizer
2) When creating OnlineSummarizer object, the number of columns is not needed in the constructor. It's determined when users add the first sample.
3) Added the APIs documentation for MultivariateOnlineSummarizer.
4) Added the unittests for MultivariateOnlineSummarizer.

Author: DB Tsai <dbtsai@dbtsai.com>

Closes #955 from dbtsai/dbtsai-summarizer and squashes the following commits:

b13ac90 [DB Tsai] dbtsai-summarizer
2014-07-11 23:04:43 -07:00
Xiangrui Meng 2f59ce7dbe [SPARK-2358][MLLIB] Add an option to include native BLAS/LAPACK loader in the build
It would be easy for users to include the netlib-java jniloader in the spark jar, which is LGPL-licensed. We can follow the same approach as ganglia support in Spark, which could be enabled by turning on "-Pganglia-lgpl" at build time. We can use "-Pnetlib-lgpl" flag for this.

Author: Xiangrui Meng <meng@databricks.com>

Closes #1295 from mengxr/netlib-lgpl and squashes the following commits:

aebf001 [Xiangrui Meng] add a profile to optionally include native BLAS/LAPACK loader in mllib
2014-07-10 21:57:54 -07:00
Prashant Sharma 628932b8d0 [SPARK-1776] Have Spark's SBT build read dependencies from Maven.
Patch introduces the new way of working also retaining the existing ways of doing things.

For example build instruction for yarn in maven is
`mvn -Pyarn -PHadoop2.2 clean package -DskipTests`
in sbt it can become
`MAVEN_PROFILES="yarn, hadoop-2.2" sbt/sbt clean assembly`
Also supports
`sbt/sbt -Pyarn -Phadoop-2.2 -Dhadoop.version=2.2.0 clean assembly`

Author: Prashant Sharma <prashant.s@imaginea.com>
Author: Patrick Wendell <pwendell@gmail.com>

Closes #772 from ScrapCodes/sbt-maven and squashes the following commits:

a8ac951 [Prashant Sharma] Updated sbt version.
62b09bb [Prashant Sharma] Improvements.
fa6221d [Prashant Sharma] Excluding sql from mima
4b8875e [Prashant Sharma] Sbt assembly no longer builds tools by default.
72651ca [Prashant Sharma] Addresses code reivew comments.
acab73d [Prashant Sharma] Revert "Small fix to run-examples script."
ac4312c [Prashant Sharma] Revert "minor fix"
6af91ac [Prashant Sharma] Ported oldDeps back. + fixes issues with prev commit.
65cf06c [Prashant Sharma] Servelet API jars mess up with the other servlet jars on the class path.
446768e [Prashant Sharma] minor fix
89b9777 [Prashant Sharma] Merge conflicts
d0a02f2 [Prashant Sharma] Bumped up pom versions, Since the build now depends on pom it is better updated there. + general cleanups.
dccc8ac [Prashant Sharma] updated mima to check against 1.0
a49c61b [Prashant Sharma] Fix for tools jar
a2f5ae1 [Prashant Sharma] Fixes a bug in dependencies.
cf88758 [Prashant Sharma] cleanup
9439ea3 [Prashant Sharma] Small fix to run-examples script.
96cea1f [Prashant Sharma] SPARK-1776 Have Spark's SBT build read dependencies from Maven.
36efa62 [Patrick Wendell] Set project name in pom files and added eclipse/intellij plugins.
4973dbd [Patrick Wendell] Example build using pom reader.
2014-07-10 11:03:37 -07:00
Li Pu 1f33e1f201 SPARK-1782: svd for sparse matrix using ARPACK
copy ARPACK dsaupd/dseupd code from latest breeze
change RowMatrix to use sparse SVD
change tests for sparse SVD

All tests passed. I will run it against some large matrices.

Author: Li Pu <lpu@twitter.com>
Author: Xiangrui Meng <meng@databricks.com>
Author: Li Pu <li.pu@outlook.com>

Closes #964 from vrilleup/master and squashes the following commits:

7312ec1 [Li Pu] very minor comment fix
4c618e9 [Li Pu] Merge pull request #1 from mengxr/vrilleup-master
a461082 [Xiangrui Meng] make superscript show up correctly in doc
861ec48 [Xiangrui Meng] simplify axpy
62969fa [Xiangrui Meng] use BDV directly in symmetricEigs change the computation mode to local-svd, local-eigs, and dist-eigs update tests and docs
c273771 [Li Pu] automatically determine SVD compute mode and parameters
7148426 [Li Pu] improve RowMatrix multiply
5543cce [Li Pu] improve svd api
819824b [Li Pu] add flag for dense svd or sparse svd
eb15100 [Li Pu] fix binary compatibility
4c7aec3 [Li Pu] improve comments
e7850ed [Li Pu] use aggregate and axpy
827411b [Li Pu] fix EOF new line
9c80515 [Li Pu] use non-sparse implementation when k = n
fe983b0 [Li Pu] improve scala style
96d2ecb [Li Pu] improve eigenvalue sorting
e1db950 [Li Pu] SPARK-1782: svd for sparse matrix using ARPACK
2014-07-09 12:15:08 -07:00
johnnywalleye d35e3db232 [SPARK-2417][MLlib] Fix DecisionTree tests
Fixes test failures introduced by https://github.com/apache/spark/pull/1316.

For both the regression and classification cases,
val stats is the InformationGainStats for the best tree split.
stats.predict is the predicted value for the data, before the split is made.
Since 600 of the 1,000 values generated by DecisionTreeSuite.generateCategoricalDataPoints() are 1.0 and the rest 0.0, the regression tree and classification tree both correctly predict a value of 0.6 for this data now, and the assertions have been changed to reflect that.

Author: johnnywalleye <jsondag@gmail.com>

Closes #1343 from johnnywalleye/decision-tree-tests and squashes the following commits:

ef80603 [johnnywalleye] [SPARK-2417][MLlib] Fix DecisionTree tests
2014-07-09 11:06:34 -07:00
johnnywalleye 1114207cc8 [SPARK-2152][MLlib] fix bin offset in DecisionTree node aggregations (also resolves SPARK-2160)
Hi, this pull fixes (what I believe to be) a bug in DecisionTree.scala.

In the extractLeftRightNodeAggregates function, the first set of rightNodeAgg values for Regression are set in line 792 as follows:

rightNodeAgg(featureIndex)(2 * (numBins - 2))
  = binData(shift + (2 * numBins - 1)))

Then there is a loop that sets the rest of the values, as in line 809:

rightNodeAgg(featureIndex)(2 * (numBins - 2 - splitIndex)) =
  binData(shift + (2 *(numBins - 2 - splitIndex))) +
  rightNodeAgg(featureIndex)(2 * (numBins - 1 - splitIndex))

But since splitIndex starts at 1, this ends up skipping a set of binData values.

The changes here address this issue, for both the Regression and Classification cases.

Author: johnnywalleye <jsondag@gmail.com>

Closes #1316 from johnnywalleye/master and squashes the following commits:

73809da [johnnywalleye] fix bin offset in DecisionTree node aggregations
2014-07-08 19:17:26 -07:00
Sean Owen 2b36344f58 SPARK-1675. Make clear whether computePrincipalComponents requires centered data
Just closing out this small JIRA, resolving with a comment change.

Author: Sean Owen <sowen@cloudera.com>

Closes #1171 from srowen/SPARK-1675 and squashes the following commits:

45ee9b7 [Sean Owen] Add simple note that data need not be centered for computePrincipalComponents
2014-07-03 11:54:51 -07:00
Szul, Piotr 441cdcca64 [SPARK-2172] PySpark cannot import mllib modules in YARN-client mode
Include pyspark/mllib python sources as resources in the mllib.jar.
This way they will be included in the final assembly

Author: Szul, Piotr <Piotr.Szul@csiro.au>

Closes #1223 from piotrszul/branch-1.0 and squashes the following commits:

69d5174 [Szul, Piotr] Removed unsed resource directory src/main/resource from mllib pom
f8c52a0 [Szul, Piotr] [SPARK-2172] PySpark cannot import mllib modules in YARN-client mode Include pyspark/mllib python sources as resources in the jar

(cherry picked from commit fa167194ce)
Signed-off-by: Reynold Xin <rxin@apache.org>
2014-06-25 23:07:16 -07:00
Gang Bai d484ddeff1 [SPARK-2163] class LBFGS optimize with Double tolerance instead of Int
https://issues.apache.org/jira/browse/SPARK-2163

This pull request includes the change for **[SPARK-2163]**:

* Changed the convergence tolerance parameter from type `Int` to type `Double`.
* Added types for vars in `class LBFGS`, making the style consistent with `class GradientDescent`.
* Added associated test to check that optimizing via `class LBFGS` produces the same results as via calling `runLBFGS` from `object LBFGS`.

This is a very minor change but it will solve the problem in my implementation of a regression model for count data, where I make use of LBFGS for parameter estimation.

Author: Gang Bai <me@baigang.net>

Closes #1104 from BaiGang/fix_int_tol and squashes the following commits:

cecf02c [Gang Bai] Changed setConvergenceTol'' to specify tolerance with a parameter of type Double. For the reason and the problem caused by an Int parameter, please check https://issues.apache.org/jira/browse/SPARK-2163. Added a test in LBFGSSuite for validating that optimizing via class LBFGS produces the same results as calling runLBFGS from object LBFGS. Keep the indentations and styles correct.
2014-06-20 08:52:20 -07:00