Commit graph

341 commits

Author SHA1 Message Date
Xiangrui Meng 74d6f62264 [SPARK-1997][MLLIB] update breeze to 0.9
0.9 dependences (this version doesn't depend on scalalogging and I excluded commons-math3 from its transitive dependencies):
~~~
+-org.scalanlp:breeze_2.10:0.9 [S]
  +-com.github.fommil.netlib:core:1.1.2
  +-com.github.rwl:jtransforms:2.4.0
  +-net.sf.opencsv:opencsv:2.3
  +-net.sourceforge.f2j:arpack_combined_all:0.1
  +-org.scalanlp:breeze-macros_2.10:0.3.1 [S]
  | +-org.scalamacros:quasiquotes_2.10:2.0.0 [S]
  |
  +-org.slf4j:slf4j-api:1.7.5
  +-org.spire-math:spire_2.10:0.7.4 [S]
    +-org.scalamacros:quasiquotes_2.10:2.0.0 [S]
    |
    +-org.spire-math:spire-macros_2.10:0.7.4 [S]
      +-org.scalamacros:quasiquotes_2.10:2.0.0 [S]
~~~

Closes #1749

CC: witgo avati

Author: Xiangrui Meng <meng@databricks.com>

Closes #1857 from mengxr/breeze-0.9 and squashes the following commits:

7fc16b6 [Xiangrui Meng] don't know why but exclude a private method for mima
dcc502e [Xiangrui Meng] update breeze to 0.9
2014-08-08 15:07:31 -07:00
Xiangrui Meng b9e9e53773 [SPARK-2852][MLLIB] Separate model from IDF/StandardScaler algorithms
This is part of SPARK-2828:

1. separate IDF model from IDF algorithm (which generates a model)
2. separate StandardScaler model from StandardScaler

CC: dbtsai

Author: Xiangrui Meng <meng@databricks.com>

Closes #1814 from mengxr/feature-api-update and squashes the following commits:

40d863b [Xiangrui Meng] move mean and variance to model
48a0fff [Xiangrui Meng] separate Model from StandardScaler algorithm
89f3486 [Xiangrui Meng] update IDF to separate Model from Algorithm
2014-08-07 11:28:12 -07:00
Joseph K. Bradley 8d1dec4fa4 [mllib] DecisionTree Strategy parameter checks
Added some checks to Strategy to print out meaningful error messages when given invalid DecisionTree parameters.
CC mengxr

Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>

Closes #1821 from jkbradley/dt-robustness and squashes the following commits:

4dc449a [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-robustness
7a61f7b [Joseph K. Bradley] Added some checks to Strategy to print out meaningful error messages when given invalid DecisionTree parameters
2014-08-07 00:20:38 -07:00
Joseph K. Bradley 47ccd5e71b [SPARK-2851] [mllib] DecisionTree Python consistency update
Added 6 static train methods to match Python API, but without default arguments (but with Python default args noted in docs).

Added factory classes for Algo and Impurity, but made private[mllib].

CC: mengxr dorx  Please let me know if there are other changes which would help with API consistency---thanks!

Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>

Closes #1798 from jkbradley/dt-python-consistency and squashes the following commits:

6f7edf8 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-python-consistency
a0d7dbe [Joseph K. Bradley] DecisionTree: In Java-friendly train* methods, changed to use JavaRDD instead of RDD.
ee1d236 [Joseph K. Bradley] DecisionTree API updates: * Removed train() function in Python API (tree.py) ** Removed corresponding function in Scala/Java API (the ones taking basic types)
00f820e [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-python-consistency
fe6dbfa [Joseph K. Bradley] removed unnecessary imports
e358661 [Joseph K. Bradley] DecisionTree API change: * Added 6 static train methods to match Python API, but without default arguments (but with Python default args noted in docs).
c699850 [Joseph K. Bradley] a few doc comments
eaf84c0 [Joseph K. Bradley] Added DecisionTree static train() methods API to match Python, but without default parameters
2014-08-06 22:58:59 -07:00
Xiangrui Meng 25cff1019d [SPARK-2852][MLLIB] API consistency for mllib.feature
This is part of SPARK-2828:

1. added a Java-friendly fit method to Word2Vec with tests
2. change DeveloperApi to Experimental for Normalizer & StandardScaler
3. change default feature dimension to 2^20 in HashingTF

Author: Xiangrui Meng <meng@databricks.com>

Closes #1807 from mengxr/feature-api-check and squashes the following commits:

773c1a9 [Xiangrui Meng] change default numFeatures to 2^20 in HashingTF change annotation from DeveloperApi to Experimental in Normalizer and StandardScaler
883e122 [Xiangrui Meng] add @Experimental to Word2VecModel add a Java-friendly method to Word2Vec.fit with tests
2014-08-06 14:07:51 -07:00
DB Tsai c7b52010df [MLlib] Use this.type as return type in k-means' builder pattern
to ensure that the return object is itself.

Author: DB Tsai <dbtsai@alpinenow.com>

Closes #1796 from dbtsai/dbtsai-kmeans and squashes the following commits:

658989e [DB Tsai] Alpine Data Labs
2014-08-05 23:32:29 -07:00
Michael Giannakopoulos 1aad9114c9 [SPARK-2550][MLLIB][APACHE SPARK] Support regularization and intercept in pyspark's linear methods
Related to Jira Issue: [SPARK-2550](https://issues.apache.org/jira/browse/SPARK-2550?jql=project%20%3D%20SPARK%20AND%20resolution%20%3D%20Unresolved%20AND%20priority%20%3D%20Major%20ORDER%20BY%20key%20DESC)

Author: Michael Giannakopoulos <miccagiann@gmail.com>

Closes #1775 from miccagiann/linearMethodsReg and squashes the following commits:

cb774c3 [Michael Giannakopoulos] MiniBatchFraction added in related PythonMLLibAPI java stubs.
81fcbc6 [Michael Giannakopoulos] Fixing a typo-error.
8ad263e [Michael Giannakopoulos] Adding regularizer type and intercept parameters to LogisticRegressionWithSGD and SVMWithSGD.
2014-08-05 16:30:32 -07:00
Xiangrui Meng cc491f69cd [SPARK-2864][MLLIB] fix random seed in word2vec; move model to local
It also moves the model to local in order to map `RDD[String]` to `RDD[Vector]`.

Ishiihara

Author: Xiangrui Meng <meng@databricks.com>

Closes #1790 from mengxr/word2vec-fix and squashes the following commits:

a87146c [Xiangrui Meng] add setters and make a default constructor
e5c923b [Xiangrui Meng] fix random seed in word2vec; move model to local
2014-08-05 16:22:41 -07:00
Liquan Pei e053c55819 [MLlib] [SPARK-2510]Word2Vec: Distributed Representation of Words
This is a pull request regarding SPARK-2510 at https://issues.apache.org/jira/browse/SPARK-2510. Word2Vec creates vector representation of words in a text corpus. The algorithm first constructs a vocabulary from the corpus and then learns vector representation of words in the vocabulary. The vector representation can be used as features in natural language processing and machine learning algorithms.

To make our implementation more scalable, we train each partition separately and merge the model of each partition after each iteration. To make the model more accurate, multiple iterations may be needed.

To investigate the vector representations is to find the closest words for a query word. For example, the top 20 closest words to "china" are for 1 partition and 1 iteration :

taiwan 0.8077646146334014
korea 0.740913304563621
japan 0.7240667798885471
republic 0.7107151279078352
thailand 0.6953217332072862
tibet 0.6916782118129544
mongolia 0.6800858715972612
macau 0.6794925677480378
singapore 0.6594048695593799
manchuria 0.658989931844148
laos 0.6512978726001666
nepal 0.6380792327845325
mainland 0.6365469459587788
myanmar 0.6358614338840394
macedonia 0.6322366180313249
xinjiang 0.6285291551708028
russia 0.6279951236068411
india 0.6272874944023487
shanghai 0.6234544135576999
macao 0.6220588462925876

The result with 10 partitions and 5 iterations is:
taiwan 0.8310495079388313
india 0.7737171315919039
japan 0.756777901233668
korea 0.7429767187102452
indonesia 0.7407557427278356
pakistan 0.712883426985585
mainland 0.7053379963140822
thailand 0.696298191073948
mongolia 0.693690656871415
laos 0.6913069680735292
macau 0.6903427690029617
republic 0.6766381604813666
malaysia 0.676460699141784
singapore 0.6728790997360923
malaya 0.672345232966194
manchuria 0.6703732292753156
macedonia 0.6637955686322028
myanmar 0.6589462882439646
kazakhstan 0.657017801081494
cambodia 0.6542383836451932

Author: Liquan Pei <lpei@gopivotal.com>
Author: Xiangrui Meng <meng@databricks.com>
Author: Liquan Pei <liquanpei@gmail.com>

Closes #1719 from Ishiihara/master and squashes the following commits:

2ba9483 [Liquan Pei] minor fix for Word2Vec test
e248441 [Liquan Pei] minor style change
26a948d [Liquan Pei] Merge pull request #1 from mengxr/Ishiihara-master
c14da41 [Xiangrui Meng] fix styles
384c771 [Xiangrui Meng] remove minCount and window from constructor change model to use float instead of double
e93e726 [Liquan Pei] use treeAggregate instead of aggregate
1a8fb41 [Liquan Pei] use weighted sum in combOp
7efbb6f [Liquan Pei] use broadcast version of vocab in aggregate
6bcc8be [Liquan Pei] add multiple iteration support
720b5a3 [Liquan Pei] Add test for Word2Vec algorithm, minor fixes
2e92b59 [Liquan Pei] modify according to feedback
57dc50d [Liquan Pei] code formatting
e4a04d3 [Liquan Pei] minor fix
0aafb1b [Liquan Pei] Add comments, minor fixes
8d6befe [Liquan Pei] initial commit
2014-08-03 23:55:58 -07:00
DB Tsai ae58aea2d1 SPARK-2272 [MLlib] Feature scaling which standardizes the range of independent variables or features of data
Feature scaling is a method used to standardize the range of independent variables or features of data. In data processing, it is generally performed during the data preprocessing step.

In this work, a trait called `VectorTransformer` is defined for generic transformation on a vector. It contains one method to be implemented, `transform` which applies transformation on a vector.

There are two implementations of `VectorTransformer` now, and they all can be easily extended with PMML transformation support.

1) `StandardScaler` - Standardizes features by removing the mean and scaling to unit variance using column summary statistics on the samples in the training set.

2) `Normalizer` - Normalizes samples individually to unit L^n norm

Author: DB Tsai <dbtsai@alpinenow.com>

Closes #1207 from dbtsai/dbtsai-feature-scaling and squashes the following commits:

78c15d3 [DB Tsai] Alpine Data Labs
2014-08-03 21:39:21 -07:00
Joseph K. Bradley 2998e38a94 [SPARK-2197] [mllib] Java DecisionTree bug fix and easy-of-use
Bug fix: Before, when an RDD was created in Java and passed to DecisionTree.train(), the fake class tag caused problems.
* Fix: DecisionTree: Used new RDD.retag() method to allow passing RDDs from Java.

Other improvements to Decision Trees for easy-of-use with Java:
* impurity classes: Added instance() methods to help with Java interface.
* Strategy: Added Java-friendly constructor
--> Note: I removed quantileCalculationStrategy from the Java-friendly constructor since (a) it is a special class and (b) there is only 1 option currently.  I suspect we will redo the API before the other options are included.

CC: mengxr

Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>

Closes #1740 from jkbradley/dt-java-new and squashes the following commits:

0805dc6 [Joseph K. Bradley] Changed Strategy to use JavaConverters instead of JavaConversions
519b1b7 [Joseph K. Bradley] * Organized imports in JavaDecisionTreeSuite.java * Using JavaConverters instead of JavaConversions in DecisionTreeSuite.scala
f7b5ca1 [Joseph K. Bradley] Improvements to make it easier to run DecisionTree from Java. * DecisionTree: Used new RDD.retag() method to allow passing RDDs from Java. * impurity classes: Added instance() methods to help with Java interface. * Strategy: Added Java-friendly constructor ** Note: I removed quantileCalculationStrategy from the Java-friendly constructor since (a) it is a special class and (b) there is only 1 option currently.  I suspect we will redo the API before the other options are included.
d78ada6 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-java
320853f [Joseph K. Bradley] Added JavaDecisionTreeSuite, partly written
13a585e [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-java
f1a8283 [Joseph K. Bradley] Added old JavaDecisionTreeSuite, to be updated later
225822f [Joseph K. Bradley] Bug: In DecisionTree, the method sequentialBinSearchForOrderedCategoricalFeatureInClassification() indexed bins from 0 to (math.pow(2, featureCategories.toInt - 1) - 1). This upper bound is the bound for unordered categorical features, not ordered ones. The upper bound should be the arity (i.e., max value) of the feature.
2014-08-03 10:36:52 -07:00
Joseph K. Bradley 3f67382e7c [SPARK-2478] [mllib] DecisionTree Python API
Added experimental Python API for Decision Trees.

API:
* class DecisionTreeModel
** predict() for single examples and RDDs, taking both feature vectors and LabeledPoints
** numNodes()
** depth()
** __str__()
* class DecisionTree
** trainClassifier()
** trainRegressor()
** train()

Examples and testing:
* Added example testing classification and regression with batch prediction: examples/src/main/python/mllib/tree.py
* Have also tested example usage in doc of python/pyspark/mllib/tree.py which tests single-example prediction with dense and sparse vectors

Also: Small bug fix in python/pyspark/mllib/_common.py: In _linear_predictor_typecheck, changed check for RDD to use isinstance() instead of type() in order to catch RDD subclasses.

CC mengxr manishamde

Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>

Closes #1727 from jkbradley/decisiontree-python-new and squashes the following commits:

3744488 [Joseph K. Bradley] Renamed test tree.py to decision_tree_runner.py Small updates based on github review.
6b86a9d [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
affceb9 [Joseph K. Bradley] * Fixed bug in doc tests in pyspark/mllib/util.py caused by change in loadLibSVMFile behavior.  (It used to threshold labels at 0 to make them 0/1, but it now leaves them as they are.) * Fixed small bug in loadLibSVMFile: If a data file had no features, then loadLibSVMFile would create a single all-zero feature.
67a29bc [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
cf46ad7 [Joseph K. Bradley] Python DecisionTreeModel * predict(empty RDD) returns an empty RDD instead of an error. * Removed support for calling predict() on LabeledPoint and RDD[LabeledPoint] * predict() does not cache serialized RDD any more.
aa29873 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
bf21be4 [Joseph K. Bradley] removed old run() func from DecisionTree
fa10ea7 [Joseph K. Bradley] Small style update
7968692 [Joseph K. Bradley] small braces typo fix
e34c263 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
4801b40 [Joseph K. Bradley] Small style update to DecisionTreeSuite
db0eab2 [Joseph K. Bradley] Merge branch 'decisiontree-bugfix2' into decisiontree-python-new
6873fa9 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
225822f [Joseph K. Bradley] Bug: In DecisionTree, the method sequentialBinSearchForOrderedCategoricalFeatureInClassification() indexed bins from 0 to (math.pow(2, featureCategories.toInt - 1) - 1). This upper bound is the bound for unordered categorical features, not ordered ones. The upper bound should be the arity (i.e., max value) of the feature.
93953f1 [Joseph K. Bradley] Likely done with Python API.
6df89a9 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
4562c08 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
665ba78 [Joseph K. Bradley] Small updates towards Python DecisionTree API
188cb0d [Joseph K. Bradley] Merge branch 'decisiontree-bugfix' into decisiontree-python-new
6622247 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
b8fac57 [Joseph K. Bradley] Finished Python DecisionTree API and example but need to test a bit more.
2b20c61 [Joseph K. Bradley] Small doc and style updates
1b29c13 [Joseph K. Bradley] Merge branch 'decisiontree-bugfix' into decisiontree-python-new
584449a [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
dab0b67 [Joseph K. Bradley] Added documentation for DecisionTree internals
8bb8aa0 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-bugfix
978cfcf [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-bugfix
6eed482 [Joseph K. Bradley] In DecisionTree: Changed from using procedural syntax for functions returning Unit to explicitly writing Unit return type.
376dca2 [Joseph K. Bradley] Updated meaning of maxDepth by 1 to fit scikit-learn and rpart. * In code, replaced usages of maxDepth <-- maxDepth + 1 * In params, replace settings of maxDepth <-- maxDepth - 1
e06e423 [Joseph K. Bradley] Merge branch 'decisiontree-bugfix' into decisiontree-python-new
bab3f19 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
59750f8 [Joseph K. Bradley] * Updated Strategy to check numClassesForClassification only if algo=Classification. * Updates based on comments: ** DecisionTreeRunner *** Made dataFormat arg default to libsvm ** Small cleanups ** tree.Node: Made recursive helper methods private, and renamed them.
52e17c5 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-bugfix
f5a036c [Joseph K. Bradley] Merge branch 'decisiontree-bugfix' into decisiontree-python-new
da50db7 [Joseph K. Bradley] Added one more test to DecisionTreeSuite: stump with 2 continuous variables for binary classification.  Caused problems in past, but fixed now.
8e227ea [Joseph K. Bradley] Changed Strategy so it only requires numClassesForClassification >= 2 for classification
cd1d933 [Joseph K. Bradley] Merge branch 'decisiontree-bugfix' into decisiontree-python-new
8ea8750 [Joseph K. Bradley] Bug fix: Off-by-1 when finding thresholds for splits for continuous features.
8a758db [Joseph K. Bradley] Merge branch 'decisiontree-bugfix' into decisiontree-python-new
5fe44ed [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
2283df8 [Joseph K. Bradley] 2 bug fixes.
73fbea2 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-bugfix
5f920a1 [Joseph K. Bradley] Demonstration of bug before submitting fix: Updated DecisionTreeSuite so that 3 tests fail.  Will describe bug in next commit.
f825352 [Joseph K. Bradley] Wrote Python API and example for DecisionTree.  Also added toString, depth, and numNodes methods to DecisionTreeModel.
2014-08-02 13:07:17 -07:00
Patrick Wendell dab37966b0 Revert "[SPARK-1470][SPARK-1842] Use the scala-logging wrapper instead of the directly sfl4j api"
This reverts commit adc8303294.
2014-08-01 23:55:30 -07:00
GuoQiang Li adc8303294 [SPARK-1470][SPARK-1842] Use the scala-logging wrapper instead of the directly sfl4j api
Author: GuoQiang Li <witgo@qq.com>

Closes #1369 from witgo/SPARK-1470_new and squashes the following commits:

66a1641 [GuoQiang Li] IncompatibleResultTypeProblem
73a89ba [GuoQiang Li] Use the scala-logging wrapper instead of the directly sfl4j api.
2014-08-01 23:55:11 -07:00
Burak fda475987f [SPARK-2801][MLlib]: DistributionGenerator renamed to RandomDataGenerator. RandomRDD is now of generic type
The RandomRDDGenerators used to only output RDD[Double].
Now RandomRDDGenerators.randomRDD can be used to generate a random RDD[T] via a class that extends RandomDataGenerator, by supplying a type T and overriding the nextValue() function as they wish.

Author: Burak <brkyvz@gmail.com>

Closes #1732 from brkyvz/SPARK-2801 and squashes the following commits:

c94a694 [Burak] [SPARK-2801][MLlib] Missing ClassTags added
22d96fe [Burak] [SPARK-2801][MLlib]: DistributionGenerator renamed to RandomDataGenerator, generic types added for RandomRDD instead of Double
2014-08-01 22:32:12 -07:00
Tor Myklebust e25ec06171 [SPARK-1580][MLLIB] Estimate ALS communication and computation costs.
Continue the work from #493.

Closes #493 and Closes #593

Author: Tor Myklebust <tmyklebu@gmail.com>
Author: Xiangrui Meng <meng@databricks.com>

Closes #1731 from mengxr/tmyklebu-alscost and squashes the following commits:

9b56a8b [Xiangrui Meng] updated API and added a simple test
68a3229 [Xiangrui Meng] merge master
217bd1d [Tor Myklebust] Documentation and choleskies -> subproblems.
8cbb718 [Tor Myklebust] Braces get spaces.
0455cd4 [Tor Myklebust] Parens for collectAsMap.
2b2febe [Tor Myklebust] Use `makeLinkRDDs` when estimating costs.
2ab7a5d [Tor Myklebust] Reindent estimateCost's declaration and make it return Seqs.
8b21e6d [Tor Myklebust] Fix overlong lines.
8cbebf1 [Tor Myklebust] Rename and clean up the return format of cost estimator.
6615ed5 [Tor Myklebust] It's more useful to give per-partition estimates.  Do that.
5530678 [Tor Myklebust] Merge branch 'master' of https://github.com/apache/spark into alscost
6c31324 [Tor Myklebust] Make it actually build...
a1184d1 [Tor Myklebust] Mark ALS.evaluatePartitioner DeveloperApi.
657a71b [Tor Myklebust] Simple-minded estimates of computation and communication costs in ALS.
dcf583a [Tor Myklebust] Remove the partitioner member variable; instead, thread that needle everywhere it needs to go.
23d6f91 [Tor Myklebust] Stop making the partitioner configurable.
495784f [Tor Myklebust] Merge branch 'master' of https://github.com/apache/spark
674933a [Tor Myklebust] Fix style.
40edc23 [Tor Myklebust] Fix missing space.
f841345 [Tor Myklebust] Fix daft bug creating 'pairs', also for -> foreach.
5ec9e6c [Tor Myklebust] Clean a couple of things up using 'map'.
36a0f43 [Tor Myklebust] Make the partitioner private.
d872b09 [Tor Myklebust] Add negative id ALS test.
df27697 [Tor Myklebust] Support custom partitioners.  Currently we use the same partitioner for users and products.
c90b6d8 [Tor Myklebust] Scramble user and product ids before bucketing.
c774d7d [Tor Myklebust] Make the partitioner a member variable and use it instead of modding directly.
2014-08-01 21:25:02 -07:00
Michael Giannakopoulos c281189222 [SPARK-2550][MLLIB][APACHE SPARK] Support regularization and intercept in pyspark's linear methods.
Related to issue: [SPARK-2550](https://issues.apache.org/jira/browse/SPARK-2550?jql=project%20%3D%20SPARK%20AND%20resolution%20%3D%20Unresolved%20AND%20priority%20%3D%20Major%20ORDER%20BY%20key%20DESC).

Author: Michael Giannakopoulos <miccagiann@gmail.com>

Closes #1624 from miccagiann/new-branch and squashes the following commits:

c02e5f5 [Michael Giannakopoulos] Merge cleanly with upstream/master.
8dcb888 [Michael Giannakopoulos] Putting the if/else if statements in brackets.
fed8eaa [Michael Giannakopoulos] Adding a space in the message related to the IllegalArgumentException.
44e6ff0 [Michael Giannakopoulos] Adding a blank line before python class LinearRegressionWithSGD.
8eba9c5 [Michael Giannakopoulos] Change function signatures. Exception is thrown from the scala component and not from the python one.
638be47 [Michael Giannakopoulos] Modified code to comply with code standards.
ec50ee9 [Michael Giannakopoulos] Shorten the if-elif-else statement in regression.py file
b962744 [Michael Giannakopoulos] Replaced the enum classes, with strings-keywords for defining the values of 'regType' parameter.
78853ec [Michael Giannakopoulos] Providing intercept and regualizer functionallity for linear methods in only one function.
3ac8874 [Michael Giannakopoulos] Added support for regularizer and intercection parameters for linear regression method.
2014-08-01 21:00:31 -07:00
Jeremy Freeman f6a1899306 Streaming mllib [SPARK-2438][MLLIB]
This PR implements a streaming linear regression analysis, in which a linear regression model is trained online as new data arrive. The design is based on discussions with tdas and mengxr, in which we determined how to add this functionality in a general way, with minimal changes to existing libraries.

__Summary of additions:__

_StreamingLinearAlgorithm_
- An abstract class for fitting generalized linear models online to streaming data, including training on (and updating) a model, and making predictions.

_StreamingLinearRegressionWithSGD_
- Class and companion object for running streaming linear regression

_StreamingLinearRegressionTestSuite_
- Unit tests

_StreamingLinearRegression_
- Example use case: fitting a model online to data from one stream, and making predictions on other data

__Notes__
- If this looks good, I can use the StreamingLinearAlgorithm class to easily implement other analyses that follow the same logic (Ridge, Lasso, Logistic, SVM).

Author: Jeremy Freeman <the.freeman.lab@gmail.com>
Author: freeman <the.freeman.lab@gmail.com>

Closes #1361 from freeman-lab/streaming-mllib and squashes the following commits:

775ea29 [Jeremy Freeman] Throw error if user doesn't initialize weights
4086fee [Jeremy Freeman] Fixed current weight formatting
8b95b27 [Jeremy Freeman] Restored broadcasting
29f27ec [Jeremy Freeman] Formatting
8711c41 [Jeremy Freeman] Used return to avoid indentation
777b596 [Jeremy Freeman] Restored treeAggregate
74cf440 [Jeremy Freeman] Removed static methods
d28cf9a [Jeremy Freeman] Added usage notes
c3326e7 [Jeremy Freeman] Improved documentation
9541a41 [Jeremy Freeman] Merge remote-tracking branch 'upstream/master' into streaming-mllib
66eba5e [Jeremy Freeman] Fixed line lengths
2fe0720 [Jeremy Freeman] Minor cleanup
7d51378 [Jeremy Freeman] Moved streaming loader to MLUtils
b9b69f6 [Jeremy Freeman] Added setter methods
c3f8b5a [Jeremy Freeman] Modified logging
00aafdc [Jeremy Freeman] Add modifiers
14b801e [Jeremy Freeman] Name changes
c7d38a3 [Jeremy Freeman] Move check for empty data to GradientDescent
4b0a5d3 [Jeremy Freeman] Cleaned up tests
74188d6 [Jeremy Freeman] Eliminate dependency on commons
50dd237 [Jeremy Freeman] Removed experimental tag
6bfe1e6 [Jeremy Freeman] Fixed imports
a2a63ad [freeman] Makes convergence test more robust
86220bc [freeman] Streaming linear regression unit tests
fb4683a [freeman] Minor changes for scalastyle consistency
fd31e03 [freeman] Changed logging behavior
453974e [freeman] Fixed indentation
c4b1143 [freeman] Streaming linear regression
604f4d7 [freeman] Expanded private class to include mllib
d99aa85 [freeman] Helper methods for streaming MLlib apps
0898add [freeman] Added dependency on streaming
2014-08-01 20:10:26 -07:00
Joseph K. Bradley 7058a5393b [SPARK-2796] [mllib] DecisionTree bug fix: ordered categorical features
Bug: In DecisionTree, the method sequentialBinSearchForOrderedCategoricalFeatureInClassification() indexed bins from 0 to (math.pow(2, featureCategories.toInt - 1) - 1). This upper bound is the bound for unordered categorical features, not ordered ones. The upper bound should be the arity (i.e., max value) of the feature.

Added new test to DecisionTreeSuite to catch this: "regression stump with categorical variables of arity 2"

Bug fix: Modified upper bound discussed above.

Also: Small improvements to coding style in DecisionTree.

CC mengxr manishamde

Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>

Closes #1720 from jkbradley/decisiontree-bugfix2 and squashes the following commits:

225822f [Joseph K. Bradley] Bug: In DecisionTree, the method sequentialBinSearchForOrderedCategoricalFeatureInClassification() indexed bins from 0 to (math.pow(2, featureCategories.toInt - 1) - 1). This upper bound is the bound for unordered categorical features, not ordered ones. The upper bound should be the arity (i.e., max value) of the feature.
2014-08-01 15:52:21 -07:00
Doris Xin d88e695613 [SPARK-2786][mllib] Python correlations
Author: Doris Xin <doris.s.xin@gmail.com>

Closes #1713 from dorx/pythonCorrelation and squashes the following commits:

5f1e60c [Doris Xin] reviewer comments.
46ff6eb [Doris Xin] reviewer comments.
ad44085 [Doris Xin] style fix
e69d446 [Doris Xin] fixed missed conflicts.
eb5bf56 [Doris Xin] merge master
cc9f725 [Doris Xin] units passed.
9141a63 [Doris Xin] WIP2
d199f1f [Doris Xin] Moved correlation names into a public object
cd163d6 [Doris Xin] WIP
2014-08-01 15:02:17 -07:00
Xiangrui Meng 5328c0aaa0 [HOTFIX] downgrade breeze version to 0.7
breeze-0.8.1 causes dependency issues, as discussed in #940 .

Author: Xiangrui Meng <meng@databricks.com>

Closes #1718 from mengxr/revert-breeze and squashes the following commits:

99c4681 [Xiangrui Meng] downgrade breeze version to 0.7
2014-08-01 10:00:46 -07:00
witgo 0dacb1adb5 [SPARK-1997] update breeze to version 0.8.1
`breeze 0.8.1`  dependent on  `scala-logging-slf4j 2.1.1` The relevant code on #1369

Author: witgo <witgo@qq.com>

Closes #940 from witgo/breeze-8.0.1 and squashes the following commits:

65cc65e [witgo] update breeze  to version 0.8.1
2014-08-01 07:47:44 -07:00
Sean Owen 82d209d43f SPARK-2768 [MLLIB] Add product, user recommend method to MatrixFactorizationModel
Right now, `MatrixFactorizationModel` can only predict a score for one or more `(user,product)` tuples. As a comment in the file notes, it would be more useful to expose a recommend method, that computes top N scoring products for a user (or vice versa – users for a product).

(This also corrects some long lines in the Java ALS test suite.)

As you can see, it's a little messy to access the class from Java. Should there be a Java-friendly wrapper for it? with a pointer about where that should go, I could add that.

Author: Sean Owen <srowen@gmail.com>

Closes #1687 from srowen/SPARK-2768 and squashes the following commits:

b349675 [Sean Owen] Additional review changes
c9edb04 [Sean Owen] Updates from code review
7bc35f9 [Sean Owen] Add recommend methods to MatrixFactorizationModel
2014-08-01 07:32:53 -07:00
Doris Xin c4755403e7 [SPARK-2782][mllib] Bug fix for getRanks in SpearmanCorrelation
getRanks computes the wrong rank when numPartition >= size in the input RDDs before this patch. added units to address this bug.

Author: Doris Xin <doris.s.xin@gmail.com>

Closes #1710 from dorx/correlationBug and squashes the following commits:

733def4 [Doris Xin] bugs and reviewer comments.
31db920 [Doris Xin] revert unnecessary change
043ff83 [Doris Xin] bug fix for spearman corner case
2014-07-31 21:23:35 -07:00
Xiangrui Meng b19008320b [SPARK-2777][MLLIB] change ALS factors storage level to MEMORY_AND_DISK
Now the factors are persisted in memory only. If they get kicked off by later jobs, we might have to start the computation from very beginning. A better solution is changing the storage level to `MEMORY_AND_DISK`.

srowen

Author: Xiangrui Meng <meng@databricks.com>

Closes #1700 from mengxr/als-level and squashes the following commits:

c103d76 [Xiangrui Meng] change ALS factors storage level to MEMORY_AND_DISK
2014-07-31 21:14:08 -07:00
Joseph K. Bradley b124de584a [SPARK-2756] [mllib] Decision tree bug fixes
(1) Inconsistent aggregate (agg) indexing for unordered features.
(2) Fixed gain calculations for edge cases.
(3) One-off error in choosing thresholds for continuous features for small datasets.
(4) (not a bug) Changed meaning of tree depth by 1 to fit scikit-learn and rpart. (Depth 1 used to mean 1 leaf node; depth 0 now means 1 leaf node.)

Other updates, to help with tests:
* Updated DecisionTreeRunner to print more info.
* Added utility functions to DecisionTreeModel: toString, depth, numNodes
* Improved internal DecisionTree documentation

Bug fix details:

(1) Indexing was inconsistent for aggregate calculations for unordered features (in multiclass classification with categorical features, where the features had few enough values such that they could be considered unordered, i.e., isSpaceSufficientForAllCategoricalSplits=true).

* updateBinForUnorderedFeature indexed agg as (node, feature, featureValue, binIndex), where
** featureValue was from arr (so it was a feature value)
** binIndex was in [0,…, 2^(maxFeatureValue-1)-1)
* The rest of the code indexed agg as (node, feature, binIndex, label).
* Corrected this bug by changing updateBinForUnorderedFeature to use the second indexing pattern.

Unit tests in DecisionTreeSuite
* Updated a few tests to train a model and test its training accuracy, which catches the indexing bug from updateBinForUnorderedFeature() discussed above.
* Added new test (“stump with categorical variables for multiclass classification, with just enough bins”) to test bin extremes.

(2) Bug fix: calculateGainForSplit (for classification):
* It used to return dummy prediction values when either the right or left children had 0 weight.  These were incorrect for multiclass classification.  It has been corrected.

Updated impurities to allow for count = 0.  This was related to the above bug fix for calculateGainForSplit (for classification).

Small updates to documentation and coding style.

(3) Bug fix: Off-by-1 when finding thresholds for splits for continuous features.

* Exhibited bug in new test in DecisionTreeSuite: “stump with 1 continuous variable for binary classification, to check off-by-1 error”
* Description: When finding thresholds for possible splits for continuous features in DecisionTree.findSplitsBins, the thresholds were set according to individual training examples’ feature values.
* Fix: The threshold is set to be the average of 2 consecutive (sorted) examples’ feature values.  E.g.: If the old code set the threshold using example i, the new code sets the threshold using exam
* Note: In 4 DecisionTreeSuite tests with all labels identical, removed check of threshold since it is somewhat arbitrary.

CC: mengxr manishamde  Please let me know if I missed something!

Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>

Closes #1673 from jkbradley/decisiontree-bugfix and squashes the following commits:

2b20c61 [Joseph K. Bradley] Small doc and style updates
dab0b67 [Joseph K. Bradley] Added documentation for DecisionTree internals
8bb8aa0 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-bugfix
978cfcf [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-bugfix
6eed482 [Joseph K. Bradley] In DecisionTree: Changed from using procedural syntax for functions returning Unit to explicitly writing Unit return type.
376dca2 [Joseph K. Bradley] Updated meaning of maxDepth by 1 to fit scikit-learn and rpart. * In code, replaced usages of maxDepth <-- maxDepth + 1 * In params, replace settings of maxDepth <-- maxDepth - 1
59750f8 [Joseph K. Bradley] * Updated Strategy to check numClassesForClassification only if algo=Classification. * Updates based on comments: ** DecisionTreeRunner *** Made dataFormat arg default to libsvm ** Small cleanups ** tree.Node: Made recursive helper methods private, and renamed them.
52e17c5 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-bugfix
da50db7 [Joseph K. Bradley] Added one more test to DecisionTreeSuite: stump with 2 continuous variables for binary classification.  Caused problems in past, but fixed now.
8ea8750 [Joseph K. Bradley] Bug fix: Off-by-1 when finding thresholds for splits for continuous features.
2283df8 [Joseph K. Bradley] 2 bug fixes.
73fbea2 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-bugfix
5f920a1 [Joseph K. Bradley] Demonstration of bug before submitting fix: Updated DecisionTreeSuite so that 3 tests fail.  Will describe bug in next commit.
2014-07-31 20:51:48 -07:00
Doris Xin d8430148ee [SPARK-2724] Python version of RandomRDDGenerators
RandomRDDGenerators but without support for randomRDD and randomVectorRDD, which take in arbitrary DistributionGenerator.

`randomRDD.py` is named to avoid collision with the built-in Python `random` package.

Author: Doris Xin <doris.s.xin@gmail.com>

Closes #1628 from dorx/pythonRDD and squashes the following commits:

55c6de8 [Doris Xin] review comments. all python units passed.
f831d9b [Doris Xin] moved default args logic into PythonMLLibAPI
2d73917 [Doris Xin] fix for linalg.py
8663e6a [Doris Xin] reverting back to a single python file for random
f47c481 [Doris Xin] docs update
687aac0 [Doris Xin] add RandomRDDGenerators.py to run-tests
4338f40 [Doris Xin] renamed randomRDD to rand and import as random
29d205e [Doris Xin] created mllib.random package
bd2df13 [Doris Xin] typos
07ddff2 [Doris Xin] units passed.
23b2ecd [Doris Xin] WIP
2014-07-31 20:32:57 -07:00
Xiangrui Meng dc0865bc7e [SPARK-2511][MLLIB] add HashingTF and IDF
This is roughly the TF-IDF implementation used in the Databricks Cloud Demo: http://databricks.com/cloud/ .

Both `HashingTF` and `IDF` are implemented as transformers, similar to scikit-learn.

Author: Xiangrui Meng <meng@databricks.com>

Closes #1671 from mengxr/tfidf and squashes the following commits:

7d65888 [Xiangrui Meng] use JavaConverters._
5fe9ec4 [Xiangrui Meng] fix unit test
6e214ec [Xiangrui Meng] add apache header
cfd9aed [Xiangrui Meng] add Java-friendly methods move classes to mllib.feature
3814440 [Xiangrui Meng] add HashingTF and IDF
2014-07-31 12:55:00 -07:00
Sean Owen e9b275b769 SPARK-2341 [MLLIB] loadLibSVMFile doesn't handle regression datasets
Per discussion at https://issues.apache.org/jira/browse/SPARK-2341 , this is a look at deprecating the multiclass parameter. Thoughts welcome of course.

Author: Sean Owen <srowen@gmail.com>

Closes #1663 from srowen/SPARK-2341 and squashes the following commits:

8a3abd7 [Sean Owen] Suppress MIMA error for removed package private classes
18a8c8e [Sean Owen] Updates from review
83d0092 [Sean Owen] Deprecated methods with multiclass, and instead always parse target as a double (ie. multiclass = true)
2014-07-30 17:34:32 -07:00
Sean Owen 6ab96a6fd0 SPARK-2749 [BUILD]. Spark SQL Java tests aren't compiling in Jenkins' Maven builds; missing junit:junit dep
The Maven-based builds in the build matrix have been failing for a few days:

https://amplab.cs.berkeley.edu/jenkins/view/Spark/

On inspection, it looks like the Spark SQL Java tests don't compile:

https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-Master-Maven-pre-YARN/hadoop.version=1.0.4,label=centos/244/consoleFull

I confirmed it by repeating the command vs master:

`mvn -Dhadoop.version=1.0.4 -Dlabel=centos -DskipTests clean package`

The problem is that this module doesn't depend on JUnit. In fact, none of the modules do, but `com.novocode:junit-interface` (the SBT-JUnit bridge) pulls it in, in most places. However this module doesn't depend on `com.novocode:junit-interface`

Adding the `junit:junit` dependency fixes the compile problem. In fact, the other modules with Java tests should probably depend on it explicitly instead of happening to get it via `com.novocode:junit-interface`, since that is a bit SBT/Scala-specific (and I am not even sure it's needed).

Author: Sean Owen <srowen@gmail.com>

Closes #1660 from srowen/SPARK-2749 and squashes the following commits:

858ff7c [Sean Owen] Add explicit junit dep to other modules with Java tests for robustness
9636794 [Sean Owen] Add junit dep so that Spark SQL Java tests compile
2014-07-30 15:04:33 -07:00
GuoQiang Li fc47bb6967 [SPARK-2544][MLLIB] Improve ALS algorithm resource usage
Author: GuoQiang Li <witgo@qq.com>
Author: witgo <witgo@qq.com>

Closes #929 from witgo/improve_als and squashes the following commits:

ea25033 [GuoQiang Li] checkpoint products 3,6,9 ...
154dccf [GuoQiang Li] checkpoint products only
c5779ff [witgo] Improve ALS algorithm resource usage
2014-07-30 11:00:11 -07:00
Sean Owen ee07541e99 SPARK-2748 [MLLIB] [GRAPHX] Loss of precision for small arguments to Math.exp, Math.log
In a few places in MLlib, an expression of the form `log(1.0 + p)` is evaluated. When p is so small that `1.0 + p == 1.0`, the result is 0.0. However the correct answer is very near `p`. This is why `Math.log1p` exists.

Similarly for one instance of `exp(m) - 1` in GraphX; there's a special `Math.expm1` method.

While the errors occur only for very small arguments, given their use in machine learning algorithms, this is entirely possible.

Also note the related PR for Python: https://github.com/apache/spark/pull/1652

Author: Sean Owen <srowen@gmail.com>

Closes #1659 from srowen/SPARK-2748 and squashes the following commits:

c5926d4 [Sean Owen] Use log1p, expm1 for better precision for tiny arguments
2014-07-30 08:55:15 -07:00
Xiangrui Meng 20424dad30 [SPARK-2174][MLLIB] treeReduce and treeAggregate
In `reduce` and `aggregate`, the driver node spends linear time on the number of partitions. It becomes a bottleneck when there are many partitions and the data from each partition is big.

SPARK-1485 (#506) tracks the progress of implementing AllReduce on Spark. I did several implementations including butterfly, reduce + broadcast, and treeReduce + broadcast. treeReduce + BT broadcast seems to be right way to go for Spark. Using binary tree may introduce some overhead in communication, because the driver still need to coordinate on data shuffling. In my experiments, n -> sqrt(n) -> 1 gives the best performance in general, which is why I set "depth = 2" in MLlib algorithms. But it certainly needs more testing.

I left `treeReduce` and `treeAggregate` public for easy testing. Some numbers from a test on 32-node m3.2xlarge cluster.

code:

~~~
import breeze.linalg._
import org.apache.log4j._

Logger.getRootLogger.setLevel(Level.OFF)

for (n <- Seq(1, 10, 100, 1000, 10000, 100000, 1000000)) {
  val vv = sc.parallelize(0 until 1024, 1024).map(i => DenseVector.zeros[Double](n))
  var start = System.nanoTime(); vv.treeReduce(_ + _, 2); println((System.nanoTime() - start) / 1e9)
  start = System.nanoTime(); vv.reduce(_ + _); println((System.nanoTime() - start) / 1e9)
}
~~~

out:

| n | treeReduce(,2) | reduce |
|---|---------------------|-----------|
| 10 | 0.215538731 | 0.204206899 |
| 100 | 0.278405907 | 0.205732582 |
| 1000 | 0.208972182 | 0.214298272 |
| 10000 | 0.194792071 | 0.349353687 |
| 100000 | 0.347683285 | 6.086671892 |
| 1000000 | 2.589350682 | 66.572906702 |

CC: @pwendell

This is clearly more scalable than the default implementation. My question is whether we should use this implementation in `reduce` and `aggregate` or put them as separate methods. The concern is that users may use `reduce` and `aggregate` as collect, where having multiple stages doesn't reduce the data size. However, in this case, `collect` is more appropriate.

Author: Xiangrui Meng <meng@databricks.com>

Closes #1110 from mengxr/tree and squashes the following commits:

c6cd267 [Xiangrui Meng] make depth default to 2
b04b96a [Xiangrui Meng] address comments
9bcc5d3 [Xiangrui Meng] add depth for readability
7495681 [Xiangrui Meng] fix compile error
142a857 [Xiangrui Meng] merge master
d58a087 [Xiangrui Meng] move treeReduce and treeAggregate to mllib
8a2a59c [Xiangrui Meng] Merge branch 'master' into tree
be6a88a [Xiangrui Meng] use treeAggregate in mllib
0f94490 [Xiangrui Meng] add docs
eb71c33 [Xiangrui Meng] add treeReduce
fe42a5e [Xiangrui Meng] add treeAggregate
2014-07-29 01:16:41 -07:00
Cheng Lian a7a9d14479 [SPARK-2410][SQL] Merging Hive Thrift/JDBC server (with Maven profile fix)
JIRA issue: [SPARK-2410](https://issues.apache.org/jira/browse/SPARK-2410)

Another try for #1399 & #1600. Those two PR breaks Jenkins builds because we made a separate profile `hive-thriftserver` in sub-project `assembly`, but the `hive-thriftserver` module is defined outside the `hive-thriftserver` profile. Thus every time a pull request that doesn't touch SQL code will also execute test suites defined in `hive-thriftserver`, but tests fail because related .class files are not included in the assembly jar.

In the most recent commit, module `hive-thriftserver` is moved into its own profile to fix this problem. All previous commits are squashed for clarity.

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #1620 from liancheng/jdbc-with-maven-fix and squashes the following commits:

629988e [Cheng Lian] Moved hive-thriftserver module definition into its own profile
ec3c7a7 [Cheng Lian] Cherry picked the Hive Thrift server
2014-07-28 12:07:30 -07:00
DB Tsai 255b56f9f5 [SPARK-2479][MLlib] Comparing floating-point numbers using relative error in UnitTests
Floating point math is not exact, and most floating-point numbers end up being slightly imprecise due to rounding errors.

Simple values like 0.1 cannot be precisely represented using binary floating point numbers, and the limited precision of floating point numbers means that slight changes in the order of operations or the precision of intermediates can change the result.

That means that comparing two floats to see if they are equal is usually not what we want. As long as this imprecision stays small, it can usually be ignored.

Based on discussion in the community, we have implemented two different APIs for relative tolerance, and absolute tolerance. It makes sense that test writers should know which one they need depending on their circumstances.

Developers also need to explicitly specify the eps, and there is no default value which will sometimes cause confusion.

When comparing against zero using relative tolerance, a exception will be raised to warn users that it's meaningless.

For relative tolerance, users can now write

    assert(23.1 ~== 23.52 relTol 0.02)
    assert(23.1 ~== 22.74 relTol 0.02)
    assert(23.1 ~= 23.52 relTol 0.02)
    assert(23.1 ~= 22.74 relTol 0.02)
    assert(!(23.1 !~= 23.52 relTol 0.02))
    assert(!(23.1 !~= 22.74 relTol 0.02))

    // This will throw exception with the following message.
    // "Did not expect 23.1 and 23.52 to be within 0.02 using relative tolerance."
    assert(23.1 !~== 23.52 relTol 0.02)

    // "Expected 23.1 and 22.34 to be within 0.02 using relative tolerance."
    assert(23.1 ~== 22.34 relTol 0.02)

For absolute error,

    assert(17.8 ~== 17.99 absTol 0.2)
    assert(17.8 ~== 17.61 absTol 0.2)
    assert(17.8 ~= 17.99 absTol 0.2)
    assert(17.8 ~= 17.61 absTol 0.2)
    assert(!(17.8 !~= 17.99 absTol 0.2))
    assert(!(17.8 !~= 17.61 absTol 0.2))

    // This will throw exception with the following message.
    // "Did not expect 17.8 and 17.99 to be within 0.2 using absolute error."
    assert(17.8 !~== 17.99 absTol 0.2)

    // "Expected 17.8 and 17.59 to be within 0.2 using absolute error."
    assert(17.8 ~== 17.59 absTol 0.2)

Authors:
  DB Tsai <dbtsaialpinenow.com>
  Marek Kolodziej <marekalpinenow.com>

Author: DB Tsai <dbtsai@alpinenow.com>

Closes #1425 from dbtsai/SPARK-2479_comparing_floating_point and squashes the following commits:

8c7cbcc [DB Tsai] Alpine Data Labs
2014-07-28 11:34:19 -07:00
Patrick Wendell e5bbce9a60 Revert "[SPARK-2410][SQL] Merging Hive Thrift/JDBC server"
This reverts commit f6ff2a61d0.
2014-07-27 18:46:58 -07:00
Doris Xin 81fcdd22c8 [SPARK-2514] [mllib] Random RDD generator
Utilities for generating random RDDs.

RandomRDD and RandomVectorRDD are created instead of using `sc.parallelize(range:Range)` because `Range` objects in Scala can only have `size <= Int.MaxValue`.

The object `RandomRDDGenerators` can be transformed into a generator class to reduce the number of auxiliary methods for optional arguments.

Author: Doris Xin <doris.s.xin@gmail.com>

Closes #1520 from dorx/randomRDD and squashes the following commits:

01121ac [Doris Xin] reviewer comments
6bf27d8 [Doris Xin] Merge branch 'master' into randomRDD
a8ea92d [Doris Xin] Reviewer comments
063ea0b [Doris Xin] Merge branch 'master' into randomRDD
aec68eb [Doris Xin] newline
bc90234 [Doris Xin] units passed.
d56cacb [Doris Xin] impl with RandomRDD
92d6f1c [Doris Xin] solution for Cloneable
df5bcff [Doris Xin] Merge branch 'generator' into randomRDD
f46d928 [Doris Xin] WIP
49ed20d [Doris Xin] alternative poisson distribution generator
7cb0e40 [Doris Xin] fix for data inconsistency
8881444 [Doris Xin] RandomRDDGenerator: initial design
2014-07-27 16:16:39 -07:00
Cheng Lian f6ff2a61d0 [SPARK-2410][SQL] Merging Hive Thrift/JDBC server
(This is a replacement of #1399, trying to fix potential `HiveThriftServer2` port collision between parallel builds. Please refer to [these comments](https://github.com/apache/spark/pull/1399#issuecomment-50212572) for details.)

JIRA issue: [SPARK-2410](https://issues.apache.org/jira/browse/SPARK-2410)

Merging the Hive Thrift/JDBC server from [branch-1.0-jdbc](https://github.com/apache/spark/tree/branch-1.0-jdbc).

Thanks chenghao-intel for his initial contribution of the Spark SQL CLI.

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #1600 from liancheng/jdbc and squashes the following commits:

ac4618b [Cheng Lian] Uses random port for HiveThriftServer2 to avoid collision with parallel builds
090beea [Cheng Lian] Revert changes related to SPARK-2678, decided to move them to another PR
21c6cf4 [Cheng Lian] Updated Spark SQL programming guide docs
fe0af31 [Cheng Lian] Reordered spark-submit options in spark-shell[.cmd]
199e3fb [Cheng Lian] Disabled MIMA for hive-thriftserver
1083e9d [Cheng Lian] Fixed failed test suites
7db82a1 [Cheng Lian] Fixed spark-submit application options handling logic
9cc0f06 [Cheng Lian] Starts beeline with spark-submit
cfcf461 [Cheng Lian] Updated documents and build scripts for the newly added hive-thriftserver profile
061880f [Cheng Lian] Addressed all comments by @pwendell
7755062 [Cheng Lian] Adapts test suites to spark-submit settings
40bafef [Cheng Lian] Fixed more license header issues
e214aab [Cheng Lian] Added missing license headers
b8905ba [Cheng Lian] Fixed minor issues in spark-sql and start-thriftserver.sh
f975d22 [Cheng Lian] Updated docs for Hive compatibility and Shark migration guide draft
3ad4e75 [Cheng Lian] Starts spark-sql shell with spark-submit
a5310d1 [Cheng Lian] Make HiveThriftServer2 play well with spark-submit
61f39f4 [Cheng Lian] Starts Hive Thrift server via spark-submit
2c4c539 [Cheng Lian] Cherry picked the Hive Thrift server
2014-07-27 13:03:38 -07:00
Doris Xin 3a69c72e5c [SPARK-2679] [MLLib] Ser/De for Double
Added a set of serializer/deserializer for Double in _common.py and PythonMLLibAPI in MLLib.

Author: Doris Xin <doris.s.xin@gmail.com>

Closes #1581 from dorx/doubleSerDe and squashes the following commits:

86a85b3 [Doris Xin] Merge branch 'master' into doubleSerDe
2bfe7a4 [Doris Xin] Removed magic byte
ad4d0d9 [Doris Xin] removed a space in unit
a9020bc [Doris Xin] units passed
7dad9af [Doris Xin] WIP
2014-07-27 07:21:07 -07:00
Xiangrui Meng aaf2b735fd [SPARK-2361][MLLIB] Use broadcast instead of serializing data directly into task closure
We saw task serialization problems with large feature dimension, which could be avoid if we don't serialize data directly into task but use broadcast variables. This PR uses broadcast in both training and prediction and adds tests to make sure the task size is small.

Author: Xiangrui Meng <meng@databricks.com>

Closes #1427 from mengxr/broadcast-new and squashes the following commits:

b9a1228 [Xiangrui Meng] style update
b97c184 [Xiangrui Meng] minimal change to LBFGS
9ebadcc [Xiangrui Meng] add task size test to RowMatrix
9427bf0 [Xiangrui Meng] add task size tests to linear methods
e0a5cf2 [Xiangrui Meng] add task size test to GD
28a8411 [Xiangrui Meng] add test for NaiveBayes
380778c [Xiangrui Meng] update KMeans test
bccab92 [Xiangrui Meng] add task size test to LBFGS
02103ba [Xiangrui Meng] remove print
e73d68e [Xiangrui Meng] update tests for k-means
174cb15 [Xiangrui Meng] use local-cluster for test with a small akka.frameSize
1928a5a [Xiangrui Meng] add test for KMeans task size
e00c2da [Xiangrui Meng] use broadcast in GD, KMeans
010d076 [Xiangrui Meng] modify NaiveBayesModel and GLM to use broadcast
2014-07-26 22:56:07 -07:00
Michael Armbrust afd757a241 Revert "[SPARK-2410][SQL] Merging Hive Thrift/JDBC server"
This reverts commit 06dc0d2c6b.

#1399 is making Jenkins fail.  We should investigate and put this back after its passing tests.

Author: Michael Armbrust <michael@databricks.com>

Closes #1594 from marmbrus/revertJDBC and squashes the following commits:

59748da [Michael Armbrust] Revert "[SPARK-2410][SQL] Merging Hive Thrift/JDBC server"
2014-07-25 15:36:57 -07:00
Cheng Lian 06dc0d2c6b [SPARK-2410][SQL] Merging Hive Thrift/JDBC server
JIRA issue:

- Main: [SPARK-2410](https://issues.apache.org/jira/browse/SPARK-2410)
- Related: [SPARK-2678](https://issues.apache.org/jira/browse/SPARK-2678)

Cherry picked the Hive Thrift/JDBC server from [branch-1.0-jdbc](https://github.com/apache/spark/tree/branch-1.0-jdbc).

(Thanks chenghao-intel for his initial contribution of the Spark SQL CLI.)

TODO

- [x] Use `spark-submit` to launch the server, the CLI and beeline
- [x] Migration guideline draft for Shark users

----

Hit by a bug in `SparkSubmitArguments` while working on this PR: all application options that are recognized by `SparkSubmitArguments` are stolen as `SparkSubmit` options. For example:

```bash
$ spark-submit --class org.apache.hive.beeline.BeeLine spark-internal --help
```

This actually shows usage information of `SparkSubmit` rather than `BeeLine`.

~~Fixed this bug here since the `spark-internal` related stuff also touches `SparkSubmitArguments` and I'd like to avoid conflict.~~

**UPDATE** The bug mentioned above is now tracked by [SPARK-2678](https://issues.apache.org/jira/browse/SPARK-2678). Decided to revert changes to this bug since it involves more subtle considerations and worth a separate PR.

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #1399 from liancheng/thriftserver and squashes the following commits:

090beea [Cheng Lian] Revert changes related to SPARK-2678, decided to move them to another PR
21c6cf4 [Cheng Lian] Updated Spark SQL programming guide docs
fe0af31 [Cheng Lian] Reordered spark-submit options in spark-shell[.cmd]
199e3fb [Cheng Lian] Disabled MIMA for hive-thriftserver
1083e9d [Cheng Lian] Fixed failed test suites
7db82a1 [Cheng Lian] Fixed spark-submit application options handling logic
9cc0f06 [Cheng Lian] Starts beeline with spark-submit
cfcf461 [Cheng Lian] Updated documents and build scripts for the newly added hive-thriftserver profile
061880f [Cheng Lian] Addressed all comments by @pwendell
7755062 [Cheng Lian] Adapts test suites to spark-submit settings
40bafef [Cheng Lian] Fixed more license header issues
e214aab [Cheng Lian] Added missing license headers
b8905ba [Cheng Lian] Fixed minor issues in spark-sql and start-thriftserver.sh
f975d22 [Cheng Lian] Updated docs for Hive compatibility and Shark migration guide draft
3ad4e75 [Cheng Lian] Starts spark-sql shell with spark-submit
a5310d1 [Cheng Lian] Make HiveThriftServer2 play well with spark-submit
61f39f4 [Cheng Lian] Starts Hive Thrift server via spark-submit
2c4c539 [Cheng Lian] Cherry picked the Hive Thrift server
2014-07-25 12:20:49 -07:00
Matei Zaharia 8529ced35c SPARK-2657 Use more compact data structures than ArrayBuffer in groupBy & cogroup
JIRA: https://issues.apache.org/jira/browse/SPARK-2657

Our current code uses ArrayBuffers for each group of values in groupBy, as well as for the key's elements in CoGroupedRDD. ArrayBuffers have a lot of overhead if there are few values in them, which is likely to happen in cases such as join. In particular, they have a pointer to an Object[] of size 16 by default, which is 24 bytes for the array header + 128 for the pointers in there, plus at least 32 for the ArrayBuffer data structure. This patch replaces the per-group buffers with a CompactBuffer class that can store up to 2 elements more efficiently (in fields of itself) and acts like an ArrayBuffer beyond that. For a key's elements in CoGroupedRDD, we use an Array of CompactBuffers instead of an ArrayBuffer of ArrayBuffers.

There are some changes throughout the code to deal with CoGroupedRDD returning Array instead. We can also decide not to do that but CoGroupedRDD is a `DeveloperAPI` so I think it's okay to change it here.

Author: Matei Zaharia <matei@databricks.com>

Closes #1555 from mateiz/compact-groupby and squashes the following commits:

845a356 [Matei Zaharia] Lower initial size of CompactBuffer's vector to 8
07621a7 [Matei Zaharia] Review comments
0c1cd12 [Matei Zaharia] Don't use varargs in CompactBuffer.apply
bdc8a39 [Matei Zaharia] Small tweak to +=, and typos
f61f040 [Matei Zaharia] Fix line lengths
59da88b0 [Matei Zaharia] Fix line lengths
197cde8 [Matei Zaharia] Make CompactBuffer extend Seq to make its toSeq more efficient
775110f [Matei Zaharia] Change CoGroupedRDD to give (K, Array[Iterable[_]]) to avoid wrappers
9b4c6e8 [Matei Zaharia] Use CompactBuffer in CoGroupedRDD
ed577ab [Matei Zaharia] Use CompactBuffer in groupByKey
10f0de1 [Matei Zaharia] A CompactBuffer that's more memory-efficient than ArrayBuffer for small buffers
2014-07-25 00:32:32 -07:00
Xiangrui Meng c960b50518 [SPARK-2479 (partial)][MLLIB] fix binary metrics unit tests
Allow small errors in comparison.

@dbtsai , this unit test blocks https://github.com/apache/spark/pull/1562 . I may need to merge this one first. We can change it to use the tools in https://github.com/apache/spark/pull/1425 after that PR gets merged.

Author: Xiangrui Meng <meng@databricks.com>

Closes #1576 from mengxr/fix-binary-metrics-unit-tests and squashes the following commits:

5076a7f [Xiangrui Meng] fix binary metrics unit tests
2014-07-24 12:37:02 -07:00
Xiangrui Meng 4c7243e109 [SPARK-2617] Correct doc and usages of preservesPartitioning
The name `preservesPartitioning` is ambiguous: 1) preserves the indices of partitions, 2) preserves the partitioner. The latter is correct and `preservesPartitioning` should really be called `preservesPartitioner` to avoid confusion. Unfortunately, this is already part of the API and we cannot change. We should be clear in the doc and fix wrong usages.

This PR

1. adds notes in `maPartitions*`,
2. makes `RDD.sample` preserve partitioner,
3. changes `preservesPartitioning` to false in  `RDD.zip` because the keys of the first RDD are no longer the keys of the zipped RDD,
4. fixes some wrong usages in MLlib.

Author: Xiangrui Meng <meng@databricks.com>

Closes #1526 from mengxr/preserve-partitioner and squashes the following commits:

b361e65 [Xiangrui Meng] update doc based on pwendell's comments
3b1ba19 [Xiangrui Meng] update doc
357575c [Xiangrui Meng] fix unit test
20b4816 [Xiangrui Meng] Merge branch 'master' into preserve-partitioner
d1caa65 [Xiangrui Meng] add doc to explain preservesPartitioning fix wrong usage of preservesPartitioning make sample preserse partitioning
2014-07-23 00:58:55 -07:00
peng.zhang 75db1742ab [SPARK-2612] [mllib] Fix data skew in ALS
Author: peng.zhang <peng.zhang@xiaomi.com>

Closes #1521 from renozhang/fix-als and squashes the following commits:

b5727a4 [peng.zhang] Remove no need argument
1a4f7a0 [peng.zhang] Fix data skew in ALS
2014-07-22 02:39:07 -07:00
Xiangrui Meng 1b10b8114a [SPARK-2495][MLLIB] remove private[mllib] from linear models' constructors
This is part of SPARK-2495 to allow users construct linear models manually.

Author: Xiangrui Meng <meng@databricks.com>

Closes #1492 from mengxr/public-constructor and squashes the following commits:

a48b766 [Xiangrui Meng] remove private[mllib] from linear models' constructors
2014-07-20 13:04:59 -07:00
Doris Xin a243364b22 [SPARK-2359][MLlib] Correlations
Implementation for Pearson and Spearman's correlation.

Author: Doris Xin <doris.s.xin@gmail.com>

Closes #1367 from dorx/correlation and squashes the following commits:

c0dd7dc [Doris Xin] here we go
32d83a3 [Doris Xin] Reviewer comments
4db0da1 [Doris Xin] added private[stat] to Spearman
b716f70 [Doris Xin] minor fixes
6e1b42a [Doris Xin] More comments addressed. Still some open questions
8104f44 [Doris Xin] addressed comments. some open questions still
39387c2 [Doris Xin] added missing header
bd3cf19 [Doris Xin] Merge branch 'master' into correlation
6341884 [Doris Xin] race condition bug squished
bd2bacf [Doris Xin] Race condition bug
b775ff9 [Doris Xin] old wrong impl
534ebf2 [Doris Xin] Merge branch 'master' into correlation
818fa31 [Doris Xin] wip units
9d808ee [Doris Xin] wip units
b843a13 [Doris Xin] revert change in stat counter
28561b6 [Doris Xin] wip
bb2e977 [Doris Xin] minor fix
8e02c63 [Doris Xin] Merge branch 'master' into correlation
2a40aa1 [Doris Xin] initial, untested implementation of Pearson
dfc4854 [Doris Xin] WIP
2014-07-18 17:25:32 -07:00
Manish Amde d88f6be446 [MLlib] SPARK-1536: multiclass classification support for decision tree
The ability to perform multiclass classification is a big advantage for using decision trees and was a highly requested feature for mllib. This pull request adds multiclass classification support to the MLlib decision tree. It also adds sample weights support using WeightedLabeledPoint class for handling unbalanced datasets during classification. It will also support algorithms such as AdaBoost which requires instances to be weighted.

It handles the special case where the categorical variables cannot be ordered for multiclass classification and thus the optimizations used for speeding up binary classification cannot be directly used for multiclass classification with categorical variables. More specifically, for m categories in a categorical feature, it analyses all the ```2^(m-1) - 1``` categorical splits provided that #splits are less than the maxBins provided in the input. This condition will not be met for features with large number of categories -- using decision trees is not recommended for such datasets in general since the categorical features are favored over continuous features. Moreover, the user can use a combination of tricks (increasing bin size of the tree algorithms, use binary encoding for categorical features or use one-vs-all classification strategy) to avoid these constraints.

The new code is accompanied by unit tests and has also been tested on the iris and covtype datasets.

cc: mengxr, etrain, hirakendu, atalwalkar, srowen

Author: Manish Amde <manish9ue@gmail.com>
Author: manishamde <manish9ue@gmail.com>
Author: Evan Sparks <sparks@cs.berkeley.edu>

Closes #886 from manishamde/multiclass and squashes the following commits:

26f8acc [Manish Amde] another attempt at fixing mima
c5b2d04 [Manish Amde] more MIMA fixes
1ce7212 [Manish Amde] change problem filter for mima
10fdd82 [Manish Amde] fixing MIMA excludes
e1c970d [Manish Amde] merged master
abf2901 [Manish Amde] adding classes to MimaExcludes.scala
45e767a [Manish Amde] adding developer api annotation for overriden methods
c8428c4 [Manish Amde] fixing weird multiline bug
afced16 [Manish Amde] removed label weights support
2d85a48 [Manish Amde] minor: fixed scalastyle issues reprise
4e85f2c [Manish Amde] minor: fixed scalastyle issues
b2ae41f [Manish Amde] minor: scalastyle
e4c1321 [Manish Amde] using while loop for regression histograms
d75ac32 [Manish Amde] removed WeightedLabeledPoint from this PR
0fecd38 [Manish Amde] minor: add newline to EOF
2061cf5 [Manish Amde] merged from master
06b1690 [Manish Amde] fixed off-by-one error in bin to split conversion
9cc3e31 [Manish Amde] added implicit conversion import
5c1b2ca [Manish Amde] doc for PointConverter class
485eaae [Manish Amde] implicit conversion from LabeledPoint to WeightedLabeledPoint
3d7f911 [Manish Amde] updated doc
8e44ab8 [Manish Amde] updated doc
adc7315 [Manish Amde] support ordered categorical splits for multiclass classification
e3e8843 [Manish Amde] minor code formatting
23d4268 [Manish Amde] minor: another minor code style
34ee7b9 [Manish Amde] minor: code style
237762d [Manish Amde] renaming functions
12e6d0a [Manish Amde] minor: removing line in doc
9a90c93 [Manish Amde] Merge branch 'master' into multiclass
1892a2c [Manish Amde] tests and use multiclass binaggregate length when atleast one categorical feature is present
f5f6b83 [Manish Amde] multiclass for continous variables
8cfd3b6 [Manish Amde] working for categorical multiclass classification
828ff16 [Manish Amde] added categorical variable test
bce835f [Manish Amde] code cleanup
7e5f08c [Manish Amde] minor doc
1dd2735 [Manish Amde] bin search logic for multiclass
f16a9bb [Manish Amde] fixing while loop
d811425 [Manish Amde] multiclass bin aggregate logic
ab5cb21 [Manish Amde] multiclass logic
d8e4a11 [Manish Amde] sample weights
ed5a2df [Manish Amde] fixed classification requirements
d012be7 [Manish Amde] fixed while loop
18d2835 [Manish Amde] changing default values for num classes
6b912dc [Manish Amde] added numclasses to tree runner, predict logic for multiclass, add multiclass option to train
75f2bfc [Manish Amde] minor code style fix
e547151 [Manish Amde] minor modifications
34549d0 [Manish Amde] fixing error during merge
098e8c5 [Manish Amde] merged master
e006f9d [Manish Amde] changing variable names
5c78e1a [Manish Amde] added multiclass support
6c7af22 [Manish Amde] prepared for multiclass without breaking binary classification
46e06ee [Manish Amde] minor mods
3f85a17 [Manish Amde] tests for multiclass classification
4d5f70c [Manish Amde] added multiclass support for find splits bins
46f909c [Manish Amde] todo for multiclass support
455bea9 [Manish Amde] fixed tests
14aea48 [Manish Amde] changing instance format to weighted labeled point
a1a6e09 [Manish Amde] added weighted point class
968ca9d [Manish Amde] merged master
7fc9545 [Manish Amde] added docs
ce004a1 [Manish Amde] minor formatting
b27ad2c [Manish Amde] formatting
426bb28 [Manish Amde] programming guide blurb
8053fed [Manish Amde] more formatting
5eca9e4 [Manish Amde] grammar
4731cda [Manish Amde] formatting
5e82202 [Manish Amde] added documentation, fixed off by 1 error in max level calculation
cbd9f14 [Manish Amde] modified scala.math to math
dad9652 [Manish Amde] removed unused imports
e0426ee [Manish Amde] renamed parameter
718506b [Manish Amde] added unit test
1517155 [Manish Amde] updated documentation
9dbdabe [Manish Amde] merge from master
719d009 [Manish Amde] updating user documentation
fecf89a [manishamde] Merge pull request #6 from etrain/deep_tree
0287772 [Evan Sparks] Fixing scalastyle issue.
2f1e093 [Manish Amde] minor: added doc for maxMemory parameter
2f6072c [manishamde] Merge pull request #5 from etrain/deep_tree
abc5a23 [Evan Sparks] Parameterizing max memory.
50b143a [Manish Amde] adding support for very deep trees
2014-07-18 14:00:13 -07:00
Joseph K. Bradley 935fe65ff6 SPARK-1215 [MLLIB]: Clustering: Index out of bounds error (2)
Added check to LocalKMeans.scala: kMeansPlusPlus initialization to handle case with fewer distinct data points than clusters k.  Added two related unit tests to KMeansSuite.  (Re-submitting PR after tangling commits in PR 1407 https://github.com/apache/spark/pull/1407 )

Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>

Closes #1468 from jkbradley/kmeans-fix and squashes the following commits:

4e9bd1e [Joseph K. Bradley] Updated PR per comments from mengxr
6c7a2ec [Joseph K. Bradley] Added check to LocalKMeans.scala: kMeansPlusPlus initialization to handle case with fewer distinct data points than clusters k.  Added two related unit tests to KMeansSuite.
2014-07-17 15:05:02 -07:00