ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Sean Owen	856c50f59b	SPARK-1387. Update build plugins, avoid plugin version warning, centralize versions Another handful of small build changes to organize and standardize a bit, and avoid warnings: - Update Maven plugin versions for good measure - Since plugins need maven 3.0.4 already, require it explicitly (<3.0.4 had some bugs anyway) - Use variables to define versions across dependencies where they should move in lock step - ... and make this consistent between Maven/SBT OK, I also updated the JIRA URL while I was at it here. Author: Sean Owen <sowen@cloudera.com> Closes #291 from srowen/SPARK-1387 and squashes the following commits: 461eca1 [Sean Owen] Couldn't resist also updating JIRA location to new one c2d5cc5 [Sean Owen] Update plugins and Maven version; use variables consistently across Maven/SBT to define dependency versions that should stay in step.	2014-04-06 17:41:01 -07:00
Xiangrui Meng	9c65fa76f9	[SPARK-1212, Part II] Support sparse data in MLlib In PR https://github.com/apache/spark/pull/117, we added dense/sparse vector data model and updated KMeans to support sparse input. This PR is to replace all other `Array[Double]` usage by `Vector` in generalized linear models (GLMs) and Naive Bayes. Major changes: 1. `LabeledPoint` becomes `LabeledPoint(Double, Vector)`. 2. Methods that accept `RDD[Array[Double]]` now accept `RDD[Vector]`. We cannot support both in an elegant way because of type erasure. 3. Mark 'createModel' and 'predictPoint' protected because they are not for end users. 4. Add libSVMFile to MLContext. 5. NaiveBayes can accept arbitrary labels (introducing a breaking change to Python's `NaiveBayesModel`). 6. Gradient computation no longer creates temp vectors. 7. Column normalization and centering are removed from Lasso and Ridge because the operation will densify the data. Simple feature transformation can be done before training. TODO: 1. ~~Use axpy when possible.~~ 2. ~~Optimize Naive Bayes.~~ Author: Xiangrui Meng <meng@databricks.com> Closes #245 from mengxr/vector and squashes the following commits: eb6e793 [Xiangrui Meng] move libSVMFile to MLUtils and rename to loadLibSVMData c26c4fc [Xiangrui Meng] update DecisionTree to use RDD[Vector] 11999c7 [Xiangrui Meng] Merge branch 'master' into vector f7da54b [Xiangrui Meng] add minSplits to libSVMFile da25e24 [Xiangrui Meng] revert the change to default addIntercept because it might change the behavior of existing code without warning 493f26f [Xiangrui Meng] Merge branch 'master' into vector 7c1bc01 [Xiangrui Meng] add a TODO to NB b9b7ef7 [Xiangrui Meng] change default value of addIntercept to false b01df54 [Xiangrui Meng] allow to change or clear threshold in LR and SVM 4addc50 [Xiangrui Meng] merge master 4ca5b1b [Xiangrui Meng] remove normalization from Lasso and update tests f04fe8a [Xiangrui Meng] remove normalization from RidgeRegression and update tests d088552 [Xiangrui Meng] use static constructor for MLContext 6f59eed [Xiangrui Meng] update libSVMFile to determine number of features automatically 3432e84 [Xiangrui Meng] update NaiveBayes to support sparse data 0f8759b [Xiangrui Meng] minor updates to NB b11659c [Xiangrui Meng] style update 78c4671 [Xiangrui Meng] add libSVMFile to MLContext f0fe616 [Xiangrui Meng] add a test for sparse linear regression 44733e1 [Xiangrui Meng] use in-place gradient computation e981396 [Xiangrui Meng] use axpy in Updater db808a1 [Xiangrui Meng] update JavaLR example befa592 [Xiangrui Meng] passed scala/java tests 75c83a4 [Xiangrui Meng] passed test compile 1859701 [Xiangrui Meng] passed compile 834ada2 [Xiangrui Meng] optimized MLUtils.computeStats update some ml algorithms to use Vector (cont.) 135ab72 [Xiangrui Meng] merge glm 0e57aa4 [Xiangrui Meng] update Lasso and RidgeRegression to parse the weights correctly from GLM mark createModel protected mark predictPoint protected d7f629f [Xiangrui Meng] fix a bug in GLM when intercept is not used 3f346ba [Xiangrui Meng] update some ml algorithms to use Vector	2014-04-02 14:01:12 -07:00
Manish Amde	8b3045ceab	MLI-1 Decision Trees Joint work with @hirakendu, @etrain, @atalwalkar and @harsha2010. Key features: + Supports binary classification and regression + Supports gini, entropy and variance for information gain calculation + Supports both continuous and categorical features The algorithm has gone through several development iterations over the last few months leading to a highly optimized implementation. Optimizations include: 1. Level-wise training to reduce passes over the entire dataset. 2. Bin-wise split calculation to reduce computation overhead. 3. Aggregation over partitions before combining to reduce communication overhead. Author: Manish Amde <manish9ue@gmail.com> Author: manishamde <manish9ue@gmail.com> Author: Xiangrui Meng <meng@databricks.com> Closes #79 from manishamde/tree and squashes the following commits: 1e8c704 [Manish Amde] remove numBins field in the Strategy class 7d54b4f [manishamde] Merge pull request #4 from mengxr/dtree f536ae9 [Xiangrui Meng] another pass on code style e1dd86f [Manish Amde] implementing code style suggestions 62dc723 [Manish Amde] updating javadoc and converting helper methods to package private to allow unit testing 201702f [Manish Amde] making some more methods private f963ef5 [Manish Amde] making methods private c487e6a [manishamde] Merge pull request #1 from mengxr/dtree 24500c5 [Xiangrui Meng] minor style updates 4576b64 [Manish Amde] documentation and for to while loop conversion ff363a7 [Manish Amde] binary search for bins and while loop for categorical feature bins 632818f [Manish Amde] removing threshold for classification predict method 2116360 [Manish Amde] removing dummy bin calculation for categorical variables 6068356 [Manish Amde] ensuring num bins is always greater than max number of categories 62c2562 [Manish Amde] fixing comment indentation ad1fc21 [Manish Amde] incorporated mengxr's code style suggestions d1ef4f6 [Manish Amde] more documentation 794ff4d [Manish Amde] minor improvements to docs and style eb8fcbe [Manish Amde] minor code style updates cd2c2b4 [Manish Amde] fixing code style based on feedback 63e786b [Manish Amde] added multiple train methods for java compatability d3023b3 [Manish Amde] adding more docs for nested methods 84f85d6 [Manish Amde] code documentation 9372779 [Manish Amde] code style: max line lenght <= 100 dd0c0d7 [Manish Amde] minor: some docs 0dd7659 [manishamde] basic doc 5841c28 [Manish Amde] unit tests for categorical features f067d68 [Manish Amde] minor cleanup c0e522b [Manish Amde] updated predict and split threshold logic b09dc98 [Manish Amde] minor refactoring 6b7de78 [Manish Amde] minor refactoring and tests d504eb1 [Manish Amde] more tests for categorical features dbb7ac1 [Manish Amde] categorical feature support 6df35b9 [Manish Amde] regression predict logic 53108ed [Manish Amde] fixing index for highest bin e23c2e5 [Manish Amde] added regression support c8f6d60 [Manish Amde] adding enum for feature type b0e3e76 [Manish Amde] adding enum for feature type 154aa77 [Manish Amde] enums for configurations 733d6dd [Manish Amde] fixed tests 02c595c [Manish Amde] added command line parsing 98ec8d5 [Manish Amde] tree building and prediction logic b0eb866 [Manish Amde] added logic to handle leaf nodes 80e8c66 [Manish Amde] working version of multi-level split calculation 4798aae [Manish Amde] added gain stats class dad0afc [Manish Amde] decison stump functionality working 03f534c [Manish Amde] some more tests 0012a77 [Manish Amde] basic stump working 8bca1e2 [Manish Amde] additional code for creating intermediate RDD 92cedce [Manish Amde] basic building blocks for intermediate RDD calculation. untested. cd53eae [Manish Amde] skeletal framework	2014-04-01 21:40:49 -07:00
Xiangrui Meng	d679843a39	[SPARK-1327] GLM needs to check addIntercept for intercept and weights GLM needs to check addIntercept for intercept and weights. The current implementation always uses the first weight as intercept. Added a test for training without adding intercept. JIRA: https://spark-project.atlassian.net/browse/SPARK-1327 Author: Xiangrui Meng <meng@databricks.com> Closes #236 from mengxr/glm and squashes the following commits: bcac1ac [Xiangrui Meng] add two tests to ensure {Lasso, Ridge}.setIntercept will throw an exceptions a104072 [Xiangrui Meng] remove protected to be compatible with 0.9 0e57aa4 [Xiangrui Meng] update Lasso and RidgeRegression to parse the weights correctly from GLM mark createModel protected mark predictPoint protected d7f629f [Xiangrui Meng] fix a bug in GLM when intercept is not used	2014-03-26 19:30:20 -07:00
Xiangrui Meng	80c29689ae	[SPARK-1212] Adding sparse data support and update KMeans Continue our discussions from https://github.com/apache/incubator-spark/pull/575 This PR is WIP because it depends on a SNAPSHOT version of breeze. Per previous discussions and benchmarks, I switched to breeze for linear algebra operations. @dlwh and I made some improvements to breeze to keep its performance comparable to the bare-bone implementation, including norm computation and squared distance. This is why this PR needs to depend on a SNAPSHOT version of breeze. @fommil , please find the notice of using netlib-core in `NOTICE`. This is following Apache's instructions on appropriate labeling. I'm going to update this PR to include: 1. Fast distance computation: using `\\|a\\|_2^2 + \\|b\\|_2^2 - 2 a^T b` when it doesn't introduce too much numerical error. The squared norms are pre-computed. Otherwise, computing the distance between the center (dense) and a point (possibly sparse) always takes O(n) time. 2. Some numbers about the performance. 3. A released version of breeze. @dlwh, a minor release of breeze will help this PR get merged early. Do you mind sharing breeze's release plan? Thanks! Author: Xiangrui Meng <meng@databricks.com> Closes #117 from mengxr/sparse-kmeans and squashes the following commits: 67b368d [Xiangrui Meng] fix SparseVector.toArray 5eda0de [Xiangrui Meng] update NOTICE 67abe31 [Xiangrui Meng] move ArrayRDDs to mllib.rdd 1da1033 [Xiangrui Meng] remove dependency on commons-math3 and compute EPSILON directly 9bb1b31 [Xiangrui Meng] optimize SparseVector.toArray 226d2cd [Xiangrui Meng] update Java friendly methods in Vectors 238ba34 [Xiangrui Meng] add VectorRDDs with a converter from RDD[Array[Double]] b28ba2f [Xiangrui Meng] add toArray to Vector e69b10c [Xiangrui Meng] remove examples/JavaKMeans.java, which is replaced by mllib/examples/JavaKMeans.java 72bde33 [Xiangrui Meng] clean up code for distance computation 712cb88 [Xiangrui Meng] make Vectors.sparse Java friendly 27858e4 [Xiangrui Meng] update breeze version to 0.7 07c3cf2 [Xiangrui Meng] change Mahout to breeze in doc use a simple lower bound to avoid unnecessary distance computation 6f5cdde [Xiangrui Meng] fix a bug in filtering finished runs 42512f2 [Xiangrui Meng] Merge branch 'master' into sparse-kmeans d6e6c07 [Xiangrui Meng] add predict(RDD[Vector]) to KMeansModel 42b4e50 [Xiangrui Meng] line feed at the end a4ace73 [Xiangrui Meng] Merge branch 'fast-dist' into sparse-kmeans 3ed1a24 [Xiangrui Meng] add doc to BreezeVectorWithSquaredNorm 0107e19 [Xiangrui Meng] update NOTICE 87bc755 [Xiangrui Meng] tuned the KMeans code: changed some for loops to while, use view to avoid copying arrays 0ff8046 [Xiangrui Meng] update KMeans to use fastSquaredDistance f355411 [Xiangrui Meng] add BreezeVectorWithSquaredNorm case class ab74f67 [Xiangrui Meng] add fastSquaredDistance for KMeans 4e7d5ca [Xiangrui Meng] minor style update 07ffaf2 [Xiangrui Meng] add dense/sparse vector data models and conversions to/from breeze vectors use breeze to implement KMeans in order to support both dense and sparse data	2014-03-23 17:34:02 -07:00
Reza Zadeh	66a03e5fe0	Principal Component Analysis # Principal Component Analysis Computes the top k principal component coefficients for the m-by-n data matrix X. Rows of X correspond to observations and columns correspond to variables. The coefficient matrix is n-by-k. Each column of the coefficients return matrix contains coefficients for one principal component, and the columns are in descending order of component variance. This function centers the data and uses the singular value decomposition (SVD) algorithm. ## Testing Tests included: * All principal components * Only top k principal components * Dense SVD tests * Dense/sparse matrix tests The results are tested against MATLAB's pca: http://www.mathworks.com/help/stats/pca.html ## Documentation Added to mllib-guide.md ## Example Usage Added to examples directory under SparkPCA.scala Author: Reza Zadeh <rizlar@gmail.com> Closes #88 from rezazadeh/sparkpca and squashes the following commits: e298700 [Reza Zadeh] reformat using IDE 3f23271 [Reza Zadeh] documentation and cleanup b025ab2 [Reza Zadeh] documentation e2667d4 [Reza Zadeh] assertMatrixApproximatelyEquals 3787bb4 [Reza Zadeh] stylin c6ecc1f [Reza Zadeh] docs aa2bbcb [Reza Zadeh] rename sparseToTallSkinnyDense 56975b0 [Reza Zadeh] docs 2df9bde [Reza Zadeh] docs update 8fb0015 [Reza Zadeh] rcond documentation dbf7797 [Reza Zadeh] correct argument number a9f1f62 [Reza Zadeh] documentation 4ce6caa [Reza Zadeh] style changes 9a56a02 [Reza Zadeh] use rcond relative to larget svalue 120f796 [Reza Zadeh] housekeeping 156ff78 [Reza Zadeh] string comprehension 2e1cf43 [Reza Zadeh] rename rcond ea223a6 [Reza Zadeh] many style changes f4002d7 [Reza Zadeh] more docs bd53c7a [Reza Zadeh] proper accumulator a8b5ecf [Reza Zadeh] Don't use for loops 0dc7980 [Reza Zadeh] filter zeros in sparse 6115610 [Reza Zadeh] More documentation 36d51e8 [Reza Zadeh] use JBLAS for UVS^-1 computation bc4599f [Reza Zadeh] configurable rcond 86f7515 [Reza Zadeh] compute per parition, use while 09726b3 [Reza Zadeh] more style changes 4195e69 [Reza Zadeh] private, accumulator 17002be [Reza Zadeh] style changes 4ba7471 [Reza Zadeh] style change f4982e6 [Reza Zadeh] Use dense matrix in example 2828d28 [Reza Zadeh] optimizations: normalize once, use inplace ops 72c9fa1 [Reza Zadeh] rename DenseMatrix to TallSkinnyDenseMatrix, lean f807be9 [Reza Zadeh] fix typo 2d7ccde [Reza Zadeh] Array interface for dense svd and pca cd290fa [Reza Zadeh] provide RDD[Array[Double]] support 398d123 [Reza Zadeh] style change 55abbfa [Reza Zadeh] docs fix ef29644 [Reza Zadeh] bad chnage undo 472566e [Reza Zadeh] all files from old pr 555168f [Reza Zadeh] initial files	2014-03-20 10:39:20 -07:00
Xiangrui Meng	f9d8a83c00	[SPARK-1266] persist factors in implicit ALS In implicit ALS computation, the user or product factor is used twice in each iteration. Caching can certainly help accelerate the computation. I saw the running time decreased by ~70% for implicit ALS on the movielens data. I also made the following changes: 1. Change `YtYb` type from `Broadcast[Option[DoubleMatrix]]` to `Option[Broadcast[DoubleMatrix]]`, so we don't need to broadcast None in explicit computation. 2. Mark methods `computeYtY`, `unblockFactors`, `updateBlock`, and `updateFeatures private`. Users do not need those methods. 3. Materialize the final matrix factors before returning the model. It allows us to clean up other cached RDDs before returning the model. I do not have a better solution here, so I use `RDD.count()`. JIRA: https://spark-project.atlassian.net/browse/SPARK-1266 Author: Xiangrui Meng <meng@databricks.com> Closes #165 from mengxr/als and squashes the following commits: c9676a6 [Xiangrui Meng] add a comment about the last products.persist d3a88aa [Xiangrui Meng] change implicitPrefs match to if ... else ... 63862d6 [Xiangrui Meng] persist factors in implicit ALS	2014-03-18 17:20:42 -07:00
Xiangrui Meng	e108b9ab94	[SPARK-1260]: faster construction of features with intercept The current implementation uses `Array(1.0, features: _*)` to construct a new array with intercept. This is not efficient for big arrays because `Array.apply` uses a for loop that iterates over the arguments. `Array.+:` is a better choice here. Also, I don't see a reason to set initial weights to ones. So I set them to zeros. JIRA: https://spark-project.atlassian.net/browse/SPARK-1260 Author: Xiangrui Meng <meng@databricks.com> Closes #161 from mengxr/sgd and squashes the following commits: b5cfc53 [Xiangrui Meng] set default weights to zeros a1439c2 [Xiangrui Meng] faster construction of features with intercept	2014-03-18 15:14:13 -07:00
Xiangrui Meng	e4e8d8f395	[SPARK-1237, 1238] Improve the computation of YtY for implicit ALS Computing YtY can be implemented using BLAS's DSPR operations instead of generating y_i y_i^T and then combining them. The latter generates many k-by-k matrices. On the movielens data, this change improves the performance by 10-20%. The algorithm remains the same, verified by computing RMSE on the movielens data. To compare the results, I also added an option to set a random seed in ALS. JIRA: 1. https://spark-project.atlassian.net/browse/SPARK-1237 2. https://spark-project.atlassian.net/browse/SPARK-1238 Author: Xiangrui Meng <meng@databricks.com> Closes #131 from mengxr/als and squashes the following commits: ed00432 [Xiangrui Meng] minor changes d984623 [Xiangrui Meng] minor changes 2fc1641 [Xiangrui Meng] remove commented code 4c7cde2 [Xiangrui Meng] allow specifying a random seed in ALS 200bef0 [Xiangrui Meng] optimize computeYtY and updateBlock	2014-03-13 00:43:19 -07:00
CodingCat	9032f7c0d5	SPARK-1160: Deprecate toArray in RDD https://spark-project.atlassian.net/browse/SPARK-1160 reported by @mateiz: "It's redundant with collect() and the name doesn't make sense in Java, where we return a List (we can't return an array due to the way Java generics work). It's also missing in Python." In this patch, I deprecated the method and changed the source files using it by replacing toArray with collect() directly Author: CodingCat <zhunansjtu@gmail.com> Closes #105 from CodingCat/SPARK-1060 and squashes the following commits: 286f163 [CodingCat] deprecate in JavaRDDLike ee17b4e [CodingCat] add message and since 2ff7319 [CodingCat] deprecate toArray in RDD	2014-03-12 17:43:12 -07:00
Sandy Ryza	a99fb3747a	SPARK-1193. Fix indentation in pom.xmls Author: Sandy Ryza <sandy@cloudera.com> Closes #91 from sryza/sandy-spark-1193 and squashes the following commits: a878124 [Sandy Ryza] SPARK-1193. Fix indentation in pom.xmls	2014-03-07 23:10:35 -08:00
Patrick Wendell	c3f5e07533	SPARK-1121: Include avro for yarn-alpha builds This lets us explicitly include Avro based on a profile for 0.23.X builds. It makes me sad how convoluted it is to express this logic in Maven. @tgraves and @sryza curious if this works for you. I'm also considering just reverting to how it was before. The only real problem was that Spark advertised a dependency on Avro even though it only really depends transitively on Avro through other deps. Author: Patrick Wendell <pwendell@gmail.com> Closes #49 from pwendell/avro-build-fix and squashes the following commits: 8d6ee92 [Patrick Wendell] SPARK-1121: Add avro to yarn-alpha profile	2014-03-02 15:18:19 -08:00
Patrick Wendell	1fd2bfd3dd	Remove remaining references to incubation This removes some loose ends not caught by the other (incubating -> tlp) patches. @markhamstra this updates the version as you mentioned earlier. Author: Patrick Wendell <pwendell@gmail.com> Closes #51 from pwendell/tlp and squashes the following commits: d553b1b [Patrick Wendell] Remove remaining references to incubation	2014-03-02 01:00:16 -08:00
DB Tsai	6fc76e49c1	Initialized the regVal for first iteration in SGD optimizer Ported from https://github.com/apache/incubator-spark/pull/633 In runMiniBatchSGD, the regVal (for 1st iter) should be initialized as sum of sqrt of weights if it's L2 update; for L1 update, the same logic is followed. It maybe not be important here for SGD since the updater doesn't take the loss as parameter to find the new weights. But it will give us the correct history of loss. However, for LBFGS optimizer we implemented, the correct loss with regVal is crucial to find the new weights. Author: DB Tsai <dbtsai@alpinenow.com> Closes #40 from dbtsai/dbtsai-smallRegValFix and squashes the following commits: 77d47da [DB Tsai] In runMiniBatchSGD, the regVal (for 1st iter) should be initialized as sum of sqrt of weights if it's L2 update; for L1 update, the same logic is followed.	2014-03-02 00:31:59 -08:00
Sean Owen	c8a4c9b1f6	MLLIB-25: Implicit ALS runs out of memory for moderately large numbers of features There's a step in implicit ALS where the matrix `Yt * Y` is computed. It's computed as the sum of matrices; an f x f matrix is created for each of n user/item rows in a partition. In `ALS.scala:214`: ``` factors.flatMapValues{ case factorArray => factorArray.map{ vector => val x = new DoubleMatrix(vector) x.mmul(x.transpose()) } }.reduceByKeyLocally((a, b) => a.addi(b)) .values .reduce((a, b) => a.addi(b)) ``` Completely correct, but there's a subtle but quite large memory problem here. map() is going to create all of these matrices in memory at once, when they don't need to ever all exist at the same time. For example, if a partition has n = 100000 rows, and f = 200, then this intermediate product requires 32GB of heap. The computation will never work unless you can cough up workers with (more than) that much heap. Fortunately there's a trivial change that fixes it; just add `.view` in there. Author: Sean Owen <sowen@cloudera.com> Closes #629 from srowen/ALSMatrixAllocationOptimization and squashes the following commits: 062cda9 [Sean Owen] Update style per review comments e9a5d63 [Sean Owen] Avoid unnecessary out of memory situation by not simultaneously allocating lots of matrices	2014-02-21 12:46:12 -08:00
Sean Owen	9e63f80e75	MLLIB-22. Support negative implicit input in ALS I'm back with another less trivial suggestion for ALS: In ALS for implicit feedback, input values are treated as weights on squared-errors in a loss function (or rather, the weight is a simple function of the input r, like c = 1 + alphar). The paper on which it's based assumes that the input is positive. Indeed, if the input is negative, it will create a negative weight on squared-errors, which causes things to go haywire. The optimization will try to make the error in a cell as large possible, and the result is silently bogus. There is a good use case for negative input values though. Implicit feedback is usually collected from signals of positive interaction like a view or like or buy, but equally, can come from "not interested" signals. The natural representation is negative values. The algorithm can be extended quite simply to provide a sound interpretation of these values: negative values should encourage the factorization to come up with 0 for cells with large negative input values, just as much as positive values encourage it to come up with 1. The implications for the algorithm are simple: the confidence function value must not be negative, and so can become 1 + alpha\|r\| the matrix P should have a value 1 where the input R is _positive_, not merely where it is non-zero. Actually, that's what the paper already says, it's just that we can't assume P = 1 when a cell in R is specified anymore, since it may be negative This in turn entails just a few lines of code change in `ALS.scala`: * `rs(i)` becomes `abs(rs(i))` * When constructing `userXy(us(i))`, it's implicitly only adding where P is 1. That had been true for any us(i) that is iterated over, before, since these are exactly the ones for which P is 1. But now P is zero where rs(i) <= 0, and should not be added I think it's a safe change because: * It doesn't change any existing behavior (unless you're using negative values, in which case results are already borked) * It's the simplest direct extension of the paper's algorithm * (I've used it to good effect in production FWIW) Tests included. I tweaked minor things en route: * `ALS.scala` javadoc writes "R = XtY" when the paper and rest of code defines it as "R = XYt" * RMSE in the ALS tests uses a confidence-weighted mean, but the denominator is not actually sum of weights Excuse my Scala style; I'm sure it needs tweaks. Author: Sean Owen <sowen@cloudera.com> Closes #500 from srowen/ALSNegativeImplicitInput and squashes the following commits: cf902a9 [Sean Owen] Support negative implicit input in ALS 953be1c [Sean Owen] Make weighted RMSE in ALS test actually weighted; adjust comment about R = X*Yt	2014-02-19 23:44:53 -08:00
Chen Chao	f9b7d64a4e	MLLIB-24: url of "Collaborative Filtering for Implicit Feedback Datasets" in ALS is invalid now url of "Collaborative Filtering for Implicit Feedback Datasets" is invalid now. A new url is provided. http://research.yahoo.com/files/HuKorenVolinsky-ICDM08.pdf Author: Chen Chao <crazyjvm@gmail.com> Closes #619 from CrazyJvm/master and squashes the following commits: a0b54e4 [Chen Chao] change url to IEEE 9e0e9f0 [Chen Chao] correct spell mistale fcfab5d [Chen Chao] wrap line to to fit within 100 chars 590d56e [Chen Chao] url error	2014-02-19 22:06:35 -08:00
Martin Jaggi	2182aa3c55	Merge pull request #566 from martinjaggi/copy-MLlib-d. new MLlib documentation for optimization, regression and classification new documentation with tex formulas, hopefully improving usability and reproducibility of the offered MLlib methods. also did some minor changes in the code for consistency. scala tests pass. this is the rebased branch, i deleted the old PR jira: https://spark-project.atlassian.net/browse/MLLIB-19 Author: Martin Jaggi <m.jaggi@gmail.com> Closes #566 and squashes the following commits: 5f0f31e [Martin Jaggi] line wrap at 100 chars 4e094fb [Martin Jaggi] better description of GradientDescent 1d6965d [Martin Jaggi] remove broken url ea569c3 [Martin Jaggi] telling what updater actually does 964732b [Martin Jaggi] lambda R() in documentation a6c6228 [Martin Jaggi] better comments in SGD code for regression b32224a [Martin Jaggi] new optimization documentation d5dfef7 [Martin Jaggi] new classification and regression documentation b07ead6 [Martin Jaggi] correct scaling for MSE loss ba6158c [Martin Jaggi] use d for the number of features bab2ed2 [Martin Jaggi] renaming LeastSquaresGradient	2014-02-09 15:19:50 -08:00
Patrick Wendell	b69f8b2a01	Merge pull request #557 from ScrapCodes/style. Closes #557 . SPARK-1058, Fix Style Errors and Add Scala Style to Spark Build. Author: Patrick Wendell <pwendell@gmail.com> Author: Prashant Sharma <scrapcodes@gmail.com> == Merge branch commits == commit 1a8bd1c059b842cb95cc246aaea74a79fec684f4 Author: Prashant Sharma <scrapcodes@gmail.com> Date: Sun Feb 9 17:39:07 2014 +0530 scala style fixes commit f91709887a8e0b608c5c2b282db19b8a44d53a43 Author: Patrick Wendell <pwendell@gmail.com> Date: Fri Jan 24 11:22:53 2014 -0800 Adding scalastyle snapshot	2014-02-09 10:09:19 -08:00
Mark Hamstra	c2341c92bb	Merge pull request #542 from markhamstra/versionBump. Closes #542 . Version number to 1.0.0-SNAPSHOT Since 0.9.0-incubating is done and out the door, we shouldn't be building 0.9.0-incubating-SNAPSHOT anymore. @pwendell Author: Mark Hamstra <markhamstra@gmail.com> == Merge branch commits == commit 1b00a8a7c1a7f251b4bb3774b84b9e64758eaa71 Author: Mark Hamstra <markhamstra@gmail.com> Date: Wed Feb 5 09:30:32 2014 -0800 Version number to 1.0.0-SNAPSHOT	2014-02-08 16:00:43 -08:00
Xiangrui Meng	23af00f9e0	Merge pull request #528 from mengxr/sample. Closes #528 . Refactor RDD sampling and add randomSplit to RDD (update) Replace SampledRDD by PartitionwiseSampledRDD, which accepts a RandomSampler instance as input. The current sample with/without replacement can be easily integrated via BernoulliSampler and PoissonSampler. The benefits are: 1) RDD.randomSplit is implemented in the same way, related to https://github.com/apache/incubator-spark/pull/513 2) Stratified sampling and importance sampling can be implemented in the same manner as well. Unit tests are included for samplers and RDD.randomSplit. This should performance better than my previous request where the BernoulliSampler creates many Iterator instances: https://github.com/apache/incubator-spark/pull/513 Author: Xiangrui Meng <meng@databricks.com> == Merge branch commits == commit e8ce957e5f0a600f2dec057924f4a2ca6adba373 Author: Xiangrui Meng <meng@databricks.com> Date: Mon Feb 3 12:21:08 2014 -0800 more docs to PartitionwiseSampledRDD commit fbb4586d0478ff638b24bce95f75ff06f713d43b Author: Xiangrui Meng <meng@databricks.com> Date: Mon Feb 3 00:44:23 2014 -0800 move XORShiftRandom to util.random and use it in BernoulliSampler commit 987456b0ee8612fd4f73cb8c40967112dc3c4c2d Author: Xiangrui Meng <meng@databricks.com> Date: Sat Feb 1 11:06:59 2014 -0800 relax assertions in SortingSuite because the RangePartitioner has large variance in this case commit 3690aae416b2dc9b2f9ba32efa465ba7948477f4 Author: Xiangrui Meng <meng@databricks.com> Date: Sat Feb 1 09:56:28 2014 -0800 test split ratio of RDD.randomSplit commit 8a410bc933a60c4d63852606f8bbc812e416d6ae Author: Xiangrui Meng <meng@databricks.com> Date: Sat Feb 1 09:25:22 2014 -0800 add a test to ensure seed distribution and minor style update commit ce7e866f674c30ab48a9ceb09da846d5362ab4b6 Author: Xiangrui Meng <meng@databricks.com> Date: Fri Jan 31 18:06:22 2014 -0800 minor style change commit 750912b4d77596ed807d361347bd2b7e3b9b7a74 Author: Xiangrui Meng <meng@databricks.com> Date: Fri Jan 31 18:04:54 2014 -0800 fix some long lines commit c446a25c38d81db02821f7f194b0ce5ab4ed7ff5 Author: Xiangrui Meng <meng@databricks.com> Date: Fri Jan 31 17:59:59 2014 -0800 add complement to BernoulliSampler and minor style changes commit dbe2bc2bd888a7bdccb127ee6595840274499403 Author: Xiangrui Meng <meng@databricks.com> Date: Fri Jan 31 17:45:08 2014 -0800 switch to partition-wise sampling for better performance commit a1fca5232308feb369339eac67864c787455bb23 Merge: `ac712e4` cf6128f Author: Xiangrui Meng <meng@databricks.com> Date: Fri Jan 31 16:33:09 2014 -0800 Merge branch 'sample' of github.com:mengxr/incubator-spark into sample commit cf6128fb672e8c589615adbd3eaa3cbdb72bd461 Author: Xiangrui Meng <meng@databricks.com> Date: Sun Jan 26 14:40:07 2014 -0800 set SampledRDD deprecated in 1.0 commit f430f847c3df91a3894687c513f23f823f77c255 Author: Xiangrui Meng <meng@databricks.com> Date: Sun Jan 26 14:38:59 2014 -0800 update code style commit a8b5e2021a9204e318c80a44d00c5c495f1befb6 Author: Xiangrui Meng <meng@databricks.com> Date: Sun Jan 26 12:56:27 2014 -0800 move package random to util.random commit ab0fa2c4965033737a9e3a9bf0a59cbb0df6a6f5 Author: Xiangrui Meng <meng@databricks.com> Date: Sun Jan 26 12:50:35 2014 -0800 add Apache headers and update code style commit 985609fe1a55655ad11966e05a93c18c138a403d Author: Xiangrui Meng <meng@databricks.com> Date: Sun Jan 26 11:49:25 2014 -0800 add new lines commit b21bddf29850a2c006a868869b8f91960a029322 Author: Xiangrui Meng <meng@databricks.com> Date: Sun Jan 26 11:46:35 2014 -0800 move samplers to random.IndependentRandomSampler and add tests commit c02dacb4a941618e434cefc129c002915db08be6 Author: Xiangrui Meng <meng@databricks.com> Date: Sat Jan 25 15:20:24 2014 -0800 add RandomSampler commit 8ff7ba3c5cf1fc338c29ae8b5fa06c222640e89c Author: Xiangrui Meng <meng@databricks.com> Date: Fri Jan 24 13:23:22 2014 -0800 init impl of IndependentlySampledRDD	2014-02-03 13:02:09 -08:00
Sean Owen	f67ce3e229	Merge pull request #460 from srowen/RandomInitialALSVectors Choose initial user/item vectors uniformly on the unit sphere ...rather than within the unit square to possibly avoid bias in the initial state and improve convergence. The current implementation picks the N vector elements uniformly at random from [0,1). This means they all point into one quadrant of the vector space. As N gets just a little large, the vector tend strongly to point into the "corner", towards (1,1,1...,1). The vectors are not unit vectors either. I suggest choosing the elements as Gaussian ~ N(0,1) and normalizing. This gets you uniform random choices on the unit sphere which is more what's of interest here. It has worked a little better for me in the past. This is pretty minor but wanted to warm up suggesting a few tweaks to ALS. Please excuse my Scala, pretty new to it. Author: Sean Owen <sowen@cloudera.com> == Merge branch commits == commit 492b13a7469e5a4ed7591ee8e56d8bd7570dfab6 Author: Sean Owen <sowen@cloudera.com> Date: Mon Jan 27 08:05:25 2014 +0000 Style: spaces around binary operators commit ce2b5b5a4fefa0356875701f668f01f02ba4d87e Author: Sean Owen <sowen@cloudera.com> Date: Sun Jan 19 22:50:03 2014 +0000 Generate factors with all positive components, per discussion in https://github.com/apache/incubator-spark/pull/460 commit b6f7a8a61643a8209e8bc662e8e81f2d15c710c7 Author: Sean Owen <sowen@cloudera.com> Date: Sat Jan 18 15:54:42 2014 +0000 Choose initial user/item vectors uniformly on the unit sphere rather than within the unit square to possibly avoid bias in the initial state and improve convergence	2014-01-27 11:15:51 -08:00
Matei Zaharia	d009b17d13	Merge pull request #315 from rezazadeh/sparsesvd Sparse SVD # Singular Value Decomposition Given an m x n matrix A, compute matrices U, S, V such that A = U S * V^T* There is no restriction on m, but we require n^2 doubles to fit in memory. Further, n should be less than m. The decomposition is computed by first computing A^TA = V S^2 V^T, computing svd locally on that (since n x n is small), from which we recover S and V. Then we compute U via easy matrix multiplication as U = A V * S^-1* Only singular vectors associated with the largest k singular values If there are k such values, then the dimensions of the return will be: * S is k x k and diagonal, holding the singular values on diagonal. * U is m x k and satisfies U^TU = eye(k). V is n x k and satisfies V^TV = eye(k). All input and output is expected in sparse matrix format, 0-indexed as tuples of the form ((i,j),value) all in RDDs. # Testing Tests included. They test: - Decomposition promise (A = USV^T) - For small matrices, output is compared to that of jblas - Rank 1 matrix test included - Full Rank matrix test included - Middle-rank matrix forced via k included # Example Usage import org.apache.spark.SparkContext import org.apache.spark.mllib.linalg.SVD import org.apache.spark.mllib.linalg.SparseMatrix import org.apache.spark.mllib.linalg.MatrixyEntry // Load and parse the data file val data = sc.textFile("mllib/data/als/test.data").map { line => val parts = line.split(',') MatrixEntry(parts(0).toInt, parts(1).toInt, parts(2).toDouble) } val m = 4 val n = 4 // recover top 1 singular vector val decomposed = SVD.sparseSVD(SparseMatrix(data, m, n), 1) println("singular values = " + decomposed.S.data.toArray.mkString) # Documentation Added to docs/mllib-guide.md	2014-01-22 14:01:30 -08:00
Andrew Tulloch	3a067b4a76	Fixed import order	2014-01-21 13:36:53 +00:00
Andrew Tulloch	720836a761	LocalSparkContext for MLlib	2014-01-19 17:51:00 +00:00
Sean Owen	e91ad3f164	Correct L2 regularized weight update with canonical form	2014-01-18 12:53:01 +00:00
Reza Zadeh	85b95d039d	rename to MatrixSVD	2014-01-17 14:40:51 -08:00
Reza Zadeh	fa3299835b	rename to MatrixSVD	2014-01-17 14:39:30 -08:00
Reza Zadeh	caf97a25a2	Merge remote-tracking branch 'upstream/master' into sparsesvd	2014-01-17 14:34:03 -08:00
Reza Zadeh	c9b4845bc1	prettify	2014-01-17 14:14:29 -08:00
Reza Zadeh	dbec69bbf4	add rename computeSVD	2014-01-17 13:59:05 -08:00
Reza Zadeh	eb2d8c431f	replace this.type with SVD	2014-01-17 13:57:27 -08:00
Reza Zadeh	cb13b15a60	use 0-indexing	2014-01-17 13:55:42 -08:00
Reynold Xin	84595ea3e2	Merge pull request #414 from soulmachine/code-style Code clean up for mllib * Removed unnecessary parentheses * Removed unused imports * Simplified `filter...size()` to `count ...` * Removed obsoleted parameters' comments	2014-01-15 20:15:29 -08:00
Frank Dai	57fcfc75b3	Added parentheses for that getDouble() also has side effect	2014-01-14 18:56:11 +08:00
Patrick Wendell	23034798d7	Add missing header files	2014-01-14 01:17:13 -08:00
Reza Zadeh	845e568fad	Merge remote-tracking branch 'upstream/master' into sparsesvd	2014-01-13 23:52:34 -08:00
Frank Dai	a3da468d8b	Merge remote-tracking branch 'upstream/master' into code-style	2014-01-14 15:29:17 +08:00
Patrick Wendell	fdaabdc673	Merge pull request #380 from mateiz/py-bayes Add Naive Bayes to Python MLlib, and some API fixes - Added a Python wrapper for Naive Bayes - Updated the Scala Naive Bayes to match the style of our other algorithms better and in particular make it easier to call from Java (added builder pattern, removed default value in train method) - Updated Python MLlib functions to not require a SparkContext; we can get that from the RDD the user gives - Added a toString method in LabeledPoint - Made the Python MLlib tests run as part of run-tests as well (before they could only be run individually through each file)	2014-01-13 23:08:26 -08:00
Frank Dai	c2852cf42e	Indent two spaces	2014-01-14 14:59:01 +08:00
Frank Dai	12386b3eea	Since getLong() and getInt() have side effect, get back parentheses, and remove an empty line	2014-01-14 14:53:10 +08:00
Frank Dai	0d94d74edf	Code clean up for mllib	2014-01-14 14:37:26 +08:00
Henry Saputra	91a563608e	Merge branch 'master' into remove_simpleredundantreturn_scala	2014-01-12 10:34:13 -08:00
Henry Saputra	93a65e5fde	Remove simple redundant return statement for Scala methods/functions: -) Only change simple return statements at the end of method -) Ignore the complex if-else check -) Ignore the ones inside synchronized	2014-01-12 10:30:04 -08:00
Matei Zaharia	f00e949f84	Added Java unit test, data, and main method for Naive Bayes Also fixes mains of a few other algorithms to print the final model	2014-01-11 22:30:48 -08:00
Matei Zaharia	9a0dfdf868	Add Naive Bayes to Python MLlib, and some API fixes - Added a Python wrapper for Naive Bayes - Updated the Scala Naive Bayes to match the style of our other algorithms better and in particular make it easier to call from Java (added builder pattern, removed default value in train method) - Updated Python MLlib functions to not require a SparkContext; we can get that from the RDD the user gives - Added a toString method in LabeledPoint - Made the Python MLlib tests run as part of run-tests as well (before they could only be run individually through each file)	2014-01-11 22:30:48 -08:00
jerryshao	cbfbc01938	Fix configure didn't work small problem in ALS	2014-01-11 16:22:45 +08:00
Reza Zadeh	21c8a54c08	Merge remote-tracking branch 'upstream/master' into sparsesvd Conflicts: docs/mllib-guide.md	2014-01-09 22:45:32 -08:00
Reza Zadeh	7d7490b67b	More sparse matrix usage.	2014-01-07 17:16:17 -08:00
Hossein Falaki	3a8beb46cb	Merge branch 'master' into MatrixFactorizationModel-fix	2014-01-07 15:22:42 -08:00
Hossein Falaki	04132ea9b2	Added Rating deserializer	2014-01-06 12:19:08 -08:00
Hossein Falaki	11a93fb5a8	Added serializing method for Rating object	2014-01-06 12:18:03 -08:00
Xusen Yin	05e6d5b454	Added GradientDescentSuite	2014-01-06 16:54:00 +08:00
Xusen Yin	a72107284a	fix logistic loss bug	2014-01-06 12:30:17 +08:00
Reynold Xin	d43ad3ef2c	Merge pull request #292 from soulmachine/naive-bayes standard Naive Bayes classifier Has implemented the standard Naive Bayes classifier. This is an updated version of #288, which is closed because of misoperations.	2014-01-04 16:29:30 -08:00
Hossein Falaki	8d0c2f7399	Added python binding for bulk recommendation	2014-01-04 16:23:17 -08:00
Reza Zadeh	06c0f7628a	use SparseMatrix everywhere	2014-01-04 14:28:07 -08:00
Reza Zadeh	cdff9fc858	prettify	2014-01-04 12:44:04 -08:00
Reza Zadeh	e9bd6cb51d	new example file	2014-01-04 12:33:22 -08:00
Reza Zadeh	8bfcce1ad8	fix tests	2014-01-04 11:52:42 -08:00
Reza Zadeh	35adc72794	set methods	2014-01-04 11:30:36 -08:00
Reza Zadeh	73daa700bd	add k parameter	2014-01-04 01:52:28 -08:00
Reza Zadeh	26a74f0c41	using decomposed matrix struct now	2014-01-04 00:38:53 -08:00
Reza Zadeh	d2d5e5e062	new return struct	2014-01-04 00:15:04 -08:00
Reza Zadeh	7f631dd2a9	start using matrixentry	2014-01-03 22:17:24 -08:00
Reza Zadeh	6bcdb762a1	rename sparsesvd.scala	2014-01-03 21:55:38 -08:00
Reza Zadeh	b059a2a00c	New matrix entry file	2014-01-03 21:54:57 -08:00
Hossein Falaki	dfe57fa84c	Removed unnecessary blank line	2014-01-03 15:40:53 -08:00
Hossein Falaki	2c1cba851c	Added unit tests for bulk prediction in MatrixFactorizationModel	2014-01-03 15:35:20 -08:00
Hossein Falaki	67f937ec22	Added a method to enable bulk prediction	2014-01-03 15:34:16 -08:00
Reza Zadeh	e617ae2dad	fix error message	2014-01-02 01:51:38 -08:00
Reza Zadeh	61405785bc	Merge remote-tracking branch 'upstream/master' into sparsesvd	2014-01-02 01:50:30 -08:00
Reza Zadeh	2612164f85	more docs yay	2014-01-01 20:22:29 -08:00
Reza Zadeh	915d53f8ac	javadoc for sparsesvd	2014-01-01 20:20:16 -08:00
Reza Zadeh	185c882606	tweaks to docs	2014-01-01 19:53:14 -08:00
Lian, Cheng	dd6033e685	Aggregated all sample points to driver without any shuffle	2014-01-02 01:38:24 +08:00
Lian, Cheng	6d0e2e86df	Response to comments from Reynold, Ameet and Evan * Arguments renamed according to Ameet's suggestion * Using DoubleMatrix instead of Array[Double] in computation * Removed arguments C (kinds of label) and D (dimension of feature vector) from NaiveBayes.train() * Replaced reduceByKey with foldByKey to avoid modifying original input data	2013-12-30 22:46:32 +08:00
Matei Zaharia	b4ceed40d6	Merge remote-tracking branch 'origin/master' into conf2 Conflicts: core/src/main/scala/org/apache/spark/SparkContext.scala core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala core/src/main/scala/org/apache/spark/scheduler/cluster/ClusterTaskSetManager.scala core/src/main/scala/org/apache/spark/scheduler/local/LocalScheduler.scala core/src/main/scala/org/apache/spark/util/MetadataCleaner.scala core/src/test/scala/org/apache/spark/scheduler/TaskResultGetterSuite.scala core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala new-yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala streaming/src/main/scala/org/apache/spark/streaming/Checkpoint.scala streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaStreamingContext.scala streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobGenerator.scala streaming/src/test/scala/org/apache/spark/streaming/BasicOperationsSuite.scala streaming/src/test/scala/org/apache/spark/streaming/CheckpointSuite.scala streaming/src/test/scala/org/apache/spark/streaming/InputStreamsSuite.scala streaming/src/test/scala/org/apache/spark/streaming/TestSuiteBase.scala streaming/src/test/scala/org/apache/spark/streaming/WindowOperationsSuite.scala	2013-12-29 15:08:08 -05:00
Lian, Cheng	f150b6e76c	Response to Reynold's comments	2013-12-29 17:13:01 +08:00
Matei Zaharia	642029e7f4	Various fixes to configuration code - Got rid of global SparkContext.globalConf - Pass SparkConf to serializers and compression codecs - Made SparkConf public instead of private[spark] - Improved API of SparkContext and SparkConf - Switched executor environment vars to be passed through SparkConf - Fixed some places that were still using system properties - Fixed some tests, though others are still failing This still fails several tests in core, repl and streaming, likely due to properties not being set or cleared correctly (some of the tests run fine in isolation).	2013-12-28 17:13:15 -05:00
Reza Zadeh	ae5102acc0	large scale considerations	2013-12-27 04:15:13 -05:00
Reza Zadeh	642ab5c1e1	initial large scale testing begin	2013-12-27 01:51:19 -05:00
Reza Zadeh	3369c2d487	cleanup documentation	2013-12-27 00:41:46 -05:00
Reza Zadeh	bdb5037987	add all tests	2013-12-27 00:36:41 -05:00
Reza Zadeh	fa1e8d8cbf	test for truncated svd	2013-12-27 00:34:59 -05:00
Reza Zadeh	16de5268e3	full rank matrix test added	2013-12-26 23:21:57 -05:00
Lian, Cheng	d7086dc28a	Added Apache license header to NaiveBayesSuite	2013-12-27 08:20:41 +08:00
Reza Zadeh	fe1a132d40	Main method added for svd	2013-12-26 18:13:21 -05:00
Reza Zadeh	1a21ba2967	new main file	2013-12-26 18:09:33 -05:00
Reza Zadeh	6c3674cd23	Object to hold the svd methods	2013-12-26 17:39:25 -05:00
Reza Zadeh	6e740cc901	Some documentation	2013-12-26 16:12:40 -05:00
Lian, Cheng	654f42174a	Reformatted some lines commented by Matei	2013-12-27 04:45:04 +08:00
Reza Zadeh	1a173f00bd	Initial files - no tests	2013-12-26 15:01:03 -05:00
Lian, Cheng	c0337c5bbf	Let reduceByKey to take care of local combine Also refactored some heavy FP code to improve readability and reduce memory footprint.	2013-12-25 22:45:57 +08:00
Lian, Cheng	3bb714eaa3	Refactored NaiveBayes * Minimized shuffle output with mapPartitions. * Reduced RDD actions from 3 to 1.	2013-12-25 17:15:38 +08:00
Frank Dai	3dc655aa19	standard Naive Bayes classifier	2013-12-25 16:50:42 +08:00
Tor Myklebust	4e821390bc	Scala stubs for updated Python bindings.	2013-12-25 00:09:00 -05:00
Tor Myklebust	58e2a7d6d4	Move PythonMLLibAPI into its own package.	2013-12-24 16:48:40 -05:00
Tor Myklebust	2402180b32	Fix error message ugliness.	2013-12-24 16:18:33 -05:00
Prashant Sharma	2573add94c	spark-544, introducing SparkConf and related configuration overhaul.	2013-12-25 00:09:36 +05:30
Tor Myklebust	20f85eca3d	Java stubs for ALSModel.	2013-12-21 14:54:13 -05:00
Tor Myklebust	b454fdc2eb	Javadocs; also, declare some things private.	2013-12-20 02:10:21 -05:00
Tor Myklebust	b835ddf3df	Licence notice.	2013-12-20 01:55:03 -05:00
Tor Myklebust	f99970e8cd	Scala classification and clustering stubs; matrix serialization/deserialization.	2013-12-20 00:12:22 -05:00
Tor Myklebust	ded67ee90c	Bindings for linear, Lasso, and ridge regression.	2013-12-19 22:42:12 -05:00
Tor Myklebust	2a41c9aad3	Un-semicolon PythonMLLibAPI.	2013-12-19 21:27:11 -05:00
Tor Myklebust	95915f8b3b	First cut at python mllib bindings. Only LinearRegression is supported.	2013-12-19 01:29:09 -05:00
Mark Hamstra	09ed7ddfa0	Use scala.binary.version in POMs	2013-12-15 12:39:58 -08:00
Prashant Sharma	17db6a9041	Style fixes and addressed review comments at #221	2013-12-10 11:47:16 +05:30
Prashant Sharma	7ad6921ae0	Incorporated Patrick's feedback comment on #211 and made maven build/dep-resolution atleast a bit faster.	2013-12-07 12:45:57 +05:30
Prashant Sharma	44fd30d3fb	Merge branch 'master' into scala-2.10-wip Conflicts: core/src/main/scala/org/apache/spark/rdd/RDD.scala project/SparkBuild.scala	2013-11-25 18:10:54 +05:30
Marek Kolodziej	22724659db	Make XORShiftRandom explicit in KMeans and roll it back for RDD	2013-11-20 07:03:36 -05:00
Marek Kolodziej	99cfe89c68	Updates to reflect pull request code review	2013-11-18 22:00:36 -05:00
Marek Kolodziej	09bdfe3b16	XORShift RNG with unit tests and benchmark To run unit test, start SBT console and type: compile test-only org.apache.spark.util.XORShiftRandomSuite To run benchmark, type: project core console Once the Scala console starts, type: org.apache.spark.util.XORShiftRandom.benchmark(100000000)	2013-11-18 15:21:43 -05:00
Prashant Sharma	026ab75661	Merge branch 'master' of github.com:apache/incubator-spark into scala-2.10	2013-10-10 09:42:55 +05:30
Prashant Sharma	26860639c5	Merge branch 'scala-2.10' of github.com:ScrapCodes/spark into scala-2.10 Conflicts: core/src/main/scala/org/apache/spark/scheduler/cluster/ClusterTaskSetManager.scala project/SparkBuild.scala	2013-10-10 09:42:23 +05:30
Prashant Sharma	7be75682b9	Merge branch 'master' into wip-merge-master Conflicts: bagel/pom.xml core/pom.xml core/src/test/scala/org/apache/spark/ui/UISuite.scala examples/pom.xml mllib/pom.xml pom.xml project/SparkBuild.scala repl/pom.xml streaming/pom.xml tools/pom.xml In scala 2.10, a shorter representation is used for naming artifacts so changed to shorter scala version for artifacts and made it a property in pom.	2013-10-08 11:29:40 +05:30
Nick Pentreath	a5e58b8f98	Merge branch 'master' into implicit-als	2013-10-07 11:46:17 +02:00
Nick Pentreath	b0f5f4d441	Bumping up test matrix size to eliminate random failures	2013-10-07 11:44:22 +02:00
Patrick Wendell	aa9fb84994	Merging build changes in from 0.8	2013-10-05 22:07:00 -07:00
Martin Weindel	e09f4a9601	fixed some warnings	2013-10-05 23:08:23 +02:00
Nick Pentreath	c6ceaeae50	Style fix using 'if' rather than 'match' on boolean	2013-10-04 13:52:53 +02:00
Nick Pentreath	6a7836cddc	Fixing closing brace indentation	2013-10-04 13:33:01 +02:00
Nick Pentreath	0bd9b373d1	Reverting to using comma-delimited split	2013-10-04 13:30:33 +02:00
Nick Pentreath	1cbdcb9cb6	Merge remote-tracking branch 'upstream/master' into implicit-als	2013-10-04 13:25:34 +02:00
Prashant Sharma	5829692885	Merge branch 'master' into scala-2.10 Conflicts: core/src/main/scala/org/apache/spark/ui/jobs/JobProgressUI.scala docs/_config.yml project/SparkBuild.scala repl/src/main/scala/org/apache/spark/repl/SparkILoop.scala	2013-10-01 11:57:24 +05:30
Prashant Sharma	7ff4c2d399	fixed maven build for scala 2.10	2013-09-26 10:48:24 +05:30
Patrick Wendell	6079721fa1	Update build version in master	2013-09-24 11:41:51 -07:00
Nick Pentreath	d952f04c8e	Merge remote-tracking branch 'upstream/master' into implicit-als	2013-09-23 13:07:40 +02:00
Prashant Sharma	383e151fd7	Merge branch 'master' of git://github.com/mesos/spark into scala-2.10 Conflicts: core/src/main/scala/org/apache/spark/SparkContext.scala project/SparkBuild.scala	2013-09-15 10:55:12 +05:30
Matei Zaharia	7a5c4b647b	Small tweaks to MLlib docs	2013-09-08 21:47:24 -07:00
Ameet Talwalkar	81a8bd46ac	respose to PR comments	2013-09-08 19:21:30 -07:00
Nick Pentreath	737f01a1ef	Adding algorithm for implicit feedback data to ALS	2013-09-06 14:45:05 +02:00
Prashant Sharma	4106ae9fbf	Merged with master	2013-09-06 17:53:01 +05:30
Matei Zaharia	12b2f1f9c9	Add missing license headers found with RAT	2013-09-02 12:23:03 -07:00
Matei Zaharia	0a8cc30921	Move some classes to more appropriate packages: * RDD, RDDFunctions -> org.apache.spark.rdd Utils, ClosureCleaner, SizeEstimator -> org.apache.spark.util * JavaSerializer, KryoSerializer -> org.apache.spark.serializer	2013-09-01 14:13:16 -07:00
Matei Zaharia	5701eb92c7	Fix some URLs	2013-09-01 14:13:16 -07:00
Matei Zaharia	46eecd110a	Initial work to rename package to org.apache.spark	2013-09-01 14:13:13 -07:00
Shivaram Venkataraman	adc700582b	Fix broken build by removing addIntercept	2013-08-30 00:16:32 -07:00
Evan Sparks	016787de32	Merge pull request #863 from shivaram/etrain-ridge Adding linear regression and refactoring Ridge regression to use SGD	2013-08-29 22:15:14 -07:00
Evan Sparks	852d810787	Merge pull request #819 from shivaram/sgd-cleanup Change SVM to use {0,1} labels	2013-08-29 22:13:15 -07:00
Shivaram Venkataraman	dc06b52879	Add an option to turn off data validation, test it. Also moves addIntercept to have default true to make it similar to validateData option	2013-08-25 23:14:35 -07:00
Shivaram Venkataraman	b8c50a0642	Center & scale variables in Ridge, Lasso. Also add a unit test that checks if ridge regression lowers cross-validation error.	2013-08-25 22:24:27 -07:00
Matei Zaharia	215c13dd41	Fix code style and a nondeterministic RDD issue in ALS	2013-08-22 16:13:46 -07:00
Matei Zaharia	46ea0c1b47	Merge pull request #814 from holdenk/master Create less instances of the random class during ALS initialization.	2013-08-22 15:57:28 -07:00
Jey Kottalam	23f4622aff	Remove redundant dependencies from POMs	2013-08-18 18:53:57 -07:00
Evan Sparks	07fe910669	Fixing typos in Java tests, and addressing alignment issues.	2013-08-18 15:03:13 -07:00
Evan Sparks	b291db712e	Centralizing linear data generator and mllib regression tests to use it.	2013-08-18 15:03:13 -07:00
Evan Sparks	b659af83d3	Adding Linear Regression, and refactoring Ridge Regression.	2013-08-18 15:03:13 -07:00
Jey Kottalam	ad580b94d5	Maven build now also works with YARN	2013-08-16 13:50:12 -07:00
Jey Kottalam	9dd15fe700	Don't mark hadoop-client as 'provided'	2013-08-16 13:50:12 -07:00
Jey Kottalam	11b42a84db	Maven build now works with CDH hadoop-2.0.0-mr1	2013-08-16 13:50:12 -07:00
Jey Kottalam	353fab2440	Initial changes to make Maven build agnostic of hadoop version	2013-08-16 13:50:12 -07:00
Holden Karau	8fc40818d7	Fix	2013-08-15 23:08:48 -07:00
Shivaram Venkataraman	c874625354	Specify label format in LogisticRegression.	2013-08-13 16:55:53 -07:00
Shivaram Venkataraman	0ab6ff4c32	Fix SVM model and unit test to work with {0,1}. Also rename validateFuncs to validators.	2013-08-13 13:57:06 -07:00
Shivaram Venkataraman	654087194d	Change SVM to use {0,1} labels. Also add a data validation check to make sure classification labels are always 0 or 1 and add an appropriate test case.	2013-08-13 11:44:47 -07:00
Holden Karau	d145da818e	Code review feedback :)	2013-08-12 22:13:08 -07:00
Holden Karau	705c9ace2a	Use less instances of the random class during ALS setup	2013-08-12 22:08:36 -07:00
Matei Zaharia	9e02da2763	Merge pull request #812 from shivaram/maven-mllib-tests Create SparkContext in beforeAll for MLLib tests	2013-08-12 20:22:27 -07:00
Shivaram Venkataraman	4935a2558b	Clean up scaladoc in ML Lib. Also build and copy ML Lib scaladoc in Spark docs build. Some more minor cleanup with respect to naming, test locations etc.	2013-08-11 19:02:43 -07:00
Shivaram Venkataraman	ecc9bfe377	Create SparkContext in beforeAll for MLLib tests This overcomes test failures that occur using Maven	2013-08-11 17:04:00 -07:00
Evan Sparks	ff9ebfabb4	Merge pull request #762 from shivaram/sgd-cleanup Refactor SGD options into a new class.	2013-08-11 10:52:55 -07:00
Shivaram Venkataraman	a65a6ed514	Fix GLM code review comments and move java tests	2013-08-10 18:54:10 -07:00
Matei Zaharia	cd247ba5bb	Merge pull request #786 from shivaram/mllib-java Java fixes, tests and examples for ALS, KMeans	2013-08-09 20:41:13 -07:00
Reynold Xin	01f20a941e	Fixed a typo in mllib inline documentation.	2013-08-08 16:42:54 -07:00
Shivaram Venkataraman	2812e72200	Add setters for optimizer, gradient in SGD. Also remove java-specific constructor for LabeledPoint.	2013-08-08 16:24:31 -07:00
Shivaram Venkataraman	e1a209f791	Remove Java-specific constructor for Rating. The scala constructor works for native type java types. Modify examples to match this.	2013-08-08 14:36:02 -07:00
Shivaram Venkataraman	338b7a7455	Merge branch 'master' of git://github.com/mesos/spark into sgd-cleanup Conflicts: mllib/src/main/scala/spark/mllib/util/MLUtils.scala	2013-08-06 21:21:55 -07:00
Shivaram Venkataraman	7db69d56f2	Refactor GLM algorithms and add Java tests This change adds Java examples and unit tests for all GLM algorithms to make sure the MLLib interface works from Java. Changes include - Introduce LabeledPoint and avoid using Doubles in train arguments - Rename train to run in class methods - Make the optimizer a member variable of GLM to make sure the builder pattern works	2013-08-06 17:23:22 -07:00
Shivaram Venkataraman	6caec3f441	Add a test case for random initialization. Also workaround a bug where double[][] class cast fails	2013-08-06 16:35:47 -07:00
Shivaram Venkataraman	471fbadd0c	Java examples, tests for KMeans and ALS - Changes ALS to accept RDD[Rating] instead of (Int, Int, Double) making it easier to call from Java - Renames class methods from `train` to `run` to enable static methods to be called from Java. - Add unit tests which check if both static / class methods can be called. - Also add examples which port the main() function in ALS, KMeans to the examples project. Couple of minor changes to existing code: - Add a toJavaRDD method in RDD to convert scala RDD to java RDD easily - Workaround a bug where using double[] from Java leads to class cast exception in KMeans init	2013-08-06 15:43:46 -07:00
Ginger Smith	bf7033f3eb	fixing formatting, style, and input	2013-08-05 21:26:24 -07:00
Ginger Smith	8c8947e2b6	fixing formatting	2013-08-05 11:22:18 -07:00
Shivaram Venkataraman	7388e27668	Move implicit arg to constructor for Java access.	2013-08-03 18:08:43 -07:00
Ginger Smith	4ab4df5edb	adding matrix factorization data generator	2013-08-02 22:22:36 -07:00
Shivaram Venkataraman	00339cc032	Refactor optimizers and create GLMs This change refactors the structure of GLMs to use mixins which maintain a similar interface to other ML lib algorithms. This change also creates an Optimizer trait which allows GLMs to be extended to use other optimization techniques.	2013-08-02 19:15:34 -07:00
Matei Zaharia	abfa9e6f70	Increase Kryo buffer size in ALS since some arrays become big	2013-08-02 16:17:32 -07:00
shivaram	58756b72f1	Merge pull request #761 from mateiz/kmeans-generator Add data generator for K-means	2013-07-31 23:45:41 -07:00
Matei Zaharia	52dba89261	Turn on caching in KMeans.main	2013-07-31 23:08:12 -07:00
Matei Zaharia	b2b86c2575	Merge pull request #753 from shivaram/glm-refactor Build changes for ML lib	2013-07-31 15:51:39 -07:00
Matei Zaharia	f607ffb9e1	Added data generator for K-means Also made it possible to specify the number of runs in KMeans.main().	2013-07-31 14:31:07 -07:00
Shivaram Venkataraman	cef178873b	Refactor SGD options into a new class. This refactoring pulls out code shared between SVM, Lasso, LR into a common GradientDescentOpts class. Some style cleanup as well	2013-07-31 14:15:17 -07:00
Matei Zaharia	9a444cffe7	Use the Char version of split() instead of the String one for efficiency	2013-07-31 11:28:39 -07:00
Shivaram Venkataraman	48851d4dd9	Add bagel, mllib to SBT assembly. Also add jblas dependency to mllib pom.xml	2013-07-30 14:03:15 -07:00
Reynold Xin	366f7735eb	Minor style cleanup of mllib.	2013-07-30 13:59:32 -07:00
Reynold Xin	47011e6854	Use a tigher bound in logistic regression unit test's prediction validation.	2013-07-30 13:58:23 -07:00
Reynold Xin	e35966ae9a	Renamed Classification.scala to ClassificationModel.scala and Regression.scala to RegressionModel.scala	2013-07-30 13:28:31 -07:00
Ameet Talwalkar	e4387ddf5d	made SimpleUpdater consistent with other updaters	2013-07-29 22:21:50 -07:00
Shivaram Venkataraman	3ca9faa341	Clarify how regVal is computed in Updater docs	2013-07-29 18:37:28 -07:00
Shivaram Venkataraman	07da72b451	Remove duplicate loss history and clarify why. Also some minor style fixes.	2013-07-29 16:25:17 -07:00
Xinghao	2b2630ba3c	Style fix Lines shortened to < 100 characters	2013-07-29 09:22:49 -07:00
Xinghao	07f17439a5	Fix validatePrediction functions for Classification models Classifiers return categorical (Int) values that should be compared directly	2013-07-29 09:22:31 -07:00
Xinghao	3a8d07df8c	Deleting extra LogisticRegressionGenerator and RidgeRegressionGenerator	2013-07-29 09:20:26 -07:00
Xinghao	75f3757300	Fix rounding error in LogisticRegression.scala	2013-07-29 09:19:56 -07:00
Xinghao	c823ee1e2b	Replace map-reduce with dot operator using DoubleMatrix	2013-07-28 22:17:53 -07:00
Xinghao	96e04f4cb7	Fixed SVM and LR train functions to take Int instead of Double for Classification	2013-07-28 22:12:39 -07:00
Xinghao	9398dced03	Changed Classification to return Int instead of Double Also minor changes to formatting and comments	2013-07-28 21:39:19 -07:00
Xinghao	67de051bbb	SVMSuite and LassoSuite rewritten to follow closely with LogisticRegressionSuite	2013-07-28 21:09:56 -07:00
Xinghao	29e042940a	Move data generators to util	2013-07-28 20:39:52 -07:00
Xinghao	ccfa362dde	Change _LocalRandomSGD to LocalRandomSGD	2013-07-28 10:33:57 -07:00
Xinghao	b0bbc7f6a8	Resolve conflicts with master, removed regParam for LogisticRegression	2013-07-26 18:57:39 -07:00
Xinghao	071afe2a33	New files from merge with master	2013-07-26 18:21:20 -07:00
Xinghao	10fd3949e6	Making ClassificationModel serializable	2013-07-26 17:49:11 -07:00
Xinghao	f0a1f95228	Rename LogisticRegression, SVM and Lasso to *_LocalRandomSGD	2013-07-26 17:36:14 -07:00
Xinghao	f74a03c6d8	Multiple changes - Changed LogisticRegression regularization parameter to 0 - Removed println from SVM predict function - Fixed "Lasso" -> "SVM" in SVMGenerator - Added comment in Updater.scala to indicate L1 regularization leads to soft thresholding proximal function	2013-07-26 17:29:44 -07:00
Xinghao	eef678703e	Adding SVM and Lasso, moving LogisticRegression to classification from regression Also, add regularization parameter to SGD	2013-07-24 15:32:50 -07:00
Reynold Xin	2210e8ccf8	Use a different validation dataset for Logistic Regression prediction testing.	2013-07-23 12:52:15 -07:00
Reynold Xin	87a9dd898f	Made RegressionModel serializable and added unit tests to make sure predict methods would work.	2013-07-23 12:13:27 -07:00
Matei Zaharia	c40f0f21f1	Merge pull request #711 from shivaram/ml-generators Move ML lib data generator files to util/	2013-07-19 13:33:04 -07:00
Shivaram Venkataraman	2c9ea56db4	Rename classes to be called DataGenerator	2013-07-18 11:57:14 -07:00
Shivaram Venkataraman	7ab1170503	Refactor data generators to have a function that can be used in tests.	2013-07-18 11:55:19 -07:00
Shivaram Venkataraman	217667174e	Return Array[Double] from SGD instead of DoubleMatrix	2013-07-17 16:08:34 -07:00
Shivaram Venkataraman	45f3c85518	Change weights to be Array[Double] in LR model. Also ensure weights are initialized to a column vector.	2013-07-17 16:03:29 -07:00
Shivaram Venkataraman	3bf9897136	Rename loss -> stochasticLoss and add a note to explain why we have multiple train methods.	2013-07-17 14:20:24 -07:00
Shivaram Venkataraman	64b88e039a	Move ML lib data generator files to util/	2013-07-17 14:11:44 -07:00
Shivaram Venkataraman	84fa20c2a1	Allow initial weight vectors in LogisticRegression. Also move LogisticGradient to the LogisticRegression file and fix the unit tests log path.	2013-07-17 14:04:05 -07:00
Matei Zaharia	af3c9d5042	Add Apache license headers and LICENSE and NOTICE files	2013-07-16 17:21:33 -07:00
Matei Zaharia	4698a0d688	Shuffle ratings in a more efficient way at start of ALS	2013-07-15 02:54:11 +00:00
Matei Zaharia	ed7fd501cf	Make number of blocks in ALS configurable and lower the default	2013-07-15 00:30:10 +00:00
Matei Zaharia	931e4c96ef	Fix a comment	2013-07-14 08:03:13 +00:00
Matei Zaharia	c5c38d1987	Some optimizations to loading phase of ALS	2013-07-14 07:59:50 +00:00
Ameet Talwalkar	bf4c9a5e0f	renamed with labeled prefix	2013-07-08 14:37:42 -07:00
ryanlecompte	be123aa6ef	update to use ListBuffer, faster than Vector for append operations	2013-07-07 15:35:06 -07:00
ryanlecompte	f78f8d0b41	fix formatting and use Vector instead of List to maintain order	2013-07-06 16:46:53 -07:00
ryanlecompte	757e56dfc7	make binSearch a tail-recursive method	2013-07-05 19:54:28 -07:00
Matei Zaharia	8bbe907556	Replaced string constants in test	2013-07-05 17:25:23 -07:00
Matei Zaharia	653043beb6	Renamed files to match package	2013-07-05 17:18:55 -07:00
Matei Zaharia	de67deeaab	Addressed style comments from Ryan LeCompte	2013-07-05 17:16:49 -07:00
Matei Zaharia	43b24635ee	Renamed ML package to MLlib and added it to classpath	2013-07-05 11:38:53 -07:00

... 8 9 10 11 12 ...

680 commits