ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Reynold Xin	5ad78f6205	[SQL] Various DataFrame DSL update. 1. Added foreach, foreachPartition, flatMap to DataFrame. 2. Added col() in dsl. 3. Support renaming columns in toDataFrame. 4. Support type inference on arrays (in addition to Seq). 5. Updated mllib to use the new DSL. Author: Reynold Xin <rxin@databricks.com> Closes #4260 from rxin/sql-dsl-update and squashes the following commits: 73466c1 [Reynold Xin] Fixed LogisticRegression. Also added better error message for resolve. fab3ccc [Reynold Xin] Bug fix. d31fcd2 [Reynold Xin] Style fix. 62608c4 [Reynold Xin] [SQL] Various DataFrame DSL update.	2015-01-29 00:01:10 -08:00
Burak Yavuz	a63be1a18f	[SPARK-3977] Conversion methods for BlockMatrix to other Distributed Matrices The conversion methods for `BlockMatrix`. Conversions go through `CoordinateMatrix` in order to cause a shuffle so that intermediate operations will be stored on disk and the expensive initial computation will be mitigated. Author: Burak Yavuz <brkyvz@gmail.com> Closes #4256 from brkyvz/SPARK-3977PR and squashes the following commits: 4df37fe [Burak Yavuz] moved TODO inside code block b049c07 [Burak Yavuz] addressed code review feedback v1 66cb755 [Burak Yavuz] added default toBlockMatrix conversion 851f2a2 [Burak Yavuz] added better comments and checks cdb9895 [Burak Yavuz] [SPARK-3977] Conversion methods for BlockMatrix to other Distributed Matrices	2015-01-28 23:42:07 -08:00
Reynold Xin	5b9760de8d	[SPARK-5445][SQL] Made DataFrame dsl usable in Java Also removed the literal implicit transformation since it is pretty scary for API design. Instead, created a new lit method for creating literals. This doesn't break anything from a compatibility perspective because Literal was added two days ago. Author: Reynold Xin <rxin@databricks.com> Closes #4241 from rxin/df-docupdate and squashes the following commits: c0f4810 [Reynold Xin] Fix Python merge conflict. 094c7d7 [Reynold Xin] Minor style fix. Reset Python tests. 3c89f4a [Reynold Xin] Package. dfe6962 [Reynold Xin] Updated Python aggregate. 5dd4265 [Reynold Xin] Made dsl Java callable. 14b3c27 [Reynold Xin] Fix literal expression for symbols. 68b31cb [Reynold Xin] Literal. 4cfeb78 [Reynold Xin] [SPARK-5097][SQL] Address DataFrame code review feedback.	2015-01-28 19:10:32 -08:00
Xiangrui Meng	4ee79c71af	[SPARK-5430] move treeReduce and treeAggregate from mllib to core We have seen many use cases of `treeAggregate`/`treeReduce` outside the ML domain. Maybe it is time to move them to Core. pwendell Author: Xiangrui Meng <meng@databricks.com> Closes #4228 from mengxr/SPARK-5430 and squashes the following commits: 20ad40d [Xiangrui Meng] exclude tree* from mima e89a43e [Xiangrui Meng] fix compile and update java doc 3ae1a4b [Xiangrui Meng] add treeReduce/treeAggregate to Python 6f948c5 [Xiangrui Meng] add treeReduce/treeAggregate to JavaRDDLike d600b6c [Xiangrui Meng] move treeReduce and treeAggregate to core	2015-01-28 17:26:03 -08:00
Xiangrui Meng	e80dc1c5a8	[SPARK-4586][MLLIB] Python API for ML pipeline and parameters This PR adds Python API for ML pipeline and parameters. The design doc can be found on the JIRA page. It includes transformers and an estimator to demo the simple text classification example code. TODO: - [x] handle parameters in LRModel - [x] unit tests - [x] missing some docs CC: davies jkbradley Author: Xiangrui Meng <meng@databricks.com> Author: Davies Liu <davies@databricks.com> Closes #4151 from mengxr/SPARK-4586 and squashes the following commits: 415268e [Xiangrui Meng] remove inherit_doc from __init__ edbd6fe [Xiangrui Meng] move Identifiable to ml.util 44c2405 [Xiangrui Meng] Merge pull request #2 from davies/ml dd1256b [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-4586 14ae7e2 [Davies Liu] fix docs 54ca7df [Davies Liu] fix tests 78638df [Davies Liu] Merge branch 'SPARK-4586' of github.com:mengxr/spark into ml fc59a02 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-4586 1dca16a [Davies Liu] refactor 090b3a3 [Davies Liu] Merge branch 'master' of github.com:apache/spark into ml 0882513 [Xiangrui Meng] update doc style a4f4dbf [Xiangrui Meng] add unit test for LR 7521d1c [Xiangrui Meng] add unit tests to HashingTF and Tokenizer ba0ba1e [Xiangrui Meng] add unit tests for pipeline 0586c7b [Xiangrui Meng] add more comments to the example 5153cff [Xiangrui Meng] simplify java models 036ca04 [Xiangrui Meng] gen numFeatures 46fa147 [Xiangrui Meng] update mllib/pom.xml to include python files in the assembly 1dcc17e [Xiangrui Meng] update code gen and make param appear in the doc f66ba0c [Xiangrui Meng] make params a property d5efd34 [Xiangrui Meng] update doc conf and move embedded param map to instance attribute f4d0fe6 [Xiangrui Meng] use LabeledDocument and Document in example 05e3e40 [Xiangrui Meng] update example d3e8dbe [Xiangrui Meng] more docs optimize pipeline.fit impl 56de571 [Xiangrui Meng] fix style d0c5bb8 [Xiangrui Meng] a working copy bce72f4 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-4586 17ecfb9 [Xiangrui Meng] code gen for shared params d9ea77c [Xiangrui Meng] update doc c18dca1 [Xiangrui Meng] make the example working dadd84e [Xiangrui Meng] add base classes and docs a3015cf [Xiangrui Meng] add Estimator and Transformer 46eea43 [Xiangrui Meng] a pipeline in python 33b68e0 [Xiangrui Meng] a working LR	2015-01-28 17:14:23 -08:00
Reynold Xin	c8e934ef3c	[SPARK-5447][SQL] Replaced reference to SchemaRDD with DataFrame. and [SPARK-5448][SQL] Make CacheManager a concrete class and field in SQLContext Author: Reynold Xin <rxin@databricks.com> Closes #4242 from rxin/sqlCleanup and squashes the following commits: e351cb2 [Reynold Xin] Fixed toDataFrame. 6545c42 [Reynold Xin] More changes. 728c017 [Reynold Xin] [SPARK-5447][SQL] Replaced reference to SchemaRDD with DataFrame.	2015-01-28 12:10:01 -08:00
Burak Yavuz	eeb53bf90e	[SPARK-3974][MLlib] Distributed Block Matrix Abstractions This pull request includes the abstractions for the distributed BlockMatrix representation. `BlockMatrix` will allow users to store very large matrices in small blocks of local matrices. Specific partitioners, such as `RowBasedPartitioner` and `ColumnBasedPartitioner`, are implemented in order to optimize addition and multiplication operations that will be added in a following PR. This work is based on the ml-matrix repo developed at the AMPLab at UC Berkeley, CA. https://github.com/amplab/ml-matrix Additional thanks to rezazadeh, shivaram, and mengxr for guidance on the design. Author: Burak Yavuz <brkyvz@gmail.com> Author: Xiangrui Meng <meng@databricks.com> Author: Burak Yavuz <brkyvz@dn51t42l.sunet> Author: Burak Yavuz <brkyvz@dn51t4rd.sunet> Author: Burak Yavuz <brkyvz@dn0a221430.sunet> Closes #3200 from brkyvz/SPARK-3974 and squashes the following commits: a8eace2 [Burak Yavuz] Merge pull request #2 from mengxr/brkyvz-SPARK-3974 feb32a7 [Xiangrui Meng] update tests e1d3ee8 [Xiangrui Meng] minor updates 24ec7b8 [Xiangrui Meng] update grid partitioner 5eecd48 [Burak Yavuz] fixed gridPartitioner and added tests 140f20e [Burak Yavuz] Merge branch 'master' of github.com:apache/spark into SPARK-3974 1694c9e [Burak Yavuz] almost finished addressing comments f9d664b [Burak Yavuz] updated API and modified partitioning scheme eebbdf7 [Burak Yavuz] preliminary changes addressing code review 1a63b20 [Burak Yavuz] [SPARK-3974] Remove setPartition method. Isn't required 1e8bb2a [Burak Yavuz] [SPARK-3974] Change return type of cache and persist 239ab4b [Burak Yavuz] [SPARK-3974] Addressed @jkbradley's comments ba414d2 [Burak Yavuz] [SPARK-3974] fixed frobenius norm ab6cde0 [Burak Yavuz] [SPARK-3974] Modifications cleaning code up, making size calculation more robust 9ae85aa [Burak Yavuz] [SPARK-3974] Made partitioner a variable inside BlockMatrix instead of a constructor variable d033861 [Burak Yavuz] [SPARK-3974] Removed SubMatrixInfo and added constructor without partitioner 49b9586 [Burak Yavuz] [SPARK-3974] Updated testing utils from master 645afbe [Burak Yavuz] [SPARK-3974] Pull latest master b05aabb [Burak Yavuz] [SPARK-3974] Updated tests to reflect changes 19c17e8 [Burak Yavuz] [SPARK-3974] Changed blockIdRow and blockIdCol 589fbb6 [Burak Yavuz] [SPARK-3974] Code review feedback addressed aa8f086 [Burak Yavuz] [SPARK-3974] Additional comments added f378e16 [Burak Yavuz] [SPARK-3974] Block Matrix Abstractions ready b693209 [Burak Yavuz] Ready for Pull request	2015-01-28 10:06:37 -08:00
Reynold Xin	119f45d61d	[SPARK-5097][SQL] DataFrame This pull request redesigns the existing Spark SQL dsl, which already provides data frame like functionalities. TODOs: With the exception of Python support, other tasks can be done in separate, follow-up PRs. - [ ] Audit of the API - [ ] Documentation - [ ] More test cases to cover the new API - [x] Python support - [ ] Type alias SchemaRDD Author: Reynold Xin <rxin@databricks.com> Author: Davies Liu <davies@databricks.com> Closes #4173 from rxin/df1 and squashes the following commits: 0a1a73b [Reynold Xin] Merge branch 'df1' of github.com:rxin/spark into df1 23b4427 [Reynold Xin] Mima. 828f70d [Reynold Xin] Merge pull request #7 from davies/df 257b9e6 [Davies Liu] add repartition 6bf2b73 [Davies Liu] fix collect with UDT and tests e971078 [Reynold Xin] Missing quotes. b9306b4 [Reynold Xin] Remove removeColumn/updateColumn for now. a728bf2 [Reynold Xin] Example rename. e8aa3d3 [Reynold Xin] groupby -> groupBy. 9662c9e [Davies Liu] improve DataFrame Python API 4ae51ea [Davies Liu] python API for dataframe 1e5e454 [Reynold Xin] Fixed a bug with symbol conversion. 2ca74db [Reynold Xin] Couple minor fixes. ea98ea1 [Reynold Xin] Documentation & literal expressions. 2b22684 [Reynold Xin] Got rid of IntelliJ problems. 02bbfbc [Reynold Xin] Tightening imports. ffbce66 [Reynold Xin] Fixed compilation error. 59b6d8b [Reynold Xin] Style violation. b85edfb [Reynold Xin] ALS. 8c37f0a [Reynold Xin] Made MLlib and examples compile 6d53134 [Reynold Xin] Hive module. d35efd5 [Reynold Xin] Fixed compilation error. ce4a5d2 [Reynold Xin] Fixed test cases in SQL except ParquetIOSuite. 66d5ef1 [Reynold Xin] SQLContext minor patch. c9bcdc0 [Reynold Xin] Checkpoint: SQL module compiles!	2015-01-27 16:08:24 -08:00
Burak Yavuz	914267484a	[SPARK-5321] Support for transposing local matrices Support for transposing local matrices added. The `.transpose` function creates a new object re-using the backing array(s) but switches `numRows` and `numCols`. Operations check the flag `.isTransposed` to see whether the indexing in `values` should be modified. This PR will pave the way for transposing `BlockMatrix`. Author: Burak Yavuz <brkyvz@gmail.com> Closes #4109 from brkyvz/SPARK-5321 and squashes the following commits: 87ab83c [Burak Yavuz] fixed scalastyle caf4438 [Burak Yavuz] addressed code review v3 c524770 [Burak Yavuz] address code review comments 2 77481e8 [Burak Yavuz] fixed MiMa f1c1742 [Burak Yavuz] small refactoring ccccdec [Burak Yavuz] fixed failed test dd45c88 [Burak Yavuz] addressed code review a01bd5f [Burak Yavuz] [SPARK-5321] Fixed MiMa issues 2a63593 [Burak Yavuz] [SPARK-5321] fixed bug causing failed gemm test c55f29a [Burak Yavuz] [SPARK-5321] Support for transposing local matrices cleaned up c408c05 [Burak Yavuz] [SPARK-5321] Support for transposing local matrices added	2015-01-27 01:46:17 -08:00
Liang-Chi Hsieh	7b0ed79795	[SPARK-5419][Mllib] Fix the logic in Vectors.sqdist The current implementation in Vectors.sqdist is not efficient because of allocating temp arrays. There is also a bug in the code `v1.indices.length / v1.size < 0.5`. This pr fixes the bug and refactors sqdist without allocating new arrays. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #4217 from viirya/fix_sqdist and squashes the following commits: e8b0b3d [Liang-Chi Hsieh] For review comments. 314c424 [Liang-Chi Hsieh] Fix sqdist bug.	2015-01-27 01:29:14 -08:00
MechCoder	d6894b1c53	[SPARK-3726] [MLlib] Allow sampling_rate not equal to 1.0 in RandomForests I've added support for sampling_rate not equal to 1.0 . I have two major questions. 1. A Scala style test is failing, since the number of parameters now exceed 10. 2. I would like suggestions to understand how to test this. Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #4073 from MechCoder/spark-3726 and squashes the following commits: 8012fb2 [MechCoder] Add test in Strategy e0e0d9c [MechCoder] TST: Add better test d1df1b2 [MechCoder] Add test to verify subsampling behavior a7bfc70 [MechCoder] [SPARK-3726] Allow sampling_rate not equal to 1.0	2015-01-26 19:46:17 -08:00
lewuathe	f2ba5c6fc3	[SPARK-5119] java.lang.ArrayIndexOutOfBoundsException on trying to train... ... decision tree model Labels loaded from libsvm files are mapped to 0.0 if they are negative labels because they should be nonnegative value. Author: lewuathe <lewuathe@me.com> Closes #3975 from Lewuathe/map-negative-label-to-positive and squashes the following commits: 12d1d59 [lewuathe] [SPARK-5119] Fix code styles 6d9a18a [lewuathe] [SPARK-5119] Organize test codes 62a150c [lewuathe] [SPARK-5119] Modify Impurities throw exceptions with negatie labels 3336c21 [lewuathe] [SPARK-5119] java.lang.ArrayIndexOutOfBoundsException on trying to train decision tree model	2015-01-26 18:03:21 -08:00
Yuhao Yang	81251682ed	[SPARK-5384][mllib] Vectors.sqdist returns inconsistent results for sparse/dense vectors when the vectors have different lengths JIRA issue: https://issues.apache.org/jira/browse/SPARK-5384 Currently `Vectors.sqdist` return inconsistent result for sparse/dense vectors when the vectors have different lengths, please refer to JIRA for sample PR scope: Unify the sqdist logic for dense/sparse vectors and fix the inconsistency, also remove the possible sparse to dense conversion in the original code. For reviewers: Maybe we should first discuss what's the correct behavior. 1. Vectors for sqdist must have the same length, like in breeze? 2. If they can have different lengths, what's the correct result for sqdist? (should the extra part get into calculation?) I'll update PR with more optimization and additional ut afterwards. Thanks. Author: Yuhao Yang <hhbyyh@gmail.com> Closes #4183 from hhbyyh/fixDouble and squashes the following commits: 1f17328 [Yuhao Yang] limit PR scope to size constraints only 54cbf97 [Yuhao Yang] fix Vectors.sqdist inconsistence	2015-01-25 22:18:09 -08:00
Xiangrui Meng	ea74365b7c	[SPARK-3541][MLLIB] New ALS implementation with improved storage This PR adds a new ALS implementation to `spark.ml` using the pipeline API, which should be able to scale to billions of ratings. Compared with the ALS under `spark.mllib`, the new implementation 1. uses the same algorithm, 2. uses float type for ratings, 3. uses primitive arrays to avoid GC, 4. sorts and compresses ratings on each block so that we can solve least squares subproblems one by one using only one normal equation instance. The following figure shows performance comparison on copies of the Amazon Reviews dataset using a 16-node (m3.2xlarge) EC2 cluster (the same setup as in http://databricks.com/blog/2014/07/23/scalable-collaborative-filtering-with-spark-mllib.html): ![als-wip](https://cloud.githubusercontent.com/assets/829644/5659447/4c4ff8e0-96c7-11e4-87a9-73c1c63d07f3.png) I keep the `spark.mllib`'s ALS untouched for easy comparison. If the new implementation works well, I'm going to match the features of the ALS under `spark.mllib` and then make it a wrapper of the new implementation, in a separate PR. TODO: - [X] Add unit tests for implicit preferences. Author: Xiangrui Meng <meng@databricks.com> Closes #3720 from mengxr/SPARK-3541 and squashes the following commits: 1b9e852 [Xiangrui Meng] fix compile 5129be9 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-3541 dd0d0e8 [Xiangrui Meng] simplify test code c627de3 [Xiangrui Meng] add tests for implicit feedback b84f41c [Xiangrui Meng] address comments a76da7b [Xiangrui Meng] update ALS tests 2a8deb3 [Xiangrui Meng] add some ALS tests 857e876 [Xiangrui Meng] add tests for rating block and encoded block d3c1ac4 [Xiangrui Meng] rename some classes for better code readability add more doc and comments 213d163 [Xiangrui Meng] org imports 771baf3 [Xiangrui Meng] chol doc update ca9ad9d [Xiangrui Meng] add unit tests for chol b4fd17c [Xiangrui Meng] add unit tests for NormalEquation d0f99d3 [Xiangrui Meng] add tests for LocalIndexEncoder 80b8e61 [Xiangrui Meng] fix imports 4937fd4 [Xiangrui Meng] update ALS example 56c253c [Xiangrui Meng] rename product to item bce8692 [Xiangrui Meng] doc for parameters and project the output columns 3f2d81a [Xiangrui Meng] add doc 1efaecf [Xiangrui Meng] add example code 8ae86b5 [Xiangrui Meng] add a working copy of the new ALS implementation	2015-01-22 22:09:13 -08:00
Liang-Chi Hsieh	246111d179	[SPARK-5365][MLlib] Refactor KMeans to reduce redundant data If a point is selected as new centers for many runs, it would collect many redundant data. This pr refactors it. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #4159 from viirya/small_refactor_kmeans and squashes the following commits: 25487e6 [Liang-Chi Hsieh] Refactor codes to reduce redundant data.	2015-01-22 08:16:35 -08:00
Basin	fcb3e1862f	[SPARK-5317]Set BoostingStrategy.defaultParams With Enumeration Algo.Classification or Algo.Regression JIRA Issue: https://issues.apache.org/jira/browse/SPARK-5317 When setting the BoostingStrategy.defaultParams("Classification"), It's more straightforward to set it with the Enumeration Algo.Classification, just like BoostingStragety.defaultParams(Algo.Classification). I overload the method BoostingStragety.defaultParams(). Author: Basin <jpsachilles@gmail.com> Closes #4103 from Peishen-Jia/stragetyAlgo and squashes the following commits: 87bab1c [Basin] Docs and Code documentations updated. 3b72875 [Basin] defaultParams(algoStr: String) call defaultParams(algo: Algo). 7c1e6ee [Basin] Doc of Java updated. algo -> algoStr instead. d5c8a2e [Basin] Merge branch 'stragetyAlgo' of github.com:Peishen-Jia/spark into stragetyAlgo 65f96ce [Basin] mllib-ensembles doc modified. e04a5aa [Basin] boostingstrategy.defaultParam string algo to enumeration. 68cf544 [Basin] mllib-ensembles doc modified. a4aea51 [Basin] boostingstrategy.defaultParam string algo to enumeration.	2015-01-21 23:06:34 -08:00
Xiangrui Meng	ca7910d6dd	[SPARK-3424][MLLIB] cache point distances during k-means\|\| init This PR ports the following feature implemented in #2634 by derrickburns: * During k-means\|\| initialization, we should cache costs (squared distances) previously computed. It also contains the following optimization: * aggregate sumCosts directly * ran multiple (#runs) k-means++ in parallel I compared the performance locally on mnist-digit. Before this patch: ![before](https://cloud.githubusercontent.com/assets/829644/5845647/93080862-a172-11e4-9a35-044ec711afc4.png) with this patch: ![after](https://cloud.githubusercontent.com/assets/829644/5845653/a47c29e8-a172-11e4-8e9f-08db57fe3502.png) It is clear that each k-means\|\| iteration takes about the same amount of time with this patch. Authors: Derrick Burns <derrickburns@gmail.com> Xiangrui Meng <meng@databricks.com> Closes #4144 from mengxr/SPARK-3424-kmeans-parallel and squashes the following commits: `0a875ec` [Xiangrui Meng] address comments 4341bb8 [Xiangrui Meng] do not re-compute point distances during k-means\|\|	2015-01-21 21:21:07 -08:00
nate.crosswhite	7450a992b3	[SPARK-4749] [mllib]: Allow initializing KMeans clusters using a seed This implements the functionality for SPARK-4749 and provides units tests in Scala and PySpark Author: nate.crosswhite <nate.crosswhite@stresearch.com> Author: nxwhite-str <nxwhite-str@users.noreply.github.com> Author: Xiangrui Meng <meng@databricks.com> Closes #3610 from nxwhite-str/master and squashes the following commits: a2ebbd3 [nxwhite-str] Merge pull request #1 from mengxr/SPARK-4749-kmeans-seed 7668124 [Xiangrui Meng] minor updates f8d5928 [nate.crosswhite] Addressing PR issues 277d367 [nate.crosswhite] Merge remote-tracking branch 'upstream/master' 9156a57 [nate.crosswhite] Merge remote-tracking branch 'upstream/master' 5d087b4 [nate.crosswhite] Adding KMeans train with seed and Scala unit test 616d111 [nate.crosswhite] Merge remote-tracking branch 'upstream/master' 35c1884 [nate.crosswhite] Add kmeans initial seed to pyspark API	2015-01-21 10:32:10 -08:00
Reza Zadeh	aa1e22b17b	[MLlib] [SPARK-5301] Missing conversions and operations on IndexedRowMatrix and CoordinateMatrix * Transpose is missing from CoordinateMatrix (this is cheap to compute, so it should be there) * IndexedRowMatrix should be convertable to CoordinateMatrix (conversion added) Tests for both added. Author: Reza Zadeh <reza@databricks.com> Closes #4089 from rezazadeh/matutils and squashes the following commits: ec5238b [Reza Zadeh] Array -> Iterator to avoid temp array 3ce0b5d [Reza Zadeh] Array -> Iterator bbc907a [Reza Zadeh] Use 'i' for index, and zipWithIndex cb10ae5 [Reza Zadeh] remove unnecessary import a7ae048 [Reza Zadeh] Missing linear algebra utilities	2015-01-21 09:48:38 -08:00
Yuhao Yang	2f82c841fa	[SPARK-5186] [MLLIB] Vector.equals and Vector.hashCode are very inefficient JIRA Issue: https://issues.apache.org/jira/browse/SPARK-5186 Currently SparseVector is using the inherited equals from Vector, which will create a full-size array for even the sparse vector. The pull request contains a specialized equals optimization that improves on both time and space. 1. The implementation will be consistent with the original. Especially it will keep equality comparison between SparseVector and DenseVector. Author: Yuhao Yang <hhbyyh@gmail.com> Author: Yuhao Yang <yuhao@yuhaodevbox.sh.intel.com> Closes #3997 from hhbyyh/master and squashes the following commits: 0d9d130 [Yuhao Yang] function name change and ut update 93f0d46 [Yuhao Yang] unify sparse vs dense vectors 985e160 [Yuhao Yang] improve locality for equals bdf8789 [Yuhao Yang] improve equals and rewrite hashCode for Vector a6952c3 [Yuhao Yang] fix scala style for comments 50abef3 [Yuhao Yang] fix ut for sparse vector with explicit 0 f41b135 [Yuhao Yang] iterative equals for sparse vector 5741144 [Yuhao Yang] Specialized equals for SparseVector	2015-01-20 15:20:20 -08:00
Travis Galoppo	23e25543be	SPARK-5019 [MLlib] - GaussianMixtureModel exposes instances of MultivariateGauss... This PR modifies GaussianMixtureModel to expose instances of MutlivariateGaussian rather than separate mean and covariance arrays. Author: Travis Galoppo <tjg2107@columbia.edu> Closes #4088 from tgaloppo/spark-5019 and squashes the following commits: 3ef6c7f [Travis Galoppo] In GaussianMixtureModel: Changed name of weight, gaussian to weights, gaussians. Other sources modified accordingly. 091e8da [Travis Galoppo] SPARK-5019 - GaussianMixtureModel exposes instances of MultivariateGaussian rather than mean/covariance matrices	2015-01-20 12:58:11 -08:00
Yuhao Yang	4432568aac	[SPARK-5282][mllib]: RowMatrix easily gets int overflow in the memory size warning JIRA: https://issues.apache.org/jira/browse/SPARK-5282 fix the possible int overflow in the memory computation warning Author: Yuhao Yang <hhbyyh@gmail.com> Closes #4069 from hhbyyh/addscStop and squashes the following commits: e54e5c8 [Yuhao Yang] change to MB based number 7afac23 [Yuhao Yang] 5282: fix int overflow in the warning	2015-01-19 10:10:15 -08:00
Reynold Xin	61b427d4b1	[SPARK-5193][SQL] Remove Spark SQL Java-specific API. After the following patches, the main (Scala) API is now usable for Java users directly. https://github.com/apache/spark/pull/4056 https://github.com/apache/spark/pull/4054 https://github.com/apache/spark/pull/4049 https://github.com/apache/spark/pull/4030 https://github.com/apache/spark/pull/3965 https://github.com/apache/spark/pull/3958 Author: Reynold Xin <rxin@databricks.com> Closes #4065 from rxin/sql-java-api and squashes the following commits: b1fd860 [Reynold Xin] Fix Mima 6d86578 [Reynold Xin] Ok one more attempt in fixing Python... e8f1455 [Reynold Xin] Fix Python again... 3e53f91 [Reynold Xin] Fixed Python. 83735da [Reynold Xin] Fix BigDecimal test. e9f1de3 [Reynold Xin] Use scala BigDecimal. 500d2c4 [Reynold Xin] Fix Decimal. ba3bfa2 [Reynold Xin] Updated javadoc for RowFactory. c4ae1c5 [Reynold Xin] [SPARK-5193][SQL] Remove Spark SQL Java-specific API.	2015-01-16 21:09:06 -08:00
Reynold Xin	f9969098c8	[SPARK-5123][SQL] Reconcile Java/Scala API for data types. Having two versions of the data type APIs (one for Java, one for Scala) requires downstream libraries to also have two versions of the APIs if the library wants to support both Java and Scala. I took a look at the Scala version of the data type APIs - it can actually work out pretty well for Java out of the box. As part of the PR, I created a sql.types package and moved all type definitions there. I then removed the Java specific data type API along with a lot of the conversion code. This subsumes https://github.com/apache/spark/pull/3925 Author: Reynold Xin <rxin@databricks.com> Closes #3958 from rxin/SPARK-5123-datatype-2 and squashes the following commits: 66505cc [Reynold Xin] [SPARK-5123] Expose only one version of the data type APIs (i.e. remove the Java-specific API).	2015-01-13 17:16:41 -08:00
Travis Galoppo	2130de9d8f	SPARK-5018 [MLlib] [WIP] Make MultivariateGaussian public Moving MutlivariateGaussian from private[mllib] to public. The class uses Breeze vectors internally, so this involves creating a public interface using MLlib vectors and matrices. This initial commit provides public construction, accessors for mean/covariance, density and log-density. Other potential methods include entropy and sample generation. Author: Travis Galoppo <tjg2107@columbia.edu> Closes #3923 from tgaloppo/spark-5018 and squashes the following commits: 2b15587 [Travis Galoppo] Style correction b4121b4 [Travis Galoppo] Merge remote-tracking branch 'upstream/master' into spark-5018 e30a100 [Travis Galoppo] Made mu, sigma private[mllib] members of MultivariateGaussian Moved MultivariateGaussian (and test suite) from stat.impl to stat.distribution (required updates in GaussianMixture{EM,Model}.scala) Marked MultivariateGaussian as @DeveloperApi Fixed style error 9fa3bb7 [Travis Galoppo] Style improvements 91a5fae [Travis Galoppo] Rearranged equation for part of density function 8c35381 [Travis Galoppo] Fixed accessor methods to match member variable names. Modified calculations to avoid log(pow(x,y)) calculations 0943dc4 [Travis Galoppo] SPARK-5018 4dee9e1 [Travis Galoppo] SPARK-5018	2015-01-11 21:31:16 -08:00
MechCoder	4554529dce	[SPARK-4406] [MLib] FIX: Validate k in SVD Raise exception when k is non-positive in SVD Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #3945 from MechCoder/spark-4406 and squashes the following commits: 64e6d2d [MechCoder] TST: Add better test errors and messages 12dae73 [MechCoder] [SPARK-4406] FIX: Validate k in SVD	2015-01-09 17:45:18 -08:00
Joseph K. Bradley	7e8e62aec1	[SPARK-5015] [mllib] Random seed for GMM + make test suite deterministic Issues: * From JIRA: GaussianMixtureEM uses randomness but does not take a random seed. It should take one as a parameter. * This also makes the test suite flaky since initialization can fail due to stochasticity. Fix: * Add random seed * Use it in test suite CC: mengxr tgaloppo Author: Joseph K. Bradley <joseph@databricks.com> Closes #3981 from jkbradley/gmm-seed and squashes the following commits: f0df4fd [Joseph K. Bradley] Added seed parameter to GMM. Updated test suite to use seed to prevent flakiness	2015-01-09 13:00:15 -08:00
Liang-Chi Hsieh	e9ca16ec94	[SPARK-5145][Mllib] Add BLAS.dsyr and use it in GaussianMixtureEM This pr uses BLAS.dsyr to replace few implementations in GaussianMixtureEM. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #3949 from viirya/blas_dsyr and squashes the following commits: 4e4d6cf [Liang-Chi Hsieh] Add unit test. Rename function name, modify doc and style. 3f57fd2 [Liang-Chi Hsieh] Add BLAS.dsyr and use it in GaussianMixtureEM.	2015-01-09 10:27:33 -08:00
Marcelo Vanzin	48cecf673c	[SPARK-4048] Enhance and extend hadoop-provided profile. This change does a few things to make the hadoop-provided profile more useful: - Create new profiles for other libraries / services that might be provided by the infrastructure - Simplify and fix the poms so that the profiles are only activated while building assemblies. - Fix tests so that they're able to run when the profiles are activated - Add a new env variable to be used by distributions that use these profiles to provide the runtime classpath for Spark jobs and daemons. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #2982 from vanzin/SPARK-4048 and squashes the following commits: 82eb688 [Marcelo Vanzin] Add a comment. eb228c0 [Marcelo Vanzin] Fix borked merge. 4e38f4e [Marcelo Vanzin] Merge branch 'master' into SPARK-4048 9ef79a3 [Marcelo Vanzin] Alternative way to propagate test classpath to child processes. 371ebee [Marcelo Vanzin] Review feedback. 52f366d [Marcelo Vanzin] Merge branch 'master' into SPARK-4048 83099fc [Marcelo Vanzin] Merge branch 'master' into SPARK-4048 7377e7b [Marcelo Vanzin] Merge branch 'master' into SPARK-4048 322f882 [Marcelo Vanzin] Fix merge fail. f24e9e7 [Marcelo Vanzin] Merge branch 'master' into SPARK-4048 8b00b6a [Marcelo Vanzin] Merge branch 'master' into SPARK-4048 9640503 [Marcelo Vanzin] Cleanup child process log message. 115fde5 [Marcelo Vanzin] Simplify a comment (and make it consistent with another pom). e3ab2da [Marcelo Vanzin] Fix hive-thriftserver profile. 7820d58 [Marcelo Vanzin] Fix CliSuite with provided profiles. 1be73d4 [Marcelo Vanzin] Restore flume-provided profile. d1399ed [Marcelo Vanzin] Restore jetty dependency. 82a54b9 [Marcelo Vanzin] Remove unused profile. 5c54a25 [Marcelo Vanzin] Fix HiveThriftServer2Suite with *-provided profiles. 1fc4d0b [Marcelo Vanzin] Update dependencies for hive-thriftserver. f7b3bbe [Marcelo Vanzin] Add snappy to hadoop-provided list. 9e4e001 [Marcelo Vanzin] Remove duplicate hive profile. d928d62 [Marcelo Vanzin] Redirect child stderr to parent's log. 4d67469 [Marcelo Vanzin] Propagate SPARK_DIST_CLASSPATH on Yarn. 417d90e [Marcelo Vanzin] Introduce "SPARK_DIST_CLASSPATH". 2f95f0d [Marcelo Vanzin] Propagate classpath to child processes during testing. 1adf91c [Marcelo Vanzin] Re-enable maven-install-plugin for a few projects. 284dda6 [Marcelo Vanzin] Rework the "hadoop-provided" profile, add new ones.	2015-01-08 17:15:13 -08:00
RJ Nowling	c9c8b219ad	[SPARK-4891][PySpark][MLlib] Add gamma/log normal/exp dist sampling to P... ...ySpark MLlib This is a follow up to PR3680 https://github.com/apache/spark/pull/3680 . Author: RJ Nowling <rnowling@gmail.com> Closes #3955 from rnowling/spark4891 and squashes the following commits: 1236a01 [RJ Nowling] Fix Python style issues 7a01a78 [RJ Nowling] Fix Python style issues 174beab [RJ Nowling] [SPARK-4891][PySpark][MLlib] Add gamma/log normal/exp dist sampling to PySpark MLlib	2015-01-08 15:03:43 -08:00
Fernando Otero (ZeoS)	72df5a301e	SPARK-5148 [MLlib] Make usersOut/productsOut storagelevel in ALS configurable Author: Fernando Otero (ZeoS) <fotero@gmail.com> Closes #3953 from zeitos/storageLevel and squashes the following commits: 0f070b9 [Fernando Otero (ZeoS)] fix imports 6869e80 [Fernando Otero (ZeoS)] fix comment length 90c9f7e [Fernando Otero (ZeoS)] fix comment length 18a992e [Fernando Otero (ZeoS)] changing storage level	2015-01-08 12:42:54 -08:00
Shuo Xiang	c66a976300	[SPARK-5116][MLlib] Add extractor for SparseVector and DenseVector Add extractor for SparseVector and DenseVector in MLlib to save some code while performing pattern matching on Vectors. For example, previously we may use: vec match { case dv: DenseVector => val values = dv.values ... case sv: SparseVector => val indices = sv.indices val values = sv.values val size = sv.size ... } with extractor it is: vec match { case DenseVector(values) => ... case SparseVector(size, indices, values) => ... } Author: Shuo Xiang <shuoxiangpub@gmail.com> Closes #3919 from coderxiang/extractor and squashes the following commits: 359e8d5 [Shuo Xiang] merge master ca5fc3e [Shuo Xiang] merge master 0b1e190 [Shuo Xiang] use extractor for vectors in RowMatrix.scala e961805 [Shuo Xiang] use extractor for vectors in StandardScaler.scala c2bbdaf [Shuo Xiang] use extractor for vectors in IDFscala 8433922 [Shuo Xiang] use extractor for vectors in NaiveBayes.scala and Normalizer.scala d83c7ca [Shuo Xiang] use extractor for vectors in Vectors.scala 5523dad [Shuo Xiang] Add extractor for SparseVector and DenseVector	2015-01-07 23:22:37 -08:00
DB Tsai	60e2d9e290	[SPARK-5128][MLLib] Add common used log1pExp API in MLUtils When `x` is positive and large, computing `math.log(1 + math.exp(x))` will lead to arithmetic overflow. This will happen when `x > 709.78` which is not a very large number. It can be addressed by rewriting the formula into `x + math.log1p(math.exp(-x))` when `x > 0`. Author: DB Tsai <dbtsai@alpinenow.com> Closes #3915 from dbtsai/mathutil and squashes the following commits: bec6a84 [DB Tsai] remove empty line 3239541 [DB Tsai] revert part of patch into another PR 23144f3 [DB Tsai] doc 49f3658 [DB Tsai] temp 6c29ed3 [DB Tsai] formating f8447f9 [DB Tsai] address another overflow issue in gradientMultiplier in LOR gradient code 64eefd0 [DB Tsai] first commit	2015-01-07 10:13:41 -08:00
Liang-Chi Hsieh	e21acc1978	[SPARK-5099][Mllib] Simplify logistic loss function This is a minor pr where I think that we can simply take minus of `margin`, instead of subtracting `margin`. Mathematically, they are equal. But the modified equation is the common form of logistic loss function and so more readable. It also computes more accurate value as some quick tests show. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #3899 from viirya/logit_func and squashes the following commits: 91a3860 [Liang-Chi Hsieh] Modified for comment. 0aa51e4 [Liang-Chi Hsieh] Further simplified. 72a295e [Liang-Chi Hsieh] Revert LogLoss back and add more considerations in Logistic Loss. a3f83ca [Liang-Chi Hsieh] Fix a bug. 2bc5712 [Liang-Chi Hsieh] Simplify loss function.	2015-01-06 21:23:31 -08:00
Liang-Chi Hsieh	bb38ebb1ab	[SPARK-5050][Mllib] Add unit test for sqdist Related to #3643. Follow the previous suggestion to add unit test for `sqdist` in `VectorsSuite`. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #3869 from viirya/sqdist_test and squashes the following commits: fb743da [Liang-Chi Hsieh] Modified for comment and fix bug. 90a08f3 [Liang-Chi Hsieh] Modified for comment. 39a3ca6 [Liang-Chi Hsieh] Take care of special case. b789f42 [Liang-Chi Hsieh] More proper unit test with random sparsity pattern. c36be68 [Liang-Chi Hsieh] Add unit test for sqdist.	2015-01-06 14:00:45 -08:00
Travis Galoppo	4108e5f36f	SPARK-5017 [MLlib] - Use SVD to compute determinant and inverse of covariance matrix MultivariateGaussian was calling both pinv() and det() on the covariance matrix, effectively performing two matrix decompositions. Both values are now computed using the singular value decompositon. Both the pseudo-inverse and the pseudo-determinant are used to guard against singular matrices. Author: Travis Galoppo <tjg2107@columbia.edu> Closes #3871 from tgaloppo/spark-5017 and squashes the following commits: 383b5b3 [Travis Galoppo] MultivariateGaussian - minor optimization in density calculation a5b8bc5 [Travis Galoppo] Added additional points to tests in test suite. Fixed comment in MultivariateGaussian 629d9d0 [Travis Galoppo] Moved some test values from var to val. dc3d0f7 [Travis Galoppo] Catch potential exception calculating pseudo-determinant. Style improvements. d448137 [Travis Galoppo] Added test suite for MultivariateGaussian, including test for degenerate case. 1989be0 [Travis Galoppo] SPARK-5017 - Fixed to use SVD to compute determinant and inverse of covariance matrix. Previous code called both pinv() and det(), effectively performing two matrix decompositions. Additionally, the pinv() implementation in Breeze is known to fail for singular matrices. b4415ea [Travis Galoppo] Merge branch 'spark-5017' of https://github.com/tgaloppo/spark into spark-5017 6f11b6d [Travis Galoppo] SPARK-5017 - Use SVD to compute determinant and inverse of covariance matrix. Code was calling both det() and pinv(), effectively performing two matrix decompositions. Futhermore, Breeze pinv() currently fails for singular matrices. fd9784c [Travis Galoppo] SPARK-5017 - Use SVD to compute determinant and inverse of covariance matrix	2015-01-06 13:57:42 -08:00
Sean Owen	4cba6eb420	SPARK-4159 [CORE] Maven build doesn't run JUnit test suites This PR: - Reenables `surefire`, and copies config from `scalatest` (which is itself an old fork of `surefire`, so similar) - Tells `surefire` to test only Java tests - Enables `surefire` and `scalatest` for all children, and in turn eliminates some duplication. For me this causes the Scala and Java tests to be run once each, it seems, as desired. It doesn't affect the SBT build but works for Maven. I still need to verify that all of the Scala tests and Java tests are being run. Author: Sean Owen <sowen@cloudera.com> Closes #3651 from srowen/SPARK-4159 and squashes the following commits: 2e8a0af [Sean Owen] Remove specialized SPARK_HOME setting for REPL, YARN tests as it appears to be obsolete 12e4558 [Sean Owen] Append to unit-test.log instead of overwriting, so that both surefire and scalatest output is preserved. Also standardize/correct comments a bit. e6f8601 [Sean Owen] Reenable Java tests by reenabling surefire with config cloned from scalatest; centralize test config in the parent	2015-01-06 12:02:08 -08:00
Travis Galoppo	c4f0b4f334	SPARK-5020 [MLlib] GaussianMixtureModel.predictMembership() should take an RDD only Removed unnecessary parameters to predictMembership() CC: jkbradley Author: Travis Galoppo <tjg2107@columbia.edu> Closes #3854 from tgaloppo/spark-5020 and squashes the following commits: 1bf4669 [Travis Galoppo] renamed predictMembership() to predictSoft() 0f1d96e [Travis Galoppo] SPARK-5020 - Removed superfluous parameters from predictMembership()	2014-12-31 15:39:58 -08:00
Sean Owen	3d194cc757	SPARK-4547 [MLLIB] OOM when making bins in BinaryClassificationMetrics Now that I've implemented the basics here, I'm less convinced there is a need for this change, somehow. Callers can downsample before or after. Really the OOM is not in the ROC curve code, but in code that might `collect()` it for local analysis. Still, might be useful to down-sample since the ROC curve probably never needs millions of points. This is a first pass. Since the `(score,label)` are already grouped and sorted, I think it's sufficient to just take every Nth such pair, in order to downsample by a factor of N? this is just like retaining every Nth point on the curve, which I think is the goal. All of the data is still used to build the curve of course. What do you think about the API, and usefulness? Author: Sean Owen <sowen@cloudera.com> Closes #3702 from srowen/SPARK-4547 and squashes the following commits: 1d34d05 [Sean Owen] Indent and reorganize numBins scaladoc 692d825 [Sean Owen] Change handling of large numBins, make 2nd consturctor instead of optional param, style change a03610e [Sean Owen] Add downsamplingFactor to BinaryClassificationMetrics	2014-12-31 13:37:04 -08:00
Liang-Chi Hsieh	06a9aa589c	[SPARK-4797] Replace breezeSquaredDistance This PR replaces slow breezeSquaredDistance. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #3643 from viirya/faster_squareddistance and squashes the following commits: f28b275 [Liang-Chi Hsieh] Move the implementation to linalg.Vectors and rename as sqdist. 0bc48ee [Liang-Chi Hsieh] Merge branch 'master' into faster_squareddistance ba34422 [Liang-Chi Hsieh] Fix bug. 91849d0 [Liang-Chi Hsieh] Modified for comment. 44a65ad [Liang-Chi Hsieh] Modified for comments. 35db395 [Liang-Chi Hsieh] Fix bug and some modifications for comments. f4f5ebb [Liang-Chi Hsieh] Follow BLAS.dot pattern to replace intersect, diff with while-loop. a36e09f [Liang-Chi Hsieh] Use while-loop to replace foreach for better performance. d3e0628 [Liang-Chi Hsieh] Make the methods private. dd415bc [Liang-Chi Hsieh] Consider different cases of SparseVector and DenseVector. 13669db [Liang-Chi Hsieh] Replace breezeSquaredDistance.	2014-12-31 11:50:53 -08:00
Liu Jiongzhou	035bac88c7	[SPARK-4998][MLlib]delete the "train" function To make the functions with the same in "object" effective, specially when using java reflection. As the "train" function defined in "class DecisionTree" will hide the functions with the same name in "object DecisionTree". JIRA[SPARK-4998] Author: Liu Jiongzhou <ljzzju@163.com> Closes #3836 from ljzzju/master and squashes the following commits: 4e13133 [Liu Jiongzhou] [MLlib]delete the "train" function	2014-12-30 15:55:56 -08:00
Jakub Dubovsky	0f31992c61	[Spark-4995] Replace Vector.toBreeze.activeIterator with foreachActive New foreachActive method of vector was introduced by SPARK-4431 as more efficient alternative to vector.toBreeze.activeIterator. There are some parts of codebase where it was not yet replaced. dbtsai Author: Jakub Dubovsky <dubovsky@avast.com> Closes #3846 from james64/SPARK-4995-foreachActive and squashes the following commits: 3eb7e37 [Jakub Dubovsky] Scalastyle fix 32fe6c6 [Jakub Dubovsky] activeIterator removed - IndexedRowMatrix.toBreeze 47a4777 [Jakub Dubovsky] activeIterator removed in RowMatrix.toBreeze 90a7d98 [Jakub Dubovsky] activeIterator removed in MLUtils.saveAsLibSVMFile	2014-12-30 14:19:07 -08:00
DB Tsai	040d6f2d13	[SPARK-4972][MLlib] Updated the scala doc for lasso and ridge regression for the change of LeastSquaresGradient In #SPARK-4907, we added factor of 2 into the LeastSquaresGradient. We updated the scala doc for lasso and ridge regression here. Author: DB Tsai <dbtsai@alpinenow.com> Closes #3808 from dbtsai/doc and squashes the following commits: ec3c989 [DB Tsai] first commit	2014-12-29 17:17:12 -08:00
ganonp	343db392b5	Added setMinCount to Word2Vec.scala Wanted to customize the private minCount variable in the Word2Vec class. Added a method to do so. Author: ganonp <ganonp@gmail.com> Closes #3693 from ganonp/my-custom-spark and squashes the following commits: ad534f2 [ganonp] made norm method public 5110a6f [ganonp] Reorganized 854958b [ganonp] Fixed Indentation for setMinCount 12ed8f9 [ganonp] Update Word2Vec.scala 76bdf5a [ganonp] Update Word2Vec.scala ffb88bb [ganonp] Update Word2Vec.scala 5eb9100 [ganonp] Added setMinCount to Word2Vec.scala	2014-12-29 15:31:19 -08:00
Travis Galoppo	6cf6fdf3ff	SPARK-4156 [MLLIB] EM algorithm for GMMs Implementation of Expectation-Maximization for Gaussian Mixture Models. This is my maiden contribution to Apache Spark, so I apologize now if I have done anything incorrectly; having said that, this work is my own, and I offer it to the project under the project's open source license. Author: Travis Galoppo <tjg2107@columbia.edu> Author: Travis Galoppo <travis@localhost.localdomain> Author: tgaloppo <tjg2107@columbia.edu> Author: FlytxtRnD <meethu.mathew@flytxt.com> Closes #3022 from tgaloppo/master and squashes the following commits: aaa8f25 [Travis Galoppo] MLUtils: changed privacy of EPSILON from [util] to [mllib] 709e4bf [Travis Galoppo] fixed usage line to include optional maxIterations parameter acf1fba [Travis Galoppo] Fixed parameter comment in GaussianMixtureModel Made maximum iterations an optional parameter to DenseGmmEM 9b2fc2a [Travis Galoppo] Style improvements Changed ExpectationSum to a private class b97fe00 [Travis Galoppo] Minor fixes and tweaks. 1de73f3 [Travis Galoppo] Removed redundant array from array creation 578c2d1 [Travis Galoppo] Removed unused import 227ad66 [Travis Galoppo] Moved prediction methods into model class. 308c8ad [Travis Galoppo] Numerous changes to improve code cff73e0 [Travis Galoppo] Replaced accumulators with RDD.aggregate 20ebca1 [Travis Galoppo] Removed unusued code 42b2142 [Travis Galoppo] Added functionality to allow setting of GMM starting point. Added two cluster test to testing suite. 8b633f3 [Travis Galoppo] Style issue 9be2534 [Travis Galoppo] Style issue d695034 [Travis Galoppo] Fixed style issues c3b8ce0 [Travis Galoppo] Merge branch 'master' of https://github.com/tgaloppo/spark Adds predict() method 2df336b [Travis Galoppo] Fixed style issue b99ecc4 [tgaloppo] Merge pull request #1 from FlytxtRnD/predictBranch f407b4c [FlytxtRnD] Added predict() to return the cluster labels and membership values 97044cf [Travis Galoppo] Fixed style issues dc9c742 [Travis Galoppo] Moved MultivariateGaussian utility class e7d413b [Travis Galoppo] Moved multivariate Gaussian utility class to mllib/stat/impl Improved comments 9770261 [Travis Galoppo] Corrected a variety of style and naming issues. 8aaa17d [Travis Galoppo] Added additional train() method to companion object for cluster count and tolerance parameters. 676e523 [Travis Galoppo] Fixed to no longer ignore delta value provided on command line e6ea805 [Travis Galoppo] Merged with master branch; update test suite with latest context changes. Improved cluster initialization strategy. 86fb382 [Travis Galoppo] Merge remote-tracking branch 'upstream/master' 719d8cc [Travis Galoppo] Added scala test suite with basic test c1a8e16 [Travis Galoppo] Made GaussianMixtureModel class serializable Modified sum function for better performance 5c96c57 [Travis Galoppo] Merge remote-tracking branch 'upstream/master' c15405c [Travis Galoppo] SPARK-4156	2014-12-29 15:29:15 -08:00
Burak Yavuz	02b55de3dc	[SPARK-4409][MLlib] Additional Linear Algebra Utils Addition of a very limited number of local matrix manipulation and generation methods that would be helpful in the further development for algorithms on top of BlockMatrix (SPARK-3974), such as Randomized SVD, and Multi Model Training (SPARK-1486). The proposed methods for addition are: For `Matrix` - map: maps the values in the matrix with a given function. Produces a new matrix. - update: the values in the matrix are updated with a given function. Occurs in place. Factory methods for `DenseMatrix`: - zeros: Generate a matrix consisting of zeros - ones: Generate a matrix consisting of ones - eye: Generate an identity matrix - rand: Generate a matrix consisting of i.i.d. uniform random numbers - randn: Generate a matrix consisting of i.i.d. gaussian random numbers - diag: Generate a diagonal matrix from a supplied vector *These methods already exist in the factory methods for `Matrices`, however for cases where we require a `DenseMatrix`, you constantly have to add `.asInstanceOf[DenseMatrix]` everywhere, which makes the code "dirtier". I propose moving these functions to factory methods for `DenseMatrix` where the putput will be a `DenseMatrix` and the factory methods for `Matrices` will call these functions directly and output a generic `Matrix`. Factory methods for `SparseMatrix`: - speye: Identity matrix in sparse format. Saves a ton of memory when dimensions are large, especially in Multi Model Training, where each row requires being multiplied by a scalar. - sprand: Generate a sparse matrix with a given density consisting of i.i.d. uniform random numbers. - sprandn: Generate a sparse matrix with a given density consisting of i.i.d. gaussian random numbers. - diag: Generate a diagonal matrix from a supplied vector, but is memory efficient, because it just stores the diagonal. Again, very helpful in Multi Model Training. Factory methods for `Matrices`: - Include all the factory methods given above, but return a generic `Matrix` rather than `SparseMatrix` or `DenseMatrix`. - horzCat: Horizontally concatenate matrices to form one larger matrix. Very useful in both Multi Model Training, and for the repartitioning of BlockMatrix. - vertCat: Vertically concatenate matrices to form one larger matrix. Very useful for the repartitioning of BlockMatrix. The names for these methods were selected from MATLAB Author: Burak Yavuz <brkyvz@gmail.com> Author: Xiangrui Meng <meng@databricks.com> Closes #3319 from brkyvz/SPARK-4409 and squashes the following commits: b0354f6 [Burak Yavuz] [SPARK-4409] Incorporated mengxr's code 04c4829 [Burak Yavuz] Merge pull request #1 from mengxr/SPARK-4409 80cfa29 [Xiangrui Meng] minor changes ecc937a [Xiangrui Meng] update sprand 4e95e24 [Xiangrui Meng] simplify fromCOO implementation 10a63a6 [Burak Yavuz] [SPARK-4409] Fourth pass of code review f62d6c7 [Burak Yavuz] [SPARK-4409] Modified genRandMatrix 3971c93 [Burak Yavuz] [SPARK-4409] Third pass of code review 75239f8 [Burak Yavuz] [SPARK-4409] Second pass of code review e4bd0c0 [Burak Yavuz] [SPARK-4409] Modified horzcat and vertcat 65c562e [Burak Yavuz] [SPARK-4409] Hopefully fixed Java Test d8be7bc [Burak Yavuz] [SPARK-4409] Organized imports 065b531 [Burak Yavuz] [SPARK-4409] First pass after code review a8120d2 [Burak Yavuz] [SPARK-4409] Finished updates to API according to SPARK-4614 f798c82 [Burak Yavuz] [SPARK-4409] Updated API according to SPARK-4614 c75f3cd [Burak Yavuz] [SPARK-4409] Added JavaAPI Tests, and fixed a couple of bugs d662f9d [Burak Yavuz] [SPARK-4409] Modified according to remote repo 83dfe37 [Burak Yavuz] [SPARK-4409] Scalastyle error fixed a14c0da [Burak Yavuz] [SPARK-4409] Initial commit to add methods	2014-12-29 13:24:26 -08:00
zsxwing	f9ed2b6641	[SPARK-4608][Streaming] Reorganize StreamingContext implicit to improve API convenience There is only one implicit function `toPairDStreamFunctions` in `StreamingContext`. This PR did similar reorganization like [SPARK-4397](https://issues.apache.org/jira/browse/SPARK-4397). Compiled the following codes with Spark Streaming 1.1.0 and ran it with this PR. Everything is fine. ```Scala import org.apache.spark._ import org.apache.spark.streaming._ import org.apache.spark.streaming.StreamingContext._ object StreamingApp { def main(args: Array[String]) { val conf = new SparkConf().setMaster("local[2]").setAppName("FileWordCount") val ssc = new StreamingContext(conf, Seconds(10)) val lines = ssc.textFileStream("/some/path") val words = lines.flatMap(_.split(" ")) val pairs = words.map(word => (word, 1)) val wordCounts = pairs.reduceByKey(_ + _) wordCounts.print() ssc.start() ssc.awaitTermination() } } ``` Author: zsxwing <zsxwing@gmail.com> Closes #3464 from zsxwing/SPARK-4608 and squashes the following commits: aa6d44a [zsxwing] Fix a copy-paste error f74c190 [zsxwing] Merge branch 'master' into SPARK-4608 e6f9cc9 [zsxwing] Update the docs 27833bb [zsxwing] Remove `import StreamingContext._` c15162c [zsxwing] Reorganize StreamingContext implicit to improve API convenience	2014-12-25 19:46:05 -08:00
Sean Owen	29fabb1b52	SPARK-4297 [BUILD] Build warning fixes omnibus There are a number of warnings generated in a normal, successful build right now. They're mostly Java unchecked cast warnings, which can be suppressed. But there's a grab bag of other Scala language warnings and so on that can all be easily fixed. The forthcoming PR fixes about 90% of the build warnings I see now. Author: Sean Owen <sowen@cloudera.com> Closes #3157 from srowen/SPARK-4297 and squashes the following commits: 8c9e469 [Sean Owen] Suppress unchecked cast warnings, and several other build warning fixes	2014-12-24 13:32:51 -08:00
DB Tsai	a96b72781a	[SPARK-4907][MLlib] Inconsistent loss and gradient in LeastSquaresGradient compared with R In most of the academic paper and algorithm implementations, people use L = 1/2n \|\|A weights-y\|\|^2 instead of L = 1/n \|\|A weights-y\|\|^2 for least-squared loss. See Eq. (1) in http://web.stanford.edu/~hastie/Papers/glmnet.pdf Since MLlib uses different convention, this will result different residuals and all the stats properties will be different from GLMNET package in R. The model coefficients will be still the same under this change. Author: DB Tsai <dbtsai@alpinenow.com> Closes #3746 from dbtsai/lir and squashes the following commits: 19c2e85 [DB Tsai] make stepsize twice to converge to the same solution 0b2c29c [DB Tsai] first commit	2014-12-22 16:42:55 -08:00
RJ Nowling	ee1fb97a97	[SPARK-4728][MLLib] Add exponential, gamma, and log normal sampling to MLlib da... ...ta generators This patch adds: * Exponential, gamma, and log normal generators that wrap Apache Commons math3 to the private API * Functions for generating exponential, gamma, and log normal RDDs and vector RDDs * Tests for the above Author: RJ Nowling <rnowling@gmail.com> Closes #3680 from rnowling/spark4728 and squashes the following commits: 455f50a [RJ Nowling] Add tests for exponential, gamma, and log normal samplers to JavaRandomRDDsSuite 3e1134a [RJ Nowling] Fix val/var, unncessary creation of Distribution objects when setting seeds, and import line longer than line wrap limits 58f5b97 [RJ Nowling] Fix bounds in tests so they scale with variance, not stdev 84fd98d [RJ Nowling] Add more values for testing distributions. 9f96232 [RJ Nowling] [SPARK-4728] Add exponential, gamma, and log normal sampling to MLlib data generators	2014-12-18 21:00:49 -08:00
DB Tsai	59a49db598	[SPARK-4887][MLlib] Fix a bad unittest in LogisticRegressionSuite The original test doesn't make sense since if you step in, the lossSum is already NaN, and the coefficients are diverging. That's because the step size is too large for SGD, so it doesn't work. The correct behavior is that you should get smaller coefficients than the one without regularization. Comparing the values using 20000.0 relative error doesn't make sense as well. Author: DB Tsai <dbtsai@alpinenow.com> Closes #3735 from dbtsai/mlortestfix and squashes the following commits: b1a3c42 [DB Tsai] first commit	2014-12-18 13:55:49 -08:00
Yuu ISHIKAWA	8098fab06c	[SPARK-4494][mllib] IDFModel.transform() add support for single vector I improved `IDFModel.transform` to allow using a single vector. [[SPARK-4494] IDFModel.transform() add support for single vector - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-4494) Author: Yuu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #3603 from yu-iskw/idf and squashes the following commits: 256ff3d [Yuu ISHIKAWA] Fix typo a3bf566 [Yuu ISHIKAWA] - Fix typo - Optimize import order - Aggregate the assertion tests - Modify `IDFModel.transform` API for pyspark d25e49b [Yuu ISHIKAWA] Add the implementation of `IDFModel.transform` for a term frequency vector	2014-12-15 13:44:15 -08:00
Xiangrui Meng	7e758d7092	[FIX][DOC] Fix broken links in ml-guide.md and some minor changes in ScalaDoc. Author: Xiangrui Meng <meng@databricks.com> Closes #3601 from mengxr/SPARK-4575-fix and squashes the following commits: c559768 [Xiangrui Meng] minor code update ce94da8 [Xiangrui Meng] Java Bean -> JavaBean 0b5c182 [Xiangrui Meng] fix links in ml-guide	2014-12-04 20:16:35 +08:00
Joseph K. Bradley	469a6e5f3b	[SPARK-4575] [mllib] [docs] spark.ml pipelines doc + bug fixes Documentation: * Added ml-guide.md, linked from mllib-guide.md * Updated mllib-guide.md with small section pointing to ml-guide.md Examples: * CrossValidatorExample * SimpleParamsExample * (I copied these + the SimpleTextClassificationPipeline example into the ml-guide.md) Bug fixes: * PipelineModel: did not use ParamMaps correctly * UnaryTransformer: issues with TypeTag serialization (Thanks to mengxr for that fix!) CC: mengxr shivaram etrain Documentation for Pipelines: I know the docs are not complete, but the goal is to have enough to let interested people get started using spark.ml and to add more docs once the package is more established/complete. Author: Joseph K. Bradley <joseph@databricks.com> Author: jkbradley <joseph.kurata.bradley@gmail.com> Author: Xiangrui Meng <meng@databricks.com> Closes #3588 from jkbradley/ml-package-docs and squashes the following commits: d393b5c [Joseph K. Bradley] fixed bug in Pipeline (typo from last commit). updated examples for CV and Params for spark.ml c38469c [Joseph K. Bradley] Updated ml-guide with CV examples 99f88c2 [Joseph K. Bradley] Fixed bug in PipelineModel.transform* with usage of params. Updated CrossValidatorExample to use more training examples so it is less likely to get a 0-size fold. ea34dc6 [jkbradley] Merge pull request #4 from mengxr/ml-package-docs 3b83ec0 [Xiangrui Meng] replace TypeTag with explicit datatype 41ad9b1 [Joseph K. Bradley] Added examples for spark.ml: SimpleParamsExample + Java version, CrossValidatorExample + Java version. CrossValidatorExample not working yet. Added programming guide for spark.ml, but need to add CrossValidatorExample to it once CrossValidatorExample works.	2014-12-04 17:00:06 +08:00
Joseph K. Bradley	657a88835d	[SPARK-4580] [SPARK-4610] [mllib] [docs] Documentation for tree ensembles + DecisionTree API fix Major changes: * Added programming guide sections for tree ensembles * Added examples for tree ensembles * Updated DecisionTree programming guide with more info on parameters * API change: Standardized the tree parameter for the number of classes (for classification) Minor changes: * Updated decision tree documentation * Updated existing tree and tree ensemble examples * Use train/test split, and compute test error instead of training error. * Fixed decision_tree_runner.py to actually use the number of classes it computes from data. (small bug fix) Note: I know this is a lot of lines, but most is covered by: * Programming guide sections for gradient boosting and random forests. (The changes are probably best viewed by generating the docs locally.) * New examples (which were copied from the programming guide) * The "numClasses" renaming I have run all examples and relevant unit tests. CC: mengxr manishamde codedeft Author: Joseph K. Bradley <joseph@databricks.com> Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com> Closes #3461 from jkbradley/ensemble-docs and squashes the following commits: 70a75f3 [Joseph K. Bradley] updated forest vs boosting comparison d1de753 [Joseph K. Bradley] Added note about toString and toDebugString for DecisionTree to migration guide 8e87f8f [Joseph K. Bradley] Combined GBT and RandomForest guides into one ensembles guide 6fab846 [Joseph K. Bradley] small fixes based on review b9f8576 [Joseph K. Bradley] updated decision tree doc 375204c [Joseph K. Bradley] fixed python style 2b60b6e [Joseph K. Bradley] merged Java RandomForest examples into 1 file. added header. Fixed small bug in same example in the programming guide. 706d332 [Joseph K. Bradley] updated python DT runner to print full model if it is small c76c823 [Joseph K. Bradley] added migration guide for mllib abe5ed7 [Joseph K. Bradley] added examples for random forest in Java and Python to examples folder 07fc11d [Joseph K. Bradley] Renamed numClassesForClassification to numClasses everywhere in trees and ensembles. This is a breaking API change, but it was necessary to correct an API inconsistency in Spark 1.1 (where Python DecisionTree used numClasses but Scala used numClassesForClassification). cdfdfbc [Joseph K. Bradley] added examples for GBT 6372a2b [Joseph K. Bradley] updated decision tree examples to use random split. tested all of them. ad3e695 [Joseph K. Bradley] added gbt and random forest to programming guide. still need to update their examples	2014-12-04 09:57:50 +08:00
DB Tsai	d00542987e	[SPARK-4717][MLlib] Optimize BLAS library to avoid de-reference multiple times in loop Have a local reference to `values` and `indices` array in the `Vector` object so JVM can locate the value with one operation call. See `SPARK-4581` for similar optimization, and the bytecode analysis. Author: DB Tsai <dbtsai@alpinenow.com> Closes #3577 from dbtsai/blasopt and squashes the following commits: 62d38c4 [DB Tsai] formating 0316cef [DB Tsai] first commit	2014-12-03 22:31:39 +08:00
DB Tsai	7fc49ed911	[SPARK-4708][MLLib] Make k-mean runs two/three times faster with dense/sparse sample Note that the usage of `breezeSquaredDistance` in `org.apache.spark.mllib.util.MLUtils.fastSquaredDistance` is in the critical path, and `breezeSquaredDistance` is slow. We should replace it with our own implementation. Here is the benchmark against mnist8m dataset. Before DenseVector: 70.04secs SparseVector: 59.05secs With this PR DenseVector: 30.58secs SparseVector: 21.14secs Author: DB Tsai <dbtsai@alpinenow.com> Closes #3565 from dbtsai/kmean and squashes the following commits: 08bc068 [DB Tsai] restyle de24662 [DB Tsai] address feedback b185a77 [DB Tsai] cleanup 4554ddd [DB Tsai] first commit	2014-12-03 19:01:56 +08:00
DB Tsai	64f3175bf9	[SPARK-4611][MLlib] Implement the efficient vector norm The vector norm in breeze is implemented by `activeIterator` which is known to be very slow. In this PR, an efficient vector norm is implemented, and with this API, `Normalizer` and `k-means` have big performance improvement. Here is the benchmark against mnist8m dataset. a) `Normalizer` Before DenseVector: 68.25secs SparseVector: 17.01secs With this PR DenseVector: 12.71secs SparseVector: 2.73secs b) `k-means` Before DenseVector: 83.46secs SparseVector: 61.60secs With this PR DenseVector: 70.04secs SparseVector: 59.05secs Author: DB Tsai <dbtsai@alpinenow.com> Closes #3462 from dbtsai/norm and squashes the following commits: 63c7165 [DB Tsai] typo 0c3637f [DB Tsai] add import org.apache.spark.SparkContext._ back 6fa616c [DB Tsai] address feedback 9b7cb56 [DB Tsai] move norm to static method 0b632e6 [DB Tsai] kmeans dbed124 [DB Tsai] style c1a877c [DB Tsai] first commit	2014-12-02 11:40:43 +08:00
Xiangrui Meng	561d31d2f1	[SPARK-4614][MLLIB] Slight API changes in Matrix and Matrices Before we have a full picture of the operators we want to add, it might be safer to hide `Matrix.transposeMultiply` in 1.2.0. Another update we want to change is `Matrix.randn` and `Matrix.rand`, both of which should take a `Random` implementation. Otherwise, it is very likely to produce inconsistent RDDs. I also added some unit tests for matrix factory methods. All APIs are new in 1.2, so there is no incompatible changes. brkyvz Author: Xiangrui Meng <meng@databricks.com> Closes #3468 from mengxr/SPARK-4614 and squashes the following commits: 3b0e4e2 [Xiangrui Meng] add mima excludes 6bfd8a4 [Xiangrui Meng] hide transposeMultiply; add rng to rand and randn; add unit tests	2014-11-26 08:22:50 -08:00
Xiangrui Meng	b5fb1410c5	[SPARK-4604][MLLIB] make MatrixFactorizationModel public User could construct an MF model directly. I added a note about the performance. Author: Xiangrui Meng <meng@databricks.com> Closes #3459 from mengxr/SPARK-4604 and squashes the following commits: f64bcd3 [Xiangrui Meng] organize imports ed08214 [Xiangrui Meng] check preconditions and unit tests a624c12 [Xiangrui Meng] make MatrixFactorizationModel public	2014-11-25 20:11:40 -08:00
Joseph K. Bradley	c251fd7405	[SPARK-4583] [mllib] LogLoss for GradientBoostedTrees fix + doc updates Currently, the LogLoss used by GradientBoostedTrees has 2 issues: * the gradient (and therefore loss) does not match that used by Friedman (1999) * the error computation uses 0/1 accuracy, not log loss This PR updates LogLoss. It also adds some doc for boosting and forests. I tested it on sample data and made sure the log loss is monotonically decreasing with each boosting iteration. CC: mengxr manishamde codedeft Author: Joseph K. Bradley <joseph@databricks.com> Closes #3439 from jkbradley/gbt-loss-fix and squashes the following commits: cfec17e [Joseph K. Bradley] removed forgotten temp comments a27eb6d [Joseph K. Bradley] corrections to last log loss commit ed5da2c [Joseph K. Bradley] updated LogLoss (boosting) for numerical stability 5e52bff [Joseph K. Bradley] * Removed the 1/2 from SquaredError. This also required updating the test suite since it effectively doubles the gradient and loss. * Added doc for developers within RandomForest. * Small cleanup in test suite (generating data only once) e57897a [Joseph K. Bradley] Fixed LogLoss for GradientBoostedTrees, and updated doc for losses, forests, and boosting	2014-11-25 20:10:15 -08:00
DB Tsai	bf1a6aaac5	[SPARK-4581][MLlib] Refactorize StandardScaler to improve the transformation performance The following optimizations are done to improve the StandardScaler model transformation performance. 1) Covert Breeze dense vector to primitive vector to reduce the overhead. 2) Since mean can be potentially a sparse vector, we explicitly convert it to dense primitive vector. 3) Have a local reference to `shift` and `factor` array so JVM can locate the value with one operation call. 4) In pattern matching part, we use the mllib SparseVector/DenseVector instead of breeze's vector to make the codebase cleaner. Benchmark with mnist8m dataset: Before, DenseVector withMean and withStd: 50.97secs DenseVector withMean and withoutStd: 42.11secs DenseVector withoutMean and withStd: 8.75secs SparseVector withoutMean and withStd: 5.437secs With this PR, DenseVector withMean and withStd: 5.76secs DenseVector withMean and withoutStd: 5.28secs DenseVector withoutMean and withStd: 5.30secs SparseVector withoutMean and withStd: 1.27secs Note that without the local reference copy of `factor` and `shift` arrays, the runtime is almost three time slower. DenseVector withMean and withStd: 18.15secs DenseVector withMean and withoutStd: 18.05secs DenseVector withoutMean and withStd: 18.54secs SparseVector withoutMean and withStd: 2.01secs The following code, ```scala while (i < size) { values(i) = (values(i) - shift(i)) * factor(i) i += 1 } ``` will generate the bytecode ``` L13 LINENUMBER 106 L13 FRAME FULL [org/apache/spark/mllib/feature/StandardScalerModel org/apache/spark/mllib/linalg/Vector org/apache/spark/mllib/linalg/Vector org/apache/spark/mllib/linalg/DenseVector T [D I I] [] ILOAD 7 ILOAD 6 IF_ICMPGE L14 L15 LINENUMBER 107 L15 ALOAD 5 ILOAD 7 ALOAD 5 ILOAD 7 DALOAD ALOAD 0 INVOKESPECIAL org/apache/spark/mllib/feature/StandardScalerModel.shift ()[D ILOAD 7 DALOAD DSUB ALOAD 0 INVOKESPECIAL org/apache/spark/mllib/feature/StandardScalerModel.factor ()[D ILOAD 7 DALOAD DMUL DASTORE L16 LINENUMBER 108 L16 ILOAD 7 ICONST_1 IADD ISTORE 7 GOTO L13 ``` , while with the local reference of the `shift` and `factor` arrays, the bytecode will be ``` L14 LINENUMBER 107 L14 ALOAD 0 INVOKESPECIAL org/apache/spark/mllib/feature/StandardScalerModel.factor ()[D ASTORE 9 L15 LINENUMBER 108 L15 FRAME FULL [org/apache/spark/mllib/feature/StandardScalerModel org/apache/spark/mllib/linalg/Vector [D org/apache/spark/mllib/linalg/Vector org/apache/spark/mllib/linalg/DenseVector T [D I I [D] [] ILOAD 8 ILOAD 7 IF_ICMPGE L16 L17 LINENUMBER 109 L17 ALOAD 6 ILOAD 8 ALOAD 6 ILOAD 8 DALOAD ALOAD 2 ILOAD 8 DALOAD DSUB ALOAD 9 ILOAD 8 DALOAD DMUL DASTORE L18 LINENUMBER 110 L18 ILOAD 8 ICONST_1 IADD ISTORE 8 GOTO L15 ``` You can see that with local reference, the both of the arrays will be in the stack, so JVM can access the value without calling `INVOKESPECIAL`. Author: DB Tsai <dbtsai@alpinenow.com> Closes #3435 from dbtsai/standardscaler and squashes the following commits: 85885a9 [DB Tsai] revert to have lazy in shift array. daf2b06 [DB Tsai] Address the feedback cdb5cef [DB Tsai] small change 9c51eef [DB Tsai] style fc795e4 [DB Tsai] update 5bffd3d [DB Tsai] first commit	2014-11-25 11:07:11 -08:00
GuoQiang Li	f515f9432b	[SPARK-4526][MLLIB]GradientDescent get a wrong gradient value according to the gradient formula. This is caused by the miniBatchSize parameter.The number of `RDD.sample` returns is not fixed. cc mengxr Author: GuoQiang Li <witgo@qq.com> Closes #3399 from witgo/GradientDescent and squashes the following commits: 13cb228 [GuoQiang Li] review commit 668ab66 [GuoQiang Li] Double to Long b6aa11a [GuoQiang Li] Check miniBatchSize is greater than 0 0b5c3e3 [GuoQiang Li] Minor fix 12e7424 [GuoQiang Li] GradientDescent get a wrong gradient value according to the gradient formula, which is caused by the miniBatchSize parameter.	2014-11-25 02:01:19 -08:00
DB Tsai	89f9122646	[SPARK-4596][MLLib] Refactorize Normalizer to make code cleaner In this refactoring, the performance will be slightly increased due to removing the overhead from breeze vector. The bottleneck is still in breeze norm which is implemented by activeIterator. This inefficiency of breeze norm will be addressed in next PR. At least, this PR makes the code more consistent in the codebase. Author: DB Tsai <dbtsai@alpinenow.com> Closes #3446 from dbtsai/normalizer and squashes the following commits: e20a2b9 [DB Tsai] first commit	2014-11-25 01:57:34 -08:00
tkaessmann	9ce2bf3821	[SPARK-4582][MLLIB] get raw vectors for further processing in Word2Vec This is #3309 for the master branch. e.g. clustering Author: tkaessmann <tobias.kaessmanns24.com> Closes #3309 from tkaessmann/branch-1.2 and squashes the following commits: e3a3142 [tkaessmann] changes the comment for getVectors 58d3d83 [tkaessmann] removes sign from comment a5be213 [tkaessmann] fixes getVectors to fit code guidelines 3782fa9 [tkaessmann] get raw vectors for further processing Author: tkaessmann <tobias.kaessmann@s24.com> Closes #3437 from mengxr/SPARK-4582 and squashes the following commits: 6c666b4 [tkaessmann] get raw vectors for further processing in Word2Vec	2014-11-24 19:58:01 -08:00
Davies Liu	b660de7a9c	[SPARK-4562] [MLlib] speedup vector This PR change the underline array of DenseVector to numpy.ndarray to avoid the conversion, because most of the users will using numpy.array. It also improve the serialization of DenseVector. Before this change: trial \| trainingTime \| testTime -------\|--------\|-------- 0 \| 5.126 \| 1.786 1 \|2.698 \|1.693 After the change: trial \| trainingTime \| testTime -------\|--------\|-------- 0 \|4.692 \|0.554 1 \|2.307 \|0.525 This could partially fix the performance regression during test. Author: Davies Liu <davies@databricks.com> Closes #3420 from davies/ser2 and squashes the following commits: 0e1e6f3 [Davies Liu] fix tests 426f5db [Davies Liu] impove toArray() 44707ec [Davies Liu] add name for ISO-8859-1 fa7d791 [Davies Liu] address comments 1cfb137 [Davies Liu] handle zero sparse vector 2548ee2 [Davies Liu] fix tests 9e6389d [Davies Liu] bugfix 470f702 [Davies Liu] speed up DenseMatrix f0d3c40 [Davies Liu] speedup SparseVector ef6ce70 [Davies Liu] speed up dense vector	2014-11-24 16:37:14 -08:00
DB Tsai	b5d17ef10e	[SPARK-4431][MLlib] Implement efficient foreachActive for dense and sparse vector Previously, we were using Breeze's activeIterator to access the non-zero elements in dense/sparse vector. Due to the overhead, we switched back to native `while loop` in #SPARK-4129. However, #SPARK-4129 requires de-reference the dv.values/sv.values in each access to the value, which is very expensive. Also, in MultivariateOnlineSummarizer, we're using Breeze's dense vector to store the partial stats, and this is very expensive compared with using primitive scala array. In this PR, efficient foreachActive is implemented to unify the code path for dense and sparse vector operation which makes codebase easier to maintain. Breeze dense vector is replaced by primitive array to reduce the overhead further. Benchmarking with mnist8m dataset on single JVM with first 200 samples loaded in memory, and repeating 5000 times. Before change: Sparse Vector - 30.02 Dense Vector - 38.27 With this PR: Sparse Vector - 6.29 Dense Vector - 11.72 Author: DB Tsai <dbtsai@alpinenow.com> Closes #3288 from dbtsai/activeIterator and squashes the following commits: 844b0e6 [DB Tsai] formating 03dd693 [DB Tsai] futher performance tunning. 1907ae1 [DB Tsai] address feedback 98448bb [DB Tsai] Made the override final, and had a local copy of variables which made the accessing a single step operation. c0cbd5a [DB Tsai] fix a bug 6441f92 [DB Tsai] Finished SPARK-4431	2014-11-21 18:15:07 -08:00
Davies Liu	ce95bd8e13	[SPARK-4531] [MLlib] cache serialized java object The Pyrolite is pretty slow (comparing to the adhoc serializer in 1.1), it cause much performance regression in 1.2, because we cache the serialized Python object in JVM, deserialize them into Java object in each step. This PR change to cache the deserialized JavaRDD instead of PythonRDD to avoid the deserialization of Pyrolite. It should have similar memory usage as before, but much faster. Author: Davies Liu <davies@databricks.com> Closes #3397 from davies/cache and squashes the following commits: 7f6e6ce [Davies Liu] Update -> Updater 4b52edd [Davies Liu] using named argument 63b984e [Davies Liu] fix 7da0332 [Davies Liu] add unpersist() dff33e1 [Davies Liu] address comments c2bdfc2 [Davies Liu] refactor d572f00 [Davies Liu] Merge branch 'master' into cache f1063e1 [Davies Liu] cache serialized java object	2014-11-21 15:02:31 -08:00
Davies Liu	1c53a5db99	[SPARK-4439] [MLlib] add python api for random forest ``` class RandomForestModel \| A model trained by RandomForest \| \| numTrees(self) \| Get number of trees in forest. \| \| predict(self, x) \| Predict values for a single data point or an RDD of points using the model trained. \| \| toDebugString(self) \| Full model \| \| totalNumNodes(self) \| Get total number of nodes, summed over all trees in the forest. \| class RandomForest \| trainClassifier(cls, data, numClassesForClassification, categoricalFeaturesInfo, numTrees, featureSubsetStrategy='auto', impurity='gini', maxDepth=4, maxBins=32, seed=None): \| Method to train a decision tree model for binary or multiclass classification. \| \| :param data: Training dataset: RDD of LabeledPoint. \| Labels should take values {0, 1, ..., numClasses-1}. \| :param numClassesForClassification: number of classes for classification. \| :param categoricalFeaturesInfo: Map storing arity of categorical features. \| E.g., an entry (n -> k) indicates that feature n is categorical \| with k categories indexed from 0: {0, 1, ..., k-1}. \| :param numTrees: Number of trees in the random forest. \| :param featureSubsetStrategy: Number of features to consider for splits at each node. \| Supported: "auto" (default), "all", "sqrt", "log2", "onethird". \| If "auto" is set, this parameter is set based on numTrees: \| if numTrees == 1, set to "all"; \| if numTrees > 1 (forest) set to "sqrt". \| :param impurity: Criterion used for information gain calculation. \| Supported values: "gini" (recommended) or "entropy". \| :param maxDepth: Maximum depth of the tree. E.g., depth 0 means 1 leaf node; depth 1 means \| 1 internal node + 2 leaf nodes. (default: 4) \| :param maxBins: maximum number of bins used for splitting features (default: 100) \| :param seed: Random seed for bootstrapping and choosing feature subsets. \| :return: RandomForestModel that can be used for prediction \| \| trainRegressor(cls, data, categoricalFeaturesInfo, numTrees, featureSubsetStrategy='auto', impurity='variance', maxDepth=4, maxBins=32, seed=None): \| Method to train a decision tree model for regression. \| \| :param data: Training dataset: RDD of LabeledPoint. \| Labels are real numbers. \| :param categoricalFeaturesInfo: Map storing arity of categorical features. \| E.g., an entry (n -> k) indicates that feature n is categorical \| with k categories indexed from 0: {0, 1, ..., k-1}. \| :param numTrees: Number of trees in the random forest. \| :param featureSubsetStrategy: Number of features to consider for splits at each node. \| Supported: "auto" (default), "all", "sqrt", "log2", "onethird". \| If "auto" is set, this parameter is set based on numTrees: \| if numTrees == 1, set to "all"; \| if numTrees > 1 (forest) set to "onethird". \| :param impurity: Criterion used for information gain calculation. \| Supported values: "variance". \| :param maxDepth: Maximum depth of the tree. E.g., depth 0 means 1 leaf node; depth 1 means \| 1 internal node + 2 leaf nodes.(default: 4) \| :param maxBins: maximum number of bins used for splitting features (default: 100) \| :param seed: Random seed for bootstrapping and choosing feature subsets. \| :return: RandomForestModel that can be used for prediction \| ``` Author: Davies Liu <davies@databricks.com> Closes #3320 from davies/forest and squashes the following commits: 8003dfc [Davies Liu] reorder 53cf510 [Davies Liu] fix docs 4ca593d [Davies Liu] fix docs e0df852 [Davies Liu] fix docs 0431746 [Davies Liu] rebased 2b6f239 [Davies Liu] Merge branch 'master' of github.com:apache/spark into forest 885abee [Davies Liu] address comments dae7fc0 [Davies Liu] address comments 89a000f [Davies Liu] fix docs 565d476 [Davies Liu] add python api for random forest	2014-11-20 15:31:28 -08:00
Xiangrui Meng	15cacc8124	[SPARK-4486][MLLIB] Improve GradientBoosting APIs and doc There are some inconsistencies in the gradient boosting APIs. The target is a general boosting meta-algorithm, but the implementation is attached to trees. This was partially due to the delay of SPARK-1856. But for the 1.2 release, we should make the APIs consistent. 1. WeightedEnsembleModel -> private[tree] TreeEnsembleModel and renamed members accordingly. 1. GradientBoosting -> GradientBoostedTrees 1. Add RandomForestModel and GradientBoostedTreesModel and hide CombiningStrategy 1. Slightly refactored TreeEnsembleModel (Vote takes weights into consideration.) 1. Remove `trainClassifier` and `trainRegressor` from `GradientBoostedTrees` because they are the same as `train` 1. Rename class `train` method to `run` because it hides the static methods with the same name in Java. Deprecated `DecisionTree.train` class method. 1. Simplify BoostingStrategy and make sure the input strategy is not modified. Users should put algo and numClasses in treeStrategy. We create ensembleStrategy inside boosting. 1. Fix a bug in GradientBoostedTreesSuite with AbsoluteError 1. doc updates manishamde jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #3374 from mengxr/SPARK-4486 and squashes the following commits: 7097251 [Xiangrui Meng] address joseph's comments 98dea09 [Xiangrui Meng] address manish's comments 4aae3b7 [Xiangrui Meng] add RandomForestModel and GradientBoostedTreesModel, hide CombiningStrategy ea4c467 [Xiangrui Meng] fix unit tests 751da4e [Xiangrui Meng] rename class method train -> run 19030a5 [Xiangrui Meng] update boosting public APIs	2014-11-20 00:48:59 -08:00
Marcelo Vanzin	397d3aae5b	Bumping version to 1.3.0-SNAPSHOT. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #3277 from vanzin/version-1.3 and squashes the following commits: 7c3c396 [Marcelo Vanzin] Added temp repo to sbt build. 5f404ff [Marcelo Vanzin] Add another exclusion. 19457e7 [Marcelo Vanzin] Update old version to 1.2, add temporary 1.2 repo. 3c8d705 [Marcelo Vanzin] Workaround for MIMA checks. e940810 [Marcelo Vanzin] Bumping version to 1.3.0-SNAPSHOT.	2014-11-18 21:24:18 -08:00
Davies Liu	d2e29516f2	[SPARK-4306] [MLlib] Python API for LogisticRegressionWithLBFGS ``` class LogisticRegressionWithLBFGS \| train(cls, data, iterations=100, initialWeights=None, corrections=10, tolerance=0.0001, regParam=0.01, intercept=False) \| Train a logistic regression model on the given data. \| \| :param data: The training data, an RDD of LabeledPoint. \| :param iterations: The number of iterations (default: 100). \| :param initialWeights: The initial weights (default: None). \| :param regParam: The regularizer parameter (default: 0.01). \| :param regType: The type of regularizer used for training \| our model. \| :Allowed values: \| - "l1" for using L1 regularization \| - "l2" for using L2 regularization \| - None for no regularization \| (default: "l2") \| :param intercept: Boolean parameter which indicates the use \| or not of the augmented representation for \| training data (i.e. whether bias features \| are activated or not). \| :param corrections: The number of corrections used in the LBFGS update (default: 10). \| :param tolerance: The convergence tolerance of iterations for L-BFGS (default: 1e-4). \| \| >>> data = [ \| ... LabeledPoint(0.0, [0.0, 1.0]), \| ... LabeledPoint(1.0, [1.0, 0.0]), \| ... ] \| >>> lrm = LogisticRegressionWithLBFGS.train(sc.parallelize(data)) \| >>> lrm.predict([1.0, 0.0]) \| 1 \| >>> lrm.predict([0.0, 1.0]) \| 0 \| >>> lrm.predict(sc.parallelize([[1.0, 0.0], [0.0, 1.0]])).collect() \| [1, 0] ``` Author: Davies Liu <davies@databricks.com> Closes #3307 from davies/lbfgs and squashes the following commits: 34bd986 [Davies Liu] Merge branch 'master' of http://git-wip-us.apache.org/repos/asf/spark into lbfgs 5a945a6 [Davies Liu] address comments 941061b [Davies Liu] Merge branch 'master' of github.com:apache/spark into lbfgs 03e5543 [Davies Liu] add it to docs ed2f9a8 [Davies Liu] add regType 76cd1b6 [Davies Liu] reorder arguments 4429a74 [Davies Liu] Update classification.py 9252783 [Davies Liu] python api for LogisticRegressionWithLBFGS	2014-11-18 15:57:33 -08:00
Davies Liu	8fbf72b790	[SPARK-4435] [MLlib] [PySpark] improve classification This PR add setThrehold() and clearThreshold() for LogisticRegressionModel and SVMModel, also support RDD of vector in LogisticRegressionModel.predict(), SVNModel.predict() and NaiveBayes.predict() Author: Davies Liu <davies@databricks.com> Closes #3305 from davies/setThreshold and squashes the following commits: d0b835f [Davies Liu] Merge branch 'master' of github.com:apache/spark into setThreshold e4acd76 [Davies Liu] address comments 2231a5f [Davies Liu] bugfix 7bd9009 [Davies Liu] address comments 0b0a8a7 [Davies Liu] address comments c1e5573 [Davies Liu] improve classification	2014-11-18 10:11:13 -08:00
Felix Maximilian Möller	cedc3b5aa4	ALS implicit: added missing parameter alpha in doc string Author: Felix Maximilian Möller <felixmaximilian.moeller@immobilienscout24.de> Closes #3343 from felixmaximilian/fix-documentation and squashes the following commits: 43dcdfb [Felix Maximilian Möller] Removed the information about the switch implicitPrefs. The parameter implicitPrefs cannot be set in this context because it is inherent true when calling the trainImplicit method. 7d172ba [Felix Maximilian Möller] added missing parameter alpha in doc string.	2014-11-18 10:08:24 -08:00
GuoQiang Li	5168c6ca9f	[SPARK-4422][MLLIB]In some cases, Vectors.fromBreeze get wrong results. cc mengxr Author: GuoQiang Li <witgo@qq.com> Closes #3281 from witgo/SPARK-4422 and squashes the following commits: 5f1fa5e [GuoQiang Li] import order 50783bd [GuoQiang Li] review commits 7a10123 [GuoQiang Li] In some cases, Vectors.fromBreeze get wrong results.	2014-11-16 21:31:51 -08:00
Xiangrui Meng	32218307ed	[SPARK-4372][MLLIB] Make LR and SVM's default parameters consistent in Scala and Python The current default regParam is 1.0 and regType is claimed to be none in Python (but actually it is l2), while regParam = 0.0 and regType is L2 in Scala. We should make the default values consistent. This PR sets the default regType to L2 and regParam to 0.01. Note that the default regParam value in LIBLINEAR (and hence scikit-learn) is 1.0. However, we use average loss instead of total loss in our formulation. Hence regParam=1.0 is definitely too heavy. In LinearRegression, we set regParam=0.0 and regType=None, because we have separate classes for Lasso and Ridge, both of which use regParam=0.01 as the default. davies atalwalkar Author: Xiangrui Meng <meng@databricks.com> Closes #3232 from mengxr/SPARK-4372 and squashes the following commits: 9979837 [Xiangrui Meng] update Ridge/Lasso to use default regParam 0.01 cast input arguments d3ba096 [Xiangrui Meng] change 'none' back to None 1909a6e [Xiangrui Meng] change default regParam to 0.01 and regType to L2 in LR and SVM	2014-11-13 13:54:16 -08:00
Xiangrui Meng	ca26a212fd	[SPARK-4378][MLLIB] make ALS more Java-friendly Add Java-friendly version of `run` and `predict`, and use bulk prediction in Java unit tests. The user guide update will come later (though we may not save many lines of code there). srowen Author: Xiangrui Meng <meng@databricks.com> Closes #3240 from mengxr/SPARK-4378 and squashes the following commits: 6581503 [Xiangrui Meng] check number of predictions 6c8bbd1 [Xiangrui Meng] make ALS more Java-friendly	2014-11-13 11:42:27 -08:00
Andrew Bullen	484fecbf14	[SPARK-4256] Make Binary Evaluation Metrics functions defined in cases where there ar... ...e 0 positive or 0 negative examples. Author: Andrew Bullen <andrew.bullen@workday.com> Closes #3118 from abull/master and squashes the following commits: c2bf2b1 [Andrew Bullen] [SPARK-4256] Update Code formatting for BinaryClassificationMetricsSpec 36b0533 [Andrew Bullen] [SYMAN-4256] Extract BinaryClassificationMetricsSuite assertions into private method 4d2f79a [Andrew Bullen] [SPARK-4256] Refactor classification metrics tests - extract comparison functions in test f411e70 [Andrew Bullen] [SPARK-4256] Define precision as 1.0 when there are no positive examples; update code formatting per pull request comments d9a09ef [Andrew Bullen] Make Binary Evaluation Metrics functions defined in cases where there are 0 positive or 0 negative examples.	2014-11-12 22:14:44 -08:00
Xiangrui Meng	23f5bdf06a	[SPARK-4373][MLLIB] fix MLlib maven tests We want to make sure there is at most one spark context inside the same jvm. JoshRosen Author: Xiangrui Meng <meng@databricks.com> Closes #3235 from mengxr/SPARK-4373 and squashes the following commits: 6574b69 [Xiangrui Meng] rename LocalSparkContext to MLlibTestSparkContext 913d48d [Xiangrui Meng] make sure there is at most one spark context inside the same jvm	2014-11-12 18:15:14 -08:00
Davies Liu	bd86118c4e	[SPARK-4369] [MLLib] fix TreeModel.predict() with RDD Fix TreeModel.predict() with RDD, added tests for it. (Also checked that other models don't have this issue) Author: Davies Liu <davies@databricks.com> Closes #3230 from davies/predict and squashes the following commits: 81172aa [Davies Liu] fix predict	2014-11-12 13:56:41 -08:00
Xiangrui Meng	4b736dbab3	[SPARK-3530][MLLIB] pipeline and parameters with examples This PR adds package "org.apache.spark.ml" with pipeline and parameters, as discussed on the JIRA. This is a joint work of jkbradley etrain shivaram and many others who helped on the design, also with help from marmbrus and liancheng on the Spark SQL side. The design doc can be found at: https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o/edit?usp=sharing org.apache.spark.ml This is a new package with new set of ML APIs that address practical machine learning pipelines. (Sorry for taking so long!) It will be an alpha component, so this is definitely not something set in stone. The new set of APIs, inspired by the MLI project from AMPLab and scikit-learn, takes leverage on Spark SQL's schema support and execution plan optimization. It introduces the following components that help build a practical pipeline: 1. Transformer, which transforms a dataset into another 2. Estimator, which fits models to data, where models are transformers 3. Evaluator, which evaluates model output and returns a scalar metric 4. Pipeline, a simple pipeline that consists of transformers and estimators Parameters could be supplied at fit/transform or embedded with components. 1. Param: a strong-typed parameter key with self-contained doc 2. ParamMap: a param -> value map 3. Params: trait for components with parameters For any component that implements `Params`, user can easily check the doc by calling `explainParams`: ~~~ > val lr = new LogisticRegression > lr.explainParams maxIter: max number of iterations (default: 100) regParam: regularization constant (default: 0.1) labelCol: label column name (default: label) featuresCol: features column name (default: features) ~~~ or user can check individual param: ~~~ > lr.maxIter maxIter: max number of iterations (default: 100) ~~~ Please start with the example code in test suites and under `org.apache.spark.examples.ml`, where I put several examples: 1. run a simple logistic regression job ~~~ val lr = new LogisticRegression() .setMaxIter(10) .setRegParam(1.0) val model = lr.fit(dataset) model.transform(dataset, model.threshold -> 0.8) // overwrite threshold .select('label, 'score, 'prediction).collect() .foreach(println) ~~~ 2. run logistic regression with cross-validation and grid search using areaUnderROC (default) as the metric ~~~ val lr = new LogisticRegression val lrParamMaps = new ParamGridBuilder() .addGrid(lr.regParam, Array(0.1, 100.0)) .addGrid(lr.maxIter, Array(0, 5)) .build() val eval = new BinaryClassificationEvaluator val cv = new CrossValidator() .setEstimator(lr) .setEstimatorParamMaps(lrParamMaps) .setEvaluator(eval) .setNumFolds(3) val bestModel = cv.fit(dataset) ~~~ 3. run a pipeline that consists of a standard scaler and a logistic regression component ~~~ val scaler = new StandardScaler() .setInputCol("features") .setOutputCol("scaledFeatures") val lr = new LogisticRegression() .setFeaturesCol(scaler.getOutputCol) val pipeline = new Pipeline() .setStages(Array(scaler, lr)) val model = pipeline.fit(dataset) val predictions = model.transform(dataset) .select('label, 'score, 'prediction) .collect() .foreach(println) ~~~ 4. a simple text classification pipeline, which recognizes "spark": ~~~ val training = sparkContext.parallelize(Seq( LabeledDocument(0L, "a b c d e spark", 1.0), LabeledDocument(1L, "b d", 0.0), LabeledDocument(2L, "spark f g h", 1.0), LabeledDocument(3L, "hadoop mapreduce", 0.0))) val tokenizer = new Tokenizer() .setInputCol("text") .setOutputCol("words") val hashingTF = new HashingTF() .setInputCol(tokenizer.getOutputCol) .setOutputCol("features") val lr = new LogisticRegression() .setMaxIter(10) val pipeline = new Pipeline() .setStages(Array(tokenizer, hashingTF, lr)) val model = pipeline.fit(training) val test = sparkContext.parallelize(Seq( Document(4L, "spark i j k"), Document(5L, "l m"), Document(6L, "mapreduce spark"), Document(7L, "apache hadoop"))) model.transform(test) .select('id, 'text, 'prediction, 'score) .collect() .foreach(println) ~~~ Java examples are very similar. I put example code that creates a simple text classification pipeline in Scala and Java, where a simple tokenizer is defined as a transformer outside `org.apache.spark.ml`. What are missing now and will be added soon: 1. ~~Runtime check of schemas. So before we touch the data, we will go through the schema and make sure column names and types match the input parameters.~~ 2. ~~Java examples.~~ 3. ~~Store training parameters in trained models.~~ 4. (later) Serialization and Python API. Author: Xiangrui Meng <meng@databricks.com> Closes #3099 from mengxr/SPARK-3530 and squashes the following commits: 2cc93fd [Xiangrui Meng] hide APIs as much as I can 34319ba [Xiangrui Meng] use local instead local[2] for unit tests 2524251 [Xiangrui Meng] rename PipelineStage.transform to transformSchema c9daab4 [Xiangrui Meng] remove mockito version 1397ab5 [Xiangrui Meng] use sqlContext from LocalSparkContext instead of TestSQLContext 6ffc389 [Xiangrui Meng] try to fix unit test a59d8b7 [Xiangrui Meng] doc updates 977fd9d [Xiangrui Meng] add scala ml package object 6d97fe6 [Xiangrui Meng] add AlphaComponent annotation 731f0e4 [Xiangrui Meng] update package doc 0435076 [Xiangrui Meng] remove ;this from setters fa21d9b [Xiangrui Meng] update extends indentation f1091b3 [Xiangrui Meng] typo 228a9f4 [Xiangrui Meng] do not persist before calling binary classification metrics f51cd27 [Xiangrui Meng] rename default to defaultValue b3be094 [Xiangrui Meng] refactor schema transform in lr 8791e8e [Xiangrui Meng] rename copyValues to inheritValues and make it do the right thing 51f1c06 [Xiangrui Meng] remove leftover code in Transformer 494b632 [Xiangrui Meng] compure score once ad678e9 [Xiangrui Meng] more doc for Transformer 4306ed4 [Xiangrui Meng] org imports in text pipeline 6e7c1c7 [Xiangrui Meng] update pipeline 4f9e34f [Xiangrui Meng] more doc for pipeline aa5dbd4 [Xiangrui Meng] fix typo 11be383 [Xiangrui Meng] fix unit tests 3df7952 [Xiangrui Meng] clean up 986593e [Xiangrui Meng] re-org java test suites 2b11211 [Xiangrui Meng] remove external data deps 9fd4933 [Xiangrui Meng] add unit test for pipeline 2a0df46 [Xiangrui Meng] update tests 2d52e4d [Xiangrui Meng] add @AlphaComponent to package-info 27582a4 [Xiangrui Meng] doc changes 73a000b [Xiangrui Meng] add schema transformation layer 6736e87 [Xiangrui Meng] more doc / remove HasMetricName trait 80a8b5e [Xiangrui Meng] rename SimpleTransformer to UnaryTransformer 62ca2bb [Xiangrui Meng] check param parent in set/get 1622349 [Xiangrui Meng] add getModel to PipelineModel a0e0054 [Xiangrui Meng] update StandardScaler to use SimpleTransformer d0faa04 [Xiangrui Meng] remove implicit mapping from ParamMap c7f6921 [Xiangrui Meng] move ParamGridBuilder test to ParamGridBuilderSuite e246f29 [Xiangrui Meng] re-org: 7772430 [Xiangrui Meng] remove modelParams add a simple text classification pipeline b95c408 [Xiangrui Meng] remove implicits add unit tests to params bab3e5b [Xiangrui Meng] update params fe0ee92 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-3530 6e86d98 [Xiangrui Meng] some code clean-up 2d040b3 [Xiangrui Meng] implement setters inside each class, add Params.copyValues [ci skip] fd751fc [Xiangrui Meng] add java-friendly versions of fit and tranform 3f810cd [Xiangrui Meng] use multi-model training api in cv 5b8f413 [Xiangrui Meng] rename model to modelParams 9d2d35d [Xiangrui Meng] test varargs and chain model params f46e927 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-3530 1ef26e0 [Xiangrui Meng] specialize methods/types for Java df293ed [Xiangrui Meng] switch to setter/getter 376db0a [Xiangrui Meng] pipeline and parameters	2014-11-12 10:38:57 -08:00
Xiangrui Meng	84324fbcb9	[SPARK-4355][MLLIB] fix OnlineSummarizer.merge when other.mean is zero See inline comment about the bug. I also did some code clean-up. dbtsai I moved `update` to a private method of `MultivariateOnlineSummarizer`. I don't think it will cause performance regression, but it would be great if you have some time to test. Author: Xiangrui Meng <meng@databricks.com> Closes #3220 from mengxr/SPARK-4355 and squashes the following commits: 5ef601f [Xiangrui Meng] fix OnlineSummarizer.merge when other.mean is zero and some code clean-up	2014-11-12 01:50:11 -08:00
Manish Amde	2ef016b130	[MLLIB] SPARK-4347: Reducing GradientBoostingSuite run time. Before: [info] GradientBoostingSuite: [info] - Regression with continuous features: SquaredError (22 seconds, 115 milliseconds) [info] - Regression with continuous features: Absolute Error (19 seconds, 330 milliseconds) [info] - Binary classification with continuous features: Log Loss (19 seconds, 17 milliseconds) After: [info] - Regression with continuous features: SquaredError (7 seconds, 69 milliseconds) [info] - Regression with continuous features: Absolute Error (4 seconds, 617 milliseconds) [info] - Binary classification with continuous features: Log Loss (4 seconds, 658 milliseconds) cc: mengxr, jkbradley Author: Manish Amde <manish9ue@gmail.com> Closes #3214 from manishamde/gbt_test_speedup and squashes the following commits: 8994552 [Manish Amde] reducing gbt test run times	2014-11-11 22:47:53 -08:00
Michelangelo D'Agostino	7e9d975676	[MLLIB] [PYTHON] SPARK-4221: Expose nonnegative ALS in the python API SPARK-1553 added alternating nonnegative least squares to MLLib, however it's not possible to access it via the python API. This pull request resolves that. Author: Michelangelo D'Agostino <mdagostino@civisanalytics.com> Closes #3095 from mdagost/python_nmf and squashes the following commits: a6743ad [Michelangelo D'Agostino] Use setters instead of static methods in PythonMLLibAPI. Remove the new static methods I added. Set seed in tests. Change ratings to ratingsRDD in both train and trainImplicit for consistency. 7cffd39 [Michelangelo D'Agostino] Swapped nonnegative and seed in a few more places. 3fdc851 [Michelangelo D'Agostino] Moved seed to the end of the python parameter list. bdcc154 [Michelangelo D'Agostino] Change seed type to java.lang.Long so that it can handle null. cedf043 [Michelangelo D'Agostino] Added in ability to set the seed from python and made that play nice with the nonnegative changes. Also made the python ALS tests more exact. a72fdc9 [Michelangelo D'Agostino] Expose nonnegative ALS in the python API.	2014-11-07 22:53:01 -08:00
Joseph K. Bradley	5b3b6f6f5f	[SPARK-4197] [mllib] GradientBoosting API cleanup and examples in Scala, Java ### Summary * Made it easier to construct default Strategy and BoostingStrategy and to set parameters using simple types. * Added Scala and Java examples for GradientBoostedTrees * small cleanups and fixes ### Details GradientBoosting bug fixes (“bug” = bad default options) * Force boostingStrategy.weakLearnerParams.algo = Regression * Force boostingStrategy.weakLearnerParams.impurity = impurity.Variance * Only persist data if not yet persisted (since it causes an error if persisted twice) BoostingStrategy * numEstimators: renamed to numIterations * removed subsamplingRate (duplicated by Strategy) * removed categoricalFeaturesInfo since it belongs with the weak learner params (since boosting can be oblivious to feature type) * Changed algo to var (not val) and added BeanProperty, with overload taking String argument * Added assertValid() method * Updated defaultParams() method and eliminated defaultWeakLearnerParams() since that belongs in Strategy Strategy (for DecisionTree) * Changed algo to var (not val) and added BeanProperty, with overload taking String argument * Added setCategoricalFeaturesInfo method taking Java Map. * Cleaned up assertValid * Changed val’s to def’s since parameters can now be changed. CC: manishamde mengxr codedeft Author: Joseph K. Bradley <joseph@databricks.com> Closes #3094 from jkbradley/gbt-api and squashes the following commits: 7a27e22 [Joseph K. Bradley] scalastyle fix 52013d5 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into gbt-api e9b8410 [Joseph K. Bradley] Summary of changes	2014-11-05 10:33:13 -08:00
Davies Liu	c8abddc516	[SPARK-3964] [MLlib] [PySpark] add Hypothesis test Python API ``` pyspark.mllib.stat.StatisticschiSqTest(observed, expected=None) :: Experimental :: If `observed` is Vector, conduct Pearson's chi-squared goodness of fit test of the observed data against the expected distribution, or againt the uniform distribution (by default), with each category having an expected frequency of `1 / len(observed)`. (Note: `observed` cannot contain negative values) If `observed` is matrix, conduct Pearson's independence test on the input contingency matrix, which cannot contain negative entries or columns or rows that sum up to 0. If `observed` is an RDD of LabeledPoint, conduct Pearson's independence test for every feature against the label across the input RDD. For each feature, the (feature, label) pairs are converted into a contingency matrix for which the chi-squared statistic is computed. All label and feature values must be categorical. :param observed: it could be a vector containing the observed categorical counts/relative frequencies, or the contingency matrix (containing either counts or relative frequencies), or an RDD of LabeledPoint containing the labeled dataset with categorical features. Real-valued features will be treated as categorical for each distinct value. :param expected: Vector containing the expected categorical counts/relative frequencies. `expected` is rescaled if the `expected` sum differs from the `observed` sum. :return: ChiSquaredTest object containing the test statistic, degrees of freedom, p-value, the method used, and the null hypothesis. ``` Author: Davies Liu <davies@databricks.com> Closes #3091 from davies/his and squashes the following commits: 145d16c [Davies Liu] address comments 0ab0764 [Davies Liu] fix float 5097d54 [Davies Liu] add Hypothesis test Python API	2014-11-04 21:35:52 -08:00
Niklas Wilcke	f90ad5d426	[Spark-4060] [MLlib] exposing special rdd functions to the public Author: Niklas Wilcke <1wilcke@informatik.uni-hamburg.de> Closes #2907 from numbnut/master and squashes the following commits: 7f7c767 [Niklas Wilcke] [Spark-4060] [MLlib] exposing special rdd functions to the public, #2907	2014-11-04 09:57:03 -08:00
Davies Liu	e4f42631a6	[SPARK-3886] [PySpark] simplify serializer, use AutoBatchedSerializer by default. This PR simplify serializer, always use batched serializer (AutoBatchedSerializer as default), even batch size is 1. Author: Davies Liu <davies@databricks.com> This patch had conflicts when merged, resolved by Committer: Josh Rosen <joshrosen@databricks.com> Closes #2920 from davies/fix_autobatch and squashes the following commits: e544ef9 [Davies Liu] revert unrelated change 6880b14 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch 1d557fc [Davies Liu] fix tests 8180907 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch 76abdce [Davies Liu] clean up 53fa60b [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch d7ac751 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch 2cc2497 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch b4292ce [Davies Liu] fix bug in master d79744c [Davies Liu] recover hive tests be37ece [Davies Liu] refactor eb3938d [Davies Liu] refactor serializer in scala 8d77ef2 [Davies Liu] simplify serializer, use AutoBatchedSerializer by default.	2014-11-03 23:56:14 -08:00
Xiangrui Meng	1a9c6cddad	[SPARK-3573][MLLIB] Make MLlib's Vector compatible with SQL's SchemaRDD Register MLlib's Vector as a SQL user-defined type (UDT) in both Scala and Python. With this PR, we can easily map a RDD[LabeledPoint] to a SchemaRDD, and then select columns or save to a Parquet file. Examples in Scala/Python are attached. The Scala code was copied from jkbradley. ~~This PR contains the changes from #3068 . I will rebase after #3068 is merged.~~ marmbrus jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #3070 from mengxr/SPARK-3573 and squashes the following commits: 3a0b6e5 [Xiangrui Meng] organize imports 236f0a0 [Xiangrui Meng] register vector as UDT and provide dataset examples	2014-11-03 22:29:48 -08:00
Xiangrui Meng	c5912ecc7b	[FIX][MLLIB] fix seed in BaggedPointSuite Saw Jenkins test failures due to random seeds. jkbradley manishamde Author: Xiangrui Meng <meng@databricks.com> Closes #3084 from mengxr/fix-baggedpoint-suite and squashes the following commits: f735a43 [Xiangrui Meng] fix seed in BaggedPointSuite	2014-11-03 18:50:37 -08:00
Sung Chung	56f2c61cde	[SPARK-3161][MLLIB] Adding a node Id caching mechanism for training deci... ...sion trees. jkbradley mengxr chouqin Please review this. Author: Sung Chung <schung@alpinenow.com> Closes #2868 from codedeft/SPARK-3161 and squashes the following commits: 5f5a156 [Sung Chung] [SPARK-3161][MLLIB] Adding a node Id caching mechanism for training decision trees.	2014-11-01 16:58:26 -07:00
Xiangrui Meng	d8176b1c2f	[SPARK-4121] Set commons-math3 version based on hadoop profiles, instead of shading In #2928 , we shade commons-math3 to prevent future conflicts with hadoop. It caused problems with our Jenkins master build with maven. Some tests used local-cluster mode, where the assembly jar contains relocated math3 classes, while mllib test code still compiles with core and the untouched math3 classes. This PR sets commons-math3 version based on hadoop profiles. pwendell JoshRosen srowen Author: Xiangrui Meng <meng@databricks.com> Closes #3023 from mengxr/SPARK-4121-alt and squashes the following commits: 580f6d9 [Xiangrui Meng] replace tab by spaces 7f71f08 [Xiangrui Meng] revert changes to PoissonSampler to avoid conflicts d3353d9 [Xiangrui Meng] do not shade commons-math3 b4180dc [Xiangrui Meng] temp work	2014-11-01 15:21:36 -07:00
freeman	98c556ebbc	Streaming KMeans [MLLIB][SPARK-3254] This adds a Streaming KMeans algorithm to MLlib. It uses an update rule that generalizes the mini-batch KMeans update to incorporate a decay factor, which allows past data to be forgotten. The decay factor can be specified explicitly, or via a more intuitive "fractional decay" setting, in units of either data points or batches. The PR includes: - StreamingKMeans algorithm with decay factor settings - Usage example - Additions to documentation clustering page - Unit tests of basic behavior and decay behaviors tdas mengxr rezazadeh Author: freeman <the.freeman.lab@gmail.com> Author: Jeremy Freeman <the.freeman.lab@gmail.com> Author: Xiangrui Meng <meng@databricks.com> Closes #2942 from freeman-lab/streaming-kmeans and squashes the following commits: b2e5b4a [freeman] Fixes to docs / examples 078617c [Jeremy Freeman] Merge pull request #1 from mengxr/SPARK-3254 2e682c0 [Xiangrui Meng] take discount on previous weights; use BLAS; detect dying clusters 0411bf5 [freeman] Change decay parameterization 9f7aea9 [freeman] Style fixes 374a706 [freeman] Formatting ad9bdc2 [freeman] Use labeled points and predictOnValues in examples 77dbd3f [freeman] Make initialization check an assertion 9cfc301 [freeman] Make random seed an argument 44050a9 [freeman] Simpler constructor c7050d5 [freeman] Fix spacing 2899623 [freeman] Use pattern matching for clarity a4a316b [freeman] Use collect 1472ec5 [freeman] Doc formatting ea22ec8 [freeman] Fix imports 2086bdc [freeman] Log cluster center updates ea9877c [freeman] More documentation 9facbe3 [freeman] Bug fix 5db7074 [freeman] Example usage for StreamingKMeans f33684b [freeman] Add explanation and example to docs b5b5f8d [freeman] Add better documentation a0fd790 [freeman] Merge remote-tracking branch 'upstream/master' into streaming-kmeans 9fd9c15 [freeman] Merge remote-tracking branch 'upstream/master' into streaming-kmeans b93350f [freeman] Streaming KMeans with decay	2014-10-31 22:30:12 -07:00
Manish Amde	8602195510	[MLLIB] SPARK-1547: Add Gradient Boosting to MLlib Given the popular demand for gradient boosting and AdaBoost in MLlib, I am creating a WIP branch for early feedback on gradient boosting with AdaBoost to follow soon after this PR is accepted. This is based on work done along with hirakendu that was pending due to decision tree optimizations and random forests work. Ideally, boosting algorithms should work with any base learners. This will soon be possible once the MLlib API is finalized -- we want to ensure we use a consistent interface for the underlying base learners. In the meantime, this PR uses decision trees as base learners for the gradient boosting algorithm. The current PR allows "pluggable" loss functions and provides least squares error and least absolute error by default. Here is the task list: - [x] Gradient boosting support - [x] Pluggable loss functions - [x] Stochastic gradient boosting support – Re-use the BaggedPoint approach used for RandomForest. - [x] Binary classification support - [x] Support configurable checkpointing – This approach will avoid long lineage chains. - [x] Create classification and regression APIs - [x] Weighted Ensemble Model -- created a WeightedEnsembleModel class that can be used by ensemble algorithms such as random forests and boosting. - [x] Unit Tests Future work: + Multi-class classification is currently not supported by this PR since it requires discussion on the best way to support "deviance" as a loss function. + BaggedRDD caching -- Avoid repeating feature to bin mapping for each tree estimator after standard API work is completed. cc: jkbradley hirakendu mengxr etrain atalwalkar chouqin Author: Manish Amde <manish9ue@gmail.com> Author: manishamde <manish9ue@gmail.com> Closes #2607 from manishamde/gbt and squashes the following commits: 991c7b5 [Manish Amde] public api ff2a796 [Manish Amde] addressing comments b4c1318 [Manish Amde] removing spaces 8476b6b [Manish Amde] fixing line length 0183cb9 [Manish Amde] fixed naming and formatting issues 1c40c33 [Manish Amde] add newline, removed spaces e33ab61 [Manish Amde] minor comment eadbf09 [Manish Amde] parameter renaming 035a2ed [Manish Amde] jkbradley formatting suggestions 9f7359d [Manish Amde] simplified gbt logic and added more tests 49ba107 [Manish Amde] merged from master eff21fe [Manish Amde] Added gradient boosting tests 3fd0528 [Manish Amde] moved helper methods to new class a32a5ab [Manish Amde] added test for subsampling without replacement 781542a [Manish Amde] added support for fractional subsampling with replacement 3a18cc1 [Manish Amde] cleaned up api for conversion to bagged point and moved tests to it's own test suite 0e81906 [Manish Amde] improving caching unpersisting logic d971f73 [Manish Amde] moved RF code to use WeightedEnsembleModel class fee06d3 [Manish Amde] added weighted ensemble model 1b01943 [Manish Amde] add weights for base learners 9bc6e74 [Manish Amde] adding random seed as parameter d2c8323 [Manish Amde] Merge branch 'master' into gbt 2ae97b7 [Manish Amde] added documentation for the loss classes 9366b8f [Manish Amde] minor: using numTrees instead of trees.size 3b43896 [Manish Amde] added learning rate for prediction 9b2e35e [Manish Amde] Merge branch 'master' into gbt 6a11c02 [manishamde] fixing formatting 823691b [Manish Amde] fixing RF test 1f47941 [Manish Amde] changing access modifier 5b67102 [Manish Amde] shortened parameter list 5ab3796 [Manish Amde] minor reformatting 9155a9d [Manish Amde] consolidated boosting configuration and added public API 631baea [Manish Amde] Merge branch 'master' into gbt 2cb1258 [Manish Amde] public API support 3b8ffc0 [Manish Amde] added documentation 8e10c63 [Manish Amde] modified unpersist strategy f62bc48 [Manish Amde] added unpersist bdca43a [Manish Amde] added timing parameters 2fbc9c7 [Manish Amde] fixing binomial classification prediction 6dd4dd8 [Manish Amde] added support for log loss 9af0231 [Manish Amde] classification attempt 62cc000 [Manish Amde] basic checkpointing 4784091 [Manish Amde] formatting 78ed452 [Manish Amde] added newline and fixed if statement 3973dd1 [Manish Amde] minor indicating subsample is double during comparison aa8fae7 [Manish Amde] minor refactoring 1a8031c [Manish Amde] sampling with replacement f1c9ef7 [Manish Amde] Merge branch 'master' into gbt cdceeef [Manish Amde] added documentation 6251fd5 [Manish Amde] modified method name 5538521 [Manish Amde] disable checkpointing for now 0ae1c0a [Manish Amde] basic gradient boosting code from earlier branches	2014-10-31 18:57:55 -07:00
Alexander Ulanov	62d01d255c	[MLLIB] SPARK-2329 Add multi-label evaluation metrics Implementation of various multi-label classification measures, including: Hamming-loss, strict and default Accuracy, macro-averaged Precision, Recall and F1-measure based on documents and labels, micro-averaged measures: https://issues.apache.org/jira/browse/SPARK-2329 Multi-class measures are currently in the following pull request: https://github.com/apache/spark/pull/1155 Author: Alexander Ulanov <nashb@yandex.ru> Author: avulanov <nashb@yandex.ru> Closes #1270 from avulanov/multilabelmetrics and squashes the following commits: fc8175e [Alexander Ulanov] Merge with previous updates 43a613e [Alexander Ulanov] Addressing reviewers comments: change Set to Array 517a594 [avulanov] Addressing reviewers comments: Scala style cf4222bc [avulanov] Addressing reviewers comments: renaming. Added label method that returns the list of labels 1843f73 [Alexander Ulanov] Scala style fix 79e8476 [Alexander Ulanov] Replacing fold(_ + _) with sum as suggested by srowen ca46765 [Alexander Ulanov] Cosmetic changes: Apache header and parameter explanation 40593f5 [Alexander Ulanov] Multi-label metrics: Hamming-loss, strict and normal accuracy, fix to macro measures, bunch of tests ad62df0 [Alexander Ulanov] Comments and scala style check 154164b [Alexander Ulanov] Multilabel evaluation metics and tests: macro precision and recall averaged by docs, micro and per-class precision and recall averaged by class	2014-10-31 18:31:03 -07:00
Erik Erlandson	ad3bd0dff8	[SPARK-3250] Implement Gap Sampling optimization for random sampling More efficient sampling, based on Gap Sampling optimization: http://erikerlandson.github.io/blog/2014/09/11/faster-random-samples-with-gap-sampling/ Author: Erik Erlandson <eerlands@redhat.com> Closes #2455 from erikerlandson/spark-3250-pr and squashes the following commits: 72496bc [Erik Erlandson] [SPARK-3250] Implement Gap Sampling optimization for random sampling	2014-10-30 22:30:52 -07:00
Davies Liu	872fc669b4	[SPARK-4124] [MLlib] [PySpark] simplify serialization in MLlib Python API Create several helper functions to call MLlib Java API, convert the arguments to Java type and convert return value to Python object automatically, this simplify serialization in MLlib Python API very much. After this, the MLlib Python API does not need to deal with serialization details anymore, it's easier to add new API. cc mengxr Author: Davies Liu <davies@databricks.com> Closes #2995 from davies/cleanup and squashes the following commits: 8fa6ec6 [Davies Liu] address comments 16b85a0 [Davies Liu] Merge branch 'master' of github.com:apache/spark into cleanup 43743e5 [Davies Liu] bugfix 731331f [Davies Liu] simplify serialization in MLlib Python API	2014-10-30 22:25:18 -07:00
Yanbo Liang	d9327192ee	SPARK-4111 [MLlib] add regression metrics Add RegressionMetrics.scala as regression metrics used for evaluation and corresponding test case RegressionMetricsSuite.scala. Author: Yanbo Liang <yanbohappy@gmail.com> Author: liangyanbo <liangyanbo@meituan.com> Closes #2978 from yanbohappy/regression_metrics and squashes the following commits: 730d0a9 [Yanbo Liang] more clearly annotation 3d0bec1 [Yanbo Liang] rename and keep code style a8ad3e3 [Yanbo Liang] simplify code for keeping style d454909 [Yanbo Liang] rename parameter and function names, delete unused columns, add reference 2e56282 [liangyanbo] rename r2_score() and remove unused column 43bb12b [liangyanbo] add regression metrics	2014-10-30 12:00:56 -07:00
Joseph E. Gonzalez	c7ad085208	[SPARK-4130][MLlib] Fixing libSVM parser bug with extra whitespace This simple patch filters out extra whitespace entries. Author: Joseph E. Gonzalez <joseph.e.gonzalez@gmail.com> Author: Joey <joseph.e.gonzalez@gmail.com> Closes #2996 from jegonzal/loadLibSVM and squashes the following commits: e0227ab [Joey] improving readability e028e84 [Joseph E. Gonzalez] fixing whitespace bug in loadLibSVMFile when parsing libSVM files	2014-10-30 00:05:57 -07:00
DB Tsai	51ce997355	[SPARK-4129][MLlib] Performance tuning in MultivariateOnlineSummarizer In MultivariateOnlineSummarizer, breeze's activeIterator is used to loop through the nonZero elements in the vector. However, activeIterator doesn't perform well due to lots of overhead. In this PR, native while loop is used for both DenseVector and SparseVector. The benchmark result with 20 executors using mnist8m dataset: Before: DenseVector: 48.2 seconds SparseVector: 16.3 seconds After: DenseVector: 17.8 seconds SparseVector: 11.2 seconds Since MultivariateOnlineSummarizer is used in several places, the overall performance gain in mllib library will be significant with this PR. Author: DB Tsai <dbtsai@alpinenow.com> Closes #2992 from dbtsai/SPARK-4129 and squashes the following commits: b99db6c [DB Tsai] fixed java.lang.ArrayIndexOutOfBoundsException 2b5e882 [DB Tsai] small refactoring ebe3e74 [DB Tsai] First commit	2014-10-29 10:14:53 -07:00

1 2 3 4 5 ...

558 commits