ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Sean Owen	7da6748519	[SPARK-11988][ML][MLLIB] Update JPMML to 1.2.7 Update JPMML pmml-model to 1.2.7 Author: Sean Owen <sowen@cloudera.com> Closes #9972 from srowen/SPARK-11988.	2015-12-05 15:52:52 +00:00
Antonio Murgia	e9c9ae22b9	[SPARK-11994][MLLIB] Word2VecModel load and save cause SparkException when model is bigger than spark.kryoserializer.buffer.max Author: Antonio Murgia <antonio.murgia2@studio.unibo.it> Closes #9989 from tmnd1991/SPARK-11932.	2015-12-05 15:42:02 +00:00
Yuhao Yang	ee94b70ce5	[SPARK-12096][MLLIB] remove the old constraint in word2vec jira: https://issues.apache.org/jira/browse/SPARK-12096 word2vec now can handle much bigger vocabulary. The old constraint vocabSize.toLong * vectorSize < Ine.max / 8 should be removed. new constraint is vocabSize.toLong * vectorSize < max array length (usually a little less than Int.MaxValue) I tested with vocabsize over 18M and vectorsize = 100. srowen jkbradley Sorry to miss this in last PR. I was reminded today. Author: Yuhao Yang <hhbyyh@gmail.com> Closes #10103 from hhbyyh/w2vCapacity.	2015-12-05 15:27:31 +00:00
Josh Rosen	b7204e1d41	[SPARK-12112][BUILD] Upgrade to SBT 0.13.9 We should upgrade to SBT 0.13.9, since this is a requirement in order to use SBT's new Maven-style resolution features (which will be done in a separate patch, because it's blocked by some binary compatibility issues in the POM reader plugin). I also upgraded Scalastyle to version 0.8.0, which was necessary in order to fix a Scala 2.10.5 compatibility issue (see https://github.com/scalastyle/scalastyle/issues/156). The newer Scalastyle is slightly stricter about whitespace surrounding tokens, so I fixed the new style violations. Author: Josh Rosen <joshrosen@databricks.com> Closes #10112 from JoshRosen/upgrade-to-sbt-0.13.9.	2015-12-05 08:15:30 +08:00
Dmitry Erastov	d0d8222778	[SPARK-6990][BUILD] Add Java linting script; fix minor warnings This replaces https://github.com/apache/spark/pull/9696 Invoke Checkstyle and print any errors to the console, failing the step. Use Google's style rules modified according to https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide Some important checks are disabled (see TODOs in `checkstyle.xml`) due to multiple violations being present in the codebase. Suggest fixing those TODOs in a separate PR(s). More on Checkstyle can be found on the [official website](http://checkstyle.sourceforge.net/). Sample output (from [build 46345](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/46345/consoleFull)) (duplicated because I run the build twice with different profiles): > Checkstyle checks failed at following occurrences: [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/UnsafeRowParquetRecordReader.java:[217,7] (coding) MissingSwitchDefault: switch without "default" clause. > [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java:[198,10] (modifier) ModifierOrder: 'protected' modifier out of order with the JLS suggestions. > [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/UnsafeRowParquetRecordReader.java:[217,7] (coding) MissingSwitchDefault: switch without "default" clause. > [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java:[198,10] (modifier) ModifierOrder: 'protected' modifier out of order with the JLS suggestions. > [error] running /home/jenkins/workspace/SparkPullRequestBuilder2/dev/lint-java ; received return code 1 Also fix some of the minor violations that didn't require sweeping changes. Apologies for the previous botched PRs - I finally figured out the issue. cr: JoshRosen, pwendell > I state that the contribution is my original work, and I license the work to the project under the project's open source license. Author: Dmitry Erastov <derastov@gmail.com> Closes #9867 from dskrvk/master.	2015-12-04 12:03:45 -08:00
Xiangrui Meng	9bb695b7a8	[SPARK-12000] do not specify arg types when reference a method in ScalaDoc This fixes SPARK-12000, verified on my local with JDK 7. It seems that `scaladoc` try to match method names and messed up with annotations. cc: JoshRosen jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #10114 from mengxr/SPARK-12000.2.	2015-12-02 17:19:31 -08:00
Yu ISHIKAWA	de07d06abe	[SPARK-10266][DOCUMENTATION, ML] Fixed @Since annotation for ml.tunning cc mengxr noel-smith I worked on this issues based on https://github.com/apache/spark/pull/8729. ehsanmok thank you for your contricution! Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Author: Ehsan M.Kermani <ehsanmo1367@gmail.com> Closes #9338 from yu-iskw/JIRA-10266.	2015-12-02 14:15:54 -08:00
Cheng Lian	69dbe6b40d	[SPARK-12046][DOC] Fixes various ScalaDoc/JavaDoc issues This PR backports PR #10039 to master Author: Cheng Lian <lian@databricks.com> Closes #10063 from liancheng/spark-12046.doc-fix.master.	2015-12-01 10:21:31 -08:00
Yuhao Yang	a0af0e351e	[SPARK-11898][MLLIB] Use broadcast for the global tables in Word2Vec jira: https://issues.apache.org/jira/browse/SPARK-11898 syn0Global and sync1Global in word2vec are quite large objects with size (vocab * vectorSize * 8), yet they are passed to worker using basic task serialization. Use broadcast can greatly improve the performance. My benchmark shows that, for 1M vocabulary and default vectorSize 100, changing to broadcast can help, 1. decrease the worker memory consumption by 45%. 2. decrease running time by 40%. This will also help extend the upper limit for Word2Vec. Author: Yuhao Yang <hhbyyh@gmail.com> Closes #9878 from hhbyyh/w2vBC.	2015-12-01 09:26:58 +00:00
Yuhao Yang	52bc25c8e2	[SPARK-11847][ML] Model export/import for spark.ml: LDA Add read/write support to LDA, similar to ALS. save/load for ml.LocalLDAModel is done. For DistributedLDAModel, I'm not sure if we can invoke save on the mllib.DistributedLDAModel directly. I'll send update after some test. Author: Yuhao Yang <hhbyyh@gmail.com> Closes #9894 from hhbyyh/ldaMLsave.	2015-11-24 09:56:17 -08:00
Joseph K. Bradley	9e24ba667e	[SPARK-11521][ML][DOC] Document that Logistic, Linear Regression summaries ignore weight col Doc for 1.6 that the summaries mostly ignore the weight column. To be corrected for 1.7 CC: mengxr thunterdb Author: Joseph K. Bradley <joseph@databricks.com> Closes #9927 from jkbradley/linregsummary-doc.	2015-11-24 09:54:55 -08:00
BenFradet	4be360d4ee	[SPARK-11902][ML] Unhandled case in VectorAssembler#transform There is an unhandled case in the transform method of VectorAssembler if one of the input columns doesn't have one of the supported type DoubleType, NumericType, BooleanType or VectorUDT. So, if you try to transform a column of StringType you get a cryptic "scala.MatchError: StringType". This PR aims to fix this, throwing a SparkException when dealing with an unknown column type. Author: BenFradet <benjamin.fradet@gmail.com> Closes #9885 from BenFradet/SPARK-11902.	2015-11-22 22:05:01 -08:00
Yanbo Liang	d9cf9c21fc	[SPARK-11912][ML] ml.feature.PCA minor refactor Like [SPARK-11852](https://issues.apache.org/jira/browse/SPARK-11852), ```k``` is params and we should save it under ```metadata/``` rather than both under ```data/``` and ```metadata/```. Refactor the constructor of ```ml.feature.PCAModel``` to take only ```pc``` but construct ```mllib.feature.PCAModel``` inside ```transform```. Author: Yanbo Liang <ybliang8@gmail.com> Closes #9897 from yanboliang/spark-11912.	2015-11-22 21:56:07 -08:00
Joseph K. Bradley	a6fda0bfc1	[SPARK-6791][ML] Add read/write for CrossValidator and Evaluators I believe this works for general estimators within CrossValidator, including compound estimators. (See the complex unit test.) Added read/write for all 3 Evaluators as well. CC: mengxr yanboliang Author: Joseph K. Bradley <joseph@databricks.com> Closes #9848 from jkbradley/cv-io.	2015-11-22 21:48:48 -08:00
Yanbo Liang	9ace2e5c8d	[SPARK-11852][ML] StandardScaler minor refactor ```withStd``` and ```withMean``` should be params of ```StandardScaler``` and ```StandardScalerModel```. Author: Yanbo Liang <ybliang8@gmail.com> Closes #9839 from yanboliang/standardScaler-refactor.	2015-11-20 09:55:53 -08:00
Xusen Yin	3e1d120ced	[SPARK-11867] Add save/load for kmeans and naive bayes https://issues.apache.org/jira/browse/SPARK-11867 Author: Xusen Yin <yinxusen@gmail.com> Closes #9849 from yinxusen/SPARK-11867.	2015-11-19 23:43:18 -08:00
Joseph K. Bradley	0fff8eb3e4	[SPARK-11869][ML] Clean up TempDirectory properly in ML tests Need to remove parent directory (```className```) rather than just tempDir (```className/random_name```) I tested this with IDFSuite, which has 2 read/write tests, and it fixes the problem. CC: mengxr Can you confirm this is fine? I believe it is since the same ```random_name``` is used for all tests in a suite; we basically have an extra unneeded level of nesting. Author: Joseph K. Bradley <joseph@databricks.com> Closes #9851 from jkbradley/tempdir-cleanup.	2015-11-19 23:42:24 -08:00
Yanbo Liang	3b7f056da8	[SPARK-11829][ML] Add read/write to estimators under ml.feature (II) Add read/write support to the following estimators under spark.ml: * ChiSqSelector * PCA * VectorIndexer * Word2Vec Author: Yanbo Liang <ybliang8@gmail.com> Closes #9838 from yanboliang/spark-11829.	2015-11-19 22:02:17 -08:00
Xusen Yin	4114ce20fb	[SPARK-11846] Add save/load for AFTSurvivalRegression and IsotonicRegression https://issues.apache.org/jira/browse/SPARK-11846 mengxr Author: Xusen Yin <yinxusen@gmail.com> Closes #9836 from yinxusen/SPARK-11846.	2015-11-19 22:01:02 -08:00
Joseph K. Bradley	d02d5b9295	[SPARK-11842][ML] Small cleanups to existing Readers and Writers Updates: * Add repartition(1) to save() methods' saving of data for LogisticRegressionModel, LinearRegressionModel. * Strengthen privacy to class and companion object for Writers and Readers * Change LogisticRegressionSuite read/write test to fit intercept * Add Since versions for read/write methods in Pipeline, LogisticRegression * Switch from hand-written class names in Readers to using getClass CC: mengxr CC: yanboliang Would you mind taking a look at this PR? mengxr might not be able to soon. Thank you! Author: Joseph K. Bradley <joseph@databricks.com> Closes #9829 from jkbradley/ml-io-cleanups.	2015-11-18 21:44:01 -08:00
Xiangrui Meng	e99d339206	[SPARK-11839][ML] refactor save/write traits * add "ML" prefix to reader/writer/readable/writable to avoid name collision with java.util.* * define `DefaultParamsReadable/Writable` and use them to save some code * use `super.load` instead so people can jump directly to the doc of `Readable.load`, which documents the Java compatibility issues jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #9827 from mengxr/SPARK-11839.	2015-11-18 18:34:01 -08:00
Xiangrui Meng	7e987de177	[SPARK-6787][ML] add read/write to estimators under ml.feature (1) Add read/write support to the following estimators under spark.ml: * CountVectorizer * IDF * MinMaxScaler * StandardScaler (a little awkward because we store some params in spark.mllib model) * StringIndexer Added some necessary method for read/write. Maybe we should add `private[ml] trait DefaultParamsReadable` and `DefaultParamsWritable` to save some boilerplate code, though we still need to override `load` for Java compatibility. jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #9798 from mengxr/SPARK-6787.	2015-11-18 15:47:49 -08:00
Yanbo Liang	e222d75849	[SPARK-11684][R][ML][DOC] Update SparkR glm API doc, user guide and example codes This PR includes: * Update SparkR:::glm, SparkR:::summary API docs. * Update SparkR machine learning user guide and example codes to show: * supporting feature interaction in R formula. * summary for gaussian GLM model. * coefficients for binomial GLM model. mengxr Author: Yanbo Liang <ybliang8@gmail.com> Closes #9727 from yanboliang/spark-11684.	2015-11-18 13:30:29 -08:00
Yuhao Yang	e391abdf2c	[SPARK-11813][MLLIB] Avoid serialization of vocab in Word2Vec jira: https://issues.apache.org/jira/browse/SPARK-11813 I found the problem during training a large corpus. Avoid serialization of vocab in Word2Vec has 2 benefits. 1. Performance improvement for less serialization. 2. Increase the capacity of Word2Vec a lot. Currently in the fit of word2vec, the closure mainly includes serialization of Word2Vec and 2 global table. the main part of Word2vec is the vocab of size: vocab * 40 * 2 * 4 = 320 vocab 2 global table: vocab * vectorSize * 8. If vectorSize = 20, that's 160 vocab. Their sum cannot exceed Int.max due to the restriction of ByteArrayOutputStream. In any case, avoiding serialization of vocab helps decrease the size of the closure serialization, especially when vectorSize is small, thus to allow larger vocabulary. Actually there's another possible fix, make local copy of fields to avoid including Word2Vec in the closure. Let me know if that's preferred. Author: Yuhao Yang <hhbyyh@gmail.com> Closes #9803 from hhbyyh/w2vVocab.	2015-11-18 13:25:15 -08:00
Joseph K. Bradley	2acdf10b1f	[SPARK-6789][ML] Add Readable, Writable support for spark.ml ALS, ALSModel Also modifies DefaultParamsWriter.saveMetadata to take optional extra metadata. CC: mengxr yanboliang Author: Joseph K. Bradley <joseph@databricks.com> Closes #9786 from jkbradley/als-io.	2015-11-18 13:16:31 -08:00
Wenjian Huang	045a4f0458	[SPARK-6790][ML] Add spark.ml LinearRegression import/export This replaces [https://github.com/apache/spark/pull/9656] with updates. fayeshine should be the main author when this PR is committed. CC: mengxr fayeshine Author: Wenjian Huang <nextrush@163.com> Author: Joseph K. Bradley <joseph@databricks.com> Closes #9814 from jkbradley/fayeshine-patch-6790.	2015-11-18 13:06:25 -08:00
RoyGaoVLIS	67a5132c21	[SPARK-7013][ML][TEST] Add unit test for spark.ml StandardScaler I have added unit test for ML's StandardScaler By comparing with R's output, please review for me. Thx. Author: RoyGaoVLIS <roygao@zju.edu.cn> Closes #6665 from RoyGao/7013.	2015-11-17 23:00:49 -08:00
Xiangrui Meng	3e9e638023	[SPARK-11764][ML] make Param.jsonEncode/jsonDecode support Vector This PR makes the default read/write work with simple transformers/estimators that have params of type `Param[Vector]`. jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #9776 from mengxr/SPARK-11764.	2015-11-17 14:04:49 -08:00
Joseph K. Bradley	6eb7008b7f	[SPARK-11763][ML] Add save,load to LogisticRegression Estimator Add save/load to LogisticRegression Estimator, and refactor tests a little to make it easier to add similar support to other Estimator, Model pairs. Moved LogisticRegressionReader/Writer to within LogisticRegressionModel CC: mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #9749 from jkbradley/lr-io-2.	2015-11-17 14:03:49 -08:00
Joseph K. Bradley	d98d1cb000	[SPARK-11769][ML] Add save, load to all basic Transformers This excludes Estimators and ones which include Vector and other non-basic types for Params or data. This adds: * Bucketizer * DCT * HashingTF * Interaction * NGram * Normalizer * OneHotEncoder * PolynomialExpansion * QuantileDiscretizer * RFormula * SQLTransformer * StopWordsRemover * StringIndexer * Tokenizer * VectorAssembler * VectorSlicer CC: mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #9755 from jkbradley/transformer-io.	2015-11-17 12:43:56 -08:00
Xiangrui Meng	21fac54341	[SPARK-11766][MLLIB] add toJson/fromJson to Vector/Vectors This is to support JSON serialization of Param[Vector] in the pipeline API. It could be used for other purposes too. The schema is the same as `VectorUDT`. jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #9751 from mengxr/SPARK-11766.	2015-11-17 10:17:16 -08:00
Joseph K. Bradley	1c5475f140	[SPARK-11612][ML] Pipeline and PipelineModel persistence Pipeline and PipelineModel extend Readable and Writable. Persistence succeeds only when all stages are Writable. Note: This PR reinstates tests for other read/write functionality. It should probably not get merged until [https://issues.apache.org/jira/browse/SPARK-11672] gets fixed. CC: mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #9674 from jkbradley/pipeline-io.	2015-11-16 17:12:39 -08:00
Xiangrui Meng	64e5551103	[SPARK-11672][ML] set active SQLContext in JavaDefaultReadWriteSuite The same as #9694, but for Java test suite. yhuai Author: Xiangrui Meng <meng@databricks.com> Closes #9719 from mengxr/SPARK-11672.4.	2015-11-15 13:23:05 -08:00
Xiangrui Meng	bdfbc1dcaf	[MINOR][ML] remove MLlibTestsSparkContext from ImpuritySuite ImpuritySuite doesn't need SparkContext. Author: Xiangrui Meng <meng@databricks.com> Closes #9698 from mengxr/remove-mllib-test-context-in-impurity-suite.	2015-11-13 13:19:04 -08:00
Xiangrui Meng	2d2411faa2	[SPARK-11672][ML] Set active SQLContext in MLlibTestSparkContext.beforeAll Still saw some error messages caused by `SQLContext.getOrCreate`: https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/3997/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=spark-test/testReport/junit/org.apache.spark.ml.util/JavaDefaultReadWriteSuite/testDefaultReadWrite/ This PR sets the active SQLContext in beforeAll, which is not automatically set in `new SQLContext`. This makes `SQLContext.getOrCreate` return the right SQLContext. cc: yhuai Author: Xiangrui Meng <meng@databricks.com> Closes #9694 from mengxr/SPARK-11672.3.	2015-11-13 13:09:28 -08:00
Yanbo Liang	99693fef0a	[SPARK-11723][ML][DOC] Use LibSVM data source rather than MLUtils.loadLibSVMFile to load DataFrame Use LibSVM data source rather than MLUtils.loadLibSVMFile to load DataFrame, include: * Use libSVM data source for all example codes under examples/ml, and remove unused import. * Use libSVM data source for user guides under ml-*** which were omitted by #8697. * Fix bug: We should use ```sqlContext.read().format("libsvm").load(path)``` at Java side, but the API doc and user guides misuse as ```sqlContext.read.format("libsvm").load(path)```. * Code cleanup. mengxr Author: Yanbo Liang <ybliang8@gmail.com> Closes #9690 from yanboliang/spark-11723.	2015-11-13 08:43:05 -08:00
Xiangrui Meng	e71c07557c	[SPARK-11672][ML] flaky spark.ml read/write tests We set `sqlContext = null` in `afterAll`. However, this doesn't change `SQLContext.activeContext` and then `SQLContext.getOrCreate` might use the `SparkContext` from previous test suite and hence causes the error. This PR calls `clearActive` in `beforeAll` and `afterAll` to avoid using an old context from other test suites. cc: yhuai Author: Xiangrui Meng <meng@databricks.com> Closes #9677 from mengxr/SPARK-11672.2.	2015-11-12 20:01:13 -08:00
Joseph K. Bradley	dcb896fd8c	[SPARK-11712][ML] Make spark.ml LDAModel be abstract Per discussion in the initial Pipelines LDA PR [https://github.com/apache/spark/pull/9513], we should make LDAModel abstract and create a LocalLDAModel. This code simplification should be done before the 1.6 release to ensure API compatibility in future releases. CC feynmanliang mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #9678 from jkbradley/lda-pipelines-2.	2015-11-12 17:03:19 -08:00
Xiangrui Meng	e2957bc085	[SPARK-11674][ML] add private val after @transient in Word2VecModel This causes compile failure with Scala 2.11. See https://issues.scala-lang.org/browse/SI-8813. (Jenkins won't test Scala 2.11. I tested compile locally.) JoshRosen Author: Xiangrui Meng <meng@databricks.com> Closes #9644 from mengxr/SPARK-11674.	2015-11-11 21:01:14 -08:00
Xiangrui Meng	1a21be15f6	[SPARK-11672][ML] disable spark.ml read/write tests Saw several failures on Jenkins, e.g., https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2040/testReport/org.apache.spark.ml.util/JavaDefaultReadWriteSuite/testDefaultReadWrite/. This is the first failure in master build: https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/3982/ I cannot reproduce it on local. So temporarily disable the tests and I will look into the issue under the same JIRA. I'm going to merge the PR after Jenkins passes compile. Author: Xiangrui Meng <meng@databricks.com> Closes #9641 from mengxr/SPARK-11672.	2015-11-11 15:41:36 -08:00
Yuming Wang	27524a3a9c	[SPARK-11626][ML] ml.feature.Word2Vec.transform() function very slow org.apache.spark.ml.feature.Word2Vec.transform() very slow. we should not read broadcast every sentence. Author: Yuming Wang <q79969786@gmail.com> Author: yuming.wang <q79969786@gmail.com> Author: Xiangrui Meng <meng@databricks.com> Closes #9592 from 979969786/master.	2015-11-11 09:43:26 -08:00
Joseph K. Bradley	6e101d2e9d	[SPARK-6726][ML] Import/export for spark.ml LogisticRegressionModel This PR adds model save/load for spark.ml's LogisticRegressionModel. It also does minor refactoring of the default save/load classes to reuse code. CC: mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #9606 from jkbradley/logreg-io2.	2015-11-10 18:45:48 -08:00
Yu ISHIKAWA	c0e48dfa61	[SPARK-11566] [MLLIB] [PYTHON] Refactoring GaussianMixtureModel.gaussians in Python cc jkbradley Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #9534 from yu-iskw/SPARK-11566.	2015-11-10 16:42:28 -08:00
Joseph K. Bradley	e281b87398	[SPARK-5565][ML] LDA wrapper for Pipelines API This adds LDA to spark.ml, the Pipelines API. It follows the design doc in the JIRA: [https://issues.apache.org/jira/browse/SPARK-5565], with one major change: * I eliminated doc IDs. These are not necessary with DataFrames since the user can add an ID column as needed. Note: This will conflict with [https://github.com/apache/spark/pull/9484], but I'll try to merge [https://github.com/apache/spark/pull/9484] first and then rebase this PR. CC: hhbyyh feynmanliang If you have a chance to make a pass, that'd be really helpful--thanks! Now that I'm done traveling & this PR is almost ready, I'll see about reviewing other PRs critical for 1.6. CC: mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #9513 from jkbradley/lda-pipelines.	2015-11-10 16:20:10 -08:00
unknown	dba1a62cf1	[SPARK-7316][MLLIB] RDD sliding window with step Implementation of step capability for sliding window function in MLlib's RDD. Though one can use current sliding window with step 1 and then filter every Nth window, it will take more time and space (N*data.count times more than needed). For example, below are the results for various windows and steps on 10M data points: Window \| Step \| Time \| Windows produced ------------ \| ------------- \| ---------- \| ---------- 128 \| 1 \| 6.38 \| 9999873 128 \| 10 \| 0.9 \| 999988 128 \| 100 \| 0.41 \| 99999 1024 \| 1 \| 44.67 \| 9998977 1024 \| 10 \| 4.74 \| 999898 1024 \| 100 \| 0.78 \| 99990 ``` import org.apache.spark.mllib.rdd.RDDFunctions._ val rdd = sc.parallelize(1 to 10000000, 10) rdd.count val window = 1024 val step = 1 val t = System.nanoTime(); val windows = rdd.sliding(window, step); println(windows.count); println((System.nanoTime() - t) / 1e9) ``` Author: unknown <ulanov@ULANOV3.americas.hpqcorp.net> Author: Alexander Ulanov <nashb@yandex.ru> Author: Xiangrui Meng <meng@databricks.com> Closes #5855 from avulanov/SPARK-7316-sliding.	2015-11-10 14:25:06 -08:00
Joseph K. Bradley	18350a5700	[SPARK-11618][ML] Minor refactoring of basic ML import/export Refactoring * separated overwrite and param save logic in DefaultParamsWriter * added sparkVersion to DefaultParamsWriter CC: mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #9587 from jkbradley/logreg-io.	2015-11-10 11:36:43 -08:00
Yuhao Yang	61f9c8711c	[SPARK-11069][ML] Add RegexTokenizer option to convert to lowercase jira: https://issues.apache.org/jira/browse/SPARK-11069 quotes from jira: Tokenizer converts strings to lowercase automatically, but RegexTokenizer does not. It would be nice to add an option to RegexTokenizer to convert to lowercase. Proposal: call the Boolean Param "toLowercase" set default to false (so behavior does not change) Actually sklearn converts to lowercase before tokenizing too Author: Yuhao Yang <hhbyyh@gmail.com> Closes #9092 from hhbyyh/tokenLower.	2015-11-09 16:55:23 -08:00
Yu ISHIKAWA	8a2336893a	[SPARK-6517][MLLIB] Implement the Algorithm of Hierarchical Clustering I implemented a hierarchical clustering algorithm again. This PR doesn't include examples, documentation and spark.ml APIs. I am going to send another PRs later. https://issues.apache.org/jira/browse/SPARK-6517 - This implementation based on a bi-sectiong K-means clustering. - It derives from the freeman-lab 's implementation - The basic idea is not changed from the previous version. (#2906) - However, It is 1000x faster than the previous version through parallel processing. Thank you for your great cooperation, RJ Nowling(rnowling), Jeremy Freeman(freeman-lab), Xiangrui Meng(mengxr) and Sean Owen(srowen). Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Author: Xiangrui Meng <meng@databricks.com> Author: Yu ISHIKAWA <yu-iskw@users.noreply.github.com> Closes #5267 from yu-iskw/new-hierarchical-clustering.	2015-11-09 14:56:36 -08:00
fazlan-nazeem	9b88e1dcad	[SPARK-11582][MLLIB] specifying pmml version attribute =4.2 in the root node of pmml model The current pmml models generated do not specify the pmml version in its root node. This is a problem when using this pmml model in other tools because they expect the version attribute to be set explicitly. This fix adds the pmml version attribute to the generated pmml models and specifies its value as 4.2. Author: fazlan-nazeem <fazlann@wso2.com> Closes #9558 from fazlan-nazeem/master.	2015-11-09 08:58:55 -08:00
Yanbo Liang	8c0e1b50e9	[SPARK-11494][ML][R] Expose R-like summary statistics in SparkR::glm for linear regression Expose R-like summary statistics in SparkR::glm for linear regression, the output of ```summary``` like ```Java $DevianceResiduals Min Max -0.9509607 0.7291832 $Coefficients Estimate Std. Error t value Pr(>\|t\|) (Intercept) 1.6765 0.2353597 7.123139 4.456124e-11 Sepal_Length 0.3498801 0.04630128 7.556598 4.187317e-12 Species_versicolor -0.9833885 0.07207471 -13.64402 0 Species_virginica -1.00751 0.09330565 -10.79796 0 ``` Author: Yanbo Liang <ybliang8@gmail.com> Closes #9561 from yanboliang/spark-11494.	2015-11-09 08:56:22 -08:00
Yu ISHIKAWA	2ff0e79a86	[SPARK-8467] [MLLIB] [PYSPARK] Add LDAModel.describeTopics() in Python Could jkbradley and davies review it? - Create a wrapper class: `LDAModelWrapper` for `LDAModel`. Because we can't deal with the return value of`describeTopics` in Scala from pyspark directly. `Array[(Array[Int], Array[Double])]` is too complicated to convert it. - Add `loadLDAModel` in `PythonMLlibAPI`. Since `LDAModel` in Scala is an abstract class and we need to call `load` of `DistributedLDAModel`. [[SPARK-8467] Add LDAModel.describeTopics() in Python - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-8467) Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #8643 from yu-iskw/SPARK-8467-2.	2015-11-06 22:56:29 -08:00
Xiangrui Meng	c447c9d546	[SPARK-11217][ML] save/load for non-meta estimators and transformers This PR implements the default save/load for non-meta estimators and transformers using the JSON serialization of param values. The saved metadata includes: * class name * uid * timestamp * paramMap The save/load interface is similar to DataFrames. We use the current active context by default, which should be sufficient for most use cases. ~~~scala instance.save("path") instance.write.context(sqlContext).overwrite().save("path") Instance.load("path") ~~~ The param handling is different from the design doc. We didn't save default and user-set params separately, and when we load it back, all parameters are user-set. This does cause issues. But it also cause other issues if we modify the default params. TODOs: * [x] Java test * [ ] a follow-up PR to implement default save/load for all non-meta estimators and transformers cc jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #9454 from mengxr/SPARK-11217.	2015-11-06 14:51:03 -08:00
Imran Rashid	49f1a82037	[SPARK-10116][CORE] XORShiftRandom.hashSeed is random in high bits https://issues.apache.org/jira/browse/SPARK-10116 This is really trivial, just happened to notice it -- if `XORShiftRandom.hashSeed` is really supposed to have random bits throughout (as the comment implies), it needs to do something for the conversion to `long`. mengxr mkolod Author: Imran Rashid <irashid@cloudera.com> Closes #8314 from squito/SPARK-10116.	2015-11-06 20:06:24 +00:00
Yu ISHIKAWA	8fa8c8375d	[SPARK-11514][ML] Pass random seed to spark.ml DecisionTree* cc jkbradley Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #9486 from yu-iskw/SPARK-11514.	2015-11-05 17:59:01 -08:00
Ehsan M.Kermani	f80f7b69a3	[SPARK-10265][DOCUMENTATION, ML] Fixed @Since annotation to ml.regression Here is my first commit. Author: Ehsan M.Kermani <ehsanmo1367@gmail.com> Closes #8728 from ehsanmok/SinceAnn.	2015-11-05 12:11:57 -08:00
Yanbo Liang	9da7ceed81	[SPARK-11473][ML] R-like summary statistics with intercept for OLS via normal equation solver Follow up [SPARK-9836](https://issues.apache.org/jira/browse/SPARK-9836), we should also support summary statistics for ```intercept```. Author: Yanbo Liang <ybliang8@gmail.com> Closes #9485 from yanboliang/spark-11473.	2015-11-05 09:56:18 -08:00
a1singh	a94671a027	[SPARK-11506][MLLIB] Removed redundant operation in Online LDA implementation In file LDAOptimizer.scala: line 441: since "idx" was never used, replaced unrequired zipWithIndex.foreach with foreach. - nonEmptyDocs.zipWithIndex.foreach { case ((_, termCounts: Vector), idx: Int) => + nonEmptyDocs.foreach { case (_, termCounts: Vector) => Author: a1singh <a1singh@ucsd.edu> Closes #9456 from a1singh/master.	2015-11-05 12:51:10 +00:00
Yu ISHIKAWA	411ff6afb4	[SPARK-10028][MLLIB][PYTHON] Add Python API for PrefixSpan Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #9469 from yu-iskw/SPARK-10028.	2015-11-04 15:28:19 -08:00
Yanbo Liang	e328b69c31	[SPARK-9492][ML][R] LogisticRegression in R should provide model statistics Like ml ```LinearRegression```, ```LogisticRegression``` should provide a training summary including feature names and their coefficients. Author: Yanbo Liang <ybliang8@gmail.com> Closes #9303 from yanboliang/spark-9492.	2015-11-04 08:28:33 -08:00
Yanbo Liang	f54ff19b1e	[SPARK-11349][ML] Support transform string label for RFormula Currently ```RFormula``` can only handle label with ```NumericType``` or ```BinaryType``` (cast it to ```DoubleType``` as the label of Linear Regression training), we should also support label of ```StringType``` which is needed for Logistic Regression (glm with family = "binomial"). For label of ```StringType```, we should use ```StringIndexer``` to transform it to 0-based index. Author: Yanbo Liang <ybliang8@gmail.com> Closes #9302 from yanboliang/spark-11349.	2015-11-03 08:32:37 -08:00
Yanbo Liang	3434572b14	[MINOR][ML] Fix naming conventions of AFTSurvivalRegression coefficients Rename ```regressionCoefficients``` back to ```coefficients```, and name ```weights``` to ```parameters```. See discussion [here](https://github.com/apache/spark/pull/9311/files#diff-e277fd0bc21f825d3196b4551c01fe5fR230). mengxr vectorijk dbtsai Author: Yanbo Liang <ybliang8@gmail.com> Closes #9431 from yanboliang/aft-coefficients.	2015-11-03 08:31:16 -08:00
Yanbo Liang	d6f10aa7ea	[SPARK-9836][ML] Provide R-like summary statistics for OLS via normal equation solver https://issues.apache.org/jira/browse/SPARK-9836 Author: Yanbo Liang <ybliang8@gmail.com> Closes #9413 from yanboliang/spark-9836.	2015-11-03 08:29:07 -08:00
DB Tsai	21ad846238	[MINOR][ML] removed the old `getModelWeights` function Removed the old `getModelWeights` function which was private and renamed into `getModelCoefficients` Author: DB Tsai <dbt@netflix.com> Closes #9426 from dbtsai/feature-minor.	2015-11-02 19:07:31 -08:00
vectorijk	c020f7d9d4	[SPARK-10592] [ML] [PySpark] Deprecate weights and use coefficients instead in ML models Deprecated in `LogisticRegression` and `LinearRegression` Author: vectorijk <jiangkai@gmail.com> Closes #9311 from vectorijk/spark-10592.	2015-11-02 16:12:04 -08:00
Dominik Dahlem	ec03866a7e	[SPARK-11343][ML] Allow float and double prediction/label columns in RegressionEvaluator mengxr, felixcheung This pull request just relaxes the type of the prediction/label columns to be float and double. Internally, these columns are casted to double. The other evaluators might need to be changed also. Author: Dominik Dahlem <dominik.dahlem@gmail.combination> Closes #9296 from dahlem/ddahlem_regression_evaluator_double_predictions_27102015.	2015-11-02 16:11:42 -08:00
Xiangrui Meng	33ae7a35da	[SPARK-11358][MLLIB] deprecate runs in k-means This PR deprecates `runs` in k-means. `runs` introduces extra complexity and overhead in MLlib's k-means implementation. I haven't seen much usage with `runs` not equal to `1`. We don't have a unit test for it either. We can deprecate this method in 1.6, and void it in 1.7. It helps us simplify the implementation. cc: srowen Author: Xiangrui Meng <meng@databricks.com> Closes #9322 from mengxr/SPARK-11358.	2015-11-02 13:42:16 -08:00
Yu ISHIKAWA	e963070c13	[SPARK-9722] [ML] Pass random seed to spark.ml DecisionTree* Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #9402 from yu-iskw/SPARK-9722.	2015-11-01 23:52:50 -08:00
Nakul Jindal	69b9e4b3c2	[SPARK-11385] [ML] foreachActive made public in MLLib's vector API Made foreachActive public in MLLib's vector API Author: Nakul Jindal <njindal@us.ibm.com> Closes #9362 from nakul02/SPARK-11385_foreach_for_mllib_linalg_vector.	2015-10-30 17:12:24 -07:00
Lewuathe	86d65265fc	[SPARK-11207] [ML] Add test cases for solver selection of LinearRegres… …sion as followup. This is the follow up work of SPARK-10668. * Fix miner style issues. * Add test case for checking whether solver is selected properly. Author: Lewuathe <lewuathe@me.com> Author: lewuathe <lewuathe@me.com> Closes #9180 from Lewuathe/SPARK-11207.	2015-10-30 02:59:05 -07:00
Yanbo Liang	fba9e95452	[SPARK-11369][ML][R] SparkR glm should support setting standardize SparkR glm currently support : ```formula, family = c(“gaussian”, “binomial”), data, lambda = 0, alpha = 0``` We should also support setting standardize which has been defined at [design documentation](https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/edit) Author: Yanbo Liang <ybliang8@gmail.com> Closes #9331 from yanboliang/spark-11369.	2015-10-28 08:50:21 -07:00
Nakul Jindal	5f1cee6f15	[SPARK-11332] [ML] Refactored to use ml.feature.Instance instead of WeightedLeastSquare.Instance WeightedLeastSquares now uses the common Instance class in ml.feature instead of a private one. Author: Nakul Jindal <njindal@us.ibm.com> Closes #9325 from nakul02/SPARK-11332_refactor_WeightedLeastSquares_dot_Instance.	2015-10-28 01:02:03 -07:00
Xiangrui Meng	82c1c57728	[MINOR][ML] fix compile warns This fixes some compile time warnings. Author: Xiangrui Meng <meng@databricks.com> Closes #9319 from mengxr/mllib-compile-warn-20151027.	2015-10-27 23:41:42 -07:00
Sean Owen	826e1e304b	[SPARK-11302][MLLIB] 2) Multivariate Gaussian Model with Covariance matrix returns incorrect answer in some cases Fix computation of root-sigma-inverse in multivariate Gaussian; add a test and fix related Python mixture model test. Supersedes https://github.com/apache/spark/pull/9293 Author: Sean Owen <sowen@cloudera.com> Closes #9309 from srowen/SPARK-11302.2.	2015-10-27 23:07:37 -07:00
Reza Zadeh	8b292b19c9	[SPARK-10654][MLLIB] Add columnSimilarities to IndexedRowMatrix Add columnSimilarities to IndexedRowMatrix by delegating to functionality already in RowMatrix. With a test. Author: Reza Zadeh <reza@databricks.com> Closes #8792 from rezazadeh/colsims.	2015-10-26 22:00:24 -07:00
Sean Owen	3cac6614a4	[SPARK-11184][MLLIB] Declare most of .mllib code not-Experimental Remove "Experimental" from .mllib code that has been around since 1.4.0 or earlier Author: Sean Owen <sowen@cloudera.com> Closes #9169 from srowen/SPARK-11184.	2015-10-26 21:47:42 -07:00
Jayant Shekar	4e38defae1	[SPARK-6723] [MLLIB] Model import/export for ChiSqSelector This is a PR for Parquet-based model import/export. * Added save/load for ChiSqSelectorModel * Updated the test suite ChiSqSelectorSuite Author: Jayant Shekar <jayant@user-MBPMBA-3.local> Closes #6785 from jayantshekhar/SPARK-6723.	2015-10-23 08:45:13 -07:00
Reynold Xin	cdea0174e3	[SPARK-11273][SQL] Move ArrayData/MapData/DataTypeParser to catalyst.util package Author: Reynold Xin <rxin@databricks.com> Closes #9239 from rxin/types-private.	2015-10-23 00:00:21 -07:00
Xiangrui Meng	45861693be	[SPARK-10082][MLLIB] minor style updates for matrix indexing after #8271 * `>=0` => `>= 0` * print `i`, `j` in the log message MechCoder Author: Xiangrui Meng <meng@databricks.com> Closes #9189 from mengxr/SPARK-10082.	2015-10-20 18:37:29 -07:00
MechCoder	da46b77afd	[SPARK-10082][MLLIB] Validate i, j in apply DenseMatrices and SparseMatrices Given row_ind should be less than the number of rows Given col_ind should be less than the number of cols. The current code in master gives unpredictable behavior for such cases. Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #8271 from MechCoder/hash_code_matrices.	2015-10-20 16:35:34 -07:00
Tijo Thomas	9f49895fef	[SPARK-10261][DOCUMENTATION, ML] Fixed @Since annotation to ml.evaluation Author: Tijo Thomas <tijoparacka@gmail.com> Author: tijo <tijo@ezzoft.com> Closes #8554 from tijoparacka/SPARK-10261-2.	2015-10-20 16:13:34 -07:00
lewuathe	4c33a34ba3	[SPARK-10668] [ML] Use WeightedLeastSquares in LinearRegression with L… …2 regularization if the number of features is small Author: lewuathe <lewuathe@me.com> Author: Lewuathe <sasaki@treasure-data.com> Author: Kai Sasaki <sasaki@treasure-data.com> Author: Lewuathe <lewuathe@me.com> Closes #8884 from Lewuathe/SPARK-10668.	2015-10-19 10:46:10 -07:00
Luvsandondov Lkhamsuren	cca2258685	[SPARK-9963] [ML] RandomForest cleanup: replace predictNodeIndex with predictImpl predictNodeIndex is moved to LearningNode and renamed predictImpl for consistency with Node.predictImpl Author: Luvsandondov Lkhamsuren <lkhamsurenl@gmail.com> Closes #8609 from lkhamsurenl/SPARK-9963.	2015-10-17 10:07:42 -07:00
Yuhao Yang	e1e77b22b3	[SPARK-11029] [ML] Add computeCost to KMeansModel in spark.ml jira: https://issues.apache.org/jira/browse/SPARK-11029 We should add a method analogous to spark.mllib.clustering.KMeansModel.computeCost to spark.ml.clustering.KMeansModel. This will be a temp fix until we have proper evaluators defined for clustering. Author: Yuhao Yang <hhbyyh@gmail.com> Author: yuhaoyang <yuhao@zhanglipings-iMac.local> Closes #9073 from hhbyyh/computeCost.	2015-10-17 10:04:19 -07:00
Burak Yavuz	10046ea76c	[SPARK-10599] [MLLIB] Lower communication for block matrix multiplication This PR aims to decrease communication costs in BlockMatrix multiplication in two ways: - Simulate the multiplication on the driver, and figure out which blocks actually need to be shuffled - Send the block once to a partition, and join inside the partition rather than sending multiple copies to the same partition NOTE: One important note is that right now, the old behavior of checking for multiple blocks with the same index is lost. This is not hard to add, but is a little more expensive than how it was. Initial benchmarking showed promising results (look below), however I did hit some `FileNotFound` exceptions with the new implementation after the shuffle. Size A: 1e5 x 1e5 Size B: 1e5 x 1e5 Block Sizes: 1024 x 1024 Sparsity: 0.01 Old implementation: 1m 13s New implementation: 9s cc avulanov Would you be interested in helping me benchmark this? I used your code from the mailing list (which you sent about 3 months ago?), and the old implementation didn't even run, but the new implementation completed in 268s in a 120 GB / 16 core cluster Author: Burak Yavuz <brkyvz@gmail.com> Closes #8757 from brkyvz/opt-bmm.	2015-10-16 15:30:07 -07:00
vectorijk	3889b1c7a9	[SPARK-11059] [ML] Change range of quantile probabilities in AFTSurvivalRegression Value of the quantile probabilities array should be in the range (0, 1) instead of [0,1] in `AFTSurvivalRegression.scala` according to [Discussion] (https://github.com/apache/spark/pull/8926#discussion-diff-40698242) Author: vectorijk <jiangkai@gmail.com> Closes #9083 from vectorijk/spark-11059.	2015-10-13 15:57:36 -07:00
Xiangrui Meng	2b574f52d7	[SPARK-7402] [ML] JSON SerDe for standard param types This PR implements the JSON SerDe for the following param types: `Boolean`, `Int`, `Long`, `Float`, `Double`, `String`, `Array[Int]`, `Array[Double]`, and `Array[String]`. The implementation of `Float`, `Double`, and `Array[Double]` are specialized to handle `NaN` and `Inf`s. This will be used in pipeline persistence. jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #9090 from mengxr/SPARK-7402.	2015-10-13 13:24:10 -07:00
Vladimir Vladimirov	c1b4ce4326	[SPARK-10535] Sync up API for matrix factorization model between Scala and PySpark Support for recommendUsersForProducts and recommendProductsForUsers in matrix factorization model for PySpark Author: Vladimir Vladimirov <vladimir.vladimirov@magnetic.com> Closes #8700 from smartkiwi/SPARK-10535_.	2015-10-09 14:16:13 -07:00
Nick Pritchard	5994cfe812	[SPARK-10875] [MLLIB] Computed covariance matrix should be symmetric Compute upper triangular values of the covariance matrix, then copy to lower triangular values. Author: Nick Pritchard <nicholas.pritchard@falkonry.com> Closes #8940 from pnpritchard/SPARK-10875.	2015-10-08 22:22:20 -07:00
Yanbo Liang	2268356002	[SPARK-7770] [ML] GBT validationTol change to compare with relative or absolute error GBT compare ValidateError with tolerance switching between relative and absolute ones, where the former one is relative to the current loss on the training set. Author: Yanbo Liang <ybliang8@gmail.com> Closes #8549 from yanboliang/spark-7770.	2015-10-08 11:27:46 -07:00
Holden Karau	0903c6489e	[SPARK-9718] [ML] linear regression training summary all columns LinearRegression training summary: The transformed dataset should hold all columns, not just selected ones like prediction and label. There is no real need to remove some, and the user may find them useful. Author: Holden Karau <holden@pigscanfly.ca> Closes #8564 from holdenk/SPARK-9718-LinearRegressionTrainingSummary-all-columns.	2015-10-08 11:16:20 -07:00
Nathan Howell	1bc435ae3a	[SPARK-10064] [ML] Parallelize decision tree bin split calculations Reimplement `DecisionTree.findSplitsBins` via `RDD` to parallelize bin calculation. With large feature spaces the current implementation is very slow. This change limits the features that are distributed (or collected) to just the continuous features, and performs the split calculations in parallel. It completes on a real multi terabyte dataset in less than a minute instead of multiple hours. Author: Nathan Howell <nhowell@godaddy.com> Closes #8246 from NathanHowell/SPARK-10064.	2015-10-07 17:46:16 -07:00
DB Tsai	dd36ec6bc5	[SPARK-10738] [ML] Refactoring `Instance` out from LOR and LIR, and also cleaning up some code Refactoring `Instance` case class out from LOR and LIR, and also cleaning up some code. Author: DB Tsai <dbt@netflix.com> Closes #8853 from dbtsai/refactoring.	2015-10-07 15:56:57 -07:00
Yanbo Liang	7bf07faa71	[SPARK-10490] [ML] Consolidate the Cholesky solvers in WeightedLeastSquares and ALS Consolidate the Cholesky solvers in WeightedLeastSquares and ALS. Author: Yanbo Liang <ybliang8@gmail.com> Closes #8936 from yanboliang/spark-10490.	2015-10-07 15:50:45 -07:00
Evan Chen	da936fbb74	[SPARK-10779] [PYSPARK] [MLLIB] Set initialModel for KMeans model in PySpark (spark.mllib) Provide initialModel param for pyspark.mllib.clustering.KMeans Author: Evan Chen <chene@us.ibm.com> Closes #8967 from evanyc15/SPARK-10779-pyspark-mllib.	2015-10-07 15:04:53 -07:00
Marcelo Vanzin	94fc57afdf	[SPARK-10300] [BUILD] [TESTS] Add support for test tags in run-tests.py. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #8775 from vanzin/SPARK-10300.	2015-10-07 14:11:21 -07:00
Holden Karau	5be5d24744	[SPARK-9841] [ML] Make clear public It is currently impossible to clear Param values once set. It would be helpful to be able to. Author: Holden Karau <holden@pigscanfly.ca> Closes #8619 from holdenk/SPARK-9841-params-clear-needs-to-be-public.	2015-10-07 12:00:56 -07:00
Yin Huai	b0baa11d3b	[HOT-FIX] Fix style. https://github.com/apache/spark/pull/8882 broke our build. Author: Yin Huai <yhuai@databricks.com> Closes #8964 from yhuai/fixStyle.	2015-10-02 11:23:08 -07:00
Xusen Yin	633aaae0a1	[SPARK-6530] [ML] Add chi-square selector for ml package See JIRA [here](https://issues.apache.org/jira/browse/SPARK-6530). Author: Xusen Yin <yinxusen@gmail.com> Closes #5742 from yinxusen/SPARK-6530.	2015-10-02 10:25:58 -07:00
Xusen Yin	23a9448c04	[SPARK-5890] [ML] Add feature discretizer JIRA issue [here](https://issues.apache.org/jira/browse/SPARK-5890). I borrow the code of `findSplits` from `RandomForest`. I don't think it's good to call it from `RandomForest` directly. Author: Xusen Yin <yinxusen@gmail.com> Closes #5779 from yinxusen/SPARK-5890.	2015-10-02 10:19:18 -07:00
Rerngvit Yanggratoke	2a717821bb	[SPARK-9798] [ML] CrossValidatorModel Documentation Improvements Document CrossValidatorModel members: bestModel and avgMetrics Author: Rerngvit Yanggratoke <rerngvit@kth.se> Closes #8882 from rerngvit/Spark-9798.	2015-10-02 10:15:02 -07:00
Yanbo Liang	2931e89f0c	[SPARK-10736] [ML] Use 1 for all ratings if $(ratingCol) = "" For some implicit dataset, ratings may not exist in the training data. In this case, we can assume all observed pairs to be positive and treat their ratings as 1. This should happen when users set ```ratingCol``` to an empty string. Author: Yanbo Liang <ybliang8@gmail.com> Closes #8937 from yanboliang/spark-10736.	2015-09-29 23:58:32 -07:00
y-shimizu	299b439920	[SPARK-10778] [MLLIB] Implement toString for AssociationRules.Rule I implemented toString for AssociationRules.Rule, format like `[x, y] => {z}: 1.0` Author: y-shimizu <y.shimizu0429@gmail.com> Closes #8904 from y-shimizu/master.	2015-09-27 16:36:03 +01:00
Eric Liang	922338812c	[SPARK-9681] [ML] Support R feature interactions in RFormula This integrates the Interaction feature transformer with SparkR R formula support (i.e. support `:`). To generate reasonable ML attribute names for feature interactions, it was necessary to add the ability to read attribute the original attribute names back from `StructField`, and also to specify custom group prefixes in `VectorAssembler`. This also has the side-benefit of cleaning up the double-underscores in the attributes generated for non-interaction terms. mengxr Author: Eric Liang <ekl@databricks.com> Closes #8830 from ericl/interaction-2.	2015-09-25 00:43:22 -07:00
Holden Karau	d91967e159	[SPARK-10763] [ML] [JAVA] [TEST] Update Java MLLIB/ML tests to use simplified dataframe construction As introduced in https://issues.apache.org/jira/browse/SPARK-10630 we now have an easier way to create dataframes from local Java lists. Lets update the tests to use those. Author: Holden Karau <holden@pigscanfly.ca> Closes #8886 from holdenk/SPARK-10763-update-java-mllib-ml-tests-to-use-simplified-dataframe-construction.	2015-09-23 22:49:08 -07:00
Yanbo Liang	067afb4e9b	[SPARK-10699] [ML] Support checkpointInterval can be disabled Currently use can set ```checkpointInterval``` to specify how often should the cache be check-pointed. But we also need the function that users can disable it. This PR supports that users can disable checkpoint if user setting ```checkpointInterval = -1```. We also add documents for GBT ```cacheNodeIds``` to make users can understand more clearly about checkpoint. Author: Yanbo Liang <ybliang8@gmail.com> Closes #8820 from yanboliang/spark-10699.	2015-09-23 16:41:42 -07:00
Yanbo Liang	ce2b056d35	[SPARK-10686] [ML] Add quantilesCol to AFTSurvivalRegression By default ```quantilesCol``` should be empty. If ```quantileProbabilities``` is set, we should append quantiles as a new column (of type Vector). Author: Yanbo Liang <ybliang8@gmail.com> Closes #8836 from yanboliang/spark-10686.	2015-09-23 15:26:02 -07:00
sethah	098be27ad5	[SPARK-9715] [ML] Store numFeatures in all ML PredictionModel types All prediction models should store `numFeatures` indicating the number of features the model was trained on. Default value of -1 added for backwards compatibility. Author: sethah <seth.hendrickson16@gmail.com> Closes #8675 from sethah/SPARK-9715.	2015-09-23 15:00:52 -07:00
Yanbo Liang	7104ee0e5d	[SPARK-10750] [ML] ML Param validate should print better error information Currently when you set illegal value for params of array type (such as IntArrayParam, DoubleArrayParam, StringArrayParam), it will throw IllegalArgumentException but with incomprehensible error information. Take ```VectorSlicer.setNames``` as an example: ```scala val vectorSlicer = new VectorSlicer().setInputCol("features").setOutputCol("result") // The value of setNames must be contain distinct elements, so the next line will throw exception. vectorSlicer.setIndices(Array.empty).setNames(Array("f1", "f4", "f1")) ``` It will throw IllegalArgumentException as: ``` vectorSlicer_b3b4d1a10f43 parameter names given invalid value [Ljava.lang.String;798256c5. java.lang.IllegalArgumentException: vectorSlicer_b3b4d1a10f43 parameter names given invalid value [Ljava.lang.String;798256c5. ``` We should distinguish the value of array type from primitive type at Param.validate(value: T), and we will get better error information. ``` vectorSlicer_3b744ea277b2 parameter names given invalid value [f1,f4,f1]. java.lang.IllegalArgumentException: vectorSlicer_3b744ea277b2 parameter names given invalid value [f1,f4,f1]. ``` Author: Yanbo Liang <ybliang8@gmail.com> Closes #8863 from yanboliang/spark-10750.	2015-09-22 11:00:33 -07:00
Holden Karau	f4a3c4e34c	[SPARK-9962] [ML] Decision Tree training: prevNodeIdsForInstances.unpersist() at end of training NodeIdCache: prevNodeIdsForInstances.unpersist() needs to be called at end of training. Author: Holden Karau <holden@pigscanfly.ca> Closes #8541 from holdenk/SPARK-9962-decission-tree-training-prevNodeIdsForiNstances-unpersist-at-end-of-training.	2015-09-22 10:19:08 -07:00
Meihua Wu	870b8a2edd	[SPARK-10706] [MLLIB] Add java wrapper for random vector rdd Add java wrapper for random vector rdd holdenk srowen Author: Meihua Wu <meihuawu@umich.edu> Closes #8841 from rotationsymmetry/SPARK-10706.	2015-09-22 11:05:24 +01:00
Feynman Liang	aeef44a3e3	[SPARK-3147] [MLLIB] [STREAMING] Streaming 2-sample statistical significance testing Implementation of significance testing using Streaming API. Author: Feynman Liang <fliang@databricks.com> Author: Feynman Liang <feynman.liang@gmail.com> Closes #4716 from feynmanliang/ab_testing.	2015-09-21 13:11:28 -07:00
Meihua Wu	331f0b10f7	[SPARK-9642] [ML] LinearRegression should supported weighted data In many modeling application, data points are not necessarily sampled with equal probabilities. Linear regression should support weighting which account the over or under sampling. work in progress. Author: Meihua Wu <meihuawu@umich.edu> Closes #8631 from rotationsymmetry/SPARK-9642.	2015-09-21 12:09:00 -07:00
Holden Karau	20a61dbd9b	[SPARK-10626] [MLLIB] create java friendly method for random rdd SPARK-3136 added a large number of functions for creating Java RandomRDDs, but for people that want to use custom RandomDataGenerators we should make a Java friendly method. Author: Holden Karau <holden@pigscanfly.ca> Closes #8782 from holdenk/SPARK-10626-create-java-friendly-method-for-randomRDD.	2015-09-21 18:53:28 +01:00
lewuathe	0c498717ba	[SPARK-10715] [ML] Duplicate initialization flag in WeightedLeastSquare There are duplicate set of initialization flag in `WeightedLeastSquares#add`. `initialized` is already set in `init(Int)`. Author: lewuathe <lewuathe@me.com> Closes #8837 from Lewuathe/duplicate-initialization-flag.	2015-09-20 16:16:31 -07:00
Sean Owen	1aa9e50256	[SPARK-5905] [MLLIB] Note requirements for certain RowMatrix methods in docs Note methods that fail for cols > 65535; note that SVD does not require n >= m CC mengxr Author: Sean Owen <sowen@cloudera.com> Closes #8839 from srowen/SPARK-5905.	2015-09-20 16:05:12 -07:00
Eric Liang	c8149ef2c5	[MINOR] [ML] override toString of AttributeGroup This makes equality test failures much more readable. mengxr Author: Eric Liang <ekl@databricks.com> Author: Eric Liang <ekhliang@gmail.com> Closes #8826 from ericl/attrgroupstr.	2015-09-18 16:23:05 -07:00
Yanbo Liang	98f1ea67da	[SPARK-8518] [ML] Log-linear models for survival analysis [Accelerated Failure Time (AFT) model](https://en.wikipedia.org/wiki/Accelerated_failure_time_model) is the most commonly used and easy to parallel method of survival analysis for censored survival data. It is the log-linear model based on the Weibull distribution of the survival time. Users can refer to the R function [```survreg```](https://stat.ethz.ch/R-manual/R-devel/library/survival/html/survreg.html) to compare the model and [```predict```](https://stat.ethz.ch/R-manual/R-devel/library/survival/html/predict.survreg.html) to compare the prediction. There are different kinds of model prediction, I have just select the type ```response``` which is default used for R. Author: Yanbo Liang <ybliang8@gmail.com> Closes #8611 from yanboliang/spark-8518.	2015-09-17 21:37:10 -07:00
Eric Liang	4fbf332869	[SPARK-9698] [ML] Add RInteraction transformer for supporting R-style feature interactions This is a pre-req for supporting the ":" operator in the RFormula feature transformer. Design doc from umbrella task: https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/edit mengxr Author: Eric Liang <ekl@databricks.com> Closes #7987 from ericl/interaction.	2015-09-17 14:09:06 -07:00
Yanbo Liang	64743870f2	[SPARK-10394] [ML] Make GBTParams use shared stepSize ```GBTParams``` has ```stepSize``` as learning rate currently. ML has shared param class ```HasStepSize```, ```GBTParams``` can extend from it rather than duplicated implementation. Author: Yanbo Liang <ybliang8@gmail.com> Closes #8552 from yanboliang/spark-10394.	2015-09-17 11:24:38 -07:00
Holden Karau	e51345e1e0	[SPARK-10077] [DOCS] [ML] Add package info for java of ml/feature Should be the same as SPARK-7808 but use Java for the code example. It would be great to add package doc for `spark.ml.feature`. Author: Holden Karau <holden@pigscanfly.ca> Closes #8740 from holdenk/SPARK-10077-JAVA-PACKAGE-DOC-FOR-SPARK.ML.FEATURE.	2015-09-17 09:17:43 -07:00
DB Tsai	be52faa7c7	[SPARK-7685] [ML] Apply weights to different samples in Logistic Regression In fraud detection dataset, almost all the samples are negative while only couple of them are positive. This type of high imbalanced data will bias the models toward negative resulting poor performance. In python-scikit, they provide a correction allowing users to Over-/undersample the samples of each class according to the given weights. In auto mode, selects weights inversely proportional to class frequencies in the training set. This can be done in a more efficient way by multiplying the weights into loss and gradient instead of doing actual over/undersampling in the training dataset which is very expensive. http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html On the other hand, some of the training data maybe more important like the training samples from tenure users while the training samples from new users maybe less important. We should be able to provide another "weight: Double" information in the LabeledPoint to weight them differently in the learning algorithm. Author: DB Tsai <dbt@netflix.com> Author: DB Tsai <dbt@dbs-mac-pro.corp.netflix.com> Closes #7884 from dbtsai/SPARK-7685.	2015-09-15 15:46:47 -07:00
Marcelo Vanzin	b42059d2ef	Revert "[SPARK-10300] [BUILD] [TESTS] Add support for test tags in run-tests.py." This reverts commit `8abef21dac`.	2015-09-15 13:03:38 -07:00
Marcelo Vanzin	8abef21dac	[SPARK-10300] [BUILD] [TESTS] Add support for test tags in run-tests.py. This change does two things: - tag a few tests and adds the mechanism in the build to be able to disable those tags, both in maven and sbt, for both junit and scalatest suites. - add some logic to run-tests.py to disable some tags depending on what files have changed; that's used to disable expensive tests when a module hasn't explicitly been changed, to speed up testing for changes that don't directly affect those modules. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #8437 from vanzin/test-tags.	2015-09-15 10:45:02 -07:00
Yuhao Yang	c35fdcb7e9	[SPARK-10491] [MLLIB] move RowMatrix.dspr to BLAS jira: https://issues.apache.org/jira/browse/SPARK-10491 We implemented dspr with sparse vector support in `RowMatrix`. This method is also used in WeightedLeastSquares and other places. It would be useful to move it to `linalg.BLAS`. Let me know if new UT needed. Author: Yuhao Yang <hhbyyh@gmail.com> Closes #8663 from hhbyyh/movedspr.	2015-09-15 09:58:49 -07:00
Reynold Xin	09b7e7c198	Update version to 1.6.0-SNAPSHOT. Author: Reynold Xin <rxin@databricks.com> Closes #8350 from rxin/1.6.	2015-09-15 00:54:20 -07:00
Nick Pritchard	8a634e9bcc	[SPARK-10573] [ML] IndexToString output schema should be StringType Fixes bug where IndexToString output schema was DoubleType. Correct me if I'm wrong, but it doesn't seem like the output needs to have any "ML Attribute" metadata. Author: Nick Pritchard <nicholas.pritchard@falkonry.com> Closes #8751 from pnpritchard/SPARK-10573.	2015-09-14 13:27:45 -07:00
Yanbo Liang	ce6f3f163b	[SPARK-10194] [MLLIB] [PYSPARK] SGD algorithms need convergenceTol parameter in Python [SPARK-3382](https://issues.apache.org/jira/browse/SPARK-3382) added a ```convergenceTol``` parameter for GradientDescent-based methods in Scala. We need that parameter in Python; otherwise, Python users will not be able to adjust that behavior (or even reproduce behavior from previous releases since the default changed). Author: Yanbo Liang <ybliang8@gmail.com> Closes #8457 from yanboliang/spark-10194.	2015-09-14 12:08:52 -07:00
Bertrand Dechoux	d81565465c	[SPARK-9720] [ML] Identifiable types need UID in toString methods A few Identifiable types did override their toString method but without using the parent implementation. As a consequence, the uid was not present anymore in the toString result. It is the default behaviour. This patch is a quick fix. The question of enforcement is still up. No tests have been written to verify the toString method behaviour. That would be long to do because all types should be tested and not only those which have a regression now. It is possible to enforce the condition using the compiler by making the toString method final but that would introduce unwanted potential API breaking changes (see jira). Author: Bertrand Dechoux <BertrandDechoux@users.noreply.github.com> Closes #8062 from BertrandDechoux/SPARK-9720.	2015-09-14 09:18:46 +01:00
Joseph K. Bradley	2e3a280754	[MINOR] [MLLIB] [ML] [DOC] Minor doc fixes for StringIndexer and MetadataUtils Changes: * Make Scala doc for StringIndexerInverse clearer. Also remove Scala doc from transformSchema, so that the doc is inherited. * MetadataUtils.scala: “ Helper utilities for tree-based algorithms” —> not just trees anymore CC: holdenk mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #8679 from jkbradley/doc-fixes-1.5.	2015-09-11 08:55:35 -07:00
Xiangrui Meng	960d2d0ac6	[SPARK-10537] [ML] document LIBSVM source options in public API doc and some minor improvements We should document options in public API doc. Otherwise, it is hard to find out the options without looking at the code. I tried to make `DefaultSource` private and put the documentation to package doc. However, since then there exists no public class under `source.libsvm`, the Java package doc doesn't show up in the generated html file (http://bugs.java.com/bugdatabase/view_bug.do?bug_id=4492654). So I put the doc to `DefaultSource` instead. There are several minor updates in this PR: 1. Do `vectorType == "sparse"` only once. 2. Update `hashCode` and `equals`. 3. Remove inherited doc. 4. Delete temp dir in `afterAll`. Lewuathe Author: Xiangrui Meng <meng@databricks.com> Closes #8699 from mengxr/SPARK-10537.	2015-09-11 08:53:40 -07:00
Yanbo Liang	b01b262606	[SPARK-9773] [ML] [PySpark] Add Python API for MultilayerPerceptronClassifier Add Python API for ```MultilayerPerceptronClassifier```. Author: Yanbo Liang <ybliang8@gmail.com> Closes #8067 from yanboliang/SPARK-9773.	2015-09-11 08:52:28 -07:00
Yanbo Liang	339a527141	[SPARK-10023] [ML] [PySpark] Unified DecisionTreeParams checkpointInterval between Scala and Python API. "checkpointInterval" is member of DecisionTreeParams in Scala API which is inconsistency with Python API, we should unified them. ``` member of DecisionTreeParams <-> Scala API shared param for all ML Transformer/Estimator <-> Python API ``` Proposal: "checkpointInterval" is also used by ALS, so we make it shared params at Scala. Author: Yanbo Liang <ybliang8@gmail.com> Closes #8528 from yanboliang/spark-10023.	2015-09-10 20:34:00 -07:00
lewuathe	2ddeb63126	[SPARK-10117] [MLLIB] Implement SQL data source API for reading LIBSVM data It is convenient to implement data source API for LIBSVM format to have a better integration with DataFrames and ML pipeline API. Two option is implemented. * `numFeatures`: Specify the dimension of features vector * `featuresType`: Specify the type of output vector. `sparse` is default. Author: lewuathe <lewuathe@me.com> Closes #8537 from Lewuathe/SPARK-10117 and squashes the following commits: 986999d [lewuathe] Change unit test phrase 11d513f [lewuathe] Fix some reviews 21600a4 [lewuathe] Merge branch 'master' into SPARK-10117 9ce63c7 [lewuathe] Rewrite service loader file 1fdd2df [lewuathe] Merge branch 'SPARK-10117' of github.com:Lewuathe/spark into SPARK-10117 ba3657c [lewuathe] Merge branch 'master' into SPARK-10117 0ea1c1c [lewuathe] LibSVMRelation is registered into META-INF 4f40891 [lewuathe] Improve test suites 5ab62ab [lewuathe] Merge branch 'master' into SPARK-10117 8660d0e [lewuathe] Fix Java unit test b56a948 [lewuathe] Merge branch 'master' into SPARK-10117 2c12894 [lewuathe] Remove unnecessary tag 7d693c2 [lewuathe] Resolv conflict 62010af [lewuathe] Merge branch 'master' into SPARK-10117 a97ee97 [lewuathe] Fix some points aef9564 [lewuathe] Fix 70ee4dd [lewuathe] Add Java test 3fd8dce [lewuathe] [SPARK-10117] Implement SQL data source API for reading LIBSVM data 40d3027 [lewuathe] Add Java test 7056d4a [lewuathe] Merge branch 'master' into SPARK-10117 99accaa [lewuathe] [SPARK-10117] Implement SQL data source API for reading LIBSVM data	2015-09-09 09:29:10 -07:00
Luc Bourlier	c1bc4f439f	[SPARK-10227] fatal warnings with sbt on Scala 2.11 The bulk of the changes are on `transient` annotation on class parameter. Often the compiler doesn't generate a field for this parameters, so the the transient annotation would be unnecessary. But if the class parameter are used in methods, then fields are created. So it is safer to keep the annotations. The remainder are some potential bugs, and deprecated syntax. Author: Luc Bourlier <luc.bourlier@typesafe.com> Closes #8433 from skyluc/issue/sbt-2.11.	2015-09-09 09:57:58 +01:00
Holden Karau	2f6fd5256c	[SPARK-9654] [ML] [PYSPARK] Add IndexToString to PySpark Adds IndexToString to PySpark. Author: Holden Karau <holden@pigscanfly.ca> Closes #7976 from holdenk/SPARK-9654-add-string-indexer-inverse-in-pyspark.	2015-09-08 22:13:05 -07:00
Yanbo Liang	a1573489a3	[SPARK-10464] [MLLIB] Add WeibullGenerator for RandomDataGenerator Add WeibullGenerator for RandomDataGenerator. #8611 need use WeibullGenerator to generate random data based on Weibull distribution. Author: Yanbo Liang <ybliang8@gmail.com> Closes #8622 from yanboliang/spark-10464.	2015-09-08 20:54:02 -07:00
Xiangrui Meng	52fe32f6ac	[SPARK-9834] [MLLIB] implement weighted least squares via normal equation The goal of this PR is to have a weighted least squares implementation that takes the normal equation approach, and hence to be able to provide R-like summary statistics and support IRLS (used by GLMs). The tests match R's lm and glmnet. There are couple TODOs that can be addressed in future PRs: * consolidate summary statistics aggregators * move `dspr` to `BLAS` * etc It would be nice to have this merged first because it blocks couple other features. dbtsai Author: Xiangrui Meng <meng@databricks.com> Closes #8588 from mengxr/SPARK-9834.	2015-09-08 20:51:20 -07:00
Vinod K C	e6f8d36860	[SPARK-10468] [ MLLIB ] Verify schema before Dataframe select API call Loader.checkSchema was called to verify the schema after dataframe.select(...). Schema verification should be done before dataframe.select(...) Author: Vinod K C <vinod.kc@huawei.com> Closes #8636 from vinodkc/fix_GaussianMixtureModel_load_verification.	2015-09-08 14:44:05 -07:00
Yanbo Liang	f7b55dbfc3	[SPARK-10470] [ML] ml.IsotonicRegressionModel.copy should set parent Copied model must have the same parent, but ml.IsotonicRegressionModel.copy did not set parent. Here fix it and add test case. Author: Yanbo Liang <ybliang8@gmail.com> Closes #8637 from yanboliang/spark-10470.	2015-09-08 12:48:21 -07:00
Yanbo Liang	5b2192e846	[SPARK-10480] [ML] Fix ML.LinearRegressionModel.copy() This PR fix two model ```copy()``` related issues: [SPARK-10480](https://issues.apache.org/jira/browse/SPARK-10480) ```ML.LinearRegressionModel.copy()``` ignored argument ```extra```, it will not take effect when users setting this parameter. [SPARK-10479](https://issues.apache.org/jira/browse/SPARK-10479) ```ML.LogisticRegressionModel.copy()``` should copy model summary if available. Author: Yanbo Liang <ybliang8@gmail.com> Closes #8641 from yanboliang/linear-regression-copy.	2015-09-08 11:11:35 -07:00
Holden Karau	871764c6ce	[SPARK-10013] [ML] [JAVA] [TEST] remove java assert from java unit tests From Jira: We should use assertTrue, etc. instead to make sure the asserts are not ignored in tests. Author: Holden Karau <holden@pigscanfly.ca> Closes #8607 from holdenk/SPARK-10013-remove-java-assert-from-java-unit-tests.	2015-09-05 00:04:00 -10:00
Holden Karau	22eab706f4	[SPARK-10402] [DOCS] [ML] Add defaults to the scaladoc for params in ml/ We should make sure the scaladoc for params includes their default values through the models in ml/ Author: Holden Karau <holden@pigscanfly.ca> Closes #8591 from holdenk/SPARK-10402-add-scaladoc-for-default-values-of-params-in-ml.	2015-09-04 17:32:35 -07:00
Holden Karau	44948a2e9d	[SPARK-9723] [ML] params getordefault should throw more useful error Params.getOrDefault should throw a more meaningful exception than what you get from a bad key lookup. Author: Holden Karau <holden@pigscanfly.ca> Closes #8567 from holdenk/SPARK-9723-params-getordefault-should-throw-more-useful-error.	2015-09-02 21:19:42 -07:00
Holden Karau	e6e483cc4d	[SPARK-9679] [ML] [PYSPARK] Add Python API for Stop Words Remover Add a python API for the Stop Words Remover. Author: Holden Karau <holden@pigscanfly.ca> Closes #8118 from holdenk/SPARK-9679-python-StopWordsRemover.	2015-09-01 10:48:57 -07:00
Yanbo Liang	fe16fd0b8b	[SPARK-10349] [ML] OneVsRest use 'when ... otherwise' not UDF to generate new label at binary reduction Currently OneVsRest use UDF to generate new binary label during training. Considering that [SPARK-7321](https://issues.apache.org/jira/browse/SPARK-7321) has been merged, we can use ```when ... otherwise``` which will be more efficiency. Author: Yanbo Liang <ybliang8@gmail.com> Closes #8519 from yanboliang/spark-10349.	2015-08-31 16:06:38 -07:00
Xiangrui Meng	23e39cc7b1	[SPARK-9954] [MLLIB] use first 128 nonzeros to compute Vector.hashCode This could help reduce hash collisions, e.g., in `RDD[Vector].repartition`. jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #8182 from mengxr/SPARK-9954.	2015-08-31 15:49:25 -07:00
Xiangrui Meng	f0f563a3c4	[SPARK-100354] [MLLIB] fix some apparent memory issues in k-means\|\| initializaiton * do not cache first cost RDD * change following cost RDD cache level to MEMORY_AND_DISK * remove Vector wrapper to save a object per instance Further improvements will be addressed in SPARK-10329 cc: yu-iskw HuJiayin Author: Xiangrui Meng <meng@databricks.com> Closes #8526 from mengxr/SPARK-10354.	2015-08-30 23:20:03 -07:00
Burak Yavuz	8d2ab75d3b	[SPARK-10353] [MLLIB] BLAS gemm not scaling when beta = 0.0 for some subset of matrix multiplications mengxr jkbradley rxin It would be great if this fix made it into RC3! Author: Burak Yavuz <brkyvz@gmail.com> Closes #8525 from brkyvz/blas-scaling.	2015-08-30 12:21:15 -07:00
Yu ISHIKAWA	4eeda8d472	[SPARK-10260] [ML] Add @Since annotation to ml.clustering ### JIRA [[SPARK-10260] Add Since annotation to ml.clustering - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-10260) Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #8455 from yu-iskw/SPARK-10260.	2015-08-28 00:50:26 -07:00
Feynman Liang	5bfe9e1111	[SPARK-9680] [MLLIB] [DOC] StopWordsRemovers user guide and Java compatibility test * Adds user guide for ml.feature.StopWordsRemovers, ran code examples on my machine * Cleans up scaladocs for public methods * Adds test for Java compatibility * Follow up Python user guide code example is tracked by SPARK-10249 Author: Feynman Liang <fliang@databricks.com> Closes #8436 from feynmanliang/SPARK-10230.	2015-08-27 16:10:37 -07:00
Vyacheslav Baranov	fdd466bed7	[SPARK-10182] [MLLIB] GeneralizedLinearModel doesn't unpersist cached data `GeneralizedLinearModel` creates a cached RDD when building a model. It's inconvenient, since these RDDs flood the memory when building several models in a row, so useful data might get evicted from the cache. The proposed solution is to always cache the dataset & remove the warning. There's a caveat though: input dataset gets evaluated twice, in line 270 when fitting `StandardScaler` for the first time, and when running optimizer for the second time. So, it might worth to return removed warning. Another possible solution is to disable caching entirely & return removed warning. I don't really know what approach is better. Author: Vyacheslav Baranov <slavik.baranov@gmail.com> Closes #8395 from SlavikBaranov/SPARK-10182.	2015-08-27 18:56:18 +01:00
Feynman Liang	e1f4de4a7d	[SPARK-10257] [MLLIB] Removes Guava from all spark.mllib Java tests * Replaces instances of `Lists.newArrayList` with `Arrays.asList` * Replaces `commons.lang.StringUtils` over `com.google.collections.Strings` * Replaces `List` interface over `ArrayList` implementations This PR along with #8445 #8446 #8447 completely removes all `com.google.collections.Lists` dependencies within mllib's Java tests. Author: Feynman Liang <fliang@databricks.com> Closes #8451 from feynmanliang/SPARK-10257.	2015-08-27 18:46:41 +01:00
Jacek Laskowski	b02e818722	[SPARK-9613] [HOTFIX] Fix usage of JavaConverters removed in Scala 2.11 Fix for [JavaConverters.asJavaListConverter](http://www.scala-lang.org/api/2.10.5/index.html#scala.collection.JavaConverters$) being removed in 2.11.7 and hence the build fails with the 2.11 profile enabled. Tested with the default 2.10 and 2.11 profiles. BUILD SUCCESS in both cases. Build for 2.10: ./build/mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.7.1 -DskipTests clean install and 2.11: ./dev/change-scala-version.sh 2.11 ./build/mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.7.1 -Dscala-2.11 -DskipTests clean install Author: Jacek Laskowski <jacek@japila.pl> Closes #8479 from jaceklaskowski/SPARK-9613-hotfix.	2015-08-27 11:07:37 +01:00
Feynman Liang	1a446f75b6	[SPARK-10256] [ML] Removes guava dependency from spark.ml.classification JavaTests Author: Feynman Liang <fliang@databricks.com> Closes #8447 from feynmanliang/SPARK-10256.	2015-08-27 10:46:18 +01:00
Feynman Liang	75d6230794	[SPARK-10255] [ML] Removes Guava dependencies from spark.ml.param JavaTests Author: Feynman Liang <fliang@databricks.com> Closes #8446 from feynmanliang/SPARK-10255.	2015-08-27 10:45:35 +01:00
Feynman Liang	1650f6f56e	[SPARK-10254] [ML] Removes Guava dependencies in spark.ml.feature JavaTests * Replaces `com.google.common` dependencies with `java.util.Arrays` * Small clean up in `JavaNormalizerSuite` Author: Feynman Liang <fliang@databricks.com> Closes #8445 from feynmanliang/SPARK-10254.	2015-08-27 10:44:44 +01:00
Xiangrui Meng	086d4681df	[SPARK-10241] [MLLIB] update since versions in mllib.recommendation Same as #8421 but for `mllib.recommendation`. cc srowen coderxiang Author: Xiangrui Meng <meng@databricks.com> Closes #8432 from mengxr/SPARK-10241.	2015-08-26 14:02:19 -07:00
Xiangrui Meng	6519fd06cc	[SPARK-9665] [MLLIB] audit MLlib API annotations I only found `ml.NaiveBayes` missing `Experimental` annotation. This PR doesn't cover Python APIs. cc jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #8452 from mengxr/SPARK-9665.	2015-08-26 11:47:05 -07:00
Xiangrui Meng	321d775969	[SPARK-10236] [MLLIB] update since versions in mllib.feature Same as #8421 but for `mllib.feature`. cc dbtsai Author: Xiangrui Meng <meng@databricks.com> Closes #8449 from mengxr/SPARK-10236.feature and squashes the following commits: 0e8d658 [Xiangrui Meng] remove unnecessary comment ad70b03 [Xiangrui Meng] update since versions in mllib.feature	2015-08-25 23:45:41 -07:00
Xiangrui Meng	4657fa1f37	[SPARK-10235] [MLLIB] update since versions in mllib.regression Same as #8421 but for `mllib.regression`. cc freeman-lab dbtsai Author: Xiangrui Meng <meng@databricks.com> Closes #8426 from mengxr/SPARK-10235 and squashes the following commits: 6cd28e4 [Xiangrui Meng] update since versions in mllib.regression	2015-08-25 22:49:33 -07:00
Xiangrui Meng	fb7e12fe2e	[SPARK-10243] [MLLIB] update since versions in mllib.tree Same as #8421 but for `mllib.tree`. cc jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #8442 from mengxr/SPARK-10236.	2015-08-25 22:35:49 -07:00
Xiangrui Meng	d703372f86	[SPARK-10234] [MLLIB] update since version in mllib.clustering Same as #8421 but for `mllib.clustering`. cc feynmanliang yu-iskw Author: Xiangrui Meng <meng@databricks.com> Closes #8435 from mengxr/SPARK-10234.	2015-08-25 22:33:48 -07:00
Xiangrui Meng	c3a54843c0	[SPARK-10240] [SPARK-10242] [MLLIB] update since versions in mlilb.random and mllib.stat The same as #8241 but for `mllib.stat` and `mllib.random`. cc feynmanliang Author: Xiangrui Meng <meng@databricks.com> Closes #8439 from mengxr/SPARK-10242.	2015-08-25 22:31:23 -07:00
Xiangrui Meng	ab431f8a97	[SPARK-10238] [MLLIB] update since versions in mllib.linalg Same as #8421 but for `mllib.linalg`. cc dbtsai Author: Xiangrui Meng <meng@databricks.com> Closes #8440 from mengxr/SPARK-10238 and squashes the following commits: b38437e [Xiangrui Meng] update since versions in mllib.linalg	2015-08-25 20:07:56 -07:00
Xiangrui Meng	8668ead2e7	[SPARK-10233] [MLLIB] update since version in mllib.evaluation Same as #8421 but for `mllib.evaluation`. cc avulanov Author: Xiangrui Meng <meng@databricks.com> Closes #8423 from mengxr/SPARK-10233.	2015-08-25 18:17:54 -07:00
Feynman Liang	125205cdb3	[SPARK-9888] [MLLIB] User guide for new LDA features * Adds two new sections to LDA's user guide; one for each optimizer/model * Documents new features added to LDA (e.g. topXXXperXXX, asymmetric priors, hyperpam optimization) * Cleans up a TODO and sets a default parameter in LDA code jkbradley hhbyyh Author: Feynman Liang <fliang@databricks.com> Closes #8254 from feynmanliang/SPARK-9888.	2015-08-25 17:39:20 -07:00
Xiangrui Meng	00ae4be97f	[SPARK-10239] [SPARK-10244] [MLLIB] update since versions in mllib.pmml and mllib.util Same as #8421 but for `mllib.pmml` and `mllib.util`. cc dbtsai Author: Xiangrui Meng <meng@databricks.com> Closes #8430 from mengxr/SPARK-10239 and squashes the following commits: a189acf [Xiangrui Meng] update since versions in mllib.pmml and mllib.util	2015-08-25 14:11:38 -07:00
Feynman Liang	9205907876	[SPARK-9797] [MLLIB] [DOC] StreamingLinearRegressionWithSGD.setConvergenceTol default value Adds default convergence tolerance (0.001, set in `GradientDescent.convergenceTol`) to `setConvergenceTol`'s scaladoc Author: Feynman Liang <fliang@databricks.com> Closes #8424 from feynmanliang/SPARK-9797.	2015-08-25 13:23:15 -07:00
Xiangrui Meng	c619c7552f	[SPARK-10237] [MLLIB] update since versions in mllib.fpm Same as #8421 but for `mllib.fpm`. cc feynmanliang Author: Xiangrui Meng <meng@databricks.com> Closes #8429 from mengxr/SPARK-10237.	2015-08-25 13:22:38 -07:00
Feynman Liang	c0e9ff1588	[SPARK-9800] Adds docs for GradientDescent$.runMiniBatchSGD alias * Adds doc for alias of runMIniBatchSGD documenting default value for convergeTol * Cleans up a note in code Author: Feynman Liang <fliang@databricks.com> Closes #8425 from feynmanliang/SPARK-9800.	2015-08-25 13:21:05 -07:00
Xiangrui Meng	16a2be1a84	[SPARK-10231] [MLLIB] update @Since annotation for mllib.classification Update `Since` annotation in `mllib.classification`: 1. add version to classes, objects, constructors, and public variables declared in constructors 2. correct some versions 3. remove `Since` on `toString` MechCoder dbtsai Author: Xiangrui Meng <meng@databricks.com> Closes #8421 from mengxr/SPARK-10231 and squashes the following commits: b2dce80 [Xiangrui Meng] update @Since annotation for mllib.classification	2015-08-25 12:16:23 -07:00
Feynman Liang	881208a8e8	[SPARK-10230] [MLLIB] Rename optimizeAlpha to optimizeDocConcentration See [discussion](https://github.com/apache/spark/pull/8254#discussion_r37837770) CC jkbradley Author: Feynman Liang <fliang@databricks.com> Closes #8422 from feynmanliang/SPARK-10230.	2015-08-25 11:58:47 -07:00
Sean Owen	69c9c17716	[SPARK-9613] [CORE] Ban use of JavaConversions and migrate all existing uses to JavaConverters Replace `JavaConversions` implicits with `JavaConverters` Most occurrences I've seen so far are necessary conversions; a few have been avoidable. None are in critical code as far as I see, yet. Author: Sean Owen <sowen@cloudera.com> Closes #8033 from srowen/SPARK-9613.	2015-08-25 12:33:13 +01:00
Joseph K. Bradley	b963c19a80	[SPARK-10164] [MLLIB] Fixed GMM distributed decomposition bug GaussianMixture now distributes matrix decompositions for certain problem sizes. Distributed computation actually fails, but this was not tested in unit tests. This PR adds a unit test which checks this. It failed previously but works with this fix. CC: mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #8370 from jkbradley/gmm-fix.	2015-08-23 18:34:07 -07:00
Xusen Yin	630a994e6a	[SPARK-9893] User guide with Java test suite for VectorSlicer Add user guide for `VectorSlicer`, with Java test suite and Python version VectorSlicer. Note that Python version does not support selecting by names now. Author: Xusen Yin <yinxusen@gmail.com> Closes #8267 from yinxusen/SPARK-9893.	2015-08-21 16:30:12 -07:00
Joseph K. Bradley	f01c4220d2	[SPARK-10163] [ML] Allow single-category features for GBT models Removed categorical feature info validation since no longer needed This is needed to make the ML user guide examples work (in another current PR). CC: mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #8367 from jkbradley/gbt-single-cat.	2015-08-21 16:28:00 -07:00
MechCoder	f5b028ed2f	[SPARK-9864] [DOC] [MLlib] [SQL] Replace since in scaladoc to Since annotation Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #8352 from MechCoder/since.	2015-08-21 14:19:24 -07:00
Joseph K. Bradley	eaafe139f8	[SPARK-9245] [MLLIB] LDA topic assignments For each (document, term) pair, return top topic. Note that instances of (doc, term) pairs within a document (a.k.a. "tokens") are exchangeable, so we should provide an estimate per document-term, rather than per token. CC: rotationsymmetry mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #8329 from jkbradley/lda-topic-assignments.	2015-08-20 15:01:31 -07:00
MechCoder	7cfc0750e1	[SPARK-10108] Add since tags to mllib.feature Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #8309 from MechCoder/tags_feature.	2015-08-20 14:56:08 -07:00
Xiangrui Meng	2a3d98aae2	[SPARK-10138] [ML] move setters to MultilayerPerceptronClassifier and add Java test suite Otherwise, setters do not return self type. jkbradley avulanov Author: Xiangrui Meng <meng@databricks.com> Closes #8342 from mengxr/SPARK-10138.	2015-08-20 14:47:04 -07:00
Eric Liang	8e0a072f78	[SPARK-9895] User Guide for RFormula Feature Transformer mengxr Author: Eric Liang <ekl@databricks.com> Closes #8293 from ericl/docs-2.	2015-08-19 15:43:08 -07:00
Xiangrui Meng	5b62bef8cb	[SPARK-8918] [MLLIB] [DOC] Add @since tags to mllib.clustering This continues the work from #8256. I removed `since` tags from private/protected/local methods/variables (see `72fdeb6463`). MechCoder Closes #8256 Author: Xiangrui Meng <meng@databricks.com> Author: Xiaoqing Wang <spark445@126.com> Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #8288 from mengxr/SPARK-8918.	2015-08-19 13:17:26 -07:00
Feynman Liang	28a98464ea	[SPARK-10097] Adds `shouldMaximize` flag to `ml.evaluation.Evaluator` Previously, users of evaluator (`CrossValidator` and `TrainValidationSplit`) would only maximize the metric in evaluator, leading to a hacky solution which negated metrics to be minimized and caused erroneous negative values to be reported to the user. This PR adds a `isLargerBetter` attribute to the `Evaluator` base class, instructing users of `Evaluator` on whether the chosen metric should be maximized or minimized. CC jkbradley Author: Feynman Liang <fliang@databricks.com> Author: Joseph K. Bradley <joseph@databricks.com> Closes #8290 from feynmanliang/SPARK-10097.	2015-08-19 11:35:05 -07:00
lewuathe	c635a16f64	[SPARK-10012] [ML] Missing test case for Params#arrayLengthGt Currently there is no test case for `Params#arrayLengthGt`. Author: lewuathe <lewuathe@me.com> Closes #8223 from Lewuathe/SPARK-10012.	2015-08-18 15:30:23 -07:00
Bryan Cutler	1dbffba37a	[SPARK-8924] [MLLIB, DOCUMENTATION] Added @since tags to mllib.tree Added since tags to mllib.tree Author: Bryan Cutler <bjcutler@us.ibm.com> Closes #7380 from BryanCutler/sinceTag-mllibTree-8924.	2015-08-18 14:58:30 -07:00
Feynman Liang	f5ea391290	[SPARK-9900] [MLLIB] User guide for Association Rules Updates FPM user guide to include Association Rules. Author: Feynman Liang <fliang@databricks.com> Closes #8207 from feynmanliang/SPARK-9900-arules.	2015-08-18 12:53:57 -07:00
Yuhao Yang	354f4582b6	[SPARK-9028] [ML] Add CountVectorizer as an estimator to generate CountVectorizerModel jira: https://issues.apache.org/jira/browse/SPARK-9028 Add an estimator for CountVectorizerModel. The estimator will extract a vocabulary from document collections according to the term frequency. I changed the meaning of minCount as a filter across the corpus. This aligns with Word2Vec and the similar parameter in SKlearn. Author: Yuhao Yang <hhbyyh@gmail.com> Author: Joseph K. Bradley <joseph@databricks.com> Closes #7388 from hhbyyh/cvEstimator.	2015-08-18 11:00:09 -07:00
Yanbo Liang	dd0614fd61	[SPARK-10076] [ML] make MultilayerPerceptronClassifier layers and weights public Fix the issue that ```layers``` and ```weights``` should be public variables of ```MultilayerPerceptronClassificationModel```. Users can not get ```layers``` and ```weights``` from a ```MultilayerPerceptronClassificationModel``` currently. Author: Yanbo Liang <ybliang8@gmail.com> Closes #8263 from yanboliang/mlp-public.	2015-08-17 23:57:02 -07:00
Xiangrui Meng	e290029a35	[SPARK-7808] [ML] add package doc for ml.feature This PR adds a short description of `ml.feature` package with code example. The Java package doc will come in a separate PR. jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #8260 from mengxr/SPARK-7808.	2015-08-17 19:40:51 -07:00
Prayag Chandran	18523c1305	SPARK-8916 [Documentation, MLlib] Add @since tags to mllib.regression Added since tags to mllib.regression Author: Prayag Chandran <prayagchandran@gmail.com> Closes #7518 from prayagchandran/sinceTags and squashes the following commits: fa4dda2 [Prayag Chandran] Re-formatting 6c6d584 [Prayag Chandran] Corrected a few tags. Removed few unnecessary tags 1a0365f [Prayag Chandran] Reformating and adding a few more tags 89fdb66 [Prayag Chandran] SPARK-8916 [Documentation, MLlib] Add @since tags to mllib.regression	2015-08-17 17:26:08 -07:00
Sameer Abhyankar	088b11ec59	[SPARK-8920] [MLLIB] Add @since tags to mllib.linalg Author: Sameer Abhyankar <sabhyankar@sabhyankar-MBP.Samavihome> Author: Sameer Abhyankar <sabhyankar@sabhyankar-MBP.local> Closes #7729 from sabhyankar/branch_8920.	2015-08-17 16:00:23 -07:00
Feynman Liang	f7efda3975	[SPARK-9959] [MLLIB] Association Rules Java Compatibility mengxr Author: Feynman Liang <fliang@databricks.com> Closes #8206 from feynmanliang/SPARK-9959-arules-java.	2015-08-17 09:58:34 -07:00
Davies Liu	37586e5449	[HOTFIX] fix duplicated braces Author: Davies Liu <davies@databricks.com> Closes #8219 from davies/fix_typo.	2015-08-14 20:56:55 -07:00
Joseph K. Bradley	2a6590e510	[SPARK-9981] [ML] Made labels public for StringIndexerModel Also added unit test for integration between StringIndexerModel and IndexToString CC: holdenk We realized we should have left in your unit test (to catch the issue with removing the inverse() method), so this adds it back. mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #8211 from jkbradley/stridx-labels.	2015-08-14 14:05:03 -07:00
Wenchen Fan	34d610be85	[SPARK-9929] [SQL] support metadata in withColumn in MLlib sometimes we need to set metadata for the new column, thus we will alias the new column with metadata before call `withColumn` and in `withColumn` we alias this clolumn again. Here I overloaded `withColumn` to allow user set metadata, just like what we did for `Column.as`. Author: Wenchen Fan <cloud0fan@outlook.com> Closes #8159 from cloud-fan/withColumn.	2015-08-14 12:00:01 -07:00
Holden Karau	a7317ccdc2	[SPARK-8744] [ML] Add a public constructor to StringIndexer It would be helpful to allow users to pass a pre-computed index to create an indexer, rather than always going through StringIndexer to create the model. Author: Holden Karau <holden@pigscanfly.ca> Closes #7267 from holdenk/SPARK-8744-StringIndexerModel-should-have-public-constructor.	2015-08-14 11:22:10 -07:00
Joseph K. Bradley	7ecf0c4699	[SPARK-9956] [ML] Make trees work with one-category features This modifies DecisionTreeMetadata construction to treat 1-category features as continuous, so that trees do not fail with such features. It is important for the pipelines API, where VectorIndexer can automatically categorize certain features as categorical. As stated in the JIRA, this is a temp fix which we can improve upon later by automatically filtering out those features. That will take longer, though, since it will require careful indexing. Targeted for 1.5 and master CC: manishamde mengxr yanboliang Author: Joseph K. Bradley <joseph@databricks.com> Closes #8187 from jkbradley/tree-1cat.	2015-08-14 10:48:02 -07:00
Xiangrui Meng	a0e1abbd01	[SPARK-9661] [MLLIB] minor clean-up of SPARK-9661 Some minor clean-ups after SPARK-9661. See my inline comments. MechCoder jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #8190 from mengxr/SPARK-9661-fix.	2015-08-14 10:25:11 -07:00
Xiangrui Meng	6c5858bc65	[SPARK-9922] [ML] rename StringIndexerReverse to IndexToString What `StringIndexerInverse` does is not strictly associated with `StringIndexer`, and the name is not clearly describing the transformation. Renaming to `IndexToString` might be better. ~~I also changed `invert` to `inverse` without arguments. `inputCol` and `outputCol` could be set after.~~ I also removed `invert`. jkbradley holdenk Author: Xiangrui Meng <meng@databricks.com> Closes #8152 from mengxr/SPARK-9922.	2015-08-13 16:52:17 -07:00
MechCoder	864de8eaf4	[SPARK-9661] [MLLIB] [ML] Java compatibility I skimmed through the docs for various instance of Object and replaced them with Java compaible versions of the same. 1. Some methods in LDAModel. 2. runMiniBatchSGD 3. kolmogorovSmirnovTest Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #8126 from MechCoder/java_incop.	2015-08-13 13:42:35 -07:00

... 2 3 4 5 6 ...

1306 commits