ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Xusen Yin	d6dc12ef01	[SPARK-13449] Naive Bayes wrapper in SparkR ## What changes were proposed in this pull request? This PR continues the work in #11486 from yinxusen with some code refactoring. In R package e1071, `naiveBayes` supports both categorical (Bernoulli) and continuous features (Gaussian), while in MLlib we support Bernoulli and multinomial. This PR implements the common subset: Bernoulli. I moved the implementation out from SparkRWrappers to NaiveBayesWrapper to make it easier to read. Argument names, default values, and summary now match e1071's naiveBayes. I removed the preprocess part that omit NA values because we don't know which columns to process. ## How was this patch tested? Test against output from R package e1071's naiveBayes. cc: yanboliang yinxusen Closes #11486 Author: Xusen Yin <yinxusen@gmail.com> Author: Xiangrui Meng <meng@databricks.com> Closes #11890 from mengxr/SPARK-13449.	2016-03-22 14:16:51 -07:00
Dongjoon Hyun	df61fbd978	[SPARK-13986][CORE][MLLIB] Remove `DeveloperApi`-annotations for non-publics ## What changes were proposed in this pull request? Spark uses `DeveloperApi` annotation, but sometimes it seems to conflict with visibility. This PR tries to fix those conflict by removing annotations for non-publics. The following is the example. JobResult.scala ```scala DeveloperApi sealed trait JobResult DeveloperApi case object JobSucceeded extends JobResult -DeveloperApi private[spark] case class JobFailed(exception: Exception) extends JobResult ``` ## How was this patch tested? Pass the existing Jenkins test. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11797 from dongjoon-hyun/SPARK-13986.	2016-03-21 14:57:52 +00:00
Dongjoon Hyun	20fd254101	[SPARK-14011][CORE][SQL] Enable `LineLength` Java checkstyle rule ## What changes were proposed in this pull request? [Spark Coding Style Guide](https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide) has 100-character limit on lines, but it's disabled for Java since 11/09/15. This PR enables LineLength checkstyle again. To help that, this also introduces RedundantImport and RedundantModifier, too. The following is the diff on `checkstyle.xml`. ```xml - <!-- TODO: 11/09/15 disabled - the lengths are currently > 100 in many places --> - <!-- <module name="LineLength"> <property name="max" value="100"/> <property name="ignorePattern" value="^package.\|^import.\|a href\|href\|http://\|https://\|ftp://"/> </module> - --> <module name="NoLineWrap"/> <module name="EmptyBlock"> <property name="option" value="TEXT"/> -167,5 +164,7 </module> <module name="CommentsIndentation"/> <module name="UnusedImports"/> + <module name="RedundantImport"/> + <module name="RedundantModifier"/> ``` ## How was this patch tested? Currently, `lint-java` is disabled in Jenkins. It needs a manual test. After passing the Jenkins tests, `dev/lint-java` should passes locally. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11831 from dongjoon-hyun/SPARK-14011.	2016-03-21 07:58:57 +00:00
sethah	811a524722	[SPARK-12182][ML] Distributed binning for trees in spark.ml This PR changes the `findSplits` method in spark.ml to perform split calculations on the workers. This PR is meant to copy [PR-8246](https://github.com/apache/spark/pull/8246) which added the same feature for MLlib. Author: sethah <seth.hendrickson16@gmail.com> Closes #10231 from sethah/SPARK-12182.	2016-03-20 12:31:28 -07:00
Yuhao Yang	f43a26ef92	[SPARK-13629][ML] Add binary toggle Param to CountVectorizer ## What changes were proposed in this pull request? This is a continued work for https://github.com/apache/spark/pull/11536#issuecomment-198511013, containing some comment update and style adjustment. jkbradley ## How was this patch tested? unit tests. Author: Yuhao Yang <hhbyyh@gmail.com> Closes #11830 from hhbyyh/cvToggle.	2016-03-18 17:34:33 -07:00
Yanbo Liang	7783b6f38f	[MINOR][ML] When trainingSummary is None, it should throw RuntimeException. ## What changes were proposed in this pull request? When trainingSummary is None, it should throw ```RuntimeException```. cc mengxr ## How was this patch tested? Existing tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #11784 from yanboliang/fix-summary.	2016-03-18 11:23:17 +00:00
sethah	1614485fd9	[SPARK-10788][MLLIB][ML] Remove duplicate bins for decision trees Decision trees in spark.ml (RandomForest.scala) communicate twice as much data as needed for unordered categorical features. Here's an example. Say there are 3 categories A, B, C. We consider 3 splits: * A vs. B, C * A, B vs. C * A, C vs. B Currently, we collect statistics for each of the 6 subsets of categories (3 * 2 = 6). However, we could instead collect statistics for the 3 subsets on the left-hand side of the 3 possible splits: A and A,B and A,C. If we also have stats for the entire node, then we can compute the stats for the 3 subsets on the right-hand side of the splits. In pseudomath: stats(B,C) = stats(A,B,C) - stats(A). This patch adds a parent stats array to the `DTStatsAggregator` so that the right child stats do not need to be stored. The right child stats are computed by subtracting left child stats from the parent stats for unordered categorical features. Author: sethah <seth.hendrickson16@gmail.com> Closes #9474 from sethah/SPARK-10788.	2016-03-17 16:44:41 -07:00
Joseph K. Bradley	b39e80d39d	[SPARK-13761][ML] Remove remaining uses of validateParams ## What changes were proposed in this pull request? Cleanups from [https://github.com/apache/spark/pull/11620]: remove remaining uses of validateParams, and put functionality into transformSchema ## How was this patch tested? Existing unit tests, modified to check using transformSchema instead of validateParams Author: Joseph K. Bradley <joseph@databricks.com> Closes #11790 from jkbradley/SPARK-13761-cleanup.	2016-03-17 13:23:07 -07:00
Xusen Yin	edf8b8775b	[SPARK-11891] Model export/import for RFormula and RFormulaModel https://issues.apache.org/jira/browse/SPARK-11891 Author: Xusen Yin <yinxusen@gmail.com> Closes #9884 from yinxusen/SPARK-11891.	2016-03-17 10:19:10 -07:00
Wenchen Fan	8ef3399aff	[SPARK-13928] Move org.apache.spark.Logging into org.apache.spark.internal.Logging ## What changes were proposed in this pull request? Logging was made private in Spark 2.0. If we move it, then users would be able to create a Logging trait themselves to avoid changing their own code. ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #11764 from cloud-fan/logger.	2016-03-17 19:23:38 +08:00
Yuhao Yang	357d82d84d	[SPARK-13629][ML] Add binary toggle Param to CountVectorizer ## What changes were proposed in this pull request? It would be handy to add a binary toggle Param to CountVectorizer, as in the scikit-learn one: http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html If set, then all non-zero counts will be set to 1. ## How was this patch tested? unit tests Author: Yuhao Yang <hhbyyh@gmail.com> Closes #11536 from hhbyyh/cvToggle.	2016-03-17 11:21:11 +02:00
Yuhao Yang	92b70576ea	[SPARK-13761][ML] Deprecate validateParams ## What changes were proposed in this pull request? Deprecate validateParams() method here: `035d3acdf3/mllib/src/main/scala/org/apache/spark/ml/param/params.scala (L553)` Move all functionality in overridden methods to transformSchema(). Check docs to make sure they indicate complex Param interaction checks should be done in transformSchema. ## How was this patch tested? unit tests Author: Yuhao Yang <hhbyyh@gmail.com> Closes #11620 from hhbyyh/depreValid.	2016-03-16 17:31:55 -07:00
Jakob Odersky	d4d84936fb	[SPARK-11011][SQL] Narrow type of UDT serialization ## What changes were proposed in this pull request? Narrow down the parameter type of `UserDefinedType#serialize()`. Currently, the parameter type is `Any`, however it would logically make more sense to narrow it down to the type of the actual user defined type. ## How was this patch tested? Existing tests were successfully run on local machine. Author: Jakob Odersky <jakob@odersky.com> Closes #11379 from jodersky/SPARK-11011-udt-types.	2016-03-16 16:59:36 -07:00
Xiangrui Meng	85c42fda99	[SPARK-13927][MLLIB] add row/column iterator to local matrices ## What changes were proposed in this pull request? Add row/column iterator to local matrices to simplify tasks like BlockMatrix => RowMatrix conversion. It handles dense and sparse matrices properly. ## How was this patch tested? Unit tests on sparse and dense matrix. cc: dbtsai Author: Xiangrui Meng <meng@databricks.com> Closes #11757 from mengxr/SPARK-13927.	2016-03-16 14:19:54 -07:00
Joseph K. Bradley	6fc2b6541f	[SPARK-11888][ML] Decision tree persistence in spark.ml ### What changes were proposed in this pull request? Made these MLReadable and MLWritable: DecisionTreeClassifier, DecisionTreeClassificationModel, DecisionTreeRegressor, DecisionTreeRegressionModel * The shared implementation is in treeModels.scala * I use case classes to create a DataFrame to save, and I use the Dataset API to parse loaded files. Other changes: * Made CategoricalSplit.numCategories public (to use in persistence) * Fixed a bug in DefaultReadWriteTest.testEstimatorAndModelReadWrite, where it did not call the checkModelData function passed as an argument. This caused an error in LDASuite, which I fixed. ### How was this patch tested? Persistence is tested via unit tests. For each algorithm, there are 2 non-trivial trees (depth 2). One is built with continuous features, and one with categorical; this ensures that both types of splits are tested. Author: Joseph K. Bradley <joseph@databricks.com> Closes #11581 from jkbradley/dt-io.	2016-03-16 14:18:35 -07:00
Yanbo Liang	3f06eb72ca	[SPARK-13613][ML] Provide ignored tests to export test dataset into CSV format ## What changes were proposed in this pull request? Provide ignored test cases to export the test dataset into CSV format in ```LinearRegressionSuite```, ```LogisticRegressionSuite```, ```AFTSurvivalRegressionSuite``` and ```GeneralizedLinearRegressionSuite```, so users can validate the training accuracy compared with R's glm, glmnet and survival package. cc mengxr ## How was this patch tested? The test suite is ignored, but I have enabled all these cases offline and it works as expected. Author: Yanbo Liang <ybliang8@gmail.com> Closes #11463 from yanboliang/spark-13613.	2016-03-16 14:14:15 -07:00
Cheng Hao	d9670f8473	[SPARK-13894][SQL] SqlContext.range return type from DataFrame to DataSet ## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-13894 Change the return type of the `SQLContext.range` API from `DataFrame` to `Dataset`. ## How was this patch tested? No additional unit test required. Author: Cheng Hao <hao.cheng@intel.com> Closes #11730 from chenghao-intel/range.	2016-03-16 11:20:15 -07:00
Sean Owen	3b461d9ecd	[SPARK-13823][SPARK-13397][SPARK-13395][CORE] More warnings, StandardCharset follow up ## What changes were proposed in this pull request? Follow up to https://github.com/apache/spark/pull/11657 - Also update `String.getBytes("UTF-8")` to use `StandardCharsets.UTF_8` - And fix one last new Coverity warning that turned up (use of unguarded `wait()` replaced by simpler/more robust `java.util.concurrent` classes in tests) - And while we're here cleaning up Coverity warnings, just fix about 15 more build warnings ## How was this patch tested? Jenkins tests Author: Sean Owen <sowen@cloudera.com> Closes #11725 from srowen/SPARK-13823.2.	2016-03-16 09:36:34 +00:00
Yanbo Liang	3665294d4e	[SPARK-9837][ML] R-like summary statistics for GLMs via iteratively reweighted least squares ## What changes were proposed in this pull request? Provide R-like summary statistics for GLMs via iteratively reweighted least squares. ## How was this patch tested? unit tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #11694 from yanboliang/spark-9837.	2016-03-15 22:30:07 -07:00
sethah	dafd70fbfe	[SPARK-12379][ML][MLLIB] Copy GBT implementation to spark.ml Currently, GBTs in spark.ml wrap the implementation in spark.mllib. This is preventing several improvements to GBTs in spark.ml, so we need to move the implementation to ml and use spark.ml decision trees in the implementation. At first, we should make minimal changes to the implementation. Performance testing should be done to ensure there were no regressions. Performance testing results are [here](https://docs.google.com/document/d/1dYd2mnfGdUKkQ3vZe2BpzsTnI5IrpSLQ-NNKDZhUkgw/edit?usp=sharing) Author: sethah <seth.hendrickson16@gmail.com> Closes #10607 from sethah/SPARK-12379.	2016-03-15 11:50:34 +02:00
Michael Armbrust	17eec0a71b	[SPARK-13664][SQL] Add a strategy for planning partitioned and bucketed scans of files This PR adds a new strategy, `FileSourceStrategy`, that can be used for planning scans of collections of files that might be partitioned or bucketed. Compared with the existing planning logic in `DataSourceStrategy` this version has the following desirable properties: - It removes the need to have `RDD`, `broadcastedHadoopConf` and other distributed concerns in the public API of `org.apache.spark.sql.sources.FileFormat` - Partition column appending is delegated to the format to avoid an extra copy / devectorization when appending partition columns - It minimizes the amount of data that is shipped to each executor (i.e. it does not send the whole list of files to every worker in the form of a hadoop conf) - it natively supports bucketing files into partitions, and thus does not require coalescing / creating a `UnionRDD` with the correct partitioning. - Small files are automatically coalesced into fewer tasks using an approximate bin-packing algorithm. Currently only a testing source is planned / tested using this strategy. In follow-up PRs we will port the existing formats to this API. A stub for `FileScanRDD` is also added, but most methods remain unimplemented. Other minor cleanups: - partition pruning is pushed into `FileCatalog` so both the new and old code paths can use this logic. This will also allow future implementations to use indexes or other tricks (i.e. a MySQL metastore) - The partitions from the `FileCatalog` now propagate information about file sizes all the way up to the planner so we can intelligently spread files out. - `Array` -> `Seq` in some internal APIs to avoid unnecessary `toArray` calls - Rename `Partition` to `PartitionDirectory` to differentiate partitions used earlier in pruning from those where we have already enumerated the files and their sizes. Author: Michael Armbrust <michael@databricks.com> Closes #11646 from marmbrus/fileStrategy.	2016-03-14 19:21:12 -07:00
Ehsan M.Kermani	992142b87e	[SPARK-11826][MLLIB] Refactor add() and subtract() methods srowen Could you please check this when you have time? Author: Ehsan M.Kermani <ehsanmo1367@gmail.com> Closes #9916 from ehsanmok/JIRA-11826.	2016-03-14 19:17:09 -07:00
Dongjoon Hyun	a48296f4fe	[SPARK-13686][MLLIB][STREAMING] Add a constructor parameter `reqParam` to (Streaming)LinearRegressionWithSGD ## What changes were proposed in this pull request? `LinearRegressionWithSGD` and `StreamingLinearRegressionWithSGD` does not have `regParam` as their constructor arguments. They just depends on GradientDescent's default reqParam values. To be consistent with other algorithms, we had better add them. The same default value is used. ## How was this patch tested? Pass the existing unit test. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11527 from dongjoon-hyun/SPARK-13686.	2016-03-14 12:46:53 -07:00
Dongjoon Hyun	acdf219703	[MINOR][DOCS] Fix more typos in comments/strings. ## What changes were proposed in this pull request? This PR fixes 135 typos over 107 files: * 121 typos in comments * 11 typos in testcase name * 3 typos in log messages ## How was this patch tested? Manual. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11689 from dongjoon-hyun/fix_more_typos.	2016-03-14 09:07:39 +00:00
Sean Owen	1840852841	[SPARK-13823][CORE][STREAMING][SQL] Always specify Charset in String <-> byte[] conversions (and remaining Coverity items) ## What changes were proposed in this pull request? - Fixes calls to `new String(byte[])` or `String.getBytes()` that rely on platform default encoding, to use UTF-8 - Same for `InputStreamReader` and `OutputStreamWriter` constructors - Standardizes on UTF-8 everywhere - Standardizes specifying the encoding with `StandardCharsets.UTF-8`, not the Guava constant or "UTF-8" (which means handling `UnuspportedEncodingException`) - (also addresses the other remaining Coverity scan issues, which are pretty trivial; these are separated into commit `1deecd8d9c` ) ## How was this patch tested? Jenkins tests Author: Sean Owen <sowen@cloudera.com> Closes #11657 from srowen/SPARK-13823.	2016-03-13 21:03:49 -07:00
Dongjoon Hyun	db88d0204e	[MINOR][DOCS] Replace `DataFrame` with `Dataset` in Javadoc. ## What changes were proposed in this pull request? SPARK-13817 (PR #11656) replaces `DataFrame` with `Dataset` from Java. This PR fixes the remaining broken links and sample Java code in `package-info.java`. As a result, it will update the following Javadoc. * http://spark.apache.org/docs/latest/api/java/org/apache/spark/ml/attribute/package-summary.html * http://spark.apache.org/docs/latest/api/java/org/apache/spark/ml/feature/package-summary.html ## How was this patch tested? Manual. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11675 from dongjoon-hyun/replace_dataframe_with_dataset_in_javadoc.	2016-03-13 12:11:18 +08:00
Cheng Lian	c079420d7c	[SPARK-13841][SQL] Removes Dataset.collectRows()/takeRows() ## What changes were proposed in this pull request? This PR removes two methods, `collectRows()` and `takeRows()`, from `Dataset[T]`. These methods were added in PR #11443, and were later considered not useful. ## How was this patch tested? Existing tests should do the work. Author: Cheng Lian <lian@databricks.com> Closes #11678 from liancheng/remove-collect-rows-and-take-rows.	2016-03-13 12:02:52 +08:00
Cheng Lian	1d542785b9	[SPARK-13244][SQL] Migrates DataFrame to Dataset ## What changes were proposed in this pull request? This PR unifies DataFrame and Dataset by migrating existing DataFrame operations to Dataset and make `DataFrame` a type alias of `Dataset[Row]`. Most Scala code changes are source compatible, but Java API is broken as Java knows nothing about Scala type alias (mostly replacing `DataFrame` with `Dataset<Row>`). There are several noticeable API changes related to those returning arrays: 1. `collect`/`take` - Old APIs in class `DataFrame`: ```scala def collect(): Array[Row] def take(n: Int): Array[Row] ``` - New APIs in class `Dataset[T]`: ```scala def collect(): Array[T] def take(n: Int): Array[T] def collectRows(): Array[Row] def takeRows(n: Int): Array[Row] ``` Two specialized methods `collectRows` and `takeRows` are added because Java doesn't support returning generic arrays. Thus, for example, `DataFrame.collect(): Array[T]` actually returns `Object` instead of `Array<T>` from Java side. Normally, Java users may fall back to `collectAsList` and `takeAsList`. The two new specialized versions are added to avoid performance regression in ML related code (but maybe I'm wrong and they are not necessary here). 1. `randomSplit` - Old APIs in class `DataFrame`: ```scala def randomSplit(weights: Array[Double], seed: Long): Array[DataFrame] def randomSplit(weights: Array[Double]): Array[DataFrame] ``` - New APIs in class `Dataset[T]`: ```scala def randomSplit(weights: Array[Double], seed: Long): Array[Dataset[T]] def randomSplit(weights: Array[Double]): Array[Dataset[T]] ``` Similar problem as above, but hasn't been addressed for Java API yet. We can probably add `randomSplitAsList` to fix this one. 1. `groupBy` Some original `DataFrame.groupBy` methods have conflicting signature with original `Dataset.groupBy` methods. To distinguish these two, typed `Dataset.groupBy` methods are renamed to `groupByKey`. Other noticeable changes: 1. Dataset always do eager analysis now We used to support disabling DataFrame eager analysis to help reporting partially analyzed malformed logical plan on analysis failure. However, Dataset encoders requires eager analysi during Dataset construction. To preserve the error reporting feature, `AnalysisException` now takes an extra `Option[LogicalPlan]` argument to hold the partially analyzed plan, so that we can check the plan tree when reporting test failures. This plan is passed by `QueryExecution.assertAnalyzed`. ## How was this patch tested? Existing tests do the work. ## TODO - [ ] Fix all tests - [ ] Re-enable MiMA check - [ ] Update ScalaDoc (`since`, `group`, and example code) Author: Cheng Lian <lian@databricks.com> Author: Yin Huai <yhuai@databricks.com> Author: Wenchen Fan <wenchen@databricks.com> Author: Cheng Lian <liancheng@users.noreply.github.com> Closes #11443 from liancheng/ds-to-df.	2016-03-10 17:00:17 -08:00
Dongjoon Hyun	91fed8e9c5	[SPARK-3854][BUILD] Scala style: require spaces before `{`. ## What changes were proposed in this pull request? Since the opening curly brace, '{', has many usages as discussed in [SPARK-3854](https://issues.apache.org/jira/browse/SPARK-3854), this PR adds a ScalaStyle rule to prevent '){' pattern for the following majority pattern and fixes the code accordingly. If we enforce this in ScalaStyle from now, it will improve the Scala code quality and reduce review time. ``` // Correct: if (true) { println("Wow!") } // Incorrect: if (true){ println("Wow!") } ``` IntelliJ also shows new warnings based on this. ## How was this patch tested? Pass the Jenkins ScalaStyle test. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11637 from dongjoon-hyun/SPARK-3854.	2016-03-10 15:57:22 -08:00
sethah	9fe38aba1f	[SPARK-11108][ML] OneHotEncoder should support other numeric types Adding support for other numeric types: * Integer * Short * Long * Float * Decimal Author: sethah <seth.hendrickson16@gmail.com> Closes #9777 from sethah/SPARK-11108.	2016-03-10 13:17:41 +02:00
sethah	e1772d3f19	[SPARK-11861][ML] Add feature importances for decision trees This patch adds an API entry point for single decision tree feature importances. Author: sethah <seth.hendrickson16@gmail.com> Closes #9912 from sethah/SPARK-11861.	2016-03-09 14:44:51 -08:00
Yanbo Liang	0dd06485c4	[SPARK-13615][ML] GeneralizedLinearRegression supports save/load ## What changes were proposed in this pull request? ```GeneralizedLinearRegression``` supports ```save/load```. cc mengxr ## How was this patch tested? unit test. Author: Yanbo Liang <ybliang8@gmail.com> Closes #11465 from yanboliang/spark-13615.	2016-03-09 11:59:22 -08:00
Dongjoon Hyun	c3689bc24e	[SPARK-13702][CORE][SQL][MLLIB] Use diamond operator for generic instance creation in Java code. ## What changes were proposed in this pull request? In order to make `docs/examples` (and other related code) more simple/readable/user-friendly, this PR replaces existing codes like the followings by using `diamond` operator. ``` - final ArrayList<Product2<Object, Object>> dataToWrite = - new ArrayList<Product2<Object, Object>>(); + final ArrayList<Product2<Object, Object>> dataToWrite = new ArrayList<>(); ``` Java 7 or higher supports diamond operator which replaces the type arguments required to invoke the constructor of a generic class with an empty set of type parameters (<>). Currently, Spark Java code use mixed usage of this. ## How was this patch tested? Manual. Pass the existing tests. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11541 from dongjoon-hyun/SPARK-13702.	2016-03-09 10:31:26 +00:00
Yanbo Liang	9740954f3f	[ML] testEstimatorAndModelReadWrite should call checkModelData ## What changes were proposed in this pull request? Although we defined ```checkModelData``` in [```read/write``` test](https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala#L994) of ML estimators/models and pass it to ```testEstimatorAndModelReadWrite```, ```testEstimatorAndModelReadWrite``` omits to call ```checkModelData``` to check the equality of model data. So actually we did not run the check of model data equality for all test cases currently, we should fix it. BTW, fix the bug of LDA read/write test which did not set ```docConcentration```. This bug should have failed test, but it does not complain because we did not run ```checkModelData``` actually. cc jkbradley mengxr ## How was this patch tested? No new unit test, should pass the exist ones. Author: Yanbo Liang <ybliang8@gmail.com> Closes #11513 from yanboliang/ml-check-model-data.	2016-03-08 13:27:31 -08:00
Sean Owen	54040f8d35	[SPARK-13715][MLLIB] Remove last usages of jblas in tests ## What changes were proposed in this pull request? Remove last usage of jblas, in tests ## How was this patch tested? Jenkins tests -- the same ones that are being modified. Author: Sean Owen <sowen@cloudera.com> Closes #11560 from srowen/SPARK-13715.	2016-03-08 17:47:55 +00:00
Michael Armbrust	e720dda42e	[SPARK-13665][SQL] Separate the concerns of HadoopFsRelation `HadoopFsRelation` is used for reading most files into Spark SQL. However today this class mixes the concerns of file management, schema reconciliation, scan building, bucketing, partitioning, and writing data. As a result, many data sources are forced to reimplement the same functionality and the various layers have accumulated a fair bit of inefficiency. This PR is a first cut at separating this into several components / interfaces that are each described below. Additionally, all implementations inside of Spark (parquet, csv, json, text, orc, svmlib) have been ported to the new API `FileFormat`. External libraries, such as spark-avro will also need to be ported to work with Spark 2.0. ### HadoopFsRelation A simple `case class` that acts as a container for all of the metadata required to read from a datasource. All discovery, resolution and merging logic for schemas and partitions has been removed. This an internal representation that no longer needs to be exposed to developers. ```scala case class HadoopFsRelation( sqlContext: SQLContext, location: FileCatalog, partitionSchema: StructType, dataSchema: StructType, bucketSpec: Option[BucketSpec], fileFormat: FileFormat, options: Map[String, String]) extends BaseRelation ``` ### FileFormat The primary interface that will be implemented by each different format including external libraries. Implementors are responsible for reading a given format and converting it into `InternalRow` as well as writing out an `InternalRow`. A format can optionally return a schema that is inferred from a set of files. ```scala trait FileFormat { def inferSchema( sqlContext: SQLContext, options: Map[String, String], files: Seq[FileStatus]): Option[StructType] def prepareWrite( sqlContext: SQLContext, job: Job, options: Map[String, String], dataSchema: StructType): OutputWriterFactory def buildInternalScan( sqlContext: SQLContext, dataSchema: StructType, requiredColumns: Array[String], filters: Array[Filter], bucketSet: Option[BitSet], inputFiles: Array[FileStatus], broadcastedConf: Broadcast[SerializableConfiguration], options: Map[String, String]): RDD[InternalRow] } ``` The current interface is based on what was required to get all the tests passing again, but still mixes a couple of concerns (i.e. `bucketSet` is passed down to the scan instead of being resolved by the planner). Additionally, scans are still returning `RDD`s instead of iterators for single files. In a future PR, bucketing should be removed from this interface and the scan should be isolated to a single file. ### FileCatalog This interface is used to list the files that make up a given relation, as well as handle directory based partitioning. ```scala trait FileCatalog { def paths: Seq[Path] def partitionSpec(schema: Option[StructType]): PartitionSpec def allFiles(): Seq[FileStatus] def getStatus(path: Path): Array[FileStatus] def refresh(): Unit } ``` Currently there are two implementations: - `HDFSFileCatalog` - based on code from the old `HadoopFsRelation`. Infers partitioning by recursive listing and caches this data for performance - `HiveFileCatalog` - based on the above, but it uses the partition spec from the Hive Metastore. ### ResolvedDataSource Produces a logical plan given the following description of a Data Source (which can come from DataFrameReader or a metastore): - `paths: Seq[String] = Nil` - `userSpecifiedSchema: Option[StructType] = None` - `partitionColumns: Array[String] = Array.empty` - `bucketSpec: Option[BucketSpec] = None` - `provider: String` - `options: Map[String, String]` This class is responsible for deciding which of the Data Source APIs a given provider is using (including the non-file based ones). All reconciliation of partitions, buckets, schema from metastores or inference is done here. ### DataSourceAnalysis / DataSourceStrategy Responsible for analyzing and planning reading/writing of data using any of the Data Source APIs, including: - pruning the files from partitions that will be read based on filters. - appending partition columns* - applying additional filters when a data source can not evaluate them internally. - constructing an RDD that is bucketed correctly when required* - sanity checking schema match-up and other analysis when writing. *In the future we should do that following: - Break out file handling into its own Strategy as its sufficiently complex / isolated. - Push the appending of partition columns down in to `FileFormat` to avoid an extra copy / unvectorization. - Use a custom RDD for scans instead of `SQLNewNewHadoopRDD2` Author: Michael Armbrust <michael@databricks.com> Author: Wenchen Fan <wenchen@databricks.com> Closes #11509 from marmbrus/fileDataSource.	2016-03-07 15:15:10 -08:00
Xusen Yin	83302c3bff	[SPARK-13036][SPARK-13318][SPARK-13319] Add save/load for feature.py Add save/load for feature.py. Meanwhile, add save/load for `ElementwiseProduct` in Scala side and fix a bug of missing `setDefault` in `VectorSlicer` and `StopWordsRemover`. In this PR I ignore the `RFormula` and `RFormulaModel` because its Scala implementation is pending in https://github.com/apache/spark/pull/9884. I'll add them in this PR if https://github.com/apache/spark/pull/9884 gets merged first. Or add a follow-up JIRA for `RFormula`. Author: Xusen Yin <yinxusen@gmail.com> Closes #11203 from yinxusen/SPARK-13036.	2016-03-04 08:32:24 -08:00
Abou Haydar Elias	27e88faa05	[SPARK-13646][MLLIB] QuantileDiscretizer counts dataset twice in get… ## What changes were proposed in this pull request? It avoids counting the dataframe twice. Author: Abou Haydar Elias <abouhaydar.elias@gmail.com> Author: Elie A <abouhaydar.elias@gmail.com> Closes #11491 from eliasah/quantile-discretizer-patch.	2016-03-04 10:01:52 +00:00
Dongjoon Hyun	941b270b70	[MINOR] Fix typos in comments and testcase name of code ## What changes were proposed in this pull request? This PR fixes typos in comments and testcase name of code. ## How was this patch tested? manual. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11481 from dongjoon-hyun/minor_fix_typos_in_code.	2016-03-03 22:42:12 +00:00
Yanbo Liang	ce58e99aae	[MINOR][ML][DOC] Remove duplicated periods at the end of some sharedParam ## What changes were proposed in this pull request? Remove duplicated periods at the end of some sharedParams in ScalaDoc, such as [here](https://github.com/apache/spark/pull/11344/files#diff-9edc669edcf2c0c7cf1efe4a0a57da80L367) cc mengxr srowen ## How was this patch tested? Documents change, no test. Author: Yanbo Liang <ybliang8@gmail.com> Closes #11344 from yanboliang/shared-cleanup.	2016-03-03 13:36:54 -08:00
Dongjoon Hyun	b5f02d6743	[SPARK-13583][CORE][STREAMING] Remove unused imports and add checkstyle rule ## What changes were proposed in this pull request? After SPARK-6990, `dev/lint-java` keeps Java code healthy and helps PR review by saving much time. This issue aims remove unused imports from Java/Scala code and add `UnusedImports` checkstyle rule to help developers. ## How was this patch tested? ``` ./dev/lint-java ./build/sbt compile ``` Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11438 from dongjoon-hyun/SPARK-13583.	2016-03-03 10:12:32 +00:00
Sean Owen	e97fc7f176	[SPARK-13423][WIP][CORE][SQL][STREAMING] Static analysis fixes for 2.x ## What changes were proposed in this pull request? Make some cross-cutting code improvements according to static analysis. These are individually up for discussion since they exist in separate commits that can be reverted. The changes are broadly: - Inner class should be static - Mismatched hashCode/equals - Overflow in compareTo - Unchecked warnings - Misuse of assert, vs junit.assert - get(a) + getOrElse(b) -> getOrElse(a,b) - Array/String .size -> .length (occasionally, -> .isEmpty / .nonEmpty) to avoid implicit conversions - Dead code - tailrec - exists(_ == ) -> contains find + nonEmpty -> exists filter + size -> count - reduce(_+_) -> sum map + flatten -> map The most controversial may be .size -> .length simply because of its size. It is intended to avoid implicits that might be expensive in some places. ## How was the this patch tested? Existing Jenkins unit tests. Author: Sean Owen <sowen@cloudera.com> Closes #11292 from srowen/SPARK-13423.	2016-03-03 09:54:09 +00:00
Yanbo Liang	5ed48dd84d	[SPARK-12811][ML] Estimator for Generalized Linear Models(GLMs) Estimator for Generalized Linear Models(GLMs) which will be solved by IRLS. cc mengxr Author: Yanbo Liang <ybliang8@gmail.com> Closes #11136 from yanboliang/spark-12811.	2016-03-01 08:47:56 -08:00
Zheng RuiFeng	ac5c635281	[SPARK-13506][MLLIB] Fix the wrong parameter in R code comment in AssociationRulesSuite JIRA: https://issues.apache.org/jira/browse/SPARK-13506 ## What changes were proposed in this pull request? just chang R Snippet Comment in AssociationRulesSuite ## How was this patch tested? unit test passsed Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #11387 from zhengruifeng/ars.	2016-02-29 14:51:27 +00:00
Yanbo Liang	d81a71357e	[SPARK-13545][MLLIB][PYSPARK] Make MLlib LogisticRegressionWithLBFGS's default parameters consistent in Scala and Python ## What changes were proposed in this pull request? * The default value of ```regParam``` of PySpark MLlib ```LogisticRegressionWithLBFGS``` should be consistent with Scala which is ```0.0```. (This is also consistent with ML ```LogisticRegression```.) * BTW, if we use a known updater(L1 or L2) for binary classification, ```LogisticRegressionWithLBFGS``` will call the ML implementation. We should update the API doc to clarifying ```numCorrections``` will have no effect if we fall into that route. * Make a pass for all parameters of ```LogisticRegressionWithLBFGS```, others are set properly. cc mengxr dbtsai ## How was this patch tested? No new tests, it should pass all current tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #11424 from yanboliang/spark-13545.	2016-02-29 00:55:51 -08:00
Bryan Cutler	b33261f913	[SPARK-12634][PYSPARK][DOC] PySpark tree parameter desc to consistent format Part of task for [SPARK-11219](https://issues.apache.org/jira/browse/SPARK-11219) to make PySpark MLlib parameter description formatting consistent. This is for the tree module. closes #10601 Author: Bryan Cutler <cutlerb@gmail.com> Author: vijaykiran <mail@vijaykiran.com> Closes #11353 from BryanCutler/param-desc-consistent-tree-SPARK-12634.	2016-02-26 08:30:32 -08:00
Cheng Lian	99dfcedbfd	[SPARK-13457][SQL] Removes DataFrame RDD operations ## What changes were proposed in this pull request? This is another try of PR #11323. This PR removes DataFrame RDD operations except for `foreach` and `foreachPartitions` (they are actions rather than transformations). Original calls are now replaced by calls to methods of `DataFrame.rdd`. PR #11323 was reverted because it introduced a regression: both `DataFrame.foreach` and `DataFrame.foreachPartitions` wrap underlying RDD operations with `withNewExecutionId` to track Spark jobs. But they are removed in #11323. ## How was the this patch tested? No extra tests are added. Existing tests should do the work. Author: Cheng Lian <lian@databricks.com> Closes #11388 from liancheng/remove-df-rdd-ops.	2016-02-27 00:28:30 +08:00
Yuhao Yang	90d07154c2	[SPARK-13028] [ML] Add MaxAbsScaler to ML.feature as a transformer jira: https://issues.apache.org/jira/browse/SPARK-13028 MaxAbsScaler works in a very similar way as MinMaxScaler, but scales in a way that the training data lies within the range [-1, 1] by dividing through the largest maximum value in each feature. The motivation to use this scaling includes robustness to very small standard deviations of features and preserving zero entries in sparse data. Unlike StandardScaler and MinMaxScaler, MaxAbsScaler does not shift/center the data, and thus does not destroy any sparsity. Something similar from sklearn: http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MaxAbsScaler.html#sklearn.preprocessing.MaxAbsScaler Author: Yuhao Yang <hhbyyh@gmail.com> Closes #10939 from hhbyyh/maxabs and squashes the following commits: fd8bdcd [Yuhao Yang] add tag and some optimization on fit 648fced [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into maxabs 75bebc2 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into maxabs cb10bb6 [Yuhao Yang] remove minmax 91ef8f3 [Yuhao Yang] ut added 8ab0747 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into maxabs a9215b5 [Yuhao Yang] max abs scaler	2016-02-25 21:04:35 -08:00
Yu ISHIKAWA	14e2700de2	[SPARK-12874][ML] ML StringIndexer does not protect itself from column name duplication ## What changes were proposed in this pull request? ML StringIndexer does not protect itself from column name duplication. We should still improve a way to validate a schema of `StringIndexer` and `StringIndexerModel`. However, it would be great to fix at another issue. ## How was this patch tested? unit test Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #11370 from yu-iskw/SPARK-12874.	2016-02-25 13:21:33 -08:00
Davies Liu	751724b132	Revert "[SPARK-13457][SQL] Removes DataFrame RDD operations" This reverts commit `157fe64f3e`.	2016-02-25 11:53:48 -08:00
Cheng Lian	157fe64f3e	[SPARK-13457][SQL] Removes DataFrame RDD operations ## What changes were proposed in this pull request? This PR removes DataFrame RDD operations. Original calls are now replaced by calls to methods of `DataFrame.rdd`. ## How was the this patch tested? No extra tests are added. Existing tests should do the work. Author: Cheng Lian <lian@databricks.com> Closes #11323 from liancheng/remove-df-rdd-ops.	2016-02-25 23:07:59 +08:00
Yanbo Liang	4460113d41	[SPARK-13490][ML] ML LinearRegression should cache standardization param value ## What changes were proposed in this pull request? Like #11027 for ```LogisticRegression```, ```LinearRegression``` with L1 regularization should also cache the value of the ```standardization``` rather than re-fetching it from the ```ParamMap``` for every OWLQN iteration. cc srowen ## How was this patch tested? No extra tests are added. It should pass all existing tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #11367 from yanboliang/spark-13490.	2016-02-25 13:34:29 +00:00
Oliver Pierson	6f8e835c68	[SPARK-13444][MLLIB] QuantileDiscretizer chooses bad splits on large DataFrames ## What changes were proposed in this pull request? Change line 113 of QuantileDiscretizer.scala to `val requiredSamples = math.max(numBins * numBins, 10000.0)` so that `requiredSamples` is a `Double`. This will fix the division in line 114 which currently results in zero if `requiredSamples < dataset.count` ## How was the this patch tested? Manual tests. I was having a problems using QuantileDiscretizer with my a dataset and after making this change QuantileDiscretizer behaves as expected. Author: Oliver Pierson <ocp@gatech.edu> Author: Oliver Pierson <opierson@umd.edu> Closes #11319 from oliverpierson/SPARK-13444.	2016-02-25 13:24:46 +00:00
Xusen Yin	8d29001dec	[SPARK-13011] K-means wrapper in SparkR https://issues.apache.org/jira/browse/SPARK-13011 Author: Xusen Yin <yinxusen@gmail.com> Closes #11124 from yinxusen/SPARK-13011.	2016-02-23 15:42:58 -08:00
Grzegorz Chilkiewicz	5d69eaf097	[SPARK-13338][ML] Allow setting 'degree' parameter to 1 for PolynomialExpansion Author: Grzegorz Chilkiewicz <grzegorz.chilkiewicz@codilime.com> Closes #11216 from grzegorz-chilkiewicz/master.	2016-02-23 10:30:02 -08:00
Xiangrui Meng	764ca18037	[SPARK-13355][MLLIB] replace GraphImpl.fromExistingRDDs by Graph.apply `GraphImpl.fromExistingRDDs` expects preprocessed vertex RDD as input. We call it in LDA without validating this requirement. So it might introduce errors. Replacing it by `Graph.apply` would be safer and more proper because it is a public API. The tests still pass. So maybe it is safe to use `fromExistingRDDs` here (though it doesn't seem so based on the implementation) or the test cases are special. jkbradley ankurdave Author: Xiangrui Meng <meng@databricks.com> Closes #11226 from mengxr/SPARK-13355.	2016-02-22 23:54:21 -08:00
Yanbo Liang	72427c3e11	[SPARK-13429][MLLIB] Unify Logistic Regression convergence tolerance of ML & MLlib ## What changes were proposed in this pull request? In order to provide better and consistent result, let's change the default value of MLlib ```LogisticRegressionWithLBFGS convergenceTol``` from ```1E-4``` to ```1E-6``` which will be equal to ML ```LogisticRegression```. cc dbtsai ## How was the this patch tested? unit tests Author: Yanbo Liang <ybliang8@gmail.com> Closes #11299 from yanboliang/spark-13429.	2016-02-22 23:37:09 -08:00
Narine Kokhlikyan	33ef3aa7ea	[SPARK-13295][ ML, MLLIB ] AFTSurvivalRegression.AFTAggregator improvements - avoid creating new instances of arrays/vectors for each record As also mentioned/marked by TODO in AFTAggregator.AFTAggregator.add(data: AFTPoint) method a new array is being created for intercept value and it is being concatenated with another array which contains the betas, the resulted Array is being converted into a Dense vector which in its turn is being converted into breeze vector. This is expensive and not necessarily beautiful. I've tried to solve above mentioned problem by simple algebraic decompositions - keeping and treating intercept independently. Please let me know what do you think and if you have any questions. Thanks, Narine Author: Narine Kokhlikyan <narine.kokhlikyan@gmail.com> Closes #11179 from NarineK/survivaloptim.	2016-02-22 17:26:32 -08:00
Yanbo Liang	40e6d40fe7	[SPARK-13334][ML] ML KMeansModel / BisectingKMeansModel / QuantileDiscretizer should set parent ML ```KMeansModel / BisectingKMeansModel / QuantileDiscretizer``` should set parent. cc mengxr Author: Yanbo Liang <ybliang8@gmail.com> Closes #11214 from yanboliang/spark-13334.	2016-02-22 12:59:50 +02:00
Bryan Cutler	e298ac91e3	[SPARK-12632][PYSPARK][DOC] PySpark fpm and als parameter desc to consistent format Part of task for [SPARK-11219](https://issues.apache.org/jira/browse/SPARK-11219) to make PySpark MLlib parameter description formatting consistent. This is for the fpm and recommendation modules. Closes #10602 Closes #10897 Author: Bryan Cutler <cutlerb@gmail.com> Author: somideshmukh <somilde@us.ibm.com> Closes #11186 from BryanCutler/param-desc-consistent-fpmrecc-SPARK-12632.	2016-02-22 12:48:37 +02:00
Dongjoon Hyun	024482bf51	[MINOR][DOCS] Fix all typos in markdown files of `doc` and similar patterns in other comments ## What changes were proposed in this pull request? This PR tries to fix all typos in all markdown files under `docs` module, and fixes similar typos in other comments, too. ## How was the this patch tested? manual tests. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11300 from dongjoon-hyun/minor_fix_typos.	2016-02-22 09:52:07 +00:00
Yong Gang Cao	ef1047fca7	[SPARK-12153][SPARK-7617][MLLIB] add support of arbitrary length sentence and other tuning for Word2Vec add support of arbitrary length sentence by using the nature representation of sentences in the input. add new similarity functions and add normalization option for distances in synonym finding add new accessor for internal structure(the vocabulary and wordindex) for convenience need instructions about how to set value for the Since annotation for newly added public functions. 1.5.3? jira link: https://issues.apache.org/jira/browse/SPARK-12153 Author: Yong Gang Cao <ygcao@amazon.com> Author: Yong-Gang Cao <ygcao@users.noreply.github.com> Closes #10152 from ygcao/improvementForSentenceBoundary.	2016-02-22 09:47:36 +00:00
Yanbo Liang	8a4ed78869	[SPARK-13379][MLLIB] Fix MLlib LogisticRegressionWithLBFGS set regularization incorrectly ## What changes were proposed in this pull request? Fix MLlib LogisticRegressionWithLBFGS regularization map as: ```SquaredL2Updater``` -> ```elasticNetParam = 0.0``` ```L1Updater``` -> ```elasticNetParam = 1.0``` cc dbtsai ## How was the this patch tested? unit tests Author: Yanbo Liang <ybliang8@gmail.com> Closes #11258 from yanboliang/spark-13379.	2016-02-21 20:20:41 -08:00
Xiangrui Meng	0088b252bf	[MINOR][MLLIB] fix mllib compile warnings This PR fixes some warnings found by `build/sbt mllib/test:compile`. Author: Xiangrui Meng <meng@databricks.com> Closes #11227 from mengxr/fix-mllib-warnings-201602.	2016-02-17 18:56:19 -08:00
BenFradet	00c72d27bf	[SPARK-12247][ML][DOC] Documentation for spark.ml's ALS and collaborative filtering in general This documents the implementation of ALS in `spark.ml` with example code in scala, java and python. Author: BenFradet <benjamin.fradet@gmail.com> Closes #10411 from BenFradet/SPARK-12247.	2016-02-16 13:03:28 +00:00
seddonm1	cbeb006f23	[SPARK-13097][ML] Binarizer allowing Double AND Vector input types This enhancement extends the existing SparkML Binarizer [SPARK-5891] to allow Vector in addition to the existing Double input column type. A use case for this enhancement is for when a user wants to Binarize many similar feature columns at once using the same threshold value (for example a binary threshold applied to many pixels in an image). This contribution is my original work and I license the work to the project under the project's open source license. viirya mengxr Author: seddonm1 <seddonm1@gmail.com> Closes #10976 from seddonm1/master.	2016-02-15 20:15:27 -08:00
Liang-Chi Hsieh	e3441e3f68	[SPARK-12363][MLLIB] Remove setRun and fix PowerIterationClustering failed test JIRA: https://issues.apache.org/jira/browse/SPARK-12363 This issue is pointed by yanboliang. When `setRuns` is removed from PowerIterationClustering, one of the tests will be failed. I found that some `dstAttr`s of the normalized graph are not correct values but 0.0. By setting `TripletFields.All` in `mapTriplets` it can work. Author: Liang-Chi Hsieh <viirya@gmail.com> Author: Xiangrui Meng <meng@databricks.com> Closes #10539 from viirya/fix-poweriter.	2016-02-13 15:56:20 -08:00
Earthson Lu	5f1c359069	[SPARK-12746][ML] ArrayType(_, true) should also accept ArrayType(_, false) https://issues.apache.org/jira/browse/SPARK-12746 Author: Earthson Lu <Earthson.Lu@gmail.com> Closes #10697 from Earthson/SPARK-12746.	2016-02-11 18:31:46 -08:00
Liu Xiang	a5257048d7	[SPARK-12765][ML][COUNTVECTORIZER] fix CountVectorizer.transform's lost transformSchema https://issues.apache.org/jira/browse/SPARK-12765 Author: Liu Xiang <lxmtlab@gmail.com> Closes #10720 from sloth2012/sloth.	2016-02-11 17:28:37 -08:00
Yu ISHIKAWA	574571c870	[SPARK-11515][ML] QuantileDiscretizer should take random seed cc jkbradley Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #9535 from yu-iskw/SPARK-11515.	2016-02-11 15:05:34 -08:00
Yu ISHIKAWA	efb65e09bc	[SPARK-13265][ML] Refactoring of basic ML import/export for other file system besides HDFS jkbradley I tried to improve the function to export a model. When I tried to export a model to S3 under Spark 1.6, we couldn't do that. So, it should offer S3 besides HDFS. Can you review it when you have time? Thanks! Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #11151 from yu-iskw/SPARK-13265.	2016-02-11 15:00:23 -08:00
Sasaki Toru	c2f21d8898	[SPARK-13264][DOC] Removed multi-byte characters in spark-env.sh.template In spark-env.sh.template, there are multi-byte characters, this PR will remove it. Author: Sasaki Toru <sasakitoa@nttdata.co.jp> Closes #11149 from sasakitoa/remove_multibyte_in_sparkenv.	2016-02-11 09:30:36 +00:00
Liang-Chi Hsieh	9267bc68fa	[SPARK-10524][ML] Use the soft prediction to order categories' bins JIRA: https://issues.apache.org/jira/browse/SPARK-10524 Currently we use the hard prediction (`ImpurityCalculator.predict`) to order categories' bins. But we should use the soft prediction. Author: Liang-Chi Hsieh <viirya@gmail.com> Author: Liang-Chi Hsieh <viirya@appier.com> Author: Joseph K. Bradley <joseph@databricks.com> Closes #8734 from viirya/dt-soft-centroids.	2016-02-09 17:10:55 -08:00
Holden Karau	ce83fe9756	[SPARK-13201][SPARK-13200] Deprecation warning cleanups: KMeans & MFDataGenerator KMeans: Make a private non-deprecated version of setRuns API so that we can call it from the PythonAPI without deprecation warnings in our own build. Also use it internally when being called from train. Add a logWarning for non-1 values MFDataGenerator: Apparently we are calling round on an integer which now in Scala 2.11 results in a warning (it didn't make any sense before either). Figure out if this is a mistake we can just remove or if we got the types wrong somewhere. I put these two together since they are both deprecation fixes in MLlib and pretty small, but I can split them up if we would prefer it that way. Author: Holden Karau <holden@us.ibm.com> Closes #11112 from holdenk/SPARK-13201-non-deprecated-setRuns-SPARK-mathround-integer.	2016-02-09 08:47:28 +00:00
Gary King	bc8890b357	[SPARK-13132][MLLIB] cache standardization param value in LogisticRegression cache the value of the standardization Param in LogisticRegression, rather than re-fetching it from the ParamMap for every index and every optimization step in the quasi-newton optimizer also, fix Param#toString to cache the stringified representation, rather than re-interpolating it on every call, so any other implementations that have similar repeated access patterns will see a benefit. this change improves training times for one of my test sets from ~7m30s to ~4m30s Author: Gary King <gary@idibon.com> Closes #11027 from idigary/spark-13132-optimize-logistic-regression.	2016-02-07 09:13:28 +00:00
Imran Younus	0557146619	[SPARK-12732][ML] bug fix in linear regression train Fixed the bug in linear regression train for the case when the target variable is constant. The two cases for `fitIntercept=true` or `fitIntercept=false` should be treated differently. Author: Imran Younus <iyounus@us.ibm.com> Closes #10702 from iyounus/SPARK-12732_bug_fix_in_linear_regression_train.	2016-02-02 20:38:53 -08:00
Grzegorz Chilkiewicz	b1835d7272	[SPARK-12711][ML] ML StopWordsRemover does not protect itself from column name duplication Fixes problem and verifies fix by test suite. Also - adds optional parameter: nullable (Boolean) to: SchemaUtils.appendColumn and deduplicates SchemaUtils.appendColumn functions. Author: Grzegorz Chilkiewicz <grzegorz.chilkiewicz@codilime.com> Closes #10741 from grzegorz-chilkiewicz/master.	2016-02-02 11:16:24 -08:00
Bryan Cutler	cba1d6b659	[SPARK-12631][PYSPARK][DOC] PySpark clustering parameter desc to consistent format Part of task for [SPARK-11219](https://issues.apache.org/jira/browse/SPARK-11219) to make PySpark MLlib parameter description formatting consistent. This is for the clustering module. Author: Bryan Cutler <cutlerb@gmail.com> Closes #10610 from BryanCutler/param-desc-consistent-cluster-SPARK-12631.	2016-02-02 10:50:22 -08:00
Josh Rosen	289373b28c	[SPARK-6363][BUILD] Make Scala 2.11 the default Scala version This patch changes Spark's build to make Scala 2.11 the default Scala version. To be clear, this does not mean that Spark will stop supporting Scala 2.10: users will still be able to compile Spark for Scala 2.10 by following the instructions on the "Building Spark" page; however, it does mean that Scala 2.11 will be the default Scala version used by our CI builds (including pull request builds). The Scala 2.11 compiler is faster than 2.10, so I think we'll be able to look forward to a slight speedup in our CI builds (it looks like it's about 2X faster for the Maven compile-only builds, for instance). After this patch is merged, I'll update Jenkins to add new compile-only jobs to ensure that Scala 2.10 compilation doesn't break. Author: Josh Rosen <joshrosen@databricks.com> Closes #10608 from JoshRosen/SPARK-6363.	2016-01-30 00:20:28 -08:00
Yanbo Liang	df78a934a0	[SPARK-9835][ML] Implement IterativelyReweightedLeastSquares solver Implement ```IterativelyReweightedLeastSquares``` solver for GLM. I consider it as a solver rather than estimator, it only used internal so I keep it ```private[ml]```. There are two limitations in the current implementation compared with R: * It can not support ```Tuple``` as response for ```Binomial``` family, such as the following code: ``` glm( cbind(using, notUsing) ~ age + education + wantsMore , family = binomial) ``` * It does not support ```offset```. Because I considered that ```RFormula``` did not support ```Tuple``` as label and ```offset``` keyword, so I simplified the implementation. But to add support for these two functions is not very hard, I can do it in follow-up PR if it is necessary. Meanwhile, we can also add R-like statistic summary for IRLS. The implementation refers R, [statsmodels](https://github.com/statsmodels/statsmodels) and [sparkGLM](https://github.com/AlteryxLabs/sparkGLM). Please focus on the main structure and overpass minor issues/docs that I will update later. Any comments and opinions will be appreciated. cc mengxr jkbradley Author: Yanbo Liang <ybliang8@gmail.com> Closes #10639 from yanboliang/spark-9835.	2016-01-28 14:29:47 -08:00
Holden Karau	b72611f20a	[SPARK-7780][MLLIB] intercept in logisticregressionwith lbfgs should not be regularized The intercept in Logistic Regression represents a prior on categories which should not be regularized. In MLlib, the regularization is handled through Updater, and the Updater penalizes all the components without excluding the intercept which resulting poor training accuracy with regularization. The new implementation in ML framework handles this properly, and we should call the implementation in ML from MLlib since majority of users are still using MLlib api. Note that both of them are doing feature scalings to improve the convergence, and the only difference is ML version doesn't regularize the intercept. As a result, when lambda is zero, they will converge to the same solution. Previously partially reviewed at https://github.com/apache/spark/pull/6386#issuecomment-168781424 re-opening for dbtsai to review. Author: Holden Karau <holden@us.ibm.com> Author: Holden Karau <holden@pigscanfly.ca> Closes #10788 from holdenk/SPARK-7780-intercept-in-logisticregressionwithLBFGS-should-not-be-regularized.	2016-01-26 17:59:05 -08:00
Jeff Zhang	1dac964c1b	[SPARK-11622][MLLIB] Make LibSVMRelation extends HadoopFsRelation and… … Add LibSVMOutputWriter The behavior of LibSVMRelation is not changed except adding LibSVMOutputWriter * Partition is still not supported * Multiple input paths is not supported Author: Jeff Zhang <zjffdu@apache.org> Closes #9595 from zjffdu/SPARK-11622.	2016-01-26 17:31:19 -08:00
Xusen Yin	fbf7623d49	[SPARK-12952] EMLDAOptimizer initialize() should return EMLDAOptimizer other than its parent class https://issues.apache.org/jira/browse/SPARK-12952 Author: Xusen Yin <yinxusen@gmail.com> Closes #10863 from yinxusen/SPARK-12952.	2016-01-26 13:18:01 -08:00
Xusen Yin	ae47ba718a	[SPARK-12834] Change ser/de of JavaArray and JavaList https://issues.apache.org/jira/browse/SPARK-12834 We use `SerDe.dumps()` to serialize `JavaArray` and `JavaList` in `PythonMLLibAPI`, then deserialize them with `PickleSerializer` in Python side. However, there is no need to transform them in such an inefficient way. Instead of it, we can use type conversion to convert them, e.g. `list(JavaArray)` or `list(JavaList)`. What's more, there is an issue to Ser/De Scala Array as I said in https://issues.apache.org/jira/browse/SPARK-12780 Author: Xusen Yin <yinxusen@gmail.com> Closes #10772 from yinxusen/SPARK-12834.	2016-01-25 22:41:52 -08:00
Yanbo Liang	dcae355c64	[SPARK-12905][ML][PYSPARK] PCAModel return eigenvalues for PySpark ```PCAModel``` can output ```explainedVariance``` at Python side. cc mengxr srowen Author: Yanbo Liang <ybliang8@gmail.com> Closes #10830 from yanboliang/spark-12905.	2016-01-25 13:54:21 -08:00
Yanbo Liang	dd2325d9a7	[SPARK-11965][ML][DOC] Update user guide for RFormula feature interactions Update user guide for RFormula feature interactions. Meanwhile we also update other new features such as supporting string label in Spark 1.6. Author: Yanbo Liang <ybliang8@gmail.com> Closes #10222 from yanboliang/spark-11965.	2016-01-25 11:52:26 -08:00
Shixiong Zhu	bc1babd63d	[SPARK-7997][CORE] Remove Akka from Spark Core and Streaming - Remove Akka dependency from core. Note: the streaming-akka project still uses Akka. - Remove HttpFileServer - Remove Akka configs from SparkConf and SSLOptions - Rename `spark.akka.frameSize` to `spark.rpc.message.maxSize`. I think it's still worth to keep this config because using `DirectTaskResult` or `IndirectTaskResult` depends on it. - Update comments and docs Author: Shixiong Zhu <shixiong@databricks.com> Closes #10854 from zsxwing/remove-akka.	2016-01-22 21:20:04 -08:00
DB Tsai	b4574e387d	[SPARK-12908][ML] Add warning message for LogisticRegression for potential converge issue When all labels are the same, it's a dangerous ground for LogisticRegression without intercept to converge. GLMNET doesn't support this case, and will just exit. GLM can train, but will have a warning message saying the algorithm doesn't converge. Author: DB Tsai <dbt@netflix.com> Closes #10862 from dbtsai/add-tests.	2016-01-21 17:24:48 -08:00
Takahashi Hiroshi	e3727c409f	[SPARK-10263][ML] Add @Since annotation to ml.param and ml.* Add Since annotations to ml.param and ml.* Author: Takahashi Hiroshi <takahashi.hiroshi@lab.ntt.co.jp> Author: Hiroshi Takahashi <takahashi.hiroshi@lab.ntt.co.jp> Closes #8935 from taishi-oss/issue10263.	2016-01-20 11:44:04 -08:00
Imran Younus	9753835cf3	[SPARK-12230][ML] WeightedLeastSquares.fit() should handle division by zero properly if standard deviation of target variable is zero. This fixes the behavior of WeightedLeastSquars.fit() when the standard deviation of the target variable is zero. If the fitIntercept is true, there is no need to train. Author: Imran Younus <iyounus@us.ibm.com> Closes #10274 from iyounus/SPARK-12230_bug_fix_in_weighted_least_squares.	2016-01-20 11:16:59 -08:00
Yu ISHIKAWA	9376ae723e	[SPARK-6519][ML] Add spark.ml API for bisecting k-means Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #9604 from yu-iskw/SPARK-6519.	2016-01-20 10:48:10 -08:00
BenFradet	f6f7ca9d2e	[SPARK-9716][ML] BinaryClassificationEvaluator should accept Double prediction column This PR aims to allow the prediction column of `BinaryClassificationEvaluator` to be of double type. Author: BenFradet <benjamin.fradet@gmail.com> Closes #10472 from BenFradet/SPARK-9716.	2016-01-19 14:59:20 -08:00
Feynman Liang	2388de5191	[SPARK-12804][ML] Fix LogisticRegression with FitIntercept on all same label training data CC jkbradley mengxr dbtsai Author: Feynman Liang <feynman.liang@gmail.com> Closes #10743 from feynmanliang/SPARK-12804.	2016-01-19 11:08:52 -08:00
Holden Karau	0ddba6d88f	[SPARK-11944][PYSPARK][MLLIB] python mllib.clustering.bisecting k means From the coverage issues for 1.6 : Add Python API for mllib.clustering.BisectingKMeans. Author: Holden Karau <holden@us.ibm.com> Closes #10150 from holdenk/SPARK-11937-python-api-coverage-SPARK-11944-python-mllib.clustering.BisectingKMeans.	2016-01-19 10:15:54 -08:00
Wojciech Jurczyk	ebd9ce0f1f	[MLLIB] Fix CholeskyDecomposition assertion's message Change assertion's message so it's consistent with the code. The old message says that the invoked method was lapack.dports, where in fact it was lapack.dppsv method. Author: Wojciech Jurczyk <wojtek.jurczyk@gmail.com> Closes #10818 from wjur/wjur/rename_error_message.	2016-01-19 09:36:45 +00:00
Eric Liang	5e492e9d5b	[SPARK-12346][ML] Missing attribute names in GLM for vector-type features Currently `summary()` fails on a GLM model fitted over a vector feature missing ML attrs, since the output feature attrs will also have no name. We can avoid this situation by forcing `VectorAssembler` to make up suitable names when inputs are missing names. cc mengxr Author: Eric Liang <ekl@databricks.com> Closes #10323 from ericl/spark-12346.	2016-01-18 12:50:58 -08:00
Tommy YU	233d6cee96	[SPARK-10264][DOCUMENTATION] Added @Since to ml.recomendation I create new pr since original pr long time no update. Please help to review. srowen Author: Tommy YU <tummyyu@163.com> Closes #10756 from Wenpei/add_since_to_recomm.	2016-01-18 13:46:14 +00:00
Reynold Xin	fe7246fea6	[SPARK-12830] Java style: disallow trailing whitespaces. Author: Reynold Xin <rxin@databricks.com> Closes #10764 from rxin/SPARK-12830.	2016-01-14 23:33:45 -08:00
Yuhao Yang	021dafc6a0	[SPARK-12026][MLLIB] ChiSqTest gets slower and slower over time when number of features is large jira: https://issues.apache.org/jira/browse/SPARK-12026 The issue is valid as features.toArray.view.zipWithIndex.slice(startCol, endCol) becomes slower as startCol gets larger. I tested on local and the change can improve the performance and the running time was stable. Author: Yuhao Yang <hhbyyh@gmail.com> Closes #10146 from hhbyyh/chiSq.	2016-01-13 17:43:27 -08:00
Sean Owen	c48f2a3a5f	[SPARK-7615][MLLIB] MLLIB Word2Vec wordVectors divided by Euclidean Norm equals to zero Cosine similarity with 0 vector should be 0 Related to https://github.com/apache/spark/pull/10152 Author: Sean Owen <sowen@cloudera.com> Closes #10696 from srowen/SPARK-7615.	2016-01-12 11:50:33 +00:00
Yuhao Yang	bbea88852c	[SPARK-10809][MLLIB] Single-document topicDistributions method for LocalLDAModel jira: https://issues.apache.org/jira/browse/SPARK-10809 We could provide a single-document topicDistributions method for LocalLDAModel to allow for quick queries which avoid RDD operations. Currently, the user must use an RDD of documents. add some missing assert too. Author: Yuhao Yang <hhbyyh@gmail.com> Closes #9484 from hhbyyh/ldaTopicPre.	2016-01-11 14:55:44 -08:00
Yuhao Yang	4f8eefa36b	[SPARK-12685][MLLIB] word2vec trainWordsCount gets overflow jira: https://issues.apache.org/jira/browse/SPARK-12685 the log of `word2vec` reports trainWordsCount = -785727483 during computation over a large dataset. Update the priority as it will affect the computation process. `alpha = learningRate * (1 - numPartitions * wordCount.toDouble / (trainWordsCount + 1))` Author: Yuhao Yang <hhbyyh@gmail.com> Closes #10627 from hhbyyh/w2voverflow.	2016-01-11 14:48:35 -08:00
Yanbo Liang	ee4ee02b86	[SPARK-12603][MLLIB] PySpark MLlib GaussianMixtureModel should support single instance predict/predictSoft PySpark MLlib ```GaussianMixtureModel``` should support single instance ```predict/predictSoft``` just like Scala do. Author: Yanbo Liang <ybliang8@gmail.com> Closes #10552 from yanboliang/spark-12603.	2016-01-11 14:43:25 -08:00
Marcelo Vanzin	6439a82503	[SPARK-3873][BUILD] Enable import ordering error checking. Turn import ordering violations into build errors, plus a few adjustments to account for how the checker behaves. I'm a little on the fence about whether the existing code is right, but it's easier to appease the checker than to discuss what's the more correct order here. Plus a few fixes to imports that cropped in since my recent cleanups. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #10612 from vanzin/SPARK-3873-enable.	2016-01-10 20:04:50 -08:00
Kousuke Saruta	e5904bb5e7	[SPARK-12692][BUILD][MLLIB] Scala style: Fix the style violation (Space before "," or ":") Fix the style violation (space before , and :). This PR is a followup for #10643. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #10684 from sarutak/SPARK-12692-followup-mllib.	2016-01-10 12:38:57 -08:00
Sean Owen	b9c8353378	[SPARK-12618][CORE][STREAMING][SQL] Clean up build warnings: 2.0.0 edition Fix most build warnings: mostly deprecated API usages. I'll annotate some of the changes below. CC rxin who is leading the charge to remove the deprecated APIs. Author: Sean Owen <sowen@cloudera.com> Closes #10570 from srowen/SPARK-12618.	2016-01-08 17:47:44 +00:00
Robert Dodier	6b6d02be0d	[SPARK-12663][MLLIB] More informative error message in MLUtils.loadLibSVMFile This PR contains 1 commit which resolves [SPARK-12663](https://issues.apache.org/jira/browse/SPARK-12663). For the record, I got a positive response from 2 people when I floated this idea on devspark.apache.org on 2015-10-23. [Link to archived discussion.](http://apache-spark-developers-list.1001551.n3.nabble.com/slightly-more-informative-error-message-in-MLUtils-loadLibSVMFile-td14764.html) Author: Robert Dodier <robert_dodier@users.sourceforge.net> Closes #10611 from robert-dodier/loadlibsvmfile-error-msg-branch.	2016-01-06 19:49:10 -08:00
BenFradet	f82ebb1522	[SPARK-12368][ML][DOC] Better doc for the binary classification evaluator' metricName For the BinaryClassificationEvaluator, the scaladoc doesn't mention that "areaUnderPR" is supported, only that the default is "areadUnderROC". Also, in the documentation, it is said that: "The default metric used to choose the best ParamMap can be overriden by the setMetric method in each of these evaluators." However, the method is called setMetricName. This PR aims to fix both issues. Author: BenFradet <benjamin.fradet@gmail.com> Closes #10328 from BenFradet/SPARK-12368.	2016-01-06 12:01:05 -08:00
Marcelo Vanzin	b3ba1be3b7	[SPARK-3873][TESTS] Import ordering fixes. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #10582 from vanzin/SPARK-3873-tests.	2016-01-05 19:07:39 -08:00
RJ Nowling	78015a8b7c	[SPARK-12450][MLLIB] Un-persist broadcasted variables in KMeans SPARK-12450 . Un-persist broadcasted variables in KMeans. Author: RJ Nowling <rnowling@gmail.com> Closes #10415 from rnowling/spark-12450.	2016-01-05 15:05:04 -08:00
Yanbo Liang	13a3b636d9	[SPARK-6724][MLLIB] Support model save/load for FPGrowthModel Support model save/load for FPGrowthModel Author: Yanbo Liang <ybliang8@gmail.com> Closes #9267 from yanboliang/spark-6724.	2016-01-05 13:31:59 -08:00
Imran Younus	1cdc42d2b9	[SPARK-12331][ML] R^2 for regression through the origin. Modified the definition of R^2 for regression through origin. Added modified test for regression metrics. Author: Imran Younus <iyounus@us.ibm.com> Author: Imran Younus <imranyounus@gmail.com> Closes #10384 from iyounus/SPARK_12331_R2_for_regression_through_origin.	2016-01-05 11:48:45 +00:00
Yanbo Liang	93ef9b6a2a	[SPARK-9622][ML] DecisionTreeRegressor: provide variance of prediction DecisionTreeRegressor will provide variance of prediction as a Double column. Author: Yanbo Liang <ybliang8@gmail.com> Closes #8866 from yanboliang/spark-9622.	2016-01-04 13:32:14 -08:00
Yanbo Liang	ba5f81859d	[SPARK-11259][ML] Params.validateParams() should be called automatically See JIRA: https://issues.apache.org/jira/browse/SPARK-11259 Author: Yanbo Liang <ybliang8@gmail.com> Closes #9224 from yanboliang/spark-11259.	2016-01-04 13:30:17 -08:00
Reynold Xin	513e3b092c	[SPARK-12599][MLLIB][SQL] Remove the use of callUDF in MLlib callUDF has been deprecated. However, we do not have an alternative for users to specify the output data type without type tags. This pull request introduced a new API for that, and replaces the invocation of the deprecated callUDF with that. Author: Reynold Xin <rxin@databricks.com> Closes #10547 from rxin/SPARK-12599.	2016-01-02 22:31:39 -08:00
Marcelo Vanzin	a59a357cae	[SPARK-3873][MLLIB] Import order fixes. A slight adjustment to the checker configuration was needed; there is a handful of warnings still left, but those are because of a bug in the checker that I'll fix separately (before enabling errors for the checker, of course). Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #10535 from vanzin/SPARK-3873-mllib.	2015-12-31 23:48:55 -08:00
Sean Owen	be86268eb5	[SPARK-12349][SPARK-12349][ML] Fix typo in Spark version regex introduced in / PR 10327 Sorry jkbradley Ref: https://github.com/apache/spark/pull/10327#discussion_r48502942 Author: Sean Owen <sowen@cloudera.com> Closes #10508 from srowen/SPARK-12349.2.	2015-12-29 16:32:26 -08:00
Shixiong Zhu	710b411729	[SPARK-12489][CORE][SQL][MLIB] Fix minor issues found by FindBugs Include the following changes: 1. Close `java.sql.Statement` 2. Fix incorrect `asInstanceOf`. 3. Remove unnecessary `synchronized` and `ReentrantLock`. Author: Shixiong Zhu <shixiong@databricks.com> Closes #10440 from zsxwing/findbugs.	2015-12-28 15:01:51 -08:00
Kousuke Saruta	07165ca06f	[SPARK-12424][ML] The implementation of ParamMap#filter is wrong. ParamMap#filter uses `mutable.Map#filterKeys`. The return type of `filterKey` is collection.Map, not mutable.Map but the result is casted to mutable.Map using `asInstanceOf` so we get `ClassCastException`. Also, the return type of Map#filterKeys is not Serializable. It's the issue of Scala (https://issues.scala-lang.org/browse/SI-6654). Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #10381 from sarutak/SPARK-12424.	2015-12-29 05:33:19 +09:00
Kazuaki Ishizaki	3920466118	[SPARK-12311][CORE] Restore previous value of "os.arch" property in test suites after forcing to set specific value to "os.arch" property Restore the original value of os.arch property after each test Since some of tests forced to set the specific value to os.arch property, we need to set the original value. Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #10289 from kiszk/SPARK-12311.	2015-12-24 13:37:28 +00:00
Sean Owen	d0f695089e	[SPARK-12349][ML] Make spark.ml PCAModel load backwards compatible Only load explainedVariance in PCAModel if it was written with Spark > 1.6.x jkbradley is this kind of what you had in mind? Author: Sean Owen <sowen@cloudera.com> Closes #10327 from srowen/SPARK-12349.	2015-12-21 10:21:22 +00:00
Bryan Cutler	ce1798b3af	[SPARK-10158][PYSPARK][MLLIB] ALS better error message when using Long IDs Added catch for casting Long to Int exception when PySpark ALS Ratings are serialized. It is easy to accidentally use Long IDs for user/product and before, it would fail with a somewhat cryptic "ClassCastException: java.lang.Long cannot be cast to java.lang.Integer." Now if this is done, a more descriptive error is shown, e.g. "PickleException: Ratings id 1205640308657491975 exceeds max integer value of 2147483647." Author: Bryan Cutler <bjcutler@us.ibm.com> Closes #9361 from BryanCutler/als-pyspark-long-id-error-SPARK-10158.	2015-12-20 09:08:23 +00:00
Reynold Xin	f496031bd2	Bump master version to 2.0.0-SNAPSHOT. Author: Reynold Xin <rxin@databricks.com> Closes #10387 from rxin/version-bump.	2015-12-19 15:13:05 -08:00
Yanbo Liang	d252b2d544	[SPARK-12309][ML] Use sqlContext from MLlibTestSparkContext for spark.ml test suites Use ```sqlContext``` from ```MLlibTestSparkContext``` rather than creating new one for spark.ml test suites. I have checked thoroughly and found there are four test cases need to update. cc mengxr jkbradley Author: Yanbo Liang <ybliang8@gmail.com> Closes #10279 from yanboliang/spark-12309.	2015-12-16 11:07:54 -08:00
Yanbo Liang	860dc7f2f8	[SPARK-9694][ML] Add random seed Param to Scala CrossValidator Add random seed Param to Scala CrossValidator Author: Yanbo Liang <ybliang8@gmail.com> Closes #9108 from yanboliang/spark-9694.	2015-12-16 11:05:37 -08:00
Liang-Chi Hsieh	b51a4cdff3	[SPARK-12016] [MLLIB] [PYSPARK] Wrap Word2VecModel when loading it in pyspark JIRA: https://issues.apache.org/jira/browse/SPARK-12016 We should not directly use Word2VecModel in pyspark. We need to wrap it in a Word2VecModelWrapper when loading it in pyspark. Author: Liang-Chi Hsieh <viirya@appier.com> Closes #10100 from viirya/fix-load-py-wordvecmodel.	2015-12-14 09:59:42 -08:00
Mike Dusenberry	1b8220387e	[SPARK-11497][MLLIB][PYTHON] PySpark RowMatrix Constructor Has Type Erasure Issue As noted in PR #9441, implementing `tallSkinnyQR` uncovered a bug with our PySpark `RowMatrix` constructor. As discussed on the dev list [here](http://apache-spark-developers-list.1001551.n3.nabble.com/K-Means-And-Class-Tags-td10038.html), there appears to be an issue with type erasure with RDDs coming from Java, and by extension from PySpark. Although we are attempting to construct a `RowMatrix` from an `RDD[Vector]` in [PythonMLlibAPI](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala#L1115), the `Vector` type is erased, resulting in an `RDD[Object]`. Thus, when calling Scala's `tallSkinnyQR` from PySpark, we get a Java `ClassCastException` in which an `Object` cannot be cast to a Spark `Vector`. As noted in the aforementioned dev list thread, this issue was also encountered with `DecisionTrees`, and the fix involved an explicit `retag` of the RDD with a `Vector` type. `IndexedRowMatrix` and `CoordinateMatrix` do not appear to have this issue likely due to their related helper functions in `PythonMLlibAPI` creating the RDDs explicitly from DataFrames with pattern matching, thus preserving the types. This PR currently contains that retagging fix applied to the `createRowMatrix` helper function in `PythonMLlibAPI`. This PR blocks #9441, so once this is merged, the other can be rebased. cc holdenk Author: Mike Dusenberry <mwdusenb@us.ibm.com> Closes #9458 from dusenberrymw/SPARK-11497_PySpark_RowMatrix_Constructor_Has_Type_Erasure_Issue.	2015-12-11 14:21:33 -08:00
Holden Karau	518ab51010	[SPARK-10991][ML] logistic regression training summary handle empty prediction col LogisticRegression training summary should still function if the predictionCol is set to an empty string or otherwise unset (related too https://issues.apache.org/jira/browse/SPARK-9718 ) Author: Holden Karau <holden@pigscanfly.ca> Author: Holden Karau <holden@us.ibm.com> Closes #9037 from holdenk/SPARK-10991-LogisticRegressionTrainingSummary-handle-empty-prediction-col.	2015-12-11 02:35:53 -05:00
Yuhao Yang	9fba9c8004	[SPARK-11602][MLLIB] Refine visibility for 1.6 scala API audit jira: https://issues.apache.org/jira/browse/SPARK-11602 Made a pass on the API change of 1.6. Open the PR for efficient discussion. Author: Yuhao Yang <hhbyyh@gmail.com> Closes #9939 from hhbyyh/auditScala.	2015-12-10 10:15:50 -08:00
Sean Owen	21b3d2a75f	[SPARK-11530][MLLIB] Return eigenvalues with PCA model Add `computePrincipalComponentsAndVariance` to also compute PCA's explained variance. CC mengxr Author: Sean Owen <sowen@cloudera.com> Closes #9736 from srowen/SPARK-11530.	2015-12-10 14:05:45 +00:00
Holden Karau	22b9a8740d	[SPARK-10299][ML] word2vec should allow users to specify the window size Currently word2vec has the window hard coded at 5, some users may want different sizes (for example if using on n-gram input or similar). User request comes from http://stackoverflow.com/questions/32231975/spark-word2vec-window-size . Author: Holden Karau <holden@us.ibm.com> Author: Holden Karau <holden@pigscanfly.ca> Closes #8513 from holdenk/SPARK-10299-word2vec-should-allow-users-to-specify-the-window-size.	2015-12-09 16:45:13 +00:00
Dominik Dahlem	a0046e379b	[SPARK-11343][ML] Documentation of float and double prediction/label columns in RegressionEvaluator felixcheung , mengxr Just added a message to require() Author: Dominik Dahlem <dominik.dahlem@gmail.combination> Closes #9598 from dahlem/ddahlem_regression_evaluator_double_predictions_message_04112015.	2015-12-08 18:54:10 -08:00
Yuhao Yang	5cb4695051	[SPARK-11605][MLLIB] ML 1.6 QA: API: Java compatibility, docs jira: https://issues.apache.org/jira/browse/SPARK-11605 Check Java compatibility for MLlib for this release. fix: 1. `StreamingTest.registerStream` needs java friendly interface. 2. `GradientBoostedTreesModel.computeInitialPredictionAndError` and `GradientBoostedTreesModel.updatePredictionError` has java compatibility issue. Mark them as `developerAPI`. TBD: [updated] no fix for now per discussion. `org.apache.spark.mllib.classification.LogisticRegressionModel` `public scala.Option<java.lang.Object> getThreshold();` has wrong return type for Java invocation. `SVMModel` has the similar issue. Yet adding a `scala.Option<java.util.Double> getThreshold()` would result in an overloading error due to the same function signature. And adding a new function with different name seems to be not necessary. cc jkbradley feynmanliang Author: Yuhao Yang <hhbyyh@gmail.com> Closes #10102 from hhbyyh/javaAPI.	2015-12-08 11:46:26 -08:00
Nakul Jindal	037b7e76a7	[SPARK-11439][ML] Optimization of creating sparse feature without dense one Sparse feature generated in LinearDataGenerator does not create dense vectors as an intermediate any more. Author: Nakul Jindal <njindal@us.ibm.com> Closes #9756 from nakul02/SPARK-11439_sparse_without_creating_dense_feature.	2015-12-08 11:08:27 +00:00
Yanbo Liang	4a39b5a1be	[SPARK-11958][SPARK-11957][ML][DOC] SQLTransformer user guide and example code Add ```SQLTransformer``` user guide, example code and make Scala API doc more clear. Author: Yanbo Liang <ybliang8@gmail.com> Closes #10006 from yanboliang/spark-11958.	2015-12-07 23:50:57 -08:00
Takahashi Hiroshi	7d05a62451	[SPARK-10259][ML] Add @since annotation to ml.classification Add since annotation to ml.classification Author: Takahashi Hiroshi <takahashi.hiroshi@lab.ntt.co.jp> Closes #8534 from taishi-oss/issue10259.	2015-12-07 23:46:55 -08:00
Joseph K. Bradley	3e7e05f5ee	[SPARK-12160][MLLIB] Use SQLContext.getOrCreate in MLlib Switched from using SQLContext constructor to using getOrCreate, mainly in model save/load methods. This covers all instances in spark.mllib. There were no uses of the constructor in spark.ml. CC: mengxr yhuai Author: Joseph K. Bradley <joseph@databricks.com> Closes #10161 from jkbradley/mllib-sqlcontext-fix.	2015-12-07 16:37:09 -08:00
Sean Owen	7da6748519	[SPARK-11988][ML][MLLIB] Update JPMML to 1.2.7 Update JPMML pmml-model to 1.2.7 Author: Sean Owen <sowen@cloudera.com> Closes #9972 from srowen/SPARK-11988.	2015-12-05 15:52:52 +00:00
Antonio Murgia	e9c9ae22b9	[SPARK-11994][MLLIB] Word2VecModel load and save cause SparkException when model is bigger than spark.kryoserializer.buffer.max Author: Antonio Murgia <antonio.murgia2@studio.unibo.it> Closes #9989 from tmnd1991/SPARK-11932.	2015-12-05 15:42:02 +00:00
Yuhao Yang	ee94b70ce5	[SPARK-12096][MLLIB] remove the old constraint in word2vec jira: https://issues.apache.org/jira/browse/SPARK-12096 word2vec now can handle much bigger vocabulary. The old constraint vocabSize.toLong * vectorSize < Ine.max / 8 should be removed. new constraint is vocabSize.toLong * vectorSize < max array length (usually a little less than Int.MaxValue) I tested with vocabsize over 18M and vectorsize = 100. srowen jkbradley Sorry to miss this in last PR. I was reminded today. Author: Yuhao Yang <hhbyyh@gmail.com> Closes #10103 from hhbyyh/w2vCapacity.	2015-12-05 15:27:31 +00:00
Josh Rosen	b7204e1d41	[SPARK-12112][BUILD] Upgrade to SBT 0.13.9 We should upgrade to SBT 0.13.9, since this is a requirement in order to use SBT's new Maven-style resolution features (which will be done in a separate patch, because it's blocked by some binary compatibility issues in the POM reader plugin). I also upgraded Scalastyle to version 0.8.0, which was necessary in order to fix a Scala 2.10.5 compatibility issue (see https://github.com/scalastyle/scalastyle/issues/156). The newer Scalastyle is slightly stricter about whitespace surrounding tokens, so I fixed the new style violations. Author: Josh Rosen <joshrosen@databricks.com> Closes #10112 from JoshRosen/upgrade-to-sbt-0.13.9.	2015-12-05 08:15:30 +08:00
Dmitry Erastov	d0d8222778	[SPARK-6990][BUILD] Add Java linting script; fix minor warnings This replaces https://github.com/apache/spark/pull/9696 Invoke Checkstyle and print any errors to the console, failing the step. Use Google's style rules modified according to https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide Some important checks are disabled (see TODOs in `checkstyle.xml`) due to multiple violations being present in the codebase. Suggest fixing those TODOs in a separate PR(s). More on Checkstyle can be found on the [official website](http://checkstyle.sourceforge.net/). Sample output (from [build 46345](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/46345/consoleFull)) (duplicated because I run the build twice with different profiles): > Checkstyle checks failed at following occurrences: [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/UnsafeRowParquetRecordReader.java:[217,7] (coding) MissingSwitchDefault: switch without "default" clause. > [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java:[198,10] (modifier) ModifierOrder: 'protected' modifier out of order with the JLS suggestions. > [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/UnsafeRowParquetRecordReader.java:[217,7] (coding) MissingSwitchDefault: switch without "default" clause. > [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java:[198,10] (modifier) ModifierOrder: 'protected' modifier out of order with the JLS suggestions. > [error] running /home/jenkins/workspace/SparkPullRequestBuilder2/dev/lint-java ; received return code 1 Also fix some of the minor violations that didn't require sweeping changes. Apologies for the previous botched PRs - I finally figured out the issue. cr: JoshRosen, pwendell > I state that the contribution is my original work, and I license the work to the project under the project's open source license. Author: Dmitry Erastov <derastov@gmail.com> Closes #9867 from dskrvk/master.	2015-12-04 12:03:45 -08:00
Xiangrui Meng	9bb695b7a8	[SPARK-12000] do not specify arg types when reference a method in ScalaDoc This fixes SPARK-12000, verified on my local with JDK 7. It seems that `scaladoc` try to match method names and messed up with annotations. cc: JoshRosen jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #10114 from mengxr/SPARK-12000.2.	2015-12-02 17:19:31 -08:00
Yu ISHIKAWA	de07d06abe	[SPARK-10266][DOCUMENTATION, ML] Fixed @Since annotation for ml.tunning cc mengxr noel-smith I worked on this issues based on https://github.com/apache/spark/pull/8729. ehsanmok thank you for your contricution! Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Author: Ehsan M.Kermani <ehsanmo1367@gmail.com> Closes #9338 from yu-iskw/JIRA-10266.	2015-12-02 14:15:54 -08:00
Cheng Lian	69dbe6b40d	[SPARK-12046][DOC] Fixes various ScalaDoc/JavaDoc issues This PR backports PR #10039 to master Author: Cheng Lian <lian@databricks.com> Closes #10063 from liancheng/spark-12046.doc-fix.master.	2015-12-01 10:21:31 -08:00
Yuhao Yang	a0af0e351e	[SPARK-11898][MLLIB] Use broadcast for the global tables in Word2Vec jira: https://issues.apache.org/jira/browse/SPARK-11898 syn0Global and sync1Global in word2vec are quite large objects with size (vocab * vectorSize * 8), yet they are passed to worker using basic task serialization. Use broadcast can greatly improve the performance. My benchmark shows that, for 1M vocabulary and default vectorSize 100, changing to broadcast can help, 1. decrease the worker memory consumption by 45%. 2. decrease running time by 40%. This will also help extend the upper limit for Word2Vec. Author: Yuhao Yang <hhbyyh@gmail.com> Closes #9878 from hhbyyh/w2vBC.	2015-12-01 09:26:58 +00:00
Yuhao Yang	52bc25c8e2	[SPARK-11847][ML] Model export/import for spark.ml: LDA Add read/write support to LDA, similar to ALS. save/load for ml.LocalLDAModel is done. For DistributedLDAModel, I'm not sure if we can invoke save on the mllib.DistributedLDAModel directly. I'll send update after some test. Author: Yuhao Yang <hhbyyh@gmail.com> Closes #9894 from hhbyyh/ldaMLsave.	2015-11-24 09:56:17 -08:00
Joseph K. Bradley	9e24ba667e	[SPARK-11521][ML][DOC] Document that Logistic, Linear Regression summaries ignore weight col Doc for 1.6 that the summaries mostly ignore the weight column. To be corrected for 1.7 CC: mengxr thunterdb Author: Joseph K. Bradley <joseph@databricks.com> Closes #9927 from jkbradley/linregsummary-doc.	2015-11-24 09:54:55 -08:00
BenFradet	4be360d4ee	[SPARK-11902][ML] Unhandled case in VectorAssembler#transform There is an unhandled case in the transform method of VectorAssembler if one of the input columns doesn't have one of the supported type DoubleType, NumericType, BooleanType or VectorUDT. So, if you try to transform a column of StringType you get a cryptic "scala.MatchError: StringType". This PR aims to fix this, throwing a SparkException when dealing with an unknown column type. Author: BenFradet <benjamin.fradet@gmail.com> Closes #9885 from BenFradet/SPARK-11902.	2015-11-22 22:05:01 -08:00
Yanbo Liang	d9cf9c21fc	[SPARK-11912][ML] ml.feature.PCA minor refactor Like [SPARK-11852](https://issues.apache.org/jira/browse/SPARK-11852), ```k``` is params and we should save it under ```metadata/``` rather than both under ```data/``` and ```metadata/```. Refactor the constructor of ```ml.feature.PCAModel``` to take only ```pc``` but construct ```mllib.feature.PCAModel``` inside ```transform```. Author: Yanbo Liang <ybliang8@gmail.com> Closes #9897 from yanboliang/spark-11912.	2015-11-22 21:56:07 -08:00
Joseph K. Bradley	a6fda0bfc1	[SPARK-6791][ML] Add read/write for CrossValidator and Evaluators I believe this works for general estimators within CrossValidator, including compound estimators. (See the complex unit test.) Added read/write for all 3 Evaluators as well. CC: mengxr yanboliang Author: Joseph K. Bradley <joseph@databricks.com> Closes #9848 from jkbradley/cv-io.	2015-11-22 21:48:48 -08:00
Yanbo Liang	9ace2e5c8d	[SPARK-11852][ML] StandardScaler minor refactor ```withStd``` and ```withMean``` should be params of ```StandardScaler``` and ```StandardScalerModel```. Author: Yanbo Liang <ybliang8@gmail.com> Closes #9839 from yanboliang/standardScaler-refactor.	2015-11-20 09:55:53 -08:00
Xusen Yin	3e1d120ced	[SPARK-11867] Add save/load for kmeans and naive bayes https://issues.apache.org/jira/browse/SPARK-11867 Author: Xusen Yin <yinxusen@gmail.com> Closes #9849 from yinxusen/SPARK-11867.	2015-11-19 23:43:18 -08:00
Joseph K. Bradley	0fff8eb3e4	[SPARK-11869][ML] Clean up TempDirectory properly in ML tests Need to remove parent directory (```className```) rather than just tempDir (```className/random_name```) I tested this with IDFSuite, which has 2 read/write tests, and it fixes the problem. CC: mengxr Can you confirm this is fine? I believe it is since the same ```random_name``` is used for all tests in a suite; we basically have an extra unneeded level of nesting. Author: Joseph K. Bradley <joseph@databricks.com> Closes #9851 from jkbradley/tempdir-cleanup.	2015-11-19 23:42:24 -08:00
Yanbo Liang	3b7f056da8	[SPARK-11829][ML] Add read/write to estimators under ml.feature (II) Add read/write support to the following estimators under spark.ml: * ChiSqSelector * PCA * VectorIndexer * Word2Vec Author: Yanbo Liang <ybliang8@gmail.com> Closes #9838 from yanboliang/spark-11829.	2015-11-19 22:02:17 -08:00
Xusen Yin	4114ce20fb	[SPARK-11846] Add save/load for AFTSurvivalRegression and IsotonicRegression https://issues.apache.org/jira/browse/SPARK-11846 mengxr Author: Xusen Yin <yinxusen@gmail.com> Closes #9836 from yinxusen/SPARK-11846.	2015-11-19 22:01:02 -08:00
Joseph K. Bradley	d02d5b9295	[SPARK-11842][ML] Small cleanups to existing Readers and Writers Updates: * Add repartition(1) to save() methods' saving of data for LogisticRegressionModel, LinearRegressionModel. * Strengthen privacy to class and companion object for Writers and Readers * Change LogisticRegressionSuite read/write test to fit intercept * Add Since versions for read/write methods in Pipeline, LogisticRegression * Switch from hand-written class names in Readers to using getClass CC: mengxr CC: yanboliang Would you mind taking a look at this PR? mengxr might not be able to soon. Thank you! Author: Joseph K. Bradley <joseph@databricks.com> Closes #9829 from jkbradley/ml-io-cleanups.	2015-11-18 21:44:01 -08:00
Xiangrui Meng	e99d339206	[SPARK-11839][ML] refactor save/write traits * add "ML" prefix to reader/writer/readable/writable to avoid name collision with java.util.* * define `DefaultParamsReadable/Writable` and use them to save some code * use `super.load` instead so people can jump directly to the doc of `Readable.load`, which documents the Java compatibility issues jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #9827 from mengxr/SPARK-11839.	2015-11-18 18:34:01 -08:00
Xiangrui Meng	7e987de177	[SPARK-6787][ML] add read/write to estimators under ml.feature (1) Add read/write support to the following estimators under spark.ml: * CountVectorizer * IDF * MinMaxScaler * StandardScaler (a little awkward because we store some params in spark.mllib model) * StringIndexer Added some necessary method for read/write. Maybe we should add `private[ml] trait DefaultParamsReadable` and `DefaultParamsWritable` to save some boilerplate code, though we still need to override `load` for Java compatibility. jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #9798 from mengxr/SPARK-6787.	2015-11-18 15:47:49 -08:00
Yanbo Liang	e222d75849	[SPARK-11684][R][ML][DOC] Update SparkR glm API doc, user guide and example codes This PR includes: * Update SparkR:::glm, SparkR:::summary API docs. * Update SparkR machine learning user guide and example codes to show: * supporting feature interaction in R formula. * summary for gaussian GLM model. * coefficients for binomial GLM model. mengxr Author: Yanbo Liang <ybliang8@gmail.com> Closes #9727 from yanboliang/spark-11684.	2015-11-18 13:30:29 -08:00
Yuhao Yang	e391abdf2c	[SPARK-11813][MLLIB] Avoid serialization of vocab in Word2Vec jira: https://issues.apache.org/jira/browse/SPARK-11813 I found the problem during training a large corpus. Avoid serialization of vocab in Word2Vec has 2 benefits. 1. Performance improvement for less serialization. 2. Increase the capacity of Word2Vec a lot. Currently in the fit of word2vec, the closure mainly includes serialization of Word2Vec and 2 global table. the main part of Word2vec is the vocab of size: vocab * 40 * 2 * 4 = 320 vocab 2 global table: vocab * vectorSize * 8. If vectorSize = 20, that's 160 vocab. Their sum cannot exceed Int.max due to the restriction of ByteArrayOutputStream. In any case, avoiding serialization of vocab helps decrease the size of the closure serialization, especially when vectorSize is small, thus to allow larger vocabulary. Actually there's another possible fix, make local copy of fields to avoid including Word2Vec in the closure. Let me know if that's preferred. Author: Yuhao Yang <hhbyyh@gmail.com> Closes #9803 from hhbyyh/w2vVocab.	2015-11-18 13:25:15 -08:00
Joseph K. Bradley	2acdf10b1f	[SPARK-6789][ML] Add Readable, Writable support for spark.ml ALS, ALSModel Also modifies DefaultParamsWriter.saveMetadata to take optional extra metadata. CC: mengxr yanboliang Author: Joseph K. Bradley <joseph@databricks.com> Closes #9786 from jkbradley/als-io.	2015-11-18 13:16:31 -08:00
Wenjian Huang	045a4f0458	[SPARK-6790][ML] Add spark.ml LinearRegression import/export This replaces [https://github.com/apache/spark/pull/9656] with updates. fayeshine should be the main author when this PR is committed. CC: mengxr fayeshine Author: Wenjian Huang <nextrush@163.com> Author: Joseph K. Bradley <joseph@databricks.com> Closes #9814 from jkbradley/fayeshine-patch-6790.	2015-11-18 13:06:25 -08:00
RoyGaoVLIS	67a5132c21	[SPARK-7013][ML][TEST] Add unit test for spark.ml StandardScaler I have added unit test for ML's StandardScaler By comparing with R's output, please review for me. Thx. Author: RoyGaoVLIS <roygao@zju.edu.cn> Closes #6665 from RoyGao/7013.	2015-11-17 23:00:49 -08:00
Xiangrui Meng	3e9e638023	[SPARK-11764][ML] make Param.jsonEncode/jsonDecode support Vector This PR makes the default read/write work with simple transformers/estimators that have params of type `Param[Vector]`. jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #9776 from mengxr/SPARK-11764.	2015-11-17 14:04:49 -08:00
Joseph K. Bradley	6eb7008b7f	[SPARK-11763][ML] Add save,load to LogisticRegression Estimator Add save/load to LogisticRegression Estimator, and refactor tests a little to make it easier to add similar support to other Estimator, Model pairs. Moved LogisticRegressionReader/Writer to within LogisticRegressionModel CC: mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #9749 from jkbradley/lr-io-2.	2015-11-17 14:03:49 -08:00
Joseph K. Bradley	d98d1cb000	[SPARK-11769][ML] Add save, load to all basic Transformers This excludes Estimators and ones which include Vector and other non-basic types for Params or data. This adds: * Bucketizer * DCT * HashingTF * Interaction * NGram * Normalizer * OneHotEncoder * PolynomialExpansion * QuantileDiscretizer * RFormula * SQLTransformer * StopWordsRemover * StringIndexer * Tokenizer * VectorAssembler * VectorSlicer CC: mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #9755 from jkbradley/transformer-io.	2015-11-17 12:43:56 -08:00
Xiangrui Meng	21fac54341	[SPARK-11766][MLLIB] add toJson/fromJson to Vector/Vectors This is to support JSON serialization of Param[Vector] in the pipeline API. It could be used for other purposes too. The schema is the same as `VectorUDT`. jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #9751 from mengxr/SPARK-11766.	2015-11-17 10:17:16 -08:00
Joseph K. Bradley	1c5475f140	[SPARK-11612][ML] Pipeline and PipelineModel persistence Pipeline and PipelineModel extend Readable and Writable. Persistence succeeds only when all stages are Writable. Note: This PR reinstates tests for other read/write functionality. It should probably not get merged until [https://issues.apache.org/jira/browse/SPARK-11672] gets fixed. CC: mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #9674 from jkbradley/pipeline-io.	2015-11-16 17:12:39 -08:00
Xiangrui Meng	64e5551103	[SPARK-11672][ML] set active SQLContext in JavaDefaultReadWriteSuite The same as #9694, but for Java test suite. yhuai Author: Xiangrui Meng <meng@databricks.com> Closes #9719 from mengxr/SPARK-11672.4.	2015-11-15 13:23:05 -08:00
Xiangrui Meng	bdfbc1dcaf	[MINOR][ML] remove MLlibTestsSparkContext from ImpuritySuite ImpuritySuite doesn't need SparkContext. Author: Xiangrui Meng <meng@databricks.com> Closes #9698 from mengxr/remove-mllib-test-context-in-impurity-suite.	2015-11-13 13:19:04 -08:00
Xiangrui Meng	2d2411faa2	[SPARK-11672][ML] Set active SQLContext in MLlibTestSparkContext.beforeAll Still saw some error messages caused by `SQLContext.getOrCreate`: https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/3997/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=spark-test/testReport/junit/org.apache.spark.ml.util/JavaDefaultReadWriteSuite/testDefaultReadWrite/ This PR sets the active SQLContext in beforeAll, which is not automatically set in `new SQLContext`. This makes `SQLContext.getOrCreate` return the right SQLContext. cc: yhuai Author: Xiangrui Meng <meng@databricks.com> Closes #9694 from mengxr/SPARK-11672.3.	2015-11-13 13:09:28 -08:00
Yanbo Liang	99693fef0a	[SPARK-11723][ML][DOC] Use LibSVM data source rather than MLUtils.loadLibSVMFile to load DataFrame Use LibSVM data source rather than MLUtils.loadLibSVMFile to load DataFrame, include: * Use libSVM data source for all example codes under examples/ml, and remove unused import. * Use libSVM data source for user guides under ml-*** which were omitted by #8697. * Fix bug: We should use ```sqlContext.read().format("libsvm").load(path)``` at Java side, but the API doc and user guides misuse as ```sqlContext.read.format("libsvm").load(path)```. * Code cleanup. mengxr Author: Yanbo Liang <ybliang8@gmail.com> Closes #9690 from yanboliang/spark-11723.	2015-11-13 08:43:05 -08:00
Xiangrui Meng	e71c07557c	[SPARK-11672][ML] flaky spark.ml read/write tests We set `sqlContext = null` in `afterAll`. However, this doesn't change `SQLContext.activeContext` and then `SQLContext.getOrCreate` might use the `SparkContext` from previous test suite and hence causes the error. This PR calls `clearActive` in `beforeAll` and `afterAll` to avoid using an old context from other test suites. cc: yhuai Author: Xiangrui Meng <meng@databricks.com> Closes #9677 from mengxr/SPARK-11672.2.	2015-11-12 20:01:13 -08:00
Joseph K. Bradley	dcb896fd8c	[SPARK-11712][ML] Make spark.ml LDAModel be abstract Per discussion in the initial Pipelines LDA PR [https://github.com/apache/spark/pull/9513], we should make LDAModel abstract and create a LocalLDAModel. This code simplification should be done before the 1.6 release to ensure API compatibility in future releases. CC feynmanliang mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #9678 from jkbradley/lda-pipelines-2.	2015-11-12 17:03:19 -08:00
Xiangrui Meng	e2957bc085	[SPARK-11674][ML] add private val after @transient in Word2VecModel This causes compile failure with Scala 2.11. See https://issues.scala-lang.org/browse/SI-8813. (Jenkins won't test Scala 2.11. I tested compile locally.) JoshRosen Author: Xiangrui Meng <meng@databricks.com> Closes #9644 from mengxr/SPARK-11674.	2015-11-11 21:01:14 -08:00
Xiangrui Meng	1a21be15f6	[SPARK-11672][ML] disable spark.ml read/write tests Saw several failures on Jenkins, e.g., https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2040/testReport/org.apache.spark.ml.util/JavaDefaultReadWriteSuite/testDefaultReadWrite/. This is the first failure in master build: https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/3982/ I cannot reproduce it on local. So temporarily disable the tests and I will look into the issue under the same JIRA. I'm going to merge the PR after Jenkins passes compile. Author: Xiangrui Meng <meng@databricks.com> Closes #9641 from mengxr/SPARK-11672.	2015-11-11 15:41:36 -08:00
Yuming Wang	27524a3a9c	[SPARK-11626][ML] ml.feature.Word2Vec.transform() function very slow org.apache.spark.ml.feature.Word2Vec.transform() very slow. we should not read broadcast every sentence. Author: Yuming Wang <q79969786@gmail.com> Author: yuming.wang <q79969786@gmail.com> Author: Xiangrui Meng <meng@databricks.com> Closes #9592 from 979969786/master.	2015-11-11 09:43:26 -08:00
Joseph K. Bradley	6e101d2e9d	[SPARK-6726][ML] Import/export for spark.ml LogisticRegressionModel This PR adds model save/load for spark.ml's LogisticRegressionModel. It also does minor refactoring of the default save/load classes to reuse code. CC: mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #9606 from jkbradley/logreg-io2.	2015-11-10 18:45:48 -08:00
Yu ISHIKAWA	c0e48dfa61	[SPARK-11566] [MLLIB] [PYTHON] Refactoring GaussianMixtureModel.gaussians in Python cc jkbradley Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #9534 from yu-iskw/SPARK-11566.	2015-11-10 16:42:28 -08:00
Joseph K. Bradley	e281b87398	[SPARK-5565][ML] LDA wrapper for Pipelines API This adds LDA to spark.ml, the Pipelines API. It follows the design doc in the JIRA: [https://issues.apache.org/jira/browse/SPARK-5565], with one major change: * I eliminated doc IDs. These are not necessary with DataFrames since the user can add an ID column as needed. Note: This will conflict with [https://github.com/apache/spark/pull/9484], but I'll try to merge [https://github.com/apache/spark/pull/9484] first and then rebase this PR. CC: hhbyyh feynmanliang If you have a chance to make a pass, that'd be really helpful--thanks! Now that I'm done traveling & this PR is almost ready, I'll see about reviewing other PRs critical for 1.6. CC: mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #9513 from jkbradley/lda-pipelines.	2015-11-10 16:20:10 -08:00
unknown	dba1a62cf1	[SPARK-7316][MLLIB] RDD sliding window with step Implementation of step capability for sliding window function in MLlib's RDD. Though one can use current sliding window with step 1 and then filter every Nth window, it will take more time and space (N*data.count times more than needed). For example, below are the results for various windows and steps on 10M data points: Window \| Step \| Time \| Windows produced ------------ \| ------------- \| ---------- \| ---------- 128 \| 1 \| 6.38 \| 9999873 128 \| 10 \| 0.9 \| 999988 128 \| 100 \| 0.41 \| 99999 1024 \| 1 \| 44.67 \| 9998977 1024 \| 10 \| 4.74 \| 999898 1024 \| 100 \| 0.78 \| 99990 ``` import org.apache.spark.mllib.rdd.RDDFunctions._ val rdd = sc.parallelize(1 to 10000000, 10) rdd.count val window = 1024 val step = 1 val t = System.nanoTime(); val windows = rdd.sliding(window, step); println(windows.count); println((System.nanoTime() - t) / 1e9) ``` Author: unknown <ulanov@ULANOV3.americas.hpqcorp.net> Author: Alexander Ulanov <nashb@yandex.ru> Author: Xiangrui Meng <meng@databricks.com> Closes #5855 from avulanov/SPARK-7316-sliding.	2015-11-10 14:25:06 -08:00
Joseph K. Bradley	18350a5700	[SPARK-11618][ML] Minor refactoring of basic ML import/export Refactoring * separated overwrite and param save logic in DefaultParamsWriter * added sparkVersion to DefaultParamsWriter CC: mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #9587 from jkbradley/logreg-io.	2015-11-10 11:36:43 -08:00
Yuhao Yang	61f9c8711c	[SPARK-11069][ML] Add RegexTokenizer option to convert to lowercase jira: https://issues.apache.org/jira/browse/SPARK-11069 quotes from jira: Tokenizer converts strings to lowercase automatically, but RegexTokenizer does not. It would be nice to add an option to RegexTokenizer to convert to lowercase. Proposal: call the Boolean Param "toLowercase" set default to false (so behavior does not change) Actually sklearn converts to lowercase before tokenizing too Author: Yuhao Yang <hhbyyh@gmail.com> Closes #9092 from hhbyyh/tokenLower.	2015-11-09 16:55:23 -08:00
Yu ISHIKAWA	8a2336893a	[SPARK-6517][MLLIB] Implement the Algorithm of Hierarchical Clustering I implemented a hierarchical clustering algorithm again. This PR doesn't include examples, documentation and spark.ml APIs. I am going to send another PRs later. https://issues.apache.org/jira/browse/SPARK-6517 - This implementation based on a bi-sectiong K-means clustering. - It derives from the freeman-lab 's implementation - The basic idea is not changed from the previous version. (#2906) - However, It is 1000x faster than the previous version through parallel processing. Thank you for your great cooperation, RJ Nowling(rnowling), Jeremy Freeman(freeman-lab), Xiangrui Meng(mengxr) and Sean Owen(srowen). Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Author: Xiangrui Meng <meng@databricks.com> Author: Yu ISHIKAWA <yu-iskw@users.noreply.github.com> Closes #5267 from yu-iskw/new-hierarchical-clustering.	2015-11-09 14:56:36 -08:00
fazlan-nazeem	9b88e1dcad	[SPARK-11582][MLLIB] specifying pmml version attribute =4.2 in the root node of pmml model The current pmml models generated do not specify the pmml version in its root node. This is a problem when using this pmml model in other tools because they expect the version attribute to be set explicitly. This fix adds the pmml version attribute to the generated pmml models and specifies its value as 4.2. Author: fazlan-nazeem <fazlann@wso2.com> Closes #9558 from fazlan-nazeem/master.	2015-11-09 08:58:55 -08:00
Yanbo Liang	8c0e1b50e9	[SPARK-11494][ML][R] Expose R-like summary statistics in SparkR::glm for linear regression Expose R-like summary statistics in SparkR::glm for linear regression, the output of ```summary``` like ```Java $DevianceResiduals Min Max -0.9509607 0.7291832 $Coefficients Estimate Std. Error t value Pr(>\|t\|) (Intercept) 1.6765 0.2353597 7.123139 4.456124e-11 Sepal_Length 0.3498801 0.04630128 7.556598 4.187317e-12 Species_versicolor -0.9833885 0.07207471 -13.64402 0 Species_virginica -1.00751 0.09330565 -10.79796 0 ``` Author: Yanbo Liang <ybliang8@gmail.com> Closes #9561 from yanboliang/spark-11494.	2015-11-09 08:56:22 -08:00
Yu ISHIKAWA	2ff0e79a86	[SPARK-8467] [MLLIB] [PYSPARK] Add LDAModel.describeTopics() in Python Could jkbradley and davies review it? - Create a wrapper class: `LDAModelWrapper` for `LDAModel`. Because we can't deal with the return value of`describeTopics` in Scala from pyspark directly. `Array[(Array[Int], Array[Double])]` is too complicated to convert it. - Add `loadLDAModel` in `PythonMLlibAPI`. Since `LDAModel` in Scala is an abstract class and we need to call `load` of `DistributedLDAModel`. [[SPARK-8467] Add LDAModel.describeTopics() in Python - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-8467) Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #8643 from yu-iskw/SPARK-8467-2.	2015-11-06 22:56:29 -08:00
Xiangrui Meng	c447c9d546	[SPARK-11217][ML] save/load for non-meta estimators and transformers This PR implements the default save/load for non-meta estimators and transformers using the JSON serialization of param values. The saved metadata includes: * class name * uid * timestamp * paramMap The save/load interface is similar to DataFrames. We use the current active context by default, which should be sufficient for most use cases. ~~~scala instance.save("path") instance.write.context(sqlContext).overwrite().save("path") Instance.load("path") ~~~ The param handling is different from the design doc. We didn't save default and user-set params separately, and when we load it back, all parameters are user-set. This does cause issues. But it also cause other issues if we modify the default params. TODOs: * [x] Java test * [ ] a follow-up PR to implement default save/load for all non-meta estimators and transformers cc jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #9454 from mengxr/SPARK-11217.	2015-11-06 14:51:03 -08:00
Imran Rashid	49f1a82037	[SPARK-10116][CORE] XORShiftRandom.hashSeed is random in high bits https://issues.apache.org/jira/browse/SPARK-10116 This is really trivial, just happened to notice it -- if `XORShiftRandom.hashSeed` is really supposed to have random bits throughout (as the comment implies), it needs to do something for the conversion to `long`. mengxr mkolod Author: Imran Rashid <irashid@cloudera.com> Closes #8314 from squito/SPARK-10116.	2015-11-06 20:06:24 +00:00
Yu ISHIKAWA	8fa8c8375d	[SPARK-11514][ML] Pass random seed to spark.ml DecisionTree* cc jkbradley Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #9486 from yu-iskw/SPARK-11514.	2015-11-05 17:59:01 -08:00
Ehsan M.Kermani	f80f7b69a3	[SPARK-10265][DOCUMENTATION, ML] Fixed @Since annotation to ml.regression Here is my first commit. Author: Ehsan M.Kermani <ehsanmo1367@gmail.com> Closes #8728 from ehsanmok/SinceAnn.	2015-11-05 12:11:57 -08:00
Yanbo Liang	9da7ceed81	[SPARK-11473][ML] R-like summary statistics with intercept for OLS via normal equation solver Follow up [SPARK-9836](https://issues.apache.org/jira/browse/SPARK-9836), we should also support summary statistics for ```intercept```. Author: Yanbo Liang <ybliang8@gmail.com> Closes #9485 from yanboliang/spark-11473.	2015-11-05 09:56:18 -08:00
a1singh	a94671a027	[SPARK-11506][MLLIB] Removed redundant operation in Online LDA implementation In file LDAOptimizer.scala: line 441: since "idx" was never used, replaced unrequired zipWithIndex.foreach with foreach. - nonEmptyDocs.zipWithIndex.foreach { case ((_, termCounts: Vector), idx: Int) => + nonEmptyDocs.foreach { case (_, termCounts: Vector) => Author: a1singh <a1singh@ucsd.edu> Closes #9456 from a1singh/master.	2015-11-05 12:51:10 +00:00
Yu ISHIKAWA	411ff6afb4	[SPARK-10028][MLLIB][PYTHON] Add Python API for PrefixSpan Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #9469 from yu-iskw/SPARK-10028.	2015-11-04 15:28:19 -08:00
Yanbo Liang	e328b69c31	[SPARK-9492][ML][R] LogisticRegression in R should provide model statistics Like ml ```LinearRegression```, ```LogisticRegression``` should provide a training summary including feature names and their coefficients. Author: Yanbo Liang <ybliang8@gmail.com> Closes #9303 from yanboliang/spark-9492.	2015-11-04 08:28:33 -08:00
Yanbo Liang	f54ff19b1e	[SPARK-11349][ML] Support transform string label for RFormula Currently ```RFormula``` can only handle label with ```NumericType``` or ```BinaryType``` (cast it to ```DoubleType``` as the label of Linear Regression training), we should also support label of ```StringType``` which is needed for Logistic Regression (glm with family = "binomial"). For label of ```StringType```, we should use ```StringIndexer``` to transform it to 0-based index. Author: Yanbo Liang <ybliang8@gmail.com> Closes #9302 from yanboliang/spark-11349.	2015-11-03 08:32:37 -08:00
Yanbo Liang	3434572b14	[MINOR][ML] Fix naming conventions of AFTSurvivalRegression coefficients Rename ```regressionCoefficients``` back to ```coefficients```, and name ```weights``` to ```parameters```. See discussion [here](https://github.com/apache/spark/pull/9311/files#diff-e277fd0bc21f825d3196b4551c01fe5fR230). mengxr vectorijk dbtsai Author: Yanbo Liang <ybliang8@gmail.com> Closes #9431 from yanboliang/aft-coefficients.	2015-11-03 08:31:16 -08:00
Yanbo Liang	d6f10aa7ea	[SPARK-9836][ML] Provide R-like summary statistics for OLS via normal equation solver https://issues.apache.org/jira/browse/SPARK-9836 Author: Yanbo Liang <ybliang8@gmail.com> Closes #9413 from yanboliang/spark-9836.	2015-11-03 08:29:07 -08:00
DB Tsai	21ad846238	[MINOR][ML] removed the old `getModelWeights` function Removed the old `getModelWeights` function which was private and renamed into `getModelCoefficients` Author: DB Tsai <dbt@netflix.com> Closes #9426 from dbtsai/feature-minor.	2015-11-02 19:07:31 -08:00
vectorijk	c020f7d9d4	[SPARK-10592] [ML] [PySpark] Deprecate weights and use coefficients instead in ML models Deprecated in `LogisticRegression` and `LinearRegression` Author: vectorijk <jiangkai@gmail.com> Closes #9311 from vectorijk/spark-10592.	2015-11-02 16:12:04 -08:00
Dominik Dahlem	ec03866a7e	[SPARK-11343][ML] Allow float and double prediction/label columns in RegressionEvaluator mengxr, felixcheung This pull request just relaxes the type of the prediction/label columns to be float and double. Internally, these columns are casted to double. The other evaluators might need to be changed also. Author: Dominik Dahlem <dominik.dahlem@gmail.combination> Closes #9296 from dahlem/ddahlem_regression_evaluator_double_predictions_27102015.	2015-11-02 16:11:42 -08:00
Xiangrui Meng	33ae7a35da	[SPARK-11358][MLLIB] deprecate runs in k-means This PR deprecates `runs` in k-means. `runs` introduces extra complexity and overhead in MLlib's k-means implementation. I haven't seen much usage with `runs` not equal to `1`. We don't have a unit test for it either. We can deprecate this method in 1.6, and void it in 1.7. It helps us simplify the implementation. cc: srowen Author: Xiangrui Meng <meng@databricks.com> Closes #9322 from mengxr/SPARK-11358.	2015-11-02 13:42:16 -08:00
Yu ISHIKAWA	e963070c13	[SPARK-9722] [ML] Pass random seed to spark.ml DecisionTree* Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #9402 from yu-iskw/SPARK-9722.	2015-11-01 23:52:50 -08:00
Nakul Jindal	69b9e4b3c2	[SPARK-11385] [ML] foreachActive made public in MLLib's vector API Made foreachActive public in MLLib's vector API Author: Nakul Jindal <njindal@us.ibm.com> Closes #9362 from nakul02/SPARK-11385_foreach_for_mllib_linalg_vector.	2015-10-30 17:12:24 -07:00
Lewuathe	86d65265fc	[SPARK-11207] [ML] Add test cases for solver selection of LinearRegres… …sion as followup. This is the follow up work of SPARK-10668. * Fix miner style issues. * Add test case for checking whether solver is selected properly. Author: Lewuathe <lewuathe@me.com> Author: lewuathe <lewuathe@me.com> Closes #9180 from Lewuathe/SPARK-11207.	2015-10-30 02:59:05 -07:00
Yanbo Liang	fba9e95452	[SPARK-11369][ML][R] SparkR glm should support setting standardize SparkR glm currently support : ```formula, family = c(“gaussian”, “binomial”), data, lambda = 0, alpha = 0``` We should also support setting standardize which has been defined at [design documentation](https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/edit) Author: Yanbo Liang <ybliang8@gmail.com> Closes #9331 from yanboliang/spark-11369.	2015-10-28 08:50:21 -07:00
Nakul Jindal	5f1cee6f15	[SPARK-11332] [ML] Refactored to use ml.feature.Instance instead of WeightedLeastSquare.Instance WeightedLeastSquares now uses the common Instance class in ml.feature instead of a private one. Author: Nakul Jindal <njindal@us.ibm.com> Closes #9325 from nakul02/SPARK-11332_refactor_WeightedLeastSquares_dot_Instance.	2015-10-28 01:02:03 -07:00
Xiangrui Meng	82c1c57728	[MINOR][ML] fix compile warns This fixes some compile time warnings. Author: Xiangrui Meng <meng@databricks.com> Closes #9319 from mengxr/mllib-compile-warn-20151027.	2015-10-27 23:41:42 -07:00
Sean Owen	826e1e304b	[SPARK-11302][MLLIB] 2) Multivariate Gaussian Model with Covariance matrix returns incorrect answer in some cases Fix computation of root-sigma-inverse in multivariate Gaussian; add a test and fix related Python mixture model test. Supersedes https://github.com/apache/spark/pull/9293 Author: Sean Owen <sowen@cloudera.com> Closes #9309 from srowen/SPARK-11302.2.	2015-10-27 23:07:37 -07:00
Reza Zadeh	8b292b19c9	[SPARK-10654][MLLIB] Add columnSimilarities to IndexedRowMatrix Add columnSimilarities to IndexedRowMatrix by delegating to functionality already in RowMatrix. With a test. Author: Reza Zadeh <reza@databricks.com> Closes #8792 from rezazadeh/colsims.	2015-10-26 22:00:24 -07:00
Sean Owen	3cac6614a4	[SPARK-11184][MLLIB] Declare most of .mllib code not-Experimental Remove "Experimental" from .mllib code that has been around since 1.4.0 or earlier Author: Sean Owen <sowen@cloudera.com> Closes #9169 from srowen/SPARK-11184.	2015-10-26 21:47:42 -07:00
Jayant Shekar	4e38defae1	[SPARK-6723] [MLLIB] Model import/export for ChiSqSelector This is a PR for Parquet-based model import/export. * Added save/load for ChiSqSelectorModel * Updated the test suite ChiSqSelectorSuite Author: Jayant Shekar <jayant@user-MBPMBA-3.local> Closes #6785 from jayantshekhar/SPARK-6723.	2015-10-23 08:45:13 -07:00
Reynold Xin	cdea0174e3	[SPARK-11273][SQL] Move ArrayData/MapData/DataTypeParser to catalyst.util package Author: Reynold Xin <rxin@databricks.com> Closes #9239 from rxin/types-private.	2015-10-23 00:00:21 -07:00
Xiangrui Meng	45861693be	[SPARK-10082][MLLIB] minor style updates for matrix indexing after #8271 * `>=0` => `>= 0` * print `i`, `j` in the log message MechCoder Author: Xiangrui Meng <meng@databricks.com> Closes #9189 from mengxr/SPARK-10082.	2015-10-20 18:37:29 -07:00
MechCoder	da46b77afd	[SPARK-10082][MLLIB] Validate i, j in apply DenseMatrices and SparseMatrices Given row_ind should be less than the number of rows Given col_ind should be less than the number of cols. The current code in master gives unpredictable behavior for such cases. Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #8271 from MechCoder/hash_code_matrices.	2015-10-20 16:35:34 -07:00
Tijo Thomas	9f49895fef	[SPARK-10261][DOCUMENTATION, ML] Fixed @Since annotation to ml.evaluation Author: Tijo Thomas <tijoparacka@gmail.com> Author: tijo <tijo@ezzoft.com> Closes #8554 from tijoparacka/SPARK-10261-2.	2015-10-20 16:13:34 -07:00
lewuathe	4c33a34ba3	[SPARK-10668] [ML] Use WeightedLeastSquares in LinearRegression with L… …2 regularization if the number of features is small Author: lewuathe <lewuathe@me.com> Author: Lewuathe <sasaki@treasure-data.com> Author: Kai Sasaki <sasaki@treasure-data.com> Author: Lewuathe <lewuathe@me.com> Closes #8884 from Lewuathe/SPARK-10668.	2015-10-19 10:46:10 -07:00
Luvsandondov Lkhamsuren	cca2258685	[SPARK-9963] [ML] RandomForest cleanup: replace predictNodeIndex with predictImpl predictNodeIndex is moved to LearningNode and renamed predictImpl for consistency with Node.predictImpl Author: Luvsandondov Lkhamsuren <lkhamsurenl@gmail.com> Closes #8609 from lkhamsurenl/SPARK-9963.	2015-10-17 10:07:42 -07:00
Yuhao Yang	e1e77b22b3	[SPARK-11029] [ML] Add computeCost to KMeansModel in spark.ml jira: https://issues.apache.org/jira/browse/SPARK-11029 We should add a method analogous to spark.mllib.clustering.KMeansModel.computeCost to spark.ml.clustering.KMeansModel. This will be a temp fix until we have proper evaluators defined for clustering. Author: Yuhao Yang <hhbyyh@gmail.com> Author: yuhaoyang <yuhao@zhanglipings-iMac.local> Closes #9073 from hhbyyh/computeCost.	2015-10-17 10:04:19 -07:00
Burak Yavuz	10046ea76c	[SPARK-10599] [MLLIB] Lower communication for block matrix multiplication This PR aims to decrease communication costs in BlockMatrix multiplication in two ways: - Simulate the multiplication on the driver, and figure out which blocks actually need to be shuffled - Send the block once to a partition, and join inside the partition rather than sending multiple copies to the same partition NOTE: One important note is that right now, the old behavior of checking for multiple blocks with the same index is lost. This is not hard to add, but is a little more expensive than how it was. Initial benchmarking showed promising results (look below), however I did hit some `FileNotFound` exceptions with the new implementation after the shuffle. Size A: 1e5 x 1e5 Size B: 1e5 x 1e5 Block Sizes: 1024 x 1024 Sparsity: 0.01 Old implementation: 1m 13s New implementation: 9s cc avulanov Would you be interested in helping me benchmark this? I used your code from the mailing list (which you sent about 3 months ago?), and the old implementation didn't even run, but the new implementation completed in 268s in a 120 GB / 16 core cluster Author: Burak Yavuz <brkyvz@gmail.com> Closes #8757 from brkyvz/opt-bmm.	2015-10-16 15:30:07 -07:00
vectorijk	3889b1c7a9	[SPARK-11059] [ML] Change range of quantile probabilities in AFTSurvivalRegression Value of the quantile probabilities array should be in the range (0, 1) instead of [0,1] in `AFTSurvivalRegression.scala` according to [Discussion] (https://github.com/apache/spark/pull/8926#discussion-diff-40698242) Author: vectorijk <jiangkai@gmail.com> Closes #9083 from vectorijk/spark-11059.	2015-10-13 15:57:36 -07:00
Xiangrui Meng	2b574f52d7	[SPARK-7402] [ML] JSON SerDe for standard param types This PR implements the JSON SerDe for the following param types: `Boolean`, `Int`, `Long`, `Float`, `Double`, `String`, `Array[Int]`, `Array[Double]`, and `Array[String]`. The implementation of `Float`, `Double`, and `Array[Double]` are specialized to handle `NaN` and `Inf`s. This will be used in pipeline persistence. jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #9090 from mengxr/SPARK-7402.	2015-10-13 13:24:10 -07:00
Vladimir Vladimirov	c1b4ce4326	[SPARK-10535] Sync up API for matrix factorization model between Scala and PySpark Support for recommendUsersForProducts and recommendProductsForUsers in matrix factorization model for PySpark Author: Vladimir Vladimirov <vladimir.vladimirov@magnetic.com> Closes #8700 from smartkiwi/SPARK-10535_.	2015-10-09 14:16:13 -07:00
Nick Pritchard	5994cfe812	[SPARK-10875] [MLLIB] Computed covariance matrix should be symmetric Compute upper triangular values of the covariance matrix, then copy to lower triangular values. Author: Nick Pritchard <nicholas.pritchard@falkonry.com> Closes #8940 from pnpritchard/SPARK-10875.	2015-10-08 22:22:20 -07:00
Yanbo Liang	2268356002	[SPARK-7770] [ML] GBT validationTol change to compare with relative or absolute error GBT compare ValidateError with tolerance switching between relative and absolute ones, where the former one is relative to the current loss on the training set. Author: Yanbo Liang <ybliang8@gmail.com> Closes #8549 from yanboliang/spark-7770.	2015-10-08 11:27:46 -07:00
Holden Karau	0903c6489e	[SPARK-9718] [ML] linear regression training summary all columns LinearRegression training summary: The transformed dataset should hold all columns, not just selected ones like prediction and label. There is no real need to remove some, and the user may find them useful. Author: Holden Karau <holden@pigscanfly.ca> Closes #8564 from holdenk/SPARK-9718-LinearRegressionTrainingSummary-all-columns.	2015-10-08 11:16:20 -07:00
Nathan Howell	1bc435ae3a	[SPARK-10064] [ML] Parallelize decision tree bin split calculations Reimplement `DecisionTree.findSplitsBins` via `RDD` to parallelize bin calculation. With large feature spaces the current implementation is very slow. This change limits the features that are distributed (or collected) to just the continuous features, and performs the split calculations in parallel. It completes on a real multi terabyte dataset in less than a minute instead of multiple hours. Author: Nathan Howell <nhowell@godaddy.com> Closes #8246 from NathanHowell/SPARK-10064.	2015-10-07 17:46:16 -07:00
DB Tsai	dd36ec6bc5	[SPARK-10738] [ML] Refactoring `Instance` out from LOR and LIR, and also cleaning up some code Refactoring `Instance` case class out from LOR and LIR, and also cleaning up some code. Author: DB Tsai <dbt@netflix.com> Closes #8853 from dbtsai/refactoring.	2015-10-07 15:56:57 -07:00
Yanbo Liang	7bf07faa71	[SPARK-10490] [ML] Consolidate the Cholesky solvers in WeightedLeastSquares and ALS Consolidate the Cholesky solvers in WeightedLeastSquares and ALS. Author: Yanbo Liang <ybliang8@gmail.com> Closes #8936 from yanboliang/spark-10490.	2015-10-07 15:50:45 -07:00
Evan Chen	da936fbb74	[SPARK-10779] [PYSPARK] [MLLIB] Set initialModel for KMeans model in PySpark (spark.mllib) Provide initialModel param for pyspark.mllib.clustering.KMeans Author: Evan Chen <chene@us.ibm.com> Closes #8967 from evanyc15/SPARK-10779-pyspark-mllib.	2015-10-07 15:04:53 -07:00
Marcelo Vanzin	94fc57afdf	[SPARK-10300] [BUILD] [TESTS] Add support for test tags in run-tests.py. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #8775 from vanzin/SPARK-10300.	2015-10-07 14:11:21 -07:00
Holden Karau	5be5d24744	[SPARK-9841] [ML] Make clear public It is currently impossible to clear Param values once set. It would be helpful to be able to. Author: Holden Karau <holden@pigscanfly.ca> Closes #8619 from holdenk/SPARK-9841-params-clear-needs-to-be-public.	2015-10-07 12:00:56 -07:00
Yin Huai	b0baa11d3b	[HOT-FIX] Fix style. https://github.com/apache/spark/pull/8882 broke our build. Author: Yin Huai <yhuai@databricks.com> Closes #8964 from yhuai/fixStyle.	2015-10-02 11:23:08 -07:00
Xusen Yin	633aaae0a1	[SPARK-6530] [ML] Add chi-square selector for ml package See JIRA [here](https://issues.apache.org/jira/browse/SPARK-6530). Author: Xusen Yin <yinxusen@gmail.com> Closes #5742 from yinxusen/SPARK-6530.	2015-10-02 10:25:58 -07:00
Xusen Yin	23a9448c04	[SPARK-5890] [ML] Add feature discretizer JIRA issue [here](https://issues.apache.org/jira/browse/SPARK-5890). I borrow the code of `findSplits` from `RandomForest`. I don't think it's good to call it from `RandomForest` directly. Author: Xusen Yin <yinxusen@gmail.com> Closes #5779 from yinxusen/SPARK-5890.	2015-10-02 10:19:18 -07:00
Rerngvit Yanggratoke	2a717821bb	[SPARK-9798] [ML] CrossValidatorModel Documentation Improvements Document CrossValidatorModel members: bestModel and avgMetrics Author: Rerngvit Yanggratoke <rerngvit@kth.se> Closes #8882 from rerngvit/Spark-9798.	2015-10-02 10:15:02 -07:00
Yanbo Liang	2931e89f0c	[SPARK-10736] [ML] Use 1 for all ratings if $(ratingCol) = "" For some implicit dataset, ratings may not exist in the training data. In this case, we can assume all observed pairs to be positive and treat their ratings as 1. This should happen when users set ```ratingCol``` to an empty string. Author: Yanbo Liang <ybliang8@gmail.com> Closes #8937 from yanboliang/spark-10736.	2015-09-29 23:58:32 -07:00
y-shimizu	299b439920	[SPARK-10778] [MLLIB] Implement toString for AssociationRules.Rule I implemented toString for AssociationRules.Rule, format like `[x, y] => {z}: 1.0` Author: y-shimizu <y.shimizu0429@gmail.com> Closes #8904 from y-shimizu/master.	2015-09-27 16:36:03 +01:00
Eric Liang	922338812c	[SPARK-9681] [ML] Support R feature interactions in RFormula This integrates the Interaction feature transformer with SparkR R formula support (i.e. support `:`). To generate reasonable ML attribute names for feature interactions, it was necessary to add the ability to read attribute the original attribute names back from `StructField`, and also to specify custom group prefixes in `VectorAssembler`. This also has the side-benefit of cleaning up the double-underscores in the attributes generated for non-interaction terms. mengxr Author: Eric Liang <ekl@databricks.com> Closes #8830 from ericl/interaction-2.	2015-09-25 00:43:22 -07:00
Holden Karau	d91967e159	[SPARK-10763] [ML] [JAVA] [TEST] Update Java MLLIB/ML tests to use simplified dataframe construction As introduced in https://issues.apache.org/jira/browse/SPARK-10630 we now have an easier way to create dataframes from local Java lists. Lets update the tests to use those. Author: Holden Karau <holden@pigscanfly.ca> Closes #8886 from holdenk/SPARK-10763-update-java-mllib-ml-tests-to-use-simplified-dataframe-construction.	2015-09-23 22:49:08 -07:00
Yanbo Liang	067afb4e9b	[SPARK-10699] [ML] Support checkpointInterval can be disabled Currently use can set ```checkpointInterval``` to specify how often should the cache be check-pointed. But we also need the function that users can disable it. This PR supports that users can disable checkpoint if user setting ```checkpointInterval = -1```. We also add documents for GBT ```cacheNodeIds``` to make users can understand more clearly about checkpoint. Author: Yanbo Liang <ybliang8@gmail.com> Closes #8820 from yanboliang/spark-10699.	2015-09-23 16:41:42 -07:00
Yanbo Liang	ce2b056d35	[SPARK-10686] [ML] Add quantilesCol to AFTSurvivalRegression By default ```quantilesCol``` should be empty. If ```quantileProbabilities``` is set, we should append quantiles as a new column (of type Vector). Author: Yanbo Liang <ybliang8@gmail.com> Closes #8836 from yanboliang/spark-10686.	2015-09-23 15:26:02 -07:00
sethah	098be27ad5	[SPARK-9715] [ML] Store numFeatures in all ML PredictionModel types All prediction models should store `numFeatures` indicating the number of features the model was trained on. Default value of -1 added for backwards compatibility. Author: sethah <seth.hendrickson16@gmail.com> Closes #8675 from sethah/SPARK-9715.	2015-09-23 15:00:52 -07:00
Yanbo Liang	7104ee0e5d	[SPARK-10750] [ML] ML Param validate should print better error information Currently when you set illegal value for params of array type (such as IntArrayParam, DoubleArrayParam, StringArrayParam), it will throw IllegalArgumentException but with incomprehensible error information. Take ```VectorSlicer.setNames``` as an example: ```scala val vectorSlicer = new VectorSlicer().setInputCol("features").setOutputCol("result") // The value of setNames must be contain distinct elements, so the next line will throw exception. vectorSlicer.setIndices(Array.empty).setNames(Array("f1", "f4", "f1")) ``` It will throw IllegalArgumentException as: ``` vectorSlicer_b3b4d1a10f43 parameter names given invalid value [Ljava.lang.String;798256c5. java.lang.IllegalArgumentException: vectorSlicer_b3b4d1a10f43 parameter names given invalid value [Ljava.lang.String;798256c5. ``` We should distinguish the value of array type from primitive type at Param.validate(value: T), and we will get better error information. ``` vectorSlicer_3b744ea277b2 parameter names given invalid value [f1,f4,f1]. java.lang.IllegalArgumentException: vectorSlicer_3b744ea277b2 parameter names given invalid value [f1,f4,f1]. ``` Author: Yanbo Liang <ybliang8@gmail.com> Closes #8863 from yanboliang/spark-10750.	2015-09-22 11:00:33 -07:00
Holden Karau	f4a3c4e34c	[SPARK-9962] [ML] Decision Tree training: prevNodeIdsForInstances.unpersist() at end of training NodeIdCache: prevNodeIdsForInstances.unpersist() needs to be called at end of training. Author: Holden Karau <holden@pigscanfly.ca> Closes #8541 from holdenk/SPARK-9962-decission-tree-training-prevNodeIdsForiNstances-unpersist-at-end-of-training.	2015-09-22 10:19:08 -07:00
Meihua Wu	870b8a2edd	[SPARK-10706] [MLLIB] Add java wrapper for random vector rdd Add java wrapper for random vector rdd holdenk srowen Author: Meihua Wu <meihuawu@umich.edu> Closes #8841 from rotationsymmetry/SPARK-10706.	2015-09-22 11:05:24 +01:00
Feynman Liang	aeef44a3e3	[SPARK-3147] [MLLIB] [STREAMING] Streaming 2-sample statistical significance testing Implementation of significance testing using Streaming API. Author: Feynman Liang <fliang@databricks.com> Author: Feynman Liang <feynman.liang@gmail.com> Closes #4716 from feynmanliang/ab_testing.	2015-09-21 13:11:28 -07:00
Meihua Wu	331f0b10f7	[SPARK-9642] [ML] LinearRegression should supported weighted data In many modeling application, data points are not necessarily sampled with equal probabilities. Linear regression should support weighting which account the over or under sampling. work in progress. Author: Meihua Wu <meihuawu@umich.edu> Closes #8631 from rotationsymmetry/SPARK-9642.	2015-09-21 12:09:00 -07:00
Holden Karau	20a61dbd9b	[SPARK-10626] [MLLIB] create java friendly method for random rdd SPARK-3136 added a large number of functions for creating Java RandomRDDs, but for people that want to use custom RandomDataGenerators we should make a Java friendly method. Author: Holden Karau <holden@pigscanfly.ca> Closes #8782 from holdenk/SPARK-10626-create-java-friendly-method-for-randomRDD.	2015-09-21 18:53:28 +01:00

... 3 4 5 6 7 ...

1493 commits