Commit graph

1495 commits

Author SHA1 Message Date
DB Tsai efaf7d1820 [SPARK-14462][ML][MLLIB] Add the mllib-local build to maven pom
## What changes were proposed in this pull request?

In order to separate the linear algebra, and vector matrix classes into a standalone jar, we need to setup the build first. This PR will create a new jar called mllib-local with minimal dependencies.

The previous PR was failing the build because of `spark-core:test` dependency, and that was reverted. In this PR, `FunSuite` with `// scalastyle:ignore funsuite` in mllib-local test was used, similar to sketch.

Thanks.

## How was this patch tested?

Unit tests

mengxr tedyu holdenk

Author: DB Tsai <dbt@netflix.com>

Closes #12298 from dbtsai/dbtsai-mllib-local-build-fix.
2016-04-11 09:35:47 -07:00
Zheng RuiFeng 643b4e2257 [SPARK-14510][MLLIB] Add args-checking for LDA and StreamingKMeans
## What changes were proposed in this pull request?
add the checking for LDA and StreamingKMeans

## How was this patch tested?
manual tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #12062 from zhengruifeng/initmodel.
2016-04-11 09:33:52 -07:00
Xiangrui Meng 1c751fcf48 [SPARK-14500] [ML] Accept Dataset[_] instead of DataFrame in MLlib APIs
## What changes were proposed in this pull request?

This PR updates MLlib APIs to accept `Dataset[_]` as input where `DataFrame` was the input type. This PR doesn't change the output type. In Java, `Dataset[_]` maps to `Dataset<?>`, which includes `Dataset<Row>`. Some implementations were changed in order to return `DataFrame`. Tests and examples were updated. Note that this is a breaking change for subclasses of Transformer/Estimator.

Lol, we don't have to rename the input argument, which has been `dataset` since Spark 1.2.

TODOs:
- [x] update MiMaExcludes (seems all covered by explicit filters from SPARK-13920)
- [x] Python
- [x] add a new test to accept Dataset[LabeledPoint]
- [x] remove unused imports of Dataset

## How was this patch tested?

Exiting unit tests with some modifications.

cc: rxin jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #12274 from mengxr/SPARK-14500.
2016-04-11 09:28:28 -07:00
fwang1 f4344582ba [SPARK-14497][ML] Use top instead of sortBy() to get top N frequent words as dict in ConutVectorizer
## What changes were proposed in this pull request?

Replace sortBy() with top() to calculate the top N frequent words as dictionary.

## How was this patch tested?
existing unit tests.  The terms with same TF would be sorted in descending order. The test would fail if hardcode the terms with same TF the dictionary like "c", "d"...

Author: fwang1 <desperado.wf@gmail.com>

Closes #12265 from lionelfeng/master.
2016-04-10 01:13:25 -07:00
Xiangrui Meng 415446cc9b Revert "[SPARK-14462][ML][MLLIB] add the mllib-local build to maven pom"
This reverts commit 1598d11bb0.
2016-04-09 14:03:03 -07:00
DB Tsai 1598d11bb0 [SPARK-14462][ML][MLLIB] add the mllib-local build to maven pom
## What changes were proposed in this pull request?

In order to separate the linear algebra, and vector matrix classes into a standalone jar, we need to setup the build first. This PR will create a new jar called mllib-local with minimal dependencies. The test scope will still depend on spark-core and spark-core-test in order to use the common utilities, but the runtime will avoid any platform dependency. Couple platform independent classes will be moved to this package to demonstrate how this work.

## How was this patch tested?

Unit tests

Author: DB Tsai <dbt@netflix.com>

Closes #12241 from dbtsai/dbtsai-mllib-local-build.
2016-04-09 09:21:12 -07:00
wm624@hotmail.com a9b8b655b2 [SPARK-14392][ML] CountVectorizer Estimator should include binary toggle Param
## What changes were proposed in this pull request?

CountVectorizerModel has a binary toggle param. This PR is to add binary toggle param for estimator CountVectorizer. As discussed in the JIRA, instead of adding a param into CountVerctorizer, I moved the binary param to CountVectorizerParams. Therefore, the estimator inherits the binary param.

## How was this patch tested?

Add a new test case, which fits the model with binary flag set to true and then check the trained model's all non-zero counts is set to 1.0.

All tests in CounterVectorizerSuite.scala are passed.

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #12200 from wangmiao1981/binary_param.
2016-04-09 09:57:07 +02:00
Joseph K. Bradley d7af736b2c [SPARK-14498][ML][PYTHON][SQL] Many cleanups to ML and ML-related docs
## What changes were proposed in this pull request?

Cleanups to documentation.  No changes to code.
* GBT docs: Move Scala doc for private object GradientBoostedTrees to public docs for GBTClassifier,Regressor
* GLM regParam: needs doc saying it is for L2 only
* TrainValidationSplitModel: add .. versionadded:: 2.0.0
* Rename “_transformer_params_from_java” to “_transfer_params_from_java”
* LogReg Summary classes: “probability” col should not say “calibrated”
* LR summaries: coefficientStandardErrors —> document that intercept stderr comes last.  Same for t,p-values
* approxCountDistinct: Document meaning of “rsd" argument.
* LDA: note which params are for online LDA only

## How was this patch tested?

Doc build

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #12266 from jkbradley/ml-doc-cleanups.
2016-04-08 20:15:44 -07:00
Yanbo Liang 56af8e85cc [SPARK-14298][ML][MLLIB] LDA should support disable checkpoint
## What changes were proposed in this pull request?
In the doc of [```checkpointInterval```](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/param/shared/sharedParams.scala#L241), we told users that they can disable checkpoint by setting ```checkpointInterval = -1```. But we did not handle this situation for LDA actually, we should fix this bug.
## How was this patch tested?
Existing tests.

cc jkbradley

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #12089 from yanboliang/spark-14298.
2016-04-08 11:49:44 -07:00
Joseph K. Bradley 953ff897e4 [SPARK-13048][ML][MLLIB] keepLastCheckpoint option for LDA EM optimizer
## What changes were proposed in this pull request?

The EMLDAOptimizer should generally not delete its last checkpoint since that can cause failures when DistributedLDAModel methods are called (if any partitions need to be recovered from the checkpoint).

This PR adds a "deleteLastCheckpoint" option which defaults to false.  This is a change in behavior from Spark 1.6, in that the last checkpoint will not be removed by default.

This involves adding the deleteLastCheckpoint option to both spark.ml and spark.mllib, and modifying PeriodicCheckpointer to support the option.

This also:
* Makes MLlibTestSparkContext extend TempDirectory and set the checkpointDir to tempDir
* Updates LibSVMRelationSuite because of a name conflict with "tempDir" (and fixes a bug where it failed to delete a temp directory)
* Adds a MIMA exclude for DistributedLDAModel constructor, which is already ```private[clustering]```

## How was this patch tested?

Added 2 new unit tests to spark.ml LDASuite, which calls into spark.mllib.

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #12166 from jkbradley/emlda-save-checkpoint.
2016-04-07 19:48:33 -07:00
Marcelo Vanzin 21d5ca128b [SPARK-14134][CORE] Change the package name used for shading classes.
The current package name uses a dash, which is a little weird but seemed
to work. That is, until a new test tried to mock a class that references
one of those shaded types, and then things started failing.

Most changes are just noise to fix the logging configs.

For reference, SPARK-8815 also raised this issue, although at the time it
did not cause any issues in Spark, so it was not addressed.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #11941 from vanzin/SPARK-14134.
2016-04-06 19:33:51 -07:00
sethah bb873754b4 [SPARK-12382][ML] Remove mllib GBT implementation and wrap ml
## What changes were proposed in this pull request?

This patch removes the implementation of gradient boosted trees in mllib/tree/GradientBoostedTrees.scala and changes mllib GBTs to call the implementation in spark.ML.

Primary changes:
* Removed `boost` method in mllib GradientBoostedTrees.scala
* Created new test suite GradientBoostedTreesSuite in ML, which contains unit tests that were specific to GBT internals from mllib

Other changes:
* Added an `updatePrediction` method in GradientBoostedTrees package. This method is added to provide consistency for methods that build predictions from boosted models. There are several methods that hard code the method of predicting as: sum_{i=1}^{numTrees} (treePrediction*treeWeight). Calling this function ensures that test methods that check accuracy use the same prediction method that the algorithm uses during training
* Added methods that were previously only used in testing, but were public methods, to GradientBoostedTrees. This includes `computeError` (previously part  of `Loss` trait) and `evaluateEachIteration`. These are used in the new spark.ML unit tests. They are left in mllib as well so as to not break the API.

## How was this patch tested?

Existing unit tests which compare ML and MLlib ensure that mllib GBTs have not changed. Only a single unit test was moved to ML, which verifies that `runWithValidation` performs as expected.

Author: sethah <seth.hendrickson16@gmail.com>

Closes #12050 from sethah/SPARK-12382.
2016-04-06 17:13:34 -07:00
Dongjoon Hyun d717ae1fd7 [SPARK-14444][BUILD] Add a new scalastyle NoScalaDoc to prevent ScalaDoc-style multiline comments
## What changes were proposed in this pull request?

According to the [Spark Code Style Guide](https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide#SparkCodeStyleGuide-Indentation), this PR adds a new scalastyle rule to prevent the followings.
```
/** In Spark, we don't use the ScalaDoc style so this
  * is not correct.
  */
```

## How was this patch tested?

Pass the Jenkins tests (including `lint-scala`).

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #12221 from dongjoon-hyun/SPARK-14444.
2016-04-06 16:02:55 -07:00
Bryan Cutler 9c6556c5f8 [SPARK-13430][PYSPARK][ML] Python API for training summaries of linear and logistic regression
## What changes were proposed in this pull request?

Adding Python API for training summaries of LogisticRegression and LinearRegression in PySpark ML.

## How was this patch tested?
Added unit tests to exercise the api calls for the summary classes.  Also, manually verified values are expected and match those from Scala directly.

Author: Bryan Cutler <cutlerb@gmail.com>

Closes #11621 from BryanCutler/pyspark-ml-summary-SPARK-13430.
2016-04-06 12:07:47 -07:00
Zheng RuiFeng af73d97378 [SPARK-13538][ML] Add GaussianMixture to ML
JIRA: https://issues.apache.org/jira/browse/SPARK-13538

## What changes were proposed in this pull request?

Add GaussianMixture and GaussianMixtureModel to ML package

## How was this patch tested?

unit tests and manual tests were done.
Local Scalastyle checks passed.

Author: Zheng RuiFeng <ruifengz@foxmail.com>
Author: Ruifeng Zheng <ruifengz@foxmail.com>
Author: Joseph K. Bradley <joseph@databricks.com>

Closes #11419 from zhengruifeng/mlgmm.
2016-04-06 11:45:16 -07:00
Yuhao Yang 8cffcb60de [SPARK-14322][MLLIB] Use treeAggregate instead of reduce in OnlineLDAOptimizer
## What changes were proposed in this pull request?
jira: https://issues.apache.org/jira/browse/SPARK-14322

OnlineLDAOptimizer uses RDD.reduce in two places where it could use treeAggregate. This can cause scalability issues. This should be an easy fix.
This is also a bug since it modifies the first argument to reduce, so we should use aggregate or treeAggregate.
See this line: f12f11e578/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala (L452)
and a few lines below it.

## How was this patch tested?
unit tests

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #12106 from hhbyyh/ldaTreeReduce.
2016-04-06 11:36:26 -07:00
Xusen Yin db0b06c6ea [SPARK-13786][ML][PYSPARK] Add save/load for pyspark.ml.tuning
## What changes were proposed in this pull request?

https://issues.apache.org/jira/browse/SPARK-13786

Add save/load for Python CrossValidator/Model and TrainValidationSplit/Model.

## How was this patch tested?

Test with Python doctest.

Author: Xusen Yin <yinxusen@gmail.com>

Closes #12020 from yinxusen/SPARK-13786.
2016-04-06 11:24:11 -07:00
Shally Sangal d356901588 [SPARK-14284][ML] KMeansSummary deprecating size; adding clusterSizes
## What changes were proposed in this pull request?

KMeansSummary class : deprecated size and added clusterSizes

Author: Shally Sangal <shallysangal@gmail.com>

Closes #12084 from shallys/master.
2016-04-05 10:41:59 -07:00
Joseph K. Bradley 8f50574ab4 [SPARK-14386][ML] Changed spark.ml ensemble trees methods to return concrete types
## What changes were proposed in this pull request?

In spark.ml, GBT and RandomForest expose the trait DecisionTreeModel in the trees method, but they should not since it is a private trait (and not ready to be made public). It will also be more useful to users if we return the concrete types.

This PR: return concrete types

The MIMA checks appear to be OK with this change.

## How was this patch tested?

Existing unit tests

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #12158 from jkbradley/hide-dtm.
2016-04-04 20:12:09 -07:00
Joseph K. Bradley 89f3befab6 [SPARK-13784][ML] Persistence for RandomForestClassifier, RandomForestRegressor
## What changes were proposed in this pull request?

**Main change**: Added save/load for RandomForestClassifier, RandomForestRegressor (implementation details below)

Modified numTrees method (*deprecation*)
* Goal: Use default implementations of unit tests which assume Estimators and Models share the same set of Params.
* What this PR does: Moves method numTrees outside of trait TreeEnsembleModel.  Adds it to GBT and RF Models.  Deprecates it in RF Models in favor of new method getNumTrees.  In Spark 2.1, we can have RF Models include Param numTrees.

Minor items
* Fixes bugs in GBTClassificationModel, GBTRegressionModel fromOld methods where they assign the wrong old UID.

**Implementation details**
* Split DecisionTreeModelReadWrite.loadTreeNodes into 2 methods in order to reuse some code for ensembles.
* Added EnsembleModelReadWrite object with save/load implementations usable for RFs and GBTs
  * These store all trees' nodes in a single DataFrame, and all trees' metadata in a second DataFrame.
* Split trait RandomForestParams into parts in order to add more Estimator Params to RF models
* Split DefaultParamsWriter.saveMetadata into two methods to allow ensembles to store sub-models' metadata in a single DataFrame.  Same for DefaultParamsReader.loadMetadata

## How was this patch tested?

Adds standard unit tests for RF save/load

Author: Joseph K. Bradley <joseph@databricks.com>
Author: GayathriMurali <gayathri.m.softie@gmail.com>

Closes #12118 from jkbradley/GayathriMurali-SPARK-13784.
2016-04-04 10:24:02 -07:00
Dongjoon Hyun 3f749f7ed4 [SPARK-14355][BUILD] Fix typos in Exception/Testcase/Comments and static analysis results
## What changes were proposed in this pull request?

This PR contains the following 5 types of maintenance fix over 59 files (+94 lines, -93 lines).
- Fix typos(exception/log strings, testcase name, comments) in 44 lines.
- Fix lint-java errors (MaxLineLength) in 6 lines. (New codes after SPARK-14011)
- Use diamond operators in 40 lines. (New codes after SPARK-13702)
- Fix redundant semicolon in 5 lines.
- Rename class `InferSchemaSuite` to `CSVInferSchemaSuite` in CSVInferSchemaSuite.scala.

## How was this patch tested?

Manual and pass the Jenkins tests.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #12139 from dongjoon-hyun/SPARK-14355.
2016-04-03 18:14:16 -07:00
Dongjoon Hyun 4a6e78abd9 [MINOR][DOCS] Use multi-line JavaDoc comments in Scala code.
## What changes were proposed in this pull request?

This PR aims to fix all Scala-Style multiline comments into Java-Style multiline comments in Scala codes.
(All comment-only changes over 77 files: +786 lines, −747 lines)

## How was this patch tested?

Manual.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #12130 from dongjoon-hyun/use_multiine_javadoc_comments.
2016-04-02 17:50:40 -07:00
Jacek Laskowski 06694f1c68 [MINOR] Typo fixes
## What changes were proposed in this pull request?

Typo fixes. No functional changes.

## How was this patch tested?

Built the sources and ran with samples.

Author: Jacek Laskowski <jacek@japila.pl>

Closes #11802 from jaceklaskowski/typo-fixes.
2016-04-02 08:12:04 -07:00
sethah 4fc35e6f5c [SPARK-14308][ML][MLLIB] Remove unused mllib tree classes and move private classes to ML
## What changes were proposed in this pull request?

Decision tree helper classes will be migrated to ML. This patch moves those internal classes that are not part of the public API and removes ones that are no longer used, after [SPARK-12183](https://github.com/apache/spark/pull/11855). No functional changes are made.

Details:
* Bin.scala is removed as the ML implementation does not require bins
* mllib NodeIdCache is removed. It was only used by the mllib implementation previously, which no longer exists
* mllib TreePoint is removed. It was only used by the mllib implementation previously, which no longer exists
* BaggedPoint, DTStatsAggregator, DecisionTreeMetadata, BaggedPointSuite and TimeTracker are all moved to ML.

## How was this patch tested?

No functional changes are made. Existing unit tests ensure behavior is unchanged.

Author: sethah <seth.hendrickson16@gmail.com>

Closes #12097 from sethah/cleanup_mllib_tree.
2016-04-01 21:23:35 -07:00
BenFradet 36e8fb8005 [SPARK-7425][ML] spark.ml Predictor should support other numeric types for label
Currently, the Predictor abstraction expects the input labelCol type to be DoubleType, but we should support other numeric types. This will involve updating the PredictorParams.validateAndTransformSchema method.

Author: BenFradet <benjamin.fradet@gmail.com>

Closes #10355 from BenFradet/SPARK-7425.
2016-04-01 18:25:43 -07:00
Cheng Lian 3715ecdf41 [SPARK-14295][MLLIB][HOTFIX] Fixes Scala 2.10 compilation failure
## What changes were proposed in this pull request?

Fixes a compilation failure introduced in PR #12088 under Scala 2.10.

## How was this patch tested?

Compilation.

Author: Cheng Lian <lian@databricks.com>

Closes #12107 from liancheng/spark-14295-hotfix.
2016-04-01 17:02:48 +08:00
Yanbo Liang 22249afb4a [SPARK-14303][ML][SPARKR] Define and use KMeansWrapper for SparkR::kmeans
## What changes were proposed in this pull request?
Define and use ```KMeansWrapper``` for ```SparkR::kmeans```. It's only the code refactor for the original ```KMeans``` wrapper.

## How was this patch tested?
Existing tests.

cc mengxr

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #12039 from yanboliang/spark-14059.
2016-03-31 23:49:58 -07:00
Alexander Ulanov 26867ebc67 [SPARK-11262][ML] Unit test for gradient, loss layers, memory management for multilayer perceptron
1.Implement LossFunction trait and implement squared error and cross entropy
loss with it
2.Implement unit test for gradient and loss
3.Implement InPlace trait and in-place layer evaluation
4.Refactor interface for ActivationFunction
5.Update of Layer and LayerModel interfaces
6.Fix random weights assignment
7.Implement memory allocation by MLP model instead of individual layers

These features decreased the memory usage and increased flexibility of
internal API.

Author: Alexander Ulanov <nashb@yandex.ru>
Author: avulanov <avulanov@gmail.com>

Closes #9229 from avulanov/mlp-refactoring.
2016-03-31 23:48:36 -07:00
Cheng Lian 1b070637fa [SPARK-14295][SPARK-14274][SQL] Implements buildReader() for LibSVM
## What changes were proposed in this pull request?

This PR implements `FileFormat.buildReader()` for the LibSVM data source. Besides that, a new interface method `prepareRead()` is added to `FileFormat`:

```scala
  def prepareRead(
      sqlContext: SQLContext,
      options: Map[String, String],
      files: Seq[FileStatus]): Map[String, String] = options
```

After migrating from `buildInternalScan()` to `buildReader()`, we lost the opportunity to collect necessary global information, since `buildReader()` works in a per-partition manner. For example, LibSVM needs to infer the total number of features if the `numFeatures` data source option is not set. Any necessary collected global information should be returned using the data source options map. By default, this method just returns the original options untouched.

An alternative approach is to absorb `inferSchema()` into `prepareRead()`, since schema inference is also some kind of global information gathering. However, this approach wasn't chosen because schema inference is optional, while `prepareRead()` must be called whenever a `HadoopFsRelation` based data source relation is instantiated.

One unaddressed problem is that, when `numFeatures` is absent, now the input data will be scanned twice. The `buildInternalScan()` code path doesn't need to do this because it caches the raw parsed RDD in memory before computing the total number of features. However, with `FileScanRDD`, the raw parsed RDD is created in a different way (e.g. partitioning) from the final RDD.

## How was this patch tested?

Tested using existing test suites.

Author: Cheng Lian <lian@databricks.com>

Closes #12088 from liancheng/spark-14295-libsvm-build-reader.
2016-03-31 23:46:08 -07:00
Xusen Yin 8b207f3b6a [SPARK-11892][ML] Model export/import for spark.ml: OneVsRest
# What changes were proposed in this pull request?

https://issues.apache.org/jira/browse/SPARK-11892

Add save/load for spark ml.OneVsRest and its model. Also add OneVsRest and OneVsRestModel in MetaAlgorithmReadWrite.

# How was this patch tested?

Test with Scala unit test.

Author: Xusen Yin <yinxusen@gmail.com>

Closes #9934 from yinxusen/SPARK-11892.
2016-03-31 11:17:32 -07:00
Yuhao Yang a0a1991580 [SPARK-13782][ML] Model export/import for spark.ml: BisectingKMeans
## What changes were proposed in this pull request?
jira: https://issues.apache.org/jira/browse/SPARK-13782
Model export/import for BisectingKMeans in spark.ml and mllib

## How was this patch tested?

unit tests

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #11933 from hhbyyh/bisectingsave.
2016-03-31 11:12:40 -07:00
Dongjoon Hyun 208fff3ac8 [SPARK-14164][MLLIB] Improve input layer validation of MultilayerPerceptronClassifier
## What changes were proposed in this pull request?

This issue improves an input layer validation and adds related testcases to MultilayerPerceptronClassifier.

```scala
-    // TODO: how to check ALSO that all elements are greater than 0?
-    ParamValidators.arrayLengthGt(1)
+    (t: Array[Int]) => t.forall(ParamValidators.gt(0)) && t.length > 1
```

## How was this patch tested?

Pass the Jenkins tests including the new testcases.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #11964 from dongjoon-hyun/SPARK-14164.
2016-03-31 09:39:15 -07:00
Yuhao Yang ca458618d8 [SPARK-11507][MLLIB] add compact in Matrices fromBreeze
jira: https://issues.apache.org/jira/browse/SPARK-11507
"In certain situations when adding two block matrices, I get an error regarding colPtr and the operation fails. External issue URL includes full error and code for reproducing the problem."

root cause: colPtr.last does NOT always equal to values.length in breeze SCSMatrix, which fails the require in SparseMatrix.

easy step to repro:
```
val m1: BM[Double] = new CSCMatrix[Double] (Array (1.0, 1, 1), 3, 3, Array (0, 1, 2, 3), Array (0, 1, 2) )
val m2: BM[Double] = new CSCMatrix[Double] (Array (1.0, 2, 2, 4), 3, 3, Array (0, 0, 2, 4), Array (1, 2, 1, 2) )
val sum = m1 + m2
Matrices.fromBreeze(sum)
```

Solution: By checking the code in [CSCMatrix](28000a7b90/math/src/main/scala/breeze/linalg/CSCMatrix.scala), CSCMatrix in breeze can have extra zeros in the end of data array. Invoking compact will make sure it aligns with the require of SparseMatrix. This should add limited overhead as the actual compact operation is only performed when necessary.

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #9520 from hhbyyh/matricesFromBreeze.
2016-03-30 15:58:19 -07:00
Yanbo Liang 5dc948e812 [MINOR][ML] Fix the wrong param name of LDA topicDistributionCol
## What changes were proposed in this pull request?
Fix the wrong param name of LDA ```topicDistributionCol```.
## How was this patch tested?
No tests.

cc jkbradley

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #12065 from yanboliang/lda-topicDistributionCol.
2016-03-30 14:57:38 -07:00
Xusen Yin 529d6ce8f9 [SPARK-14181] TrainValidationSplit should have HasSeed
https://issues.apache.org/jira/browse/SPARK-14181

TrainValidationSplit should have HasSeed for the random split of RDD. I also changed the random split from the RDD function to the DataFrame function.

Author: Xusen Yin <yinxusen@gmail.com>

Closes #11985 from yinxusen/SPARK-14181.
2016-03-30 14:32:29 -07:00
Yuhao Yang d2a819a636 [SPARK-14154][MLLIB] Simplify the implementation for Kolmogorov–Smirnov test
## What changes were proposed in this pull request?
jira: https://issues.apache.org/jira/browse/SPARK-14154

I just read the code for KolmogorovSmirnovTest and find it could be much simplified following the original definition.

Send a PR for discussion

## How was this patch tested?
unit test

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #11954 from hhbyyh/ksoptimize.
2016-03-29 09:16:50 -07:00
Bryan Cutler 425bcf6d68 [SPARK-13963][ML] Adding binary toggle param to HashingTF
## What changes were proposed in this pull request?
Adding binary toggle parameter to ml.feature.HashingTF, as well as mllib.feature.HashingTF since the former wraps this functionality.  This parameter, if true, will set non-zero valued term counts to 1 to transform term count features to binary values that are well suited for discrete probability models.

## How was this patch tested?
Added unit tests for ML and MLlib

Author: Bryan Cutler <cutlerb@gmail.com>

Closes #11832 from BryanCutler/binary-param-HashingTF-SPARK-13963.
2016-03-29 12:30:30 +02:00
sethah f6066b0c3c [SPARK-11730][ML] Add feature importances for GBTs.
## What changes were proposed in this pull request?

Now that GBTs have been moved to ML, they can use the implementation of feature importance for random forests. This patch simply adds a `featureImportances` attribute to `GBTClassifier` and `GBTRegressor` and adds tests for each.

GBT feature importances here simply average the feature importances for each tree in its ensemble. This follows the implementation from scikit-learn. This method is also suggested by J Friedman in [this paper](https://statweb.stanford.edu/~jhf/ftp/trebst.pdf).

## How was this patch tested?

Unit tests were added to `GBTClassifierSuite` and `GBTRegressorSuite` to validate feature importances.

Author: sethah <seth.hendrickson16@gmail.com>

Closes #11961 from sethah/SPARK-11730.
2016-03-28 22:27:53 -07:00
Xusen Yin 8c11d1aab8 [SPARK-11893] Model export/import for spark.ml: TrainValidationSplit
https://issues.apache.org/jira/browse/SPARK-11893

jkbradley In order to share read/write with `TrainValidationSplit`, I move the `SharedReadWrite` out of `CrossValidator` into a new trait `SharedReadWrite` in the tunning package.

To reduce the repeated tests, I move the complex tests from `CrossValidatorSuite` to `SharedReadWriteSuite`, and create a fake validator called `MyValidator` to test the shared code.

With `SharedReadWrite`, potential newly added `Validator` can share the read/write common part, and only need to implement their extra params save/load.

Author: Xusen Yin <yinxusen@gmail.com>
Author: Joseph K. Bradley <joseph@databricks.com>

Closes #9971 from yinxusen/SPARK-11893.
2016-03-28 15:40:06 -07:00
Chenliang Xu c8388297c4 [SPARK-14187][MLLIB] Fix incorrect use of binarySearch in SparseMatrix
## What changes were proposed in this pull request?

Fix incorrect use of binarySearch in SparseMatrix

## How was this patch tested?

Unit test added.

Author: Chenliang Xu <chexu@groupon.com>

Closes #11992 from luckyrandom/SPARK-14187.
2016-03-28 08:33:37 -07:00
Sean Owen 7b84154018 [SPARK-12494][MLLIB] Array out of bound Exception in KMeans Yarn Mode
## What changes were proposed in this pull request?

Better error message with k-means init can't be enough samples from input (because it is perhaps empty)

## How was this patch tested?

Jenkins tests.

Author: Sean Owen <sowen@cloudera.com>

Closes #11979 from srowen/SPARK-12494.
2016-03-28 12:01:33 +01:00
Joseph K. Bradley 8ef493760f [SPARK-10691][ML] Make LogisticRegressionModel, LinearRegressionModel evaluate() public
## What changes were proposed in this pull request?

Made evaluate method public.  Fixed LogisticRegressionModel evaluate to handle case when probabilityCol is not specified.

## How was this patch tested?

There were already unit tests for these methods.

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #11928 from jkbradley/public-evaluate.
2016-03-27 19:04:18 -07:00
Dongjoon Hyun 0f02a5c6e6 [MINOR][MLLIB] Remove TODO comment DecisionTreeModel.scala
## What changes were proposed in this pull request?

This PR fixes the following line and the related code. Historically, this code was added in [SPARK-5597](https://issues.apache.org/jira/browse/SPARK-5597). After [SPARK-5597](https://issues.apache.org/jira/browse/SPARK-5597) was committed, [SPARK-3365](https://issues.apache.org/jira/browse/SPARK-3365) is fixed now. Now, we had better remove the comment without changing persistent code.

```scala
-        categories: Seq[Double]) { // TODO: Change to List once SPARK-3365 is fixed
+        categories: Seq[Double]) {
```

## How was this patch tested?

Pass the Jenkins tests.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #11966 from dongjoon-hyun/change_categories_type.
2016-03-27 20:07:31 +01:00
Liwei Lin 62a85eb09f [SPARK-14089][CORE][MLLIB] Remove methods that has been deprecated since 1.1, 1.2, 1.3, 1.4, and 1.5
## What changes were proposed in this pull request?

Removed methods that has been deprecated since 1.1, 1.2, 1.3, 1.4, and 1.5.

## How was this patch tested?

- manully checked that no codes in Spark call these methods any more
- existing test suits

Author: Liwei Lin <lwlin7@gmail.com>
Author: proflin <proflin.me@gmail.com>

Closes #11910 from lw-lin/remove-deprecates.
2016-03-26 12:41:34 +00:00
Joseph K. Bradley 54d13bed87 [SPARK-14159][ML] Fixed bug in StringIndexer + related issue in RFormula
## What changes were proposed in this pull request?

StringIndexerModel.transform sets the output column metadata to use name inputCol.  It should not.  Fixing this causes a problem with the metadata produced by RFormula.

Fix in RFormula: I added the StringIndexer columns to prefixesToRewrite, and I modified VectorAttributeRewriter to find and replace all "prefixes" since attributes collect multiple prefixes from StringIndexer + Interaction.

Note that "prefixes" is no longer accurate since internal strings may be replaced.

## How was this patch tested?

Unit test which failed before this fix.

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #11965 from jkbradley/StringIndexer-fix.
2016-03-25 16:00:09 -07:00
Yanbo Liang 13cbb2de70 [SPARK-13010][ML][SPARKR] Implement a simple wrapper of AFTSurvivalRegression in SparkR
## What changes were proposed in this pull request?
This PR continues the work in #11447, we implemented the wrapper of ```AFTSurvivalRegression``` named ```survreg``` in SparkR.

## How was this patch tested?
Test against output from R package survival's survreg.

cc mengxr felixcheung

Close #11447

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #11932 from yanboliang/spark-13010-new.
2016-03-24 22:29:34 -07:00
Xusen Yin 2cf46d5a96 [SPARK-11871] Add save/load for MLPC
## What changes were proposed in this pull request?

https://issues.apache.org/jira/browse/SPARK-11871

Add save/load for MLPC

## How was this patch tested?

Test with Scala unit test

Author: Xusen Yin <yinxusen@gmail.com>

Closes #9854 from yinxusen/SPARK-11871.
2016-03-24 15:29:17 -07:00
Ruifeng Zheng 048a7594e2 [SPARK-14030][MLLIB] Add parameter check to MLLIB
## What changes were proposed in this pull request?

add parameter verification to MLLIB, like
numCorrections > 0
tolerance >= 0
iters > 0
regParam >= 0

## How was this patch tested?

manual tests

Author: Ruifeng Zheng <ruifengz@foxmail.com>
Author: Zheng RuiFeng <mllabs@datanode1.(none)>
Author: mllabs <mllabs@datanode1.(none)>
Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #11852 from zhengruifeng/lbfgs_check.
2016-03-24 09:25:00 +00:00
Juarez Bochi 1803bf6333 Fix typo in ALS.scala
## What changes were proposed in this pull request?

Just a typo

## How was this patch tested?

N/A

Author: Juarez Bochi <jbochi@gmail.com>

Closes #11896 from jbochi/patch-1.
2016-03-24 09:24:00 +00:00
Joseph K. Bradley cf823bead1 [SPARK-12183][ML][MLLIB] Remove mllib tree implementation, and wrap spark.ml one
Primary change:
* Removed spark.mllib.tree.DecisionTree implementation of tree and forest learning.
* spark.mllib now calls the spark.ml implementation.
* Moved unit tests (of tree learning internals) from spark.mllib to spark.ml as needed.

ml.tree.DecisionTreeModel
* Added toOld and made ```private[spark]```, implemented for Classifier and Regressor in subclasses.  These methods now use OldInformationGainStats.invalidInformationGainStats for LeafNodes in order to mimic the spark.mllib implementation.

ml.tree.Node
* Added ```private[tree] def deepCopy```, used by unit tests

Copied developer comments from spark.mllib implementation to spark.ml one.

Moving unit tests
* Tree learning internals were tested by spark.mllib.tree.DecisionTreeSuite, or spark.mllib.tree.RandomForestSuite.
* Those tests were all moved to spark.ml.tree.impl.RandomForestSuite.  The order in the file + the test names are the same, so you should be able to compare them by opening them in 2 windows side-by-side.
* I made minimal changes to each test to allow it to run.  Each test makes the same checks as before, except for a few removed assertions which were checking irrelevant values.
* No new unit tests were added.
* mllib.tree.DecisionTreeSuite: I removed some checks of splits and bins which were not relevant to the unit tests they were in.  Those same split calculations were already being tested in other unit tests, for each dataset type.

**Changes of behavior** (to be noted in SPARK-13448 once this PR is merged)
* spark.ml.tree.impl.RandomForest: Rather than throwing an error when maxMemoryInMB is set to too small a value (to split any node), we now allow 1 node to be split, even if its memory requirements exceed maxMemoryInMB.  This involved removing the maxMemoryPerNode check in RandomForest.run, as well as modifying selectNodesToSplit().  Once this PR is merged, I will note the change of behavior on SPARK-13448.
* spark.mllib.tree.DecisionTree: When a tree only has one node (root = leaf node), the "stats" field will now be empty, rather than being set to InformationGainStats.invalidInformationGainStats.  This does not remove information from the tree, and it will save a bit of storage.

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #11855 from jkbradley/remove-mllib-tree-impl.
2016-03-23 21:16:00 -07:00
sethah 69bc2c17f1 [SPARK-13952][ML] Add random seed to GBT
## What changes were proposed in this pull request?

`GBTClassifier` and `GBTRegressor` should use random seed for reproducible results. Because of the nature of current unit tests, which compare GBTs in ML and GBTs in MLlib for equality, I also added a random seed to MLlib GBT algorithm. I made alternate constructors in `mllib.tree.GradientBoostedTrees` to accept a random seed, but left them as private so as to not change the API unnecessarily.

## How was this patch tested?

Existing unit tests verify that functionality did not change. Other ML algorithms do not seem to have unit tests that directly test the functionality of random seeding, but reproducibility with seeding for GBTs is effectively verified in existing tests. I can add more tests if needed.

Author: sethah <seth.hendrickson16@gmail.com>

Closes #11903 from sethah/SPARK-13952.
2016-03-23 15:08:47 -07:00
Joseph K. Bradley 4d955cd694 [SPARK-14035][MLLIB] Make error message more verbose for mllib NaiveBayesSuite
## What changes were proposed in this pull request?

Print more info about failed NaiveBayesSuite tests which have exhibited flakiness.

## How was this patch tested?

Ran locally with incorrect check to cause failure.

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #11858 from jkbradley/naive-bayes-bug-log.
2016-03-23 10:51:58 +00:00
Xusen Yin d6dc12ef01 [SPARK-13449] Naive Bayes wrapper in SparkR
## What changes were proposed in this pull request?

This PR continues the work in #11486 from yinxusen with some code refactoring. In R package e1071, `naiveBayes` supports both categorical (Bernoulli) and continuous features (Gaussian), while in MLlib we support Bernoulli and multinomial. This PR implements the common subset: Bernoulli.

I moved the implementation out from SparkRWrappers to NaiveBayesWrapper to make it easier to read. Argument names, default values, and summary now match e1071's naiveBayes.

I removed the preprocess part that omit NA values because we don't know which columns to process.

## How was this patch tested?

Test against output from R package e1071's naiveBayes.

cc: yanboliang yinxusen

Closes #11486

Author: Xusen Yin <yinxusen@gmail.com>
Author: Xiangrui Meng <meng@databricks.com>

Closes #11890 from mengxr/SPARK-13449.
2016-03-22 14:16:51 -07:00
Dongjoon Hyun df61fbd978 [SPARK-13986][CORE][MLLIB] Remove DeveloperApi-annotations for non-publics
## What changes were proposed in this pull request?

Spark uses `DeveloperApi` annotation, but sometimes it seems to conflict with visibility. This PR tries to fix those conflict by removing annotations for non-publics. The following is the example.

**JobResult.scala**
```scala
DeveloperApi
sealed trait JobResult

DeveloperApi
case object JobSucceeded extends JobResult

-DeveloperApi
private[spark] case class JobFailed(exception: Exception) extends JobResult
```

## How was this patch tested?

Pass the existing Jenkins test.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #11797 from dongjoon-hyun/SPARK-13986.
2016-03-21 14:57:52 +00:00
Dongjoon Hyun 20fd254101 [SPARK-14011][CORE][SQL] Enable LineLength Java checkstyle rule
## What changes were proposed in this pull request?

[Spark Coding Style Guide](https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide) has 100-character limit on lines, but it's disabled for Java since 11/09/15. This PR enables **LineLength** checkstyle again. To help that, this also introduces **RedundantImport** and **RedundantModifier**, too. The following is the diff on `checkstyle.xml`.

```xml
-        <!-- TODO: 11/09/15 disabled - the lengths are currently > 100 in many places -->
-        <!--
         <module name="LineLength">
             <property name="max" value="100"/>
             <property name="ignorePattern" value="^package.*|^import.*|a href|href|http://|https://|ftp://"/>
         </module>
-        -->
         <module name="NoLineWrap"/>
         <module name="EmptyBlock">
             <property name="option" value="TEXT"/>
 -167,5 +164,7
         </module>
         <module name="CommentsIndentation"/>
         <module name="UnusedImports"/>
+        <module name="RedundantImport"/>
+        <module name="RedundantModifier"/>
```

## How was this patch tested?

Currently, `lint-java` is disabled in Jenkins. It needs a manual test.
After passing the Jenkins tests, `dev/lint-java` should passes locally.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #11831 from dongjoon-hyun/SPARK-14011.
2016-03-21 07:58:57 +00:00
sethah 811a524722 [SPARK-12182][ML] Distributed binning for trees in spark.ml
This PR changes the `findSplits` method in spark.ml to perform split calculations on the workers. This PR is meant to copy [PR-8246](https://github.com/apache/spark/pull/8246) which added the same feature for MLlib.

Author: sethah <seth.hendrickson16@gmail.com>

Closes #10231 from sethah/SPARK-12182.
2016-03-20 12:31:28 -07:00
Yuhao Yang f43a26ef92 [SPARK-13629][ML] Add binary toggle Param to CountVectorizer
## What changes were proposed in this pull request?

This is a continued work for https://github.com/apache/spark/pull/11536#issuecomment-198511013,
containing some comment update and style adjustment.
jkbradley

## How was this patch tested?

unit tests.

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #11830 from hhbyyh/cvToggle.
2016-03-18 17:34:33 -07:00
Yanbo Liang 7783b6f38f [MINOR][ML] When trainingSummary is None, it should throw RuntimeException.
## What changes were proposed in this pull request?
When trainingSummary is None, it should throw ```RuntimeException```.
cc mengxr
## How was this patch tested?
Existing tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #11784 from yanboliang/fix-summary.
2016-03-18 11:23:17 +00:00
sethah 1614485fd9 [SPARK-10788][MLLIB][ML] Remove duplicate bins for decision trees
Decision trees in spark.ml (RandomForest.scala) communicate twice as much data as needed for unordered categorical features. Here's an example.

Say there are 3 categories A, B, C. We consider 3 splits:

* A vs. B, C
* A, B vs. C
* A, C vs. B

Currently, we collect statistics for each of the 6 subsets of categories (3 * 2 = 6). However, we could instead collect statistics for the 3 subsets on the left-hand side of the 3 possible splits: A and A,B and A,C. If we also have stats for the entire node, then we can compute the stats for the 3 subsets on the right-hand side of the splits. In pseudomath: stats(B,C) = stats(A,B,C) - stats(A).

This patch adds a parent stats array to the `DTStatsAggregator` so that the right child stats do not need to be stored. The right child stats are computed by subtracting left child stats from the parent stats for unordered categorical features.

Author: sethah <seth.hendrickson16@gmail.com>

Closes #9474 from sethah/SPARK-10788.
2016-03-17 16:44:41 -07:00
Joseph K. Bradley b39e80d39d [SPARK-13761][ML] Remove remaining uses of validateParams
## What changes were proposed in this pull request?

Cleanups from [https://github.com/apache/spark/pull/11620]: remove remaining uses of validateParams, and put functionality into transformSchema

## How was this patch tested?

Existing unit tests, modified to check using transformSchema instead of validateParams

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #11790 from jkbradley/SPARK-13761-cleanup.
2016-03-17 13:23:07 -07:00
Xusen Yin edf8b8775b [SPARK-11891] Model export/import for RFormula and RFormulaModel
https://issues.apache.org/jira/browse/SPARK-11891

Author: Xusen Yin <yinxusen@gmail.com>

Closes #9884 from yinxusen/SPARK-11891.
2016-03-17 10:19:10 -07:00
Wenchen Fan 8ef3399aff [SPARK-13928] Move org.apache.spark.Logging into org.apache.spark.internal.Logging
## What changes were proposed in this pull request?

Logging was made private in Spark 2.0. If we move it, then users would be able to create a Logging trait themselves to avoid changing their own code.

## How was this patch tested?

existing tests.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #11764 from cloud-fan/logger.
2016-03-17 19:23:38 +08:00
Yuhao Yang 357d82d84d [SPARK-13629][ML] Add binary toggle Param to CountVectorizer
## What changes were proposed in this pull request?

It would be handy to add a binary toggle Param to CountVectorizer, as in the scikit-learn one: http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
If set, then all non-zero counts will be set to 1.

## How was this patch tested?

unit tests

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #11536 from hhbyyh/cvToggle.
2016-03-17 11:21:11 +02:00
Yuhao Yang 92b70576ea [SPARK-13761][ML] Deprecate validateParams
## What changes were proposed in this pull request?

Deprecate validateParams() method here: 035d3acdf3/mllib/src/main/scala/org/apache/spark/ml/param/params.scala (L553)
Move all functionality in overridden methods to transformSchema().
Check docs to make sure they indicate complex Param interaction checks should be done in transformSchema.

## How was this patch tested?

unit tests

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #11620 from hhbyyh/depreValid.
2016-03-16 17:31:55 -07:00
Jakob Odersky d4d84936fb [SPARK-11011][SQL] Narrow type of UDT serialization
## What changes were proposed in this pull request?

Narrow down the parameter type of `UserDefinedType#serialize()`. Currently, the parameter type is `Any`, however it would logically make more sense to narrow it down to the type of the actual user defined type.

## How was this patch tested?

Existing tests were successfully run on local machine.

Author: Jakob Odersky <jakob@odersky.com>

Closes #11379 from jodersky/SPARK-11011-udt-types.
2016-03-16 16:59:36 -07:00
Xiangrui Meng 85c42fda99 [SPARK-13927][MLLIB] add row/column iterator to local matrices
## What changes were proposed in this pull request?

Add row/column iterator to local matrices to simplify tasks like BlockMatrix => RowMatrix conversion. It handles dense and sparse matrices properly.

## How was this patch tested?

Unit tests on sparse and dense matrix.

cc: dbtsai

Author: Xiangrui Meng <meng@databricks.com>

Closes #11757 from mengxr/SPARK-13927.
2016-03-16 14:19:54 -07:00
Joseph K. Bradley 6fc2b6541f [SPARK-11888][ML] Decision tree persistence in spark.ml
### What changes were proposed in this pull request?

Made these MLReadable and MLWritable: DecisionTreeClassifier, DecisionTreeClassificationModel, DecisionTreeRegressor, DecisionTreeRegressionModel
* The shared implementation is in treeModels.scala
* I use case classes to create a DataFrame to save, and I use the Dataset API to parse loaded files.

Other changes:
* Made CategoricalSplit.numCategories public (to use in persistence)
* Fixed a bug in DefaultReadWriteTest.testEstimatorAndModelReadWrite, where it did not call the checkModelData function passed as an argument.  This caused an error in LDASuite, which I fixed.

### How was this patch tested?

Persistence is tested via unit tests.  For each algorithm, there are 2 non-trivial trees (depth 2).  One is built with continuous features, and one with categorical; this ensures that both types of splits are tested.

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #11581 from jkbradley/dt-io.
2016-03-16 14:18:35 -07:00
Yanbo Liang 3f06eb72ca [SPARK-13613][ML] Provide ignored tests to export test dataset into CSV format
## What changes were proposed in this pull request?
Provide ignored test cases to export the test dataset into CSV format in ```LinearRegressionSuite```, ```LogisticRegressionSuite```, ```AFTSurvivalRegressionSuite``` and ```GeneralizedLinearRegressionSuite```, so users can validate the training accuracy compared with R's glm, glmnet and survival package.
cc mengxr
## How was this patch tested?
The test suite is ignored, but I have enabled all these cases offline and it works as expected.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #11463 from yanboliang/spark-13613.
2016-03-16 14:14:15 -07:00
Cheng Hao d9670f8473 [SPARK-13894][SQL] SqlContext.range return type from DataFrame to DataSet
## What changes were proposed in this pull request?
https://issues.apache.org/jira/browse/SPARK-13894
Change the return type of the `SQLContext.range` API from `DataFrame` to `Dataset`.

## How was this patch tested?
No additional unit test required.

Author: Cheng Hao <hao.cheng@intel.com>

Closes #11730 from chenghao-intel/range.
2016-03-16 11:20:15 -07:00
Sean Owen 3b461d9ecd [SPARK-13823][SPARK-13397][SPARK-13395][CORE] More warnings, StandardCharset follow up
## What changes were proposed in this pull request?

Follow up to https://github.com/apache/spark/pull/11657

- Also update `String.getBytes("UTF-8")` to use `StandardCharsets.UTF_8`
- And fix one last new Coverity warning that turned up (use of unguarded `wait()` replaced by simpler/more robust `java.util.concurrent` classes in tests)
- And while we're here cleaning up Coverity warnings, just fix about 15 more build warnings

## How was this patch tested?

Jenkins tests

Author: Sean Owen <sowen@cloudera.com>

Closes #11725 from srowen/SPARK-13823.2.
2016-03-16 09:36:34 +00:00
Yanbo Liang 3665294d4e [SPARK-9837][ML] R-like summary statistics for GLMs via iteratively reweighted least squares
## What changes were proposed in this pull request?
Provide R-like summary statistics for GLMs via iteratively reweighted least squares.
## How was this patch tested?
unit tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #11694 from yanboliang/spark-9837.
2016-03-15 22:30:07 -07:00
sethah dafd70fbfe [SPARK-12379][ML][MLLIB] Copy GBT implementation to spark.ml
Currently, GBTs in spark.ml wrap the implementation in spark.mllib. This is preventing several improvements to GBTs in spark.ml, so we need to move the implementation to ml and use spark.ml decision trees in the implementation. At first, we should make minimal changes to the implementation.
Performance testing should be done to ensure there were no regressions.

Performance testing results are [here](https://docs.google.com/document/d/1dYd2mnfGdUKkQ3vZe2BpzsTnI5IrpSLQ-NNKDZhUkgw/edit?usp=sharing)

Author: sethah <seth.hendrickson16@gmail.com>

Closes #10607 from sethah/SPARK-12379.
2016-03-15 11:50:34 +02:00
Michael Armbrust 17eec0a71b [SPARK-13664][SQL] Add a strategy for planning partitioned and bucketed scans of files
This PR adds a new strategy, `FileSourceStrategy`, that can be used for planning scans of collections of files that might be partitioned or bucketed.

Compared with the existing planning logic in `DataSourceStrategy` this version has the following desirable properties:
 - It removes the need to have `RDD`, `broadcastedHadoopConf` and other distributed concerns  in the public API of `org.apache.spark.sql.sources.FileFormat`
 - Partition column appending is delegated to the format to avoid an extra copy / devectorization when appending partition columns
 - It minimizes the amount of data that is shipped to each executor (i.e. it does not send the whole list of files to every worker in the form of a hadoop conf)
 - it natively supports bucketing files into partitions, and thus does not require coalescing / creating a `UnionRDD` with the correct partitioning.
 - Small files are automatically coalesced into fewer tasks using an approximate bin-packing algorithm.

Currently only a testing source is planned / tested using this strategy.  In follow-up PRs we will port the existing formats to this API.

A stub for `FileScanRDD` is also added, but most methods remain unimplemented.

Other minor cleanups:
 - partition pruning is pushed into `FileCatalog` so both the new and old code paths can use this logic.  This will also allow future implementations to use indexes or other tricks (i.e. a MySQL metastore)
 - The partitions from the `FileCatalog` now propagate information about file sizes all the way up to the planner so we can intelligently spread files out.
 - `Array` -> `Seq` in some internal APIs to avoid unnecessary `toArray` calls
 - Rename `Partition` to `PartitionDirectory` to differentiate partitions used earlier in pruning from those where we have already enumerated the files and their sizes.

Author: Michael Armbrust <michael@databricks.com>

Closes #11646 from marmbrus/fileStrategy.
2016-03-14 19:21:12 -07:00
Ehsan M.Kermani 992142b87e [SPARK-11826][MLLIB] Refactor add() and subtract() methods
srowen Could you please check this when you have time?

Author: Ehsan M.Kermani <ehsanmo1367@gmail.com>

Closes #9916 from ehsanmok/JIRA-11826.
2016-03-14 19:17:09 -07:00
Dongjoon Hyun a48296f4fe [SPARK-13686][MLLIB][STREAMING] Add a constructor parameter reqParam to (Streaming)LinearRegressionWithSGD
## What changes were proposed in this pull request?

`LinearRegressionWithSGD` and `StreamingLinearRegressionWithSGD` does not have `regParam` as their constructor arguments. They just depends on GradientDescent's default reqParam values.
To be consistent with other algorithms, we had better add them. The same default value is used.

## How was this patch tested?

Pass the existing unit test.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #11527 from dongjoon-hyun/SPARK-13686.
2016-03-14 12:46:53 -07:00
Dongjoon Hyun acdf219703 [MINOR][DOCS] Fix more typos in comments/strings.
## What changes were proposed in this pull request?

This PR fixes 135 typos over 107 files:
* 121 typos in comments
* 11 typos in testcase name
* 3 typos in log messages

## How was this patch tested?

Manual.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #11689 from dongjoon-hyun/fix_more_typos.
2016-03-14 09:07:39 +00:00
Sean Owen 1840852841 [SPARK-13823][CORE][STREAMING][SQL] Always specify Charset in String <-> byte[] conversions (and remaining Coverity items)
## What changes were proposed in this pull request?

- Fixes calls to `new String(byte[])` or `String.getBytes()` that rely on platform default encoding, to use UTF-8
- Same for `InputStreamReader` and `OutputStreamWriter` constructors
- Standardizes on UTF-8 everywhere
- Standardizes specifying the encoding with `StandardCharsets.UTF-8`, not the Guava constant or "UTF-8" (which means handling `UnuspportedEncodingException`)
- (also addresses the other remaining Coverity scan issues, which are pretty trivial; these are separated into commit 1deecd8d9c )

## How was this patch tested?

Jenkins tests

Author: Sean Owen <sowen@cloudera.com>

Closes #11657 from srowen/SPARK-13823.
2016-03-13 21:03:49 -07:00
Dongjoon Hyun db88d0204e [MINOR][DOCS] Replace DataFrame with Dataset in Javadoc.
## What changes were proposed in this pull request?

SPARK-13817 (PR #11656) replaces `DataFrame` with `Dataset` from Java. This PR fixes the remaining broken links and sample Java code in `package-info.java`. As a result, it will update the following Javadoc.

* http://spark.apache.org/docs/latest/api/java/org/apache/spark/ml/attribute/package-summary.html
* http://spark.apache.org/docs/latest/api/java/org/apache/spark/ml/feature/package-summary.html

## How was this patch tested?

Manual.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #11675 from dongjoon-hyun/replace_dataframe_with_dataset_in_javadoc.
2016-03-13 12:11:18 +08:00
Cheng Lian c079420d7c [SPARK-13841][SQL] Removes Dataset.collectRows()/takeRows()
## What changes were proposed in this pull request?

This PR removes two methods, `collectRows()` and `takeRows()`, from `Dataset[T]`. These methods were added in PR #11443, and were later considered not useful.

## How was this patch tested?

Existing tests should do the work.

Author: Cheng Lian <lian@databricks.com>

Closes #11678 from liancheng/remove-collect-rows-and-take-rows.
2016-03-13 12:02:52 +08:00
Cheng Lian 1d542785b9 [SPARK-13244][SQL] Migrates DataFrame to Dataset
## What changes were proposed in this pull request?

This PR unifies DataFrame and Dataset by migrating existing DataFrame operations to Dataset and make `DataFrame` a type alias of `Dataset[Row]`.

Most Scala code changes are source compatible, but Java API is broken as Java knows nothing about Scala type alias (mostly replacing `DataFrame` with `Dataset<Row>`).

There are several noticeable API changes related to those returning arrays:

1.  `collect`/`take`

    -   Old APIs in class `DataFrame`:

        ```scala
        def collect(): Array[Row]
        def take(n: Int): Array[Row]
        ```

    -   New APIs in class `Dataset[T]`:

        ```scala
        def collect(): Array[T]
        def take(n: Int): Array[T]

        def collectRows(): Array[Row]
        def takeRows(n: Int): Array[Row]
        ```

    Two specialized methods `collectRows` and `takeRows` are added because Java doesn't support returning generic arrays. Thus, for example, `DataFrame.collect(): Array[T]` actually returns `Object` instead of `Array<T>` from Java side.

    Normally, Java users may fall back to `collectAsList` and `takeAsList`.  The two new specialized versions are added to avoid performance regression in ML related code (but maybe I'm wrong and they are not necessary here).

1.  `randomSplit`

    -   Old APIs in class `DataFrame`:

        ```scala
        def randomSplit(weights: Array[Double], seed: Long): Array[DataFrame]
        def randomSplit(weights: Array[Double]): Array[DataFrame]
        ```

    -   New APIs in class `Dataset[T]`:

        ```scala
        def randomSplit(weights: Array[Double], seed: Long): Array[Dataset[T]]
        def randomSplit(weights: Array[Double]): Array[Dataset[T]]
        ```

    Similar problem as above, but hasn't been addressed for Java API yet.  We can probably add `randomSplitAsList` to fix this one.

1.  `groupBy`

    Some original `DataFrame.groupBy` methods have conflicting signature with original `Dataset.groupBy` methods.  To distinguish these two, typed `Dataset.groupBy` methods are renamed to `groupByKey`.

Other noticeable changes:

1.  Dataset always do eager analysis now

    We used to support disabling DataFrame eager analysis to help reporting partially analyzed malformed logical plan on analysis failure.  However, Dataset encoders requires eager analysi during Dataset construction.  To preserve the error reporting feature, `AnalysisException` now takes an extra `Option[LogicalPlan]` argument to hold the partially analyzed plan, so that we can check the plan tree when reporting test failures.  This plan is passed by `QueryExecution.assertAnalyzed`.

## How was this patch tested?

Existing tests do the work.

## TODO

- [ ] Fix all tests
- [ ] Re-enable MiMA check
- [ ] Update ScalaDoc (`since`, `group`, and example code)

Author: Cheng Lian <lian@databricks.com>
Author: Yin Huai <yhuai@databricks.com>
Author: Wenchen Fan <wenchen@databricks.com>
Author: Cheng Lian <liancheng@users.noreply.github.com>

Closes #11443 from liancheng/ds-to-df.
2016-03-10 17:00:17 -08:00
Dongjoon Hyun 91fed8e9c5 [SPARK-3854][BUILD] Scala style: require spaces before {.
## What changes were proposed in this pull request?

Since the opening curly brace, '{', has many usages as discussed in [SPARK-3854](https://issues.apache.org/jira/browse/SPARK-3854), this PR adds a ScalaStyle rule to prevent '){' pattern  for the following majority pattern and fixes the code accordingly. If we enforce this in ScalaStyle from now, it will improve the Scala code quality and reduce review time.
```
// Correct:
if (true) {
  println("Wow!")
}

// Incorrect:
if (true){
   println("Wow!")
}
```
IntelliJ also shows new warnings based on this.

## How was this patch tested?

Pass the Jenkins ScalaStyle test.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #11637 from dongjoon-hyun/SPARK-3854.
2016-03-10 15:57:22 -08:00
sethah 9fe38aba1f [SPARK-11108][ML] OneHotEncoder should support other numeric types
Adding support for other numeric types:

* Integer
* Short
* Long
* Float
* Decimal

Author: sethah <seth.hendrickson16@gmail.com>

Closes #9777 from sethah/SPARK-11108.
2016-03-10 13:17:41 +02:00
sethah e1772d3f19 [SPARK-11861][ML] Add feature importances for decision trees
This patch adds an API entry point for single decision tree feature importances.

Author: sethah <seth.hendrickson16@gmail.com>

Closes #9912 from sethah/SPARK-11861.
2016-03-09 14:44:51 -08:00
Yanbo Liang 0dd06485c4 [SPARK-13615][ML] GeneralizedLinearRegression supports save/load
## What changes were proposed in this pull request?
```GeneralizedLinearRegression``` supports ```save/load```.
cc mengxr
## How was this patch tested?
unit test.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #11465 from yanboliang/spark-13615.
2016-03-09 11:59:22 -08:00
Dongjoon Hyun c3689bc24e [SPARK-13702][CORE][SQL][MLLIB] Use diamond operator for generic instance creation in Java code.
## What changes were proposed in this pull request?

In order to make `docs/examples` (and other related code) more simple/readable/user-friendly, this PR replaces existing codes like the followings by using `diamond` operator.

```
-    final ArrayList<Product2<Object, Object>> dataToWrite =
-      new ArrayList<Product2<Object, Object>>();
+    final ArrayList<Product2<Object, Object>> dataToWrite = new ArrayList<>();
```

Java 7 or higher supports **diamond** operator which replaces the type arguments required to invoke the constructor of a generic class with an empty set of type parameters (<>). Currently, Spark Java code use mixed usage of this.

## How was this patch tested?

Manual.
Pass the existing tests.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #11541 from dongjoon-hyun/SPARK-13702.
2016-03-09 10:31:26 +00:00
Yanbo Liang 9740954f3f [ML] testEstimatorAndModelReadWrite should call checkModelData
## What changes were proposed in this pull request?
Although we defined ```checkModelData``` in [```read/write``` test](https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala#L994) of ML estimators/models and pass it to ```testEstimatorAndModelReadWrite```, ```testEstimatorAndModelReadWrite``` omits to call ```checkModelData``` to check the equality of model data. So actually we did not run the check of model data equality for all test cases currently, we should fix it.
BTW, fix the bug of LDA read/write test which did not set ```docConcentration```. This bug should have failed test, but it does not complain because we did not run ```checkModelData``` actually.
cc jkbradley mengxr
## How was this patch tested?
No new unit test, should pass the exist ones.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #11513 from yanboliang/ml-check-model-data.
2016-03-08 13:27:31 -08:00
Sean Owen 54040f8d35 [SPARK-13715][MLLIB] Remove last usages of jblas in tests
## What changes were proposed in this pull request?

Remove last usage of jblas, in tests

## How was this patch tested?

Jenkins tests -- the same ones that are being modified.

Author: Sean Owen <sowen@cloudera.com>

Closes #11560 from srowen/SPARK-13715.
2016-03-08 17:47:55 +00:00
Michael Armbrust e720dda42e [SPARK-13665][SQL] Separate the concerns of HadoopFsRelation
`HadoopFsRelation` is used for reading most files into Spark SQL.  However today this class mixes the concerns of file management, schema reconciliation, scan building, bucketing, partitioning, and writing data.  As a result, many data sources are forced to reimplement the same functionality and the various layers have accumulated a fair bit of inefficiency.  This PR is a first cut at separating this into several components / interfaces that are each described below.  Additionally, all implementations inside of Spark (parquet, csv, json, text, orc, svmlib) have been ported to the new API `FileFormat`.  External libraries, such as spark-avro will also need to be ported to work with Spark 2.0.

### HadoopFsRelation
A simple `case class` that acts as a container for all of the metadata required to read from a datasource.  All discovery, resolution and merging logic for schemas and partitions has been removed.  This an internal representation that no longer needs to be exposed to developers.

```scala
case class HadoopFsRelation(
    sqlContext: SQLContext,
    location: FileCatalog,
    partitionSchema: StructType,
    dataSchema: StructType,
    bucketSpec: Option[BucketSpec],
    fileFormat: FileFormat,
    options: Map[String, String]) extends BaseRelation
```

### FileFormat
The primary interface that will be implemented by each different format including external libraries.  Implementors are responsible for reading a given format and converting it into `InternalRow` as well as writing out an `InternalRow`.  A format can optionally return a schema that is inferred from a set of files.

```scala
trait FileFormat {
  def inferSchema(
      sqlContext: SQLContext,
      options: Map[String, String],
      files: Seq[FileStatus]): Option[StructType]

  def prepareWrite(
      sqlContext: SQLContext,
      job: Job,
      options: Map[String, String],
      dataSchema: StructType): OutputWriterFactory

  def buildInternalScan(
      sqlContext: SQLContext,
      dataSchema: StructType,
      requiredColumns: Array[String],
      filters: Array[Filter],
      bucketSet: Option[BitSet],
      inputFiles: Array[FileStatus],
      broadcastedConf: Broadcast[SerializableConfiguration],
      options: Map[String, String]): RDD[InternalRow]
}
```

The current interface is based on what was required to get all the tests passing again, but still mixes a couple of concerns (i.e. `bucketSet` is passed down to the scan instead of being resolved by the planner).  Additionally, scans are still returning `RDD`s instead of iterators for single files.  In a future PR, bucketing should be removed from this interface and the scan should be isolated to a single file.

### FileCatalog
This interface is used to list the files that make up a given relation, as well as handle directory based partitioning.

```scala
trait FileCatalog {
  def paths: Seq[Path]
  def partitionSpec(schema: Option[StructType]): PartitionSpec
  def allFiles(): Seq[FileStatus]
  def getStatus(path: Path): Array[FileStatus]
  def refresh(): Unit
}
```

Currently there are two implementations:
 - `HDFSFileCatalog` - based on code from the old `HadoopFsRelation`.  Infers partitioning by recursive listing and caches this data for performance
 - `HiveFileCatalog` - based on the above, but it uses the partition spec from the Hive Metastore.

### ResolvedDataSource
Produces a logical plan given the following description of a Data Source (which can come from DataFrameReader or a metastore):
 - `paths: Seq[String] = Nil`
 - `userSpecifiedSchema: Option[StructType] = None`
 - `partitionColumns: Array[String] = Array.empty`
 - `bucketSpec: Option[BucketSpec] = None`
 - `provider: String`
 - `options: Map[String, String]`

This class is responsible for deciding which of the Data Source APIs a given provider is using (including the non-file based ones).  All reconciliation of partitions, buckets, schema from metastores or inference is done here.

### DataSourceAnalysis / DataSourceStrategy
Responsible for analyzing and planning reading/writing of data using any of the Data Source APIs, including:
 - pruning the files from partitions that will be read based on filters.
 - appending partition columns*
 - applying additional filters when a data source can not evaluate them internally.
 - constructing an RDD that is bucketed correctly when required*
 - sanity checking schema match-up and other analysis when writing.

*In the future we should do that following:
 - Break out file handling into its own Strategy as its sufficiently complex / isolated.
 - Push the appending of partition columns down in to `FileFormat` to avoid an extra copy / unvectorization.
 - Use a custom RDD for scans instead of `SQLNewNewHadoopRDD2`

Author: Michael Armbrust <michael@databricks.com>
Author: Wenchen Fan <wenchen@databricks.com>

Closes #11509 from marmbrus/fileDataSource.
2016-03-07 15:15:10 -08:00
Xusen Yin 83302c3bff [SPARK-13036][SPARK-13318][SPARK-13319] Add save/load for feature.py
Add save/load for feature.py. Meanwhile, add save/load for `ElementwiseProduct` in Scala side and fix a bug of missing `setDefault` in `VectorSlicer` and `StopWordsRemover`.

In this PR I ignore the `RFormula` and `RFormulaModel` because its Scala implementation is pending in https://github.com/apache/spark/pull/9884. I'll add them in this PR if https://github.com/apache/spark/pull/9884 gets merged first. Or add a follow-up JIRA for `RFormula`.

Author: Xusen Yin <yinxusen@gmail.com>

Closes #11203 from yinxusen/SPARK-13036.
2016-03-04 08:32:24 -08:00
Abou Haydar Elias 27e88faa05 [SPARK-13646][MLLIB] QuantileDiscretizer counts dataset twice in get…
## What changes were proposed in this pull request?

It avoids counting the dataframe twice.

Author: Abou Haydar Elias <abouhaydar.elias@gmail.com>
Author: Elie A <abouhaydar.elias@gmail.com>

Closes #11491 from eliasah/quantile-discretizer-patch.
2016-03-04 10:01:52 +00:00
Dongjoon Hyun 941b270b70 [MINOR] Fix typos in comments and testcase name of code
## What changes were proposed in this pull request?

This PR fixes typos in comments and testcase name of code.

## How was this patch tested?

manual.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #11481 from dongjoon-hyun/minor_fix_typos_in_code.
2016-03-03 22:42:12 +00:00
Yanbo Liang ce58e99aae [MINOR][ML][DOC] Remove duplicated periods at the end of some sharedParam
## What changes were proposed in this pull request?
Remove duplicated periods at the end of some sharedParams in ScalaDoc, such as [here](https://github.com/apache/spark/pull/11344/files#diff-9edc669edcf2c0c7cf1efe4a0a57da80L367)
cc mengxr srowen
## How was this patch tested?
Documents change, no test.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #11344 from yanboliang/shared-cleanup.
2016-03-03 13:36:54 -08:00
Dongjoon Hyun b5f02d6743 [SPARK-13583][CORE][STREAMING] Remove unused imports and add checkstyle rule
## What changes were proposed in this pull request?

After SPARK-6990, `dev/lint-java` keeps Java code healthy and helps PR review by saving much time.
This issue aims remove unused imports from Java/Scala code and add `UnusedImports` checkstyle rule to help developers.

## How was this patch tested?
```
./dev/lint-java
./build/sbt compile
```

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #11438 from dongjoon-hyun/SPARK-13583.
2016-03-03 10:12:32 +00:00
Sean Owen e97fc7f176 [SPARK-13423][WIP][CORE][SQL][STREAMING] Static analysis fixes for 2.x
## What changes were proposed in this pull request?

Make some cross-cutting code improvements according to static analysis. These are individually up for discussion since they exist in separate commits that can be reverted. The changes are broadly:

- Inner class should be static
- Mismatched hashCode/equals
- Overflow in compareTo
- Unchecked warnings
- Misuse of assert, vs junit.assert
- get(a) + getOrElse(b) -> getOrElse(a,b)
- Array/String .size -> .length (occasionally, -> .isEmpty / .nonEmpty) to avoid implicit conversions
- Dead code
- tailrec
- exists(_ == ) -> contains find + nonEmpty -> exists filter + size -> count
- reduce(_+_) -> sum map + flatten -> map

The most controversial may be .size -> .length simply because of its size. It is intended to avoid implicits that might be expensive in some places.

## How was the this patch tested?

Existing Jenkins unit tests.

Author: Sean Owen <sowen@cloudera.com>

Closes #11292 from srowen/SPARK-13423.
2016-03-03 09:54:09 +00:00
Yanbo Liang 5ed48dd84d [SPARK-12811][ML] Estimator for Generalized Linear Models(GLMs)
Estimator for Generalized Linear Models(GLMs) which will be solved by IRLS.

cc mengxr

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #11136 from yanboliang/spark-12811.
2016-03-01 08:47:56 -08:00
Zheng RuiFeng ac5c635281 [SPARK-13506][MLLIB] Fix the wrong parameter in R code comment in AssociationRulesSuite
JIRA: https://issues.apache.org/jira/browse/SPARK-13506

## What changes were proposed in this pull request?

just chang R Snippet Comment in  AssociationRulesSuite

## How was this patch tested?

unit test passsed

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #11387 from zhengruifeng/ars.
2016-02-29 14:51:27 +00:00
Yanbo Liang d81a71357e [SPARK-13545][MLLIB][PYSPARK] Make MLlib LogisticRegressionWithLBFGS's default parameters consistent in Scala and Python
## What changes were proposed in this pull request?
* The default value of ```regParam``` of PySpark MLlib ```LogisticRegressionWithLBFGS``` should be consistent with Scala which is ```0.0```. (This is also consistent with ML ```LogisticRegression```.)
* BTW, if we use a known updater(L1 or L2) for binary classification, ```LogisticRegressionWithLBFGS``` will call the ML implementation. We should update the API doc to clarifying ```numCorrections``` will have no effect if we fall into that route.
* Make a pass for all parameters of ```LogisticRegressionWithLBFGS```, others are set properly.

cc mengxr dbtsai
## How was this patch tested?
No new tests, it should pass all current tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #11424 from yanboliang/spark-13545.
2016-02-29 00:55:51 -08:00
Bryan Cutler b33261f913 [SPARK-12634][PYSPARK][DOC] PySpark tree parameter desc to consistent format
Part of task for [SPARK-11219](https://issues.apache.org/jira/browse/SPARK-11219) to make PySpark MLlib parameter description formatting consistent.  This is for the tree module.

closes #10601

Author: Bryan Cutler <cutlerb@gmail.com>
Author: vijaykiran <mail@vijaykiran.com>

Closes #11353 from BryanCutler/param-desc-consistent-tree-SPARK-12634.
2016-02-26 08:30:32 -08:00
Cheng Lian 99dfcedbfd [SPARK-13457][SQL] Removes DataFrame RDD operations
## What changes were proposed in this pull request?

This is another try of PR #11323.

This PR removes DataFrame RDD operations except for `foreach` and `foreachPartitions` (they are actions rather than transformations). Original calls are now replaced by calls to methods of `DataFrame.rdd`.

PR #11323 was reverted because it introduced a regression: both `DataFrame.foreach` and `DataFrame.foreachPartitions` wrap underlying RDD operations with `withNewExecutionId` to track Spark jobs. But they are removed in #11323.

## How was the this patch tested?

No extra tests are added. Existing tests should do the work.

Author: Cheng Lian <lian@databricks.com>

Closes #11388 from liancheng/remove-df-rdd-ops.
2016-02-27 00:28:30 +08:00
Yuhao Yang 90d07154c2 [SPARK-13028] [ML] Add MaxAbsScaler to ML.feature as a transformer
jira: https://issues.apache.org/jira/browse/SPARK-13028
MaxAbsScaler works in a very similar way as MinMaxScaler, but scales in a way that the training data lies within the range [-1, 1] by dividing through the largest maximum value in each feature. The motivation to use this scaling includes robustness to very small standard deviations of features and preserving zero entries in sparse data.

Unlike StandardScaler and MinMaxScaler, MaxAbsScaler does not shift/center the data, and thus does not destroy any sparsity.

Something similar from sklearn:
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MaxAbsScaler.html#sklearn.preprocessing.MaxAbsScaler

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #10939 from hhbyyh/maxabs and squashes the following commits:

fd8bdcd [Yuhao Yang] add tag and some optimization on fit
648fced [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into maxabs
75bebc2 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into maxabs
cb10bb6 [Yuhao Yang] remove minmax
91ef8f3 [Yuhao Yang] ut added
8ab0747 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into maxabs
a9215b5 [Yuhao Yang] max abs scaler
2016-02-25 21:04:35 -08:00
Yu ISHIKAWA 14e2700de2 [SPARK-12874][ML] ML StringIndexer does not protect itself from column name duplication
## What changes were proposed in this pull request?
ML StringIndexer does not protect itself from column name duplication.

We should still improve a way to validate a schema of `StringIndexer` and `StringIndexerModel`.  However, it would be great to fix at another issue.

## How was this patch tested?
unit test

Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #11370 from yu-iskw/SPARK-12874.
2016-02-25 13:21:33 -08:00
Davies Liu 751724b132 Revert "[SPARK-13457][SQL] Removes DataFrame RDD operations"
This reverts commit 157fe64f3e.
2016-02-25 11:53:48 -08:00
Cheng Lian 157fe64f3e [SPARK-13457][SQL] Removes DataFrame RDD operations
## What changes were proposed in this pull request?

This PR removes DataFrame RDD operations. Original calls are now replaced by calls to methods of `DataFrame.rdd`.

## How was the this patch tested?

No extra tests are added. Existing tests should do the work.

Author: Cheng Lian <lian@databricks.com>

Closes #11323 from liancheng/remove-df-rdd-ops.
2016-02-25 23:07:59 +08:00
Yanbo Liang 4460113d41 [SPARK-13490][ML] ML LinearRegression should cache standardization param value
## What changes were proposed in this pull request?
Like #11027 for ```LogisticRegression```, ```LinearRegression``` with L1 regularization should also cache the value of the ```standardization``` rather than re-fetching it from the ```ParamMap``` for every OWLQN iteration.
cc srowen

## How was this patch tested?
No extra tests are added. It should pass all existing tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #11367 from yanboliang/spark-13490.
2016-02-25 13:34:29 +00:00
Oliver Pierson 6f8e835c68 [SPARK-13444][MLLIB] QuantileDiscretizer chooses bad splits on large DataFrames
## What changes were proposed in this pull request?

Change line 113 of QuantileDiscretizer.scala to

`val requiredSamples = math.max(numBins * numBins, 10000.0)`

so that `requiredSamples` is a `Double`.  This will fix the division in line 114 which currently results in zero if `requiredSamples < dataset.count`

## How was the this patch tested?
Manual tests.  I was having a problems using QuantileDiscretizer with my a dataset and after making this change QuantileDiscretizer behaves as expected.

Author: Oliver Pierson <ocp@gatech.edu>
Author: Oliver Pierson <opierson@umd.edu>

Closes #11319 from oliverpierson/SPARK-13444.
2016-02-25 13:24:46 +00:00
Xusen Yin 8d29001dec [SPARK-13011] K-means wrapper in SparkR
https://issues.apache.org/jira/browse/SPARK-13011

Author: Xusen Yin <yinxusen@gmail.com>

Closes #11124 from yinxusen/SPARK-13011.
2016-02-23 15:42:58 -08:00
Grzegorz Chilkiewicz 5d69eaf097 [SPARK-13338][ML] Allow setting 'degree' parameter to 1 for PolynomialExpansion
Author: Grzegorz Chilkiewicz <grzegorz.chilkiewicz@codilime.com>

Closes #11216 from grzegorz-chilkiewicz/master.
2016-02-23 10:30:02 -08:00
Xiangrui Meng 764ca18037 [SPARK-13355][MLLIB] replace GraphImpl.fromExistingRDDs by Graph.apply
`GraphImpl.fromExistingRDDs` expects preprocessed vertex RDD as input. We call it in LDA without validating this requirement. So it might introduce errors. Replacing it by `Graph.apply` would be safer and more proper because it is a public API. The tests still pass. So maybe it is safe to use `fromExistingRDDs` here (though it doesn't seem so based on the implementation) or the test cases are special. jkbradley ankurdave

Author: Xiangrui Meng <meng@databricks.com>

Closes #11226 from mengxr/SPARK-13355.
2016-02-22 23:54:21 -08:00
Yanbo Liang 72427c3e11 [SPARK-13429][MLLIB] Unify Logistic Regression convergence tolerance of ML & MLlib
## What changes were proposed in this pull request?
In order to provide better and consistent result, let's change the default value of MLlib ```LogisticRegressionWithLBFGS convergenceTol``` from ```1E-4``` to ```1E-6``` which will be equal to ML ```LogisticRegression```.
cc dbtsai
## How was the this patch tested?
unit tests

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #11299 from yanboliang/spark-13429.
2016-02-22 23:37:09 -08:00
Narine Kokhlikyan 33ef3aa7ea [SPARK-13295][ ML, MLLIB ] AFTSurvivalRegression.AFTAggregator improvements - avoid creating new instances of arrays/vectors for each record
As also mentioned/marked by TODO in AFTAggregator.AFTAggregator.add(data: AFTPoint) method a new array is being created for intercept value and it is being concatenated
with another array which contains the betas, the resulted Array is being converted into a Dense vector which in its turn is being converted into breeze vector.
This is expensive and not necessarily beautiful.

I've tried to solve above mentioned problem by simple algebraic decompositions - keeping and treating intercept independently.

Please let me know what do you think and if you have any questions.

Thanks,
Narine

Author: Narine Kokhlikyan <narine.kokhlikyan@gmail.com>

Closes #11179 from NarineK/survivaloptim.
2016-02-22 17:26:32 -08:00
Yanbo Liang 40e6d40fe7 [SPARK-13334][ML] ML KMeansModel / BisectingKMeansModel / QuantileDiscretizer should set parent
ML ```KMeansModel / BisectingKMeansModel / QuantileDiscretizer``` should set parent.

cc mengxr

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #11214 from yanboliang/spark-13334.
2016-02-22 12:59:50 +02:00
Bryan Cutler e298ac91e3 [SPARK-12632][PYSPARK][DOC] PySpark fpm and als parameter desc to consistent format
Part of task for [SPARK-11219](https://issues.apache.org/jira/browse/SPARK-11219) to make PySpark MLlib parameter description formatting consistent.  This is for the fpm and recommendation modules.

Closes #10602
Closes #10897

Author: Bryan Cutler <cutlerb@gmail.com>
Author: somideshmukh <somilde@us.ibm.com>

Closes #11186 from BryanCutler/param-desc-consistent-fpmrecc-SPARK-12632.
2016-02-22 12:48:37 +02:00
Dongjoon Hyun 024482bf51 [MINOR][DOCS] Fix all typos in markdown files of doc and similar patterns in other comments
## What changes were proposed in this pull request?

This PR tries to fix all typos in all markdown files under `docs` module,
and fixes similar typos in other comments, too.

## How was the this patch tested?

manual tests.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #11300 from dongjoon-hyun/minor_fix_typos.
2016-02-22 09:52:07 +00:00
Yong Gang Cao ef1047fca7 [SPARK-12153][SPARK-7617][MLLIB] add support of arbitrary length sentence and other tuning for Word2Vec
add support of arbitrary length sentence by using the nature representation of sentences in the input.

add new similarity functions and add normalization option for distances in synonym finding
add new accessor for internal structure(the vocabulary and wordindex) for convenience

need instructions about how to set value for the Since annotation for newly added public functions. 1.5.3?

jira link: https://issues.apache.org/jira/browse/SPARK-12153

Author: Yong Gang Cao <ygcao@amazon.com>
Author: Yong-Gang Cao <ygcao@users.noreply.github.com>

Closes #10152 from ygcao/improvementForSentenceBoundary.
2016-02-22 09:47:36 +00:00
Yanbo Liang 8a4ed78869 [SPARK-13379][MLLIB] Fix MLlib LogisticRegressionWithLBFGS set regularization incorrectly
## What changes were proposed in this pull request?
Fix MLlib LogisticRegressionWithLBFGS regularization map as:
```SquaredL2Updater``` -> ```elasticNetParam = 0.0```
```L1Updater``` -> ```elasticNetParam = 1.0```
cc dbtsai
## How was the this patch tested?
unit tests

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #11258 from yanboliang/spark-13379.
2016-02-21 20:20:41 -08:00
Xiangrui Meng 0088b252bf [MINOR][MLLIB] fix mllib compile warnings
This PR fixes some warnings found by `build/sbt mllib/test:compile`.

Author: Xiangrui Meng <meng@databricks.com>

Closes #11227 from mengxr/fix-mllib-warnings-201602.
2016-02-17 18:56:19 -08:00
BenFradet 00c72d27bf [SPARK-12247][ML][DOC] Documentation for spark.ml's ALS and collaborative filtering in general
This documents the implementation of ALS in `spark.ml` with example code in scala, java and python.

Author: BenFradet <benjamin.fradet@gmail.com>

Closes #10411 from BenFradet/SPARK-12247.
2016-02-16 13:03:28 +00:00
seddonm1 cbeb006f23 [SPARK-13097][ML] Binarizer allowing Double AND Vector input types
This enhancement extends the existing SparkML Binarizer [SPARK-5891] to allow Vector in addition to the existing Double input column type.

A use case for this enhancement is for when a user wants to Binarize many similar feature columns at once using the same threshold value (for example a binary threshold applied to many pixels in an image).

This contribution is my original work and I license the work to the project under the project's open source license.

viirya mengxr

Author: seddonm1 <seddonm1@gmail.com>

Closes #10976 from seddonm1/master.
2016-02-15 20:15:27 -08:00
Liang-Chi Hsieh e3441e3f68 [SPARK-12363][MLLIB] Remove setRun and fix PowerIterationClustering failed test
JIRA: https://issues.apache.org/jira/browse/SPARK-12363

This issue is pointed by yanboliang. When `setRuns` is removed from PowerIterationClustering, one of the tests will be failed. I found that some `dstAttr`s of the normalized graph are not correct values but 0.0. By setting `TripletFields.All` in `mapTriplets` it can work.

Author: Liang-Chi Hsieh <viirya@gmail.com>
Author: Xiangrui Meng <meng@databricks.com>

Closes #10539 from viirya/fix-poweriter.
2016-02-13 15:56:20 -08:00
Earthson Lu 5f1c359069 [SPARK-12746][ML] ArrayType(_, true) should also accept ArrayType(_, false)
https://issues.apache.org/jira/browse/SPARK-12746

Author: Earthson Lu <Earthson.Lu@gmail.com>

Closes #10697 from Earthson/SPARK-12746.
2016-02-11 18:31:46 -08:00
Liu Xiang a5257048d7 [SPARK-12765][ML][COUNTVECTORIZER] fix CountVectorizer.transform's lost transformSchema
https://issues.apache.org/jira/browse/SPARK-12765

Author: Liu Xiang <lxmtlab@gmail.com>

Closes #10720 from sloth2012/sloth.
2016-02-11 17:28:37 -08:00
Yu ISHIKAWA 574571c870 [SPARK-11515][ML] QuantileDiscretizer should take random seed
cc jkbradley

Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #9535 from yu-iskw/SPARK-11515.
2016-02-11 15:05:34 -08:00
Yu ISHIKAWA efb65e09bc [SPARK-13265][ML] Refactoring of basic ML import/export for other file system besides HDFS
jkbradley I tried to improve the function to export a model. When I tried to export a model to S3 under Spark 1.6, we couldn't do that. So, it should offer S3 besides HDFS. Can you review it when you have time? Thanks!

Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #11151 from yu-iskw/SPARK-13265.
2016-02-11 15:00:23 -08:00
Sasaki Toru c2f21d8898 [SPARK-13264][DOC] Removed multi-byte characters in spark-env.sh.template
In spark-env.sh.template, there are multi-byte characters, this PR will remove it.

Author: Sasaki Toru <sasakitoa@nttdata.co.jp>

Closes #11149 from sasakitoa/remove_multibyte_in_sparkenv.
2016-02-11 09:30:36 +00:00
Liang-Chi Hsieh 9267bc68fa [SPARK-10524][ML] Use the soft prediction to order categories' bins
JIRA: https://issues.apache.org/jira/browse/SPARK-10524

Currently we use the hard prediction (`ImpurityCalculator.predict`) to order categories' bins. But we should use the soft prediction.

Author: Liang-Chi Hsieh <viirya@gmail.com>
Author: Liang-Chi Hsieh <viirya@appier.com>
Author: Joseph K. Bradley <joseph@databricks.com>

Closes #8734 from viirya/dt-soft-centroids.
2016-02-09 17:10:55 -08:00
Holden Karau ce83fe9756 [SPARK-13201][SPARK-13200] Deprecation warning cleanups: KMeans & MFDataGenerator
KMeans:
Make a private non-deprecated version of setRuns API so that we can call it from the PythonAPI without deprecation warnings in our own build. Also use it internally when being called from train. Add a logWarning for non-1 values

MFDataGenerator:
Apparently we are calling round on an integer which now in Scala 2.11 results in a warning (it didn't make any sense before either). Figure out if this is a mistake we can just remove or if we got the types wrong somewhere.

I put these two together since they are both deprecation fixes in MLlib and pretty small, but I can split them up if we would prefer it that way.

Author: Holden Karau <holden@us.ibm.com>

Closes #11112 from holdenk/SPARK-13201-non-deprecated-setRuns-SPARK-mathround-integer.
2016-02-09 08:47:28 +00:00
Gary King bc8890b357 [SPARK-13132][MLLIB] cache standardization param value in LogisticRegression
cache the value of the standardization Param in LogisticRegression, rather than re-fetching it from the ParamMap for every index and every optimization step in the quasi-newton optimizer

also, fix Param#toString to cache the stringified representation, rather than re-interpolating it on every call, so any other implementations that have similar repeated access patterns will see a benefit.

this change improves training times for one of my test sets from ~7m30s to ~4m30s

Author: Gary King <gary@idibon.com>

Closes #11027 from idigary/spark-13132-optimize-logistic-regression.
2016-02-07 09:13:28 +00:00
Imran Younus 0557146619 [SPARK-12732][ML] bug fix in linear regression train
Fixed the bug in linear regression train for the case when the target variable is constant. The two cases for `fitIntercept=true` or `fitIntercept=false` should be treated differently.

Author: Imran Younus <iyounus@us.ibm.com>

Closes #10702 from iyounus/SPARK-12732_bug_fix_in_linear_regression_train.
2016-02-02 20:38:53 -08:00
Grzegorz Chilkiewicz b1835d7272 [SPARK-12711][ML] ML StopWordsRemover does not protect itself from column name duplication
Fixes problem and verifies fix by test suite.
Also - adds optional parameter: nullable (Boolean) to: SchemaUtils.appendColumn
and deduplicates SchemaUtils.appendColumn functions.

Author: Grzegorz Chilkiewicz <grzegorz.chilkiewicz@codilime.com>

Closes #10741 from grzegorz-chilkiewicz/master.
2016-02-02 11:16:24 -08:00
Bryan Cutler cba1d6b659 [SPARK-12631][PYSPARK][DOC] PySpark clustering parameter desc to consistent format
Part of task for [SPARK-11219](https://issues.apache.org/jira/browse/SPARK-11219) to make PySpark MLlib parameter description formatting consistent.  This is for the clustering module.

Author: Bryan Cutler <cutlerb@gmail.com>

Closes #10610 from BryanCutler/param-desc-consistent-cluster-SPARK-12631.
2016-02-02 10:50:22 -08:00
Josh Rosen 289373b28c [SPARK-6363][BUILD] Make Scala 2.11 the default Scala version
This patch changes Spark's build to make Scala 2.11 the default Scala version. To be clear, this does not mean that Spark will stop supporting Scala 2.10: users will still be able to compile Spark for Scala 2.10 by following the instructions on the "Building Spark" page; however, it does mean that Scala 2.11 will be the default Scala version used by our CI builds (including pull request builds).

The Scala 2.11 compiler is faster than 2.10, so I think we'll be able to look forward to a slight speedup in our CI builds (it looks like it's about 2X faster for the Maven compile-only builds, for instance).

After this patch is merged, I'll update Jenkins to add new compile-only jobs to ensure that Scala 2.10 compilation doesn't break.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #10608 from JoshRosen/SPARK-6363.
2016-01-30 00:20:28 -08:00
Yanbo Liang df78a934a0 [SPARK-9835][ML] Implement IterativelyReweightedLeastSquares solver
Implement ```IterativelyReweightedLeastSquares``` solver for GLM. I consider it as a solver rather than estimator, it only used internal so I keep it ```private[ml]```.
There are two limitations in the current implementation compared with R:
* It can not support ```Tuple``` as response for ```Binomial``` family, such as the following code:
```
glm( cbind(using, notUsing) ~  age + education + wantsMore , family = binomial)
```
* It does not support ```offset```.

Because I considered that ```RFormula``` did not support ```Tuple``` as label and ```offset``` keyword, so I simplified the implementation. But to add support for these two functions is not very hard, I can do it in follow-up PR if it is necessary. Meanwhile, we can also add R-like statistic summary for IRLS.
The implementation refers R, [statsmodels](https://github.com/statsmodels/statsmodels) and [sparkGLM](https://github.com/AlteryxLabs/sparkGLM).
Please focus on the main structure and overpass minor issues/docs that I will update later. Any comments and opinions will be appreciated.

cc mengxr jkbradley

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #10639 from yanboliang/spark-9835.
2016-01-28 14:29:47 -08:00
Holden Karau b72611f20a [SPARK-7780][MLLIB] intercept in logisticregressionwith lbfgs should not be regularized
The intercept in Logistic Regression represents a prior on categories which should not be regularized. In MLlib, the regularization is handled through Updater, and the Updater penalizes all the components without excluding the intercept which resulting poor training accuracy with regularization.
The new implementation in ML framework handles this properly, and we should call the implementation in ML from MLlib since majority of users are still using MLlib api.
Note that both of them are doing feature scalings to improve the convergence, and the only difference is ML version doesn't regularize the intercept. As a result, when lambda is zero, they will converge to the same solution.

Previously partially reviewed at https://github.com/apache/spark/pull/6386#issuecomment-168781424 re-opening for dbtsai to review.

Author: Holden Karau <holden@us.ibm.com>
Author: Holden Karau <holden@pigscanfly.ca>

Closes #10788 from holdenk/SPARK-7780-intercept-in-logisticregressionwithLBFGS-should-not-be-regularized.
2016-01-26 17:59:05 -08:00
Jeff Zhang 1dac964c1b [SPARK-11622][MLLIB] Make LibSVMRelation extends HadoopFsRelation and…
… Add LibSVMOutputWriter

The behavior of LibSVMRelation is not changed except adding LibSVMOutputWriter
* Partition is still not supported
* Multiple input paths is not supported

Author: Jeff Zhang <zjffdu@apache.org>

Closes #9595 from zjffdu/SPARK-11622.
2016-01-26 17:31:19 -08:00
Xusen Yin fbf7623d49 [SPARK-12952] EMLDAOptimizer initialize() should return EMLDAOptimizer other than its parent class
https://issues.apache.org/jira/browse/SPARK-12952

Author: Xusen Yin <yinxusen@gmail.com>

Closes #10863 from yinxusen/SPARK-12952.
2016-01-26 13:18:01 -08:00
Xusen Yin ae47ba718a [SPARK-12834] Change ser/de of JavaArray and JavaList
https://issues.apache.org/jira/browse/SPARK-12834

We use `SerDe.dumps()` to serialize `JavaArray` and `JavaList` in `PythonMLLibAPI`, then deserialize them with `PickleSerializer` in Python side. However, there is no need to transform them in such an inefficient way. Instead of it, we can use type conversion to convert them, e.g. `list(JavaArray)` or `list(JavaList)`. What's more, there is an issue to Ser/De Scala Array as I said in https://issues.apache.org/jira/browse/SPARK-12780

Author: Xusen Yin <yinxusen@gmail.com>

Closes #10772 from yinxusen/SPARK-12834.
2016-01-25 22:41:52 -08:00
Yanbo Liang dcae355c64 [SPARK-12905][ML][PYSPARK] PCAModel return eigenvalues for PySpark
```PCAModel```  can output ```explainedVariance``` at Python side.

cc mengxr srowen

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #10830 from yanboliang/spark-12905.
2016-01-25 13:54:21 -08:00
Yanbo Liang dd2325d9a7 [SPARK-11965][ML][DOC] Update user guide for RFormula feature interactions
Update user guide for RFormula feature interactions. Meanwhile we also update other new features such as supporting string label in Spark 1.6.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #10222 from yanboliang/spark-11965.
2016-01-25 11:52:26 -08:00
Shixiong Zhu bc1babd63d [SPARK-7997][CORE] Remove Akka from Spark Core and Streaming
- Remove Akka dependency from core. Note: the streaming-akka project still uses Akka.
- Remove HttpFileServer
- Remove Akka configs from SparkConf and SSLOptions
- Rename `spark.akka.frameSize` to `spark.rpc.message.maxSize`. I think it's still worth to keep this config because using `DirectTaskResult` or `IndirectTaskResult`  depends on it.
- Update comments and docs

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10854 from zsxwing/remove-akka.
2016-01-22 21:20:04 -08:00
DB Tsai b4574e387d [SPARK-12908][ML] Add warning message for LogisticRegression for potential converge issue
When all labels are the same, it's a dangerous ground for LogisticRegression without intercept to converge. GLMNET doesn't support this case, and will just exit. GLM can train, but will have a warning message saying the algorithm doesn't converge.

Author: DB Tsai <dbt@netflix.com>

Closes #10862 from dbtsai/add-tests.
2016-01-21 17:24:48 -08:00
Takahashi Hiroshi e3727c409f [SPARK-10263][ML] Add @Since annotation to ml.param and ml.*
Add Since annotations to ml.param and ml.*

Author: Takahashi Hiroshi <takahashi.hiroshi@lab.ntt.co.jp>
Author: Hiroshi Takahashi <takahashi.hiroshi@lab.ntt.co.jp>

Closes #8935 from taishi-oss/issue10263.
2016-01-20 11:44:04 -08:00
Imran Younus 9753835cf3 [SPARK-12230][ML] WeightedLeastSquares.fit() should handle division by zero properly if standard deviation of target variable is zero.
This fixes the behavior of WeightedLeastSquars.fit() when the standard deviation of the target variable is zero. If the fitIntercept is true, there is no need to train.

Author: Imran Younus <iyounus@us.ibm.com>

Closes #10274 from iyounus/SPARK-12230_bug_fix_in_weighted_least_squares.
2016-01-20 11:16:59 -08:00
Yu ISHIKAWA 9376ae723e [SPARK-6519][ML] Add spark.ml API for bisecting k-means
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #9604 from yu-iskw/SPARK-6519.
2016-01-20 10:48:10 -08:00
BenFradet f6f7ca9d2e [SPARK-9716][ML] BinaryClassificationEvaluator should accept Double prediction column
This PR aims to allow the prediction column of `BinaryClassificationEvaluator` to be of double type.

Author: BenFradet <benjamin.fradet@gmail.com>

Closes #10472 from BenFradet/SPARK-9716.
2016-01-19 14:59:20 -08:00
Feynman Liang 2388de5191 [SPARK-12804][ML] Fix LogisticRegression with FitIntercept on all same label training data
CC jkbradley mengxr dbtsai

Author: Feynman Liang <feynman.liang@gmail.com>

Closes #10743 from feynmanliang/SPARK-12804.
2016-01-19 11:08:52 -08:00
Holden Karau 0ddba6d88f [SPARK-11944][PYSPARK][MLLIB] python mllib.clustering.bisecting k means
From the coverage issues for 1.6 : Add Python API for mllib.clustering.BisectingKMeans.

Author: Holden Karau <holden@us.ibm.com>

Closes #10150 from holdenk/SPARK-11937-python-api-coverage-SPARK-11944-python-mllib.clustering.BisectingKMeans.
2016-01-19 10:15:54 -08:00
Wojciech Jurczyk ebd9ce0f1f [MLLIB] Fix CholeskyDecomposition assertion's message
Change assertion's message so it's consistent with the code. The old message says that the invoked method was lapack.dports, where in fact it was lapack.dppsv method.

Author: Wojciech Jurczyk <wojtek.jurczyk@gmail.com>

Closes #10818 from wjur/wjur/rename_error_message.
2016-01-19 09:36:45 +00:00
Eric Liang 5e492e9d5b [SPARK-12346][ML] Missing attribute names in GLM for vector-type features
Currently `summary()` fails on a GLM model fitted over a vector feature missing ML attrs, since the output feature attrs will also have no name. We can avoid this situation by forcing `VectorAssembler` to make up suitable names when inputs are missing names.

cc mengxr

Author: Eric Liang <ekl@databricks.com>

Closes #10323 from ericl/spark-12346.
2016-01-18 12:50:58 -08:00
Tommy YU 233d6cee96 [SPARK-10264][DOCUMENTATION] Added @Since to ml.recomendation
I create new pr since original pr long time no update.
Please help to review.

srowen

Author: Tommy YU <tummyyu@163.com>

Closes #10756 from Wenpei/add_since_to_recomm.
2016-01-18 13:46:14 +00:00
Reynold Xin fe7246fea6 [SPARK-12830] Java style: disallow trailing whitespaces.
Author: Reynold Xin <rxin@databricks.com>

Closes #10764 from rxin/SPARK-12830.
2016-01-14 23:33:45 -08:00
Yuhao Yang 021dafc6a0 [SPARK-12026][MLLIB] ChiSqTest gets slower and slower over time when number of features is large
jira: https://issues.apache.org/jira/browse/SPARK-12026

The issue is valid as features.toArray.view.zipWithIndex.slice(startCol, endCol) becomes slower as startCol gets larger.

I tested on local and the change can improve the performance and the running time was stable.

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #10146 from hhbyyh/chiSq.
2016-01-13 17:43:27 -08:00
Sean Owen c48f2a3a5f [SPARK-7615][MLLIB] MLLIB Word2Vec wordVectors divided by Euclidean Norm equals to zero
Cosine similarity with 0 vector should be 0

Related to https://github.com/apache/spark/pull/10152

Author: Sean Owen <sowen@cloudera.com>

Closes #10696 from srowen/SPARK-7615.
2016-01-12 11:50:33 +00:00
Yuhao Yang bbea88852c [SPARK-10809][MLLIB] Single-document topicDistributions method for LocalLDAModel
jira: https://issues.apache.org/jira/browse/SPARK-10809

We could provide a single-document topicDistributions method for LocalLDAModel to allow for quick queries which avoid RDD operations. Currently, the user must use an RDD of documents.

add some missing assert too.

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #9484 from hhbyyh/ldaTopicPre.
2016-01-11 14:55:44 -08:00
Yuhao Yang 4f8eefa36b [SPARK-12685][MLLIB] word2vec trainWordsCount gets overflow
jira: https://issues.apache.org/jira/browse/SPARK-12685
the log of `word2vec` reports
trainWordsCount = -785727483
during computation over a large dataset.

Update the priority as it will affect the computation process.
`alpha = learningRate * (1 - numPartitions * wordCount.toDouble / (trainWordsCount + 1))`

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #10627 from hhbyyh/w2voverflow.
2016-01-11 14:48:35 -08:00
Yanbo Liang ee4ee02b86 [SPARK-12603][MLLIB] PySpark MLlib GaussianMixtureModel should support single instance predict/predictSoft
PySpark MLlib ```GaussianMixtureModel``` should support single instance ```predict/predictSoft``` just like Scala do.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #10552 from yanboliang/spark-12603.
2016-01-11 14:43:25 -08:00
Marcelo Vanzin 6439a82503 [SPARK-3873][BUILD] Enable import ordering error checking.
Turn import ordering violations into build errors, plus a few adjustments
to account for how the checker behaves. I'm a little on the fence about
whether the existing code is right, but it's easier to appease the checker
than to discuss what's the more correct order here.

Plus a few fixes to imports that cropped in since my recent cleanups.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #10612 from vanzin/SPARK-3873-enable.
2016-01-10 20:04:50 -08:00
Kousuke Saruta e5904bb5e7 [SPARK-12692][BUILD][MLLIB] Scala style: Fix the style violation (Space before "," or ":")
Fix the style violation (space before , and :).
This PR is a followup for #10643.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #10684 from sarutak/SPARK-12692-followup-mllib.
2016-01-10 12:38:57 -08:00
Sean Owen b9c8353378 [SPARK-12618][CORE][STREAMING][SQL] Clean up build warnings: 2.0.0 edition
Fix most build warnings: mostly deprecated API usages. I'll annotate some of the changes below. CC rxin who is leading the charge to remove the deprecated APIs.

Author: Sean Owen <sowen@cloudera.com>

Closes #10570 from srowen/SPARK-12618.
2016-01-08 17:47:44 +00:00
Robert Dodier 6b6d02be0d [SPARK-12663][MLLIB] More informative error message in MLUtils.loadLibSVMFile
This PR contains 1 commit which resolves [SPARK-12663](https://issues.apache.org/jira/browse/SPARK-12663).

For the record, I got a positive response from 2 people when I floated this idea on devspark.apache.org on 2015-10-23. [Link to archived discussion.](http://apache-spark-developers-list.1001551.n3.nabble.com/slightly-more-informative-error-message-in-MLUtils-loadLibSVMFile-td14764.html)

Author: Robert Dodier <robert_dodier@users.sourceforge.net>

Closes #10611 from robert-dodier/loadlibsvmfile-error-msg-branch.
2016-01-06 19:49:10 -08:00
BenFradet f82ebb1522 [SPARK-12368][ML][DOC] Better doc for the binary classification evaluator' metricName
For the BinaryClassificationEvaluator, the scaladoc doesn't mention that "areaUnderPR" is supported, only that the default is "areadUnderROC".
Also, in the documentation, it is said that:
"The default metric used to choose the best ParamMap can be overriden by the setMetric method in each of these evaluators."
However, the method is called setMetricName.

This PR aims to fix both issues.

Author: BenFradet <benjamin.fradet@gmail.com>

Closes #10328 from BenFradet/SPARK-12368.
2016-01-06 12:01:05 -08:00
Marcelo Vanzin b3ba1be3b7 [SPARK-3873][TESTS] Import ordering fixes.
Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #10582 from vanzin/SPARK-3873-tests.
2016-01-05 19:07:39 -08:00
RJ Nowling 78015a8b7c [SPARK-12450][MLLIB] Un-persist broadcasted variables in KMeans
SPARK-12450 . Un-persist broadcasted variables in KMeans.

Author: RJ Nowling <rnowling@gmail.com>

Closes #10415 from rnowling/spark-12450.
2016-01-05 15:05:04 -08:00
Yanbo Liang 13a3b636d9 [SPARK-6724][MLLIB] Support model save/load for FPGrowthModel
Support model save/load for FPGrowthModel

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #9267 from yanboliang/spark-6724.
2016-01-05 13:31:59 -08:00
Imran Younus 1cdc42d2b9 [SPARK-12331][ML] R^2 for regression through the origin.
Modified the definition of R^2 for regression through origin. Added modified test for regression metrics.

Author: Imran Younus <iyounus@us.ibm.com>
Author: Imran Younus <imranyounus@gmail.com>

Closes #10384 from iyounus/SPARK_12331_R2_for_regression_through_origin.
2016-01-05 11:48:45 +00:00
Yanbo Liang 93ef9b6a2a [SPARK-9622][ML] DecisionTreeRegressor: provide variance of prediction
DecisionTreeRegressor will provide variance of prediction as a Double column.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #8866 from yanboliang/spark-9622.
2016-01-04 13:32:14 -08:00
Yanbo Liang ba5f81859d [SPARK-11259][ML] Params.validateParams() should be called automatically
See JIRA: https://issues.apache.org/jira/browse/SPARK-11259

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #9224 from yanboliang/spark-11259.
2016-01-04 13:30:17 -08:00
Reynold Xin 513e3b092c [SPARK-12599][MLLIB][SQL] Remove the use of callUDF in MLlib
callUDF has been deprecated. However, we do not have an alternative for users to specify the output data type without type tags. This pull request introduced a new API for that, and replaces the invocation of the deprecated callUDF with that.

Author: Reynold Xin <rxin@databricks.com>

Closes #10547 from rxin/SPARK-12599.
2016-01-02 22:31:39 -08:00
Marcelo Vanzin a59a357cae [SPARK-3873][MLLIB] Import order fixes.
A slight adjustment to the checker configuration was needed; there is
a handful of warnings still left, but those are because of a bug in
the checker that I'll fix separately (before enabling errors for the
checker, of course).

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #10535 from vanzin/SPARK-3873-mllib.
2015-12-31 23:48:55 -08:00
Sean Owen be86268eb5 [SPARK-12349][SPARK-12349][ML] Fix typo in Spark version regex introduced in / PR 10327
Sorry jkbradley
Ref: https://github.com/apache/spark/pull/10327#discussion_r48502942

Author: Sean Owen <sowen@cloudera.com>

Closes #10508 from srowen/SPARK-12349.2.
2015-12-29 16:32:26 -08:00
Shixiong Zhu 710b411729 [SPARK-12489][CORE][SQL][MLIB] Fix minor issues found by FindBugs
Include the following changes:

1. Close `java.sql.Statement`
2. Fix incorrect `asInstanceOf`.
3. Remove unnecessary `synchronized` and `ReentrantLock`.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10440 from zsxwing/findbugs.
2015-12-28 15:01:51 -08:00
Kousuke Saruta 07165ca06f [SPARK-12424][ML] The implementation of ParamMap#filter is wrong.
ParamMap#filter uses `mutable.Map#filterKeys`. The return type of `filterKey` is collection.Map, not mutable.Map but the result is casted to mutable.Map using `asInstanceOf` so we get `ClassCastException`.
Also, the return type of Map#filterKeys is not Serializable. It's the issue of Scala (https://issues.scala-lang.org/browse/SI-6654).

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #10381 from sarutak/SPARK-12424.
2015-12-29 05:33:19 +09:00
Kazuaki Ishizaki 3920466118 [SPARK-12311][CORE] Restore previous value of "os.arch" property in test suites after forcing to set specific value to "os.arch" property
Restore the original value of os.arch property after each test

Since some of tests forced to set the specific value to os.arch property, we need to set the original value.

Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>

Closes #10289 from kiszk/SPARK-12311.
2015-12-24 13:37:28 +00:00
Sean Owen d0f695089e [SPARK-12349][ML] Make spark.ml PCAModel load backwards compatible
Only load explainedVariance in PCAModel if it was written with Spark > 1.6.x
jkbradley is this kind of what you had in mind?

Author: Sean Owen <sowen@cloudera.com>

Closes #10327 from srowen/SPARK-12349.
2015-12-21 10:21:22 +00:00
Bryan Cutler ce1798b3af [SPARK-10158][PYSPARK][MLLIB] ALS better error message when using Long IDs
Added catch for casting Long to Int exception when PySpark ALS Ratings are serialized.  It is easy to accidentally use Long IDs for user/product and before, it would fail with a somewhat cryptic "ClassCastException: java.lang.Long cannot be cast to java.lang.Integer."  Now if this is done, a more descriptive error is shown, e.g. "PickleException: Ratings id 1205640308657491975 exceeds max integer value of 2147483647."

Author: Bryan Cutler <bjcutler@us.ibm.com>

Closes #9361 from BryanCutler/als-pyspark-long-id-error-SPARK-10158.
2015-12-20 09:08:23 +00:00
Reynold Xin f496031bd2 Bump master version to 2.0.0-SNAPSHOT.
Author: Reynold Xin <rxin@databricks.com>

Closes #10387 from rxin/version-bump.
2015-12-19 15:13:05 -08:00
Yanbo Liang d252b2d544 [SPARK-12309][ML] Use sqlContext from MLlibTestSparkContext for spark.ml test suites
Use ```sqlContext``` from ```MLlibTestSparkContext``` rather than creating new one for spark.ml test suites. I have checked thoroughly and found there are four test cases need to update.

cc mengxr jkbradley

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #10279 from yanboliang/spark-12309.
2015-12-16 11:07:54 -08:00
Yanbo Liang 860dc7f2f8 [SPARK-9694][ML] Add random seed Param to Scala CrossValidator
Add random seed Param to Scala CrossValidator

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #9108 from yanboliang/spark-9694.
2015-12-16 11:05:37 -08:00
Liang-Chi Hsieh b51a4cdff3 [SPARK-12016] [MLLIB] [PYSPARK] Wrap Word2VecModel when loading it in pyspark
JIRA: https://issues.apache.org/jira/browse/SPARK-12016

We should not directly use Word2VecModel in pyspark. We need to wrap it in a Word2VecModelWrapper when loading it in pyspark.

Author: Liang-Chi Hsieh <viirya@appier.com>

Closes #10100 from viirya/fix-load-py-wordvecmodel.
2015-12-14 09:59:42 -08:00
Mike Dusenberry 1b8220387e [SPARK-11497][MLLIB][PYTHON] PySpark RowMatrix Constructor Has Type Erasure Issue
As noted in PR #9441, implementing `tallSkinnyQR` uncovered a bug with our PySpark `RowMatrix` constructor.  As discussed on the dev list [here](http://apache-spark-developers-list.1001551.n3.nabble.com/K-Means-And-Class-Tags-td10038.html), there appears to be an issue with type erasure with RDDs coming from Java, and by extension from PySpark.  Although we are attempting to construct a `RowMatrix` from an `RDD[Vector]` in [PythonMLlibAPI](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala#L1115), the `Vector` type is erased, resulting in an `RDD[Object]`.  Thus, when calling Scala's `tallSkinnyQR` from PySpark, we get a Java `ClassCastException` in which an `Object` cannot be cast to a Spark `Vector`.  As noted in the aforementioned dev list thread, this issue was also encountered with `DecisionTrees`, and the fix involved an explicit `retag` of the RDD with a `Vector` type.  `IndexedRowMatrix` and `CoordinateMatrix` do not appear to have this issue likely due to their related helper functions in `PythonMLlibAPI` creating the RDDs explicitly from DataFrames with pattern matching, thus preserving the types.

This PR currently contains that retagging fix applied to the `createRowMatrix` helper function in `PythonMLlibAPI`.  This PR blocks #9441, so once this is merged, the other can be rebased.

cc holdenk

Author: Mike Dusenberry <mwdusenb@us.ibm.com>

Closes #9458 from dusenberrymw/SPARK-11497_PySpark_RowMatrix_Constructor_Has_Type_Erasure_Issue.
2015-12-11 14:21:33 -08:00
Holden Karau 518ab51010 [SPARK-10991][ML] logistic regression training summary handle empty prediction col
LogisticRegression training summary should still function if the predictionCol is set to an empty string or otherwise unset (related too https://issues.apache.org/jira/browse/SPARK-9718 )

Author: Holden Karau <holden@pigscanfly.ca>
Author: Holden Karau <holden@us.ibm.com>

Closes #9037 from holdenk/SPARK-10991-LogisticRegressionTrainingSummary-handle-empty-prediction-col.
2015-12-11 02:35:53 -05:00
Yuhao Yang 9fba9c8004 [SPARK-11602][MLLIB] Refine visibility for 1.6 scala API audit
jira: https://issues.apache.org/jira/browse/SPARK-11602

Made a pass on the API change of 1.6. Open the PR for efficient discussion.

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #9939 from hhbyyh/auditScala.
2015-12-10 10:15:50 -08:00
Sean Owen 21b3d2a75f [SPARK-11530][MLLIB] Return eigenvalues with PCA model
Add `computePrincipalComponentsAndVariance` to also compute PCA's explained variance.

CC mengxr

Author: Sean Owen <sowen@cloudera.com>

Closes #9736 from srowen/SPARK-11530.
2015-12-10 14:05:45 +00:00
Holden Karau 22b9a8740d [SPARK-10299][ML] word2vec should allow users to specify the window size
Currently word2vec has the window hard coded at 5, some users may want different sizes (for example if using on n-gram input or similar). User request comes from http://stackoverflow.com/questions/32231975/spark-word2vec-window-size .

Author: Holden Karau <holden@us.ibm.com>
Author: Holden Karau <holden@pigscanfly.ca>

Closes #8513 from holdenk/SPARK-10299-word2vec-should-allow-users-to-specify-the-window-size.
2015-12-09 16:45:13 +00:00
Dominik Dahlem a0046e379b [SPARK-11343][ML] Documentation of float and double prediction/label columns in RegressionEvaluator
felixcheung , mengxr

Just added a message to require()

Author: Dominik Dahlem <dominik.dahlem@gmail.combination>

Closes #9598 from dahlem/ddahlem_regression_evaluator_double_predictions_message_04112015.
2015-12-08 18:54:10 -08:00
Yuhao Yang 5cb4695051 [SPARK-11605][MLLIB] ML 1.6 QA: API: Java compatibility, docs
jira: https://issues.apache.org/jira/browse/SPARK-11605
Check Java compatibility for MLlib for this release.

fix:

1. `StreamingTest.registerStream` needs java friendly interface.

2. `GradientBoostedTreesModel.computeInitialPredictionAndError` and `GradientBoostedTreesModel.updatePredictionError` has java compatibility issue. Mark them as `developerAPI`.

TBD:
[updated] no fix for now per discussion.
`org.apache.spark.mllib.classification.LogisticRegressionModel`
`public scala.Option<java.lang.Object> getThreshold();` has wrong return type for Java invocation.
`SVMModel` has the similar issue.

Yet adding a `scala.Option<java.util.Double> getThreshold()` would result in an overloading error due to the same function signature. And adding a new function with different name seems to be not necessary.

cc jkbradley feynmanliang

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #10102 from hhbyyh/javaAPI.
2015-12-08 11:46:26 -08:00
Nakul Jindal 037b7e76a7 [SPARK-11439][ML] Optimization of creating sparse feature without dense one
Sparse feature generated in LinearDataGenerator does not create dense vectors as an intermediate any more.

Author: Nakul Jindal <njindal@us.ibm.com>

Closes #9756 from nakul02/SPARK-11439_sparse_without_creating_dense_feature.
2015-12-08 11:08:27 +00:00
Yanbo Liang 4a39b5a1be [SPARK-11958][SPARK-11957][ML][DOC] SQLTransformer user guide and example code
Add ```SQLTransformer``` user guide, example code and make Scala API doc more clear.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #10006 from yanboliang/spark-11958.
2015-12-07 23:50:57 -08:00
Takahashi Hiroshi 7d05a62451 [SPARK-10259][ML] Add @since annotation to ml.classification
Add since annotation to ml.classification

Author: Takahashi Hiroshi <takahashi.hiroshi@lab.ntt.co.jp>

Closes #8534 from taishi-oss/issue10259.
2015-12-07 23:46:55 -08:00
Joseph K. Bradley 3e7e05f5ee [SPARK-12160][MLLIB] Use SQLContext.getOrCreate in MLlib
Switched from using SQLContext constructor to using getOrCreate, mainly in model save/load methods.

This covers all instances in spark.mllib.  There were no uses of the constructor in spark.ml.

CC: mengxr yhuai

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #10161 from jkbradley/mllib-sqlcontext-fix.
2015-12-07 16:37:09 -08:00
Sean Owen 7da6748519 [SPARK-11988][ML][MLLIB] Update JPMML to 1.2.7
Update JPMML pmml-model to 1.2.7

Author: Sean Owen <sowen@cloudera.com>

Closes #9972 from srowen/SPARK-11988.
2015-12-05 15:52:52 +00:00
Antonio Murgia e9c9ae22b9 [SPARK-11994][MLLIB] Word2VecModel load and save cause SparkException when model is bigger than spark.kryoserializer.buffer.max
Author: Antonio Murgia <antonio.murgia2@studio.unibo.it>

Closes #9989 from tmnd1991/SPARK-11932.
2015-12-05 15:42:02 +00:00
Yuhao Yang ee94b70ce5 [SPARK-12096][MLLIB] remove the old constraint in word2vec
jira: https://issues.apache.org/jira/browse/SPARK-12096

word2vec now can handle much bigger vocabulary.
The old constraint vocabSize.toLong * vectorSize < Ine.max / 8 should be removed.

new constraint is vocabSize.toLong * vectorSize < max array length (usually a little less than Int.MaxValue)

I tested with vocabsize over 18M and vectorsize = 100.

srowen jkbradley Sorry to miss this in last PR. I was reminded today.

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #10103 from hhbyyh/w2vCapacity.
2015-12-05 15:27:31 +00:00
Josh Rosen b7204e1d41 [SPARK-12112][BUILD] Upgrade to SBT 0.13.9
We should upgrade to SBT 0.13.9, since this is a requirement in order to use SBT's new Maven-style resolution features (which will be done in a separate patch, because it's blocked by some binary compatibility issues in the POM reader plugin).

I also upgraded Scalastyle to version 0.8.0, which was necessary in order to fix a Scala 2.10.5 compatibility issue (see https://github.com/scalastyle/scalastyle/issues/156). The newer Scalastyle is slightly stricter about whitespace surrounding tokens, so I fixed the new style violations.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #10112 from JoshRosen/upgrade-to-sbt-0.13.9.
2015-12-05 08:15:30 +08:00
Dmitry Erastov d0d8222778 [SPARK-6990][BUILD] Add Java linting script; fix minor warnings
This replaces https://github.com/apache/spark/pull/9696

Invoke Checkstyle and print any errors to the console, failing the step.
Use Google's style rules modified according to
https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide
Some important checks are disabled (see TODOs in `checkstyle.xml`) due to
multiple violations being present in the codebase.

Suggest fixing those TODOs in a separate PR(s).

More on Checkstyle can be found on the [official website](http://checkstyle.sourceforge.net/).

Sample output (from [build 46345](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/46345/consoleFull)) (duplicated because I run the build twice with different profiles):

> Checkstyle checks failed at following occurrences:
[ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/UnsafeRowParquetRecordReader.java:[217,7] (coding) MissingSwitchDefault: switch without "default" clause.
> [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java:[198,10] (modifier) ModifierOrder: 'protected' modifier out of order with the JLS suggestions.
> [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/UnsafeRowParquetRecordReader.java:[217,7] (coding) MissingSwitchDefault: switch without "default" clause.
> [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java:[198,10] (modifier) ModifierOrder: 'protected' modifier out of order with the JLS suggestions.
> [error] running /home/jenkins/workspace/SparkPullRequestBuilder2/dev/lint-java ; received return code 1

Also fix some of the minor violations that didn't require sweeping changes.

Apologies for the previous botched PRs - I finally figured out the issue.

cr: JoshRosen, pwendell

> I state that the contribution is my original work, and I license the work to the project under the project's open source license.

Author: Dmitry Erastov <derastov@gmail.com>

Closes #9867 from dskrvk/master.
2015-12-04 12:03:45 -08:00
Xiangrui Meng 9bb695b7a8 [SPARK-12000] do not specify arg types when reference a method in ScalaDoc
This fixes SPARK-12000, verified on my local with JDK 7. It seems that `scaladoc` try to match method names and messed up with annotations.

cc: JoshRosen jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #10114 from mengxr/SPARK-12000.2.
2015-12-02 17:19:31 -08:00
Yu ISHIKAWA de07d06abe [SPARK-10266][DOCUMENTATION, ML] Fixed @Since annotation for ml.tunning
cc mengxr noel-smith

I worked on this issues based on https://github.com/apache/spark/pull/8729.
ehsanmok  thank you for your contricution!

Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
Author: Ehsan M.Kermani <ehsanmo1367@gmail.com>

Closes #9338 from yu-iskw/JIRA-10266.
2015-12-02 14:15:54 -08:00
Cheng Lian 69dbe6b40d [SPARK-12046][DOC] Fixes various ScalaDoc/JavaDoc issues
This PR backports PR #10039 to master

Author: Cheng Lian <lian@databricks.com>

Closes #10063 from liancheng/spark-12046.doc-fix.master.
2015-12-01 10:21:31 -08:00
Yuhao Yang a0af0e351e [SPARK-11898][MLLIB] Use broadcast for the global tables in Word2Vec
jira: https://issues.apache.org/jira/browse/SPARK-11898
syn0Global and sync1Global in word2vec are quite large objects with size (vocab * vectorSize * 8), yet they are passed to worker using basic task serialization.

Use broadcast can greatly improve the performance. My benchmark shows that, for 1M vocabulary and default vectorSize 100, changing to broadcast can help,

1. decrease the worker memory consumption by 45%.
2. decrease running time by 40%.

This will also help extend the upper limit for Word2Vec.

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #9878 from hhbyyh/w2vBC.
2015-12-01 09:26:58 +00:00
Yuhao Yang 52bc25c8e2 [SPARK-11847][ML] Model export/import for spark.ml: LDA
Add read/write support to LDA, similar to ALS.

save/load for ml.LocalLDAModel is done.
For DistributedLDAModel, I'm not sure if we can invoke save on the mllib.DistributedLDAModel directly. I'll send update after some test.

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #9894 from hhbyyh/ldaMLsave.
2015-11-24 09:56:17 -08:00
Joseph K. Bradley 9e24ba667e [SPARK-11521][ML][DOC] Document that Logistic, Linear Regression summaries ignore weight col
Doc for 1.6 that the summaries mostly ignore the weight column.
To be corrected for 1.7

CC: mengxr thunterdb

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #9927 from jkbradley/linregsummary-doc.
2015-11-24 09:54:55 -08:00