Commit graph

1328 commits

Author SHA1 Message Date
Shally Sangal d356901588 [SPARK-14284][ML] KMeansSummary deprecating size; adding clusterSizes
## What changes were proposed in this pull request?

KMeansSummary class : deprecated size and added clusterSizes

Author: Shally Sangal <shallysangal@gmail.com>

Closes #12084 from shallys/master.
2016-04-05 10:41:59 -07:00
Joseph K. Bradley 8f50574ab4 [SPARK-14386][ML] Changed spark.ml ensemble trees methods to return concrete types
## What changes were proposed in this pull request?

In spark.ml, GBT and RandomForest expose the trait DecisionTreeModel in the trees method, but they should not since it is a private trait (and not ready to be made public). It will also be more useful to users if we return the concrete types.

This PR: return concrete types

The MIMA checks appear to be OK with this change.

## How was this patch tested?

Existing unit tests

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #12158 from jkbradley/hide-dtm.
2016-04-04 20:12:09 -07:00
Joseph K. Bradley 89f3befab6 [SPARK-13784][ML] Persistence for RandomForestClassifier, RandomForestRegressor
## What changes were proposed in this pull request?

**Main change**: Added save/load for RandomForestClassifier, RandomForestRegressor (implementation details below)

Modified numTrees method (*deprecation*)
* Goal: Use default implementations of unit tests which assume Estimators and Models share the same set of Params.
* What this PR does: Moves method numTrees outside of trait TreeEnsembleModel.  Adds it to GBT and RF Models.  Deprecates it in RF Models in favor of new method getNumTrees.  In Spark 2.1, we can have RF Models include Param numTrees.

Minor items
* Fixes bugs in GBTClassificationModel, GBTRegressionModel fromOld methods where they assign the wrong old UID.

**Implementation details**
* Split DecisionTreeModelReadWrite.loadTreeNodes into 2 methods in order to reuse some code for ensembles.
* Added EnsembleModelReadWrite object with save/load implementations usable for RFs and GBTs
  * These store all trees' nodes in a single DataFrame, and all trees' metadata in a second DataFrame.
* Split trait RandomForestParams into parts in order to add more Estimator Params to RF models
* Split DefaultParamsWriter.saveMetadata into two methods to allow ensembles to store sub-models' metadata in a single DataFrame.  Same for DefaultParamsReader.loadMetadata

## How was this patch tested?

Adds standard unit tests for RF save/load

Author: Joseph K. Bradley <joseph@databricks.com>
Author: GayathriMurali <gayathri.m.softie@gmail.com>

Closes #12118 from jkbradley/GayathriMurali-SPARK-13784.
2016-04-04 10:24:02 -07:00
Dongjoon Hyun 3f749f7ed4 [SPARK-14355][BUILD] Fix typos in Exception/Testcase/Comments and static analysis results
## What changes were proposed in this pull request?

This PR contains the following 5 types of maintenance fix over 59 files (+94 lines, -93 lines).
- Fix typos(exception/log strings, testcase name, comments) in 44 lines.
- Fix lint-java errors (MaxLineLength) in 6 lines. (New codes after SPARK-14011)
- Use diamond operators in 40 lines. (New codes after SPARK-13702)
- Fix redundant semicolon in 5 lines.
- Rename class `InferSchemaSuite` to `CSVInferSchemaSuite` in CSVInferSchemaSuite.scala.

## How was this patch tested?

Manual and pass the Jenkins tests.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #12139 from dongjoon-hyun/SPARK-14355.
2016-04-03 18:14:16 -07:00
Dongjoon Hyun 4a6e78abd9 [MINOR][DOCS] Use multi-line JavaDoc comments in Scala code.
## What changes were proposed in this pull request?

This PR aims to fix all Scala-Style multiline comments into Java-Style multiline comments in Scala codes.
(All comment-only changes over 77 files: +786 lines, −747 lines)

## How was this patch tested?

Manual.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #12130 from dongjoon-hyun/use_multiine_javadoc_comments.
2016-04-02 17:50:40 -07:00
Jacek Laskowski 06694f1c68 [MINOR] Typo fixes
## What changes were proposed in this pull request?

Typo fixes. No functional changes.

## How was this patch tested?

Built the sources and ran with samples.

Author: Jacek Laskowski <jacek@japila.pl>

Closes #11802 from jaceklaskowski/typo-fixes.
2016-04-02 08:12:04 -07:00
sethah 4fc35e6f5c [SPARK-14308][ML][MLLIB] Remove unused mllib tree classes and move private classes to ML
## What changes were proposed in this pull request?

Decision tree helper classes will be migrated to ML. This patch moves those internal classes that are not part of the public API and removes ones that are no longer used, after [SPARK-12183](https://github.com/apache/spark/pull/11855). No functional changes are made.

Details:
* Bin.scala is removed as the ML implementation does not require bins
* mllib NodeIdCache is removed. It was only used by the mllib implementation previously, which no longer exists
* mllib TreePoint is removed. It was only used by the mllib implementation previously, which no longer exists
* BaggedPoint, DTStatsAggregator, DecisionTreeMetadata, BaggedPointSuite and TimeTracker are all moved to ML.

## How was this patch tested?

No functional changes are made. Existing unit tests ensure behavior is unchanged.

Author: sethah <seth.hendrickson16@gmail.com>

Closes #12097 from sethah/cleanup_mllib_tree.
2016-04-01 21:23:35 -07:00
BenFradet 36e8fb8005 [SPARK-7425][ML] spark.ml Predictor should support other numeric types for label
Currently, the Predictor abstraction expects the input labelCol type to be DoubleType, but we should support other numeric types. This will involve updating the PredictorParams.validateAndTransformSchema method.

Author: BenFradet <benjamin.fradet@gmail.com>

Closes #10355 from BenFradet/SPARK-7425.
2016-04-01 18:25:43 -07:00
Cheng Lian 3715ecdf41 [SPARK-14295][MLLIB][HOTFIX] Fixes Scala 2.10 compilation failure
## What changes were proposed in this pull request?

Fixes a compilation failure introduced in PR #12088 under Scala 2.10.

## How was this patch tested?

Compilation.

Author: Cheng Lian <lian@databricks.com>

Closes #12107 from liancheng/spark-14295-hotfix.
2016-04-01 17:02:48 +08:00
Yanbo Liang 22249afb4a [SPARK-14303][ML][SPARKR] Define and use KMeansWrapper for SparkR::kmeans
## What changes were proposed in this pull request?
Define and use ```KMeansWrapper``` for ```SparkR::kmeans```. It's only the code refactor for the original ```KMeans``` wrapper.

## How was this patch tested?
Existing tests.

cc mengxr

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #12039 from yanboliang/spark-14059.
2016-03-31 23:49:58 -07:00
Alexander Ulanov 26867ebc67 [SPARK-11262][ML] Unit test for gradient, loss layers, memory management for multilayer perceptron
1.Implement LossFunction trait and implement squared error and cross entropy
loss with it
2.Implement unit test for gradient and loss
3.Implement InPlace trait and in-place layer evaluation
4.Refactor interface for ActivationFunction
5.Update of Layer and LayerModel interfaces
6.Fix random weights assignment
7.Implement memory allocation by MLP model instead of individual layers

These features decreased the memory usage and increased flexibility of
internal API.

Author: Alexander Ulanov <nashb@yandex.ru>
Author: avulanov <avulanov@gmail.com>

Closes #9229 from avulanov/mlp-refactoring.
2016-03-31 23:48:36 -07:00
Cheng Lian 1b070637fa [SPARK-14295][SPARK-14274][SQL] Implements buildReader() for LibSVM
## What changes were proposed in this pull request?

This PR implements `FileFormat.buildReader()` for the LibSVM data source. Besides that, a new interface method `prepareRead()` is added to `FileFormat`:

```scala
  def prepareRead(
      sqlContext: SQLContext,
      options: Map[String, String],
      files: Seq[FileStatus]): Map[String, String] = options
```

After migrating from `buildInternalScan()` to `buildReader()`, we lost the opportunity to collect necessary global information, since `buildReader()` works in a per-partition manner. For example, LibSVM needs to infer the total number of features if the `numFeatures` data source option is not set. Any necessary collected global information should be returned using the data source options map. By default, this method just returns the original options untouched.

An alternative approach is to absorb `inferSchema()` into `prepareRead()`, since schema inference is also some kind of global information gathering. However, this approach wasn't chosen because schema inference is optional, while `prepareRead()` must be called whenever a `HadoopFsRelation` based data source relation is instantiated.

One unaddressed problem is that, when `numFeatures` is absent, now the input data will be scanned twice. The `buildInternalScan()` code path doesn't need to do this because it caches the raw parsed RDD in memory before computing the total number of features. However, with `FileScanRDD`, the raw parsed RDD is created in a different way (e.g. partitioning) from the final RDD.

## How was this patch tested?

Tested using existing test suites.

Author: Cheng Lian <lian@databricks.com>

Closes #12088 from liancheng/spark-14295-libsvm-build-reader.
2016-03-31 23:46:08 -07:00
Xusen Yin 8b207f3b6a [SPARK-11892][ML] Model export/import for spark.ml: OneVsRest
# What changes were proposed in this pull request?

https://issues.apache.org/jira/browse/SPARK-11892

Add save/load for spark ml.OneVsRest and its model. Also add OneVsRest and OneVsRestModel in MetaAlgorithmReadWrite.

# How was this patch tested?

Test with Scala unit test.

Author: Xusen Yin <yinxusen@gmail.com>

Closes #9934 from yinxusen/SPARK-11892.
2016-03-31 11:17:32 -07:00
Yuhao Yang a0a1991580 [SPARK-13782][ML] Model export/import for spark.ml: BisectingKMeans
## What changes were proposed in this pull request?
jira: https://issues.apache.org/jira/browse/SPARK-13782
Model export/import for BisectingKMeans in spark.ml and mllib

## How was this patch tested?

unit tests

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #11933 from hhbyyh/bisectingsave.
2016-03-31 11:12:40 -07:00
Dongjoon Hyun 208fff3ac8 [SPARK-14164][MLLIB] Improve input layer validation of MultilayerPerceptronClassifier
## What changes were proposed in this pull request?

This issue improves an input layer validation and adds related testcases to MultilayerPerceptronClassifier.

```scala
-    // TODO: how to check ALSO that all elements are greater than 0?
-    ParamValidators.arrayLengthGt(1)
+    (t: Array[Int]) => t.forall(ParamValidators.gt(0)) && t.length > 1
```

## How was this patch tested?

Pass the Jenkins tests including the new testcases.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #11964 from dongjoon-hyun/SPARK-14164.
2016-03-31 09:39:15 -07:00
Yuhao Yang ca458618d8 [SPARK-11507][MLLIB] add compact in Matrices fromBreeze
jira: https://issues.apache.org/jira/browse/SPARK-11507
"In certain situations when adding two block matrices, I get an error regarding colPtr and the operation fails. External issue URL includes full error and code for reproducing the problem."

root cause: colPtr.last does NOT always equal to values.length in breeze SCSMatrix, which fails the require in SparseMatrix.

easy step to repro:
```
val m1: BM[Double] = new CSCMatrix[Double] (Array (1.0, 1, 1), 3, 3, Array (0, 1, 2, 3), Array (0, 1, 2) )
val m2: BM[Double] = new CSCMatrix[Double] (Array (1.0, 2, 2, 4), 3, 3, Array (0, 0, 2, 4), Array (1, 2, 1, 2) )
val sum = m1 + m2
Matrices.fromBreeze(sum)
```

Solution: By checking the code in [CSCMatrix](28000a7b90/math/src/main/scala/breeze/linalg/CSCMatrix.scala), CSCMatrix in breeze can have extra zeros in the end of data array. Invoking compact will make sure it aligns with the require of SparseMatrix. This should add limited overhead as the actual compact operation is only performed when necessary.

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #9520 from hhbyyh/matricesFromBreeze.
2016-03-30 15:58:19 -07:00
Yanbo Liang 5dc948e812 [MINOR][ML] Fix the wrong param name of LDA topicDistributionCol
## What changes were proposed in this pull request?
Fix the wrong param name of LDA ```topicDistributionCol```.
## How was this patch tested?
No tests.

cc jkbradley

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #12065 from yanboliang/lda-topicDistributionCol.
2016-03-30 14:57:38 -07:00
Xusen Yin 529d6ce8f9 [SPARK-14181] TrainValidationSplit should have HasSeed
https://issues.apache.org/jira/browse/SPARK-14181

TrainValidationSplit should have HasSeed for the random split of RDD. I also changed the random split from the RDD function to the DataFrame function.

Author: Xusen Yin <yinxusen@gmail.com>

Closes #11985 from yinxusen/SPARK-14181.
2016-03-30 14:32:29 -07:00
Yuhao Yang d2a819a636 [SPARK-14154][MLLIB] Simplify the implementation for Kolmogorov–Smirnov test
## What changes were proposed in this pull request?
jira: https://issues.apache.org/jira/browse/SPARK-14154

I just read the code for KolmogorovSmirnovTest and find it could be much simplified following the original definition.

Send a PR for discussion

## How was this patch tested?
unit test

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #11954 from hhbyyh/ksoptimize.
2016-03-29 09:16:50 -07:00
Bryan Cutler 425bcf6d68 [SPARK-13963][ML] Adding binary toggle param to HashingTF
## What changes were proposed in this pull request?
Adding binary toggle parameter to ml.feature.HashingTF, as well as mllib.feature.HashingTF since the former wraps this functionality.  This parameter, if true, will set non-zero valued term counts to 1 to transform term count features to binary values that are well suited for discrete probability models.

## How was this patch tested?
Added unit tests for ML and MLlib

Author: Bryan Cutler <cutlerb@gmail.com>

Closes #11832 from BryanCutler/binary-param-HashingTF-SPARK-13963.
2016-03-29 12:30:30 +02:00
sethah f6066b0c3c [SPARK-11730][ML] Add feature importances for GBTs.
## What changes were proposed in this pull request?

Now that GBTs have been moved to ML, they can use the implementation of feature importance for random forests. This patch simply adds a `featureImportances` attribute to `GBTClassifier` and `GBTRegressor` and adds tests for each.

GBT feature importances here simply average the feature importances for each tree in its ensemble. This follows the implementation from scikit-learn. This method is also suggested by J Friedman in [this paper](https://statweb.stanford.edu/~jhf/ftp/trebst.pdf).

## How was this patch tested?

Unit tests were added to `GBTClassifierSuite` and `GBTRegressorSuite` to validate feature importances.

Author: sethah <seth.hendrickson16@gmail.com>

Closes #11961 from sethah/SPARK-11730.
2016-03-28 22:27:53 -07:00
Xusen Yin 8c11d1aab8 [SPARK-11893] Model export/import for spark.ml: TrainValidationSplit
https://issues.apache.org/jira/browse/SPARK-11893

jkbradley In order to share read/write with `TrainValidationSplit`, I move the `SharedReadWrite` out of `CrossValidator` into a new trait `SharedReadWrite` in the tunning package.

To reduce the repeated tests, I move the complex tests from `CrossValidatorSuite` to `SharedReadWriteSuite`, and create a fake validator called `MyValidator` to test the shared code.

With `SharedReadWrite`, potential newly added `Validator` can share the read/write common part, and only need to implement their extra params save/load.

Author: Xusen Yin <yinxusen@gmail.com>
Author: Joseph K. Bradley <joseph@databricks.com>

Closes #9971 from yinxusen/SPARK-11893.
2016-03-28 15:40:06 -07:00
Chenliang Xu c8388297c4 [SPARK-14187][MLLIB] Fix incorrect use of binarySearch in SparseMatrix
## What changes were proposed in this pull request?

Fix incorrect use of binarySearch in SparseMatrix

## How was this patch tested?

Unit test added.

Author: Chenliang Xu <chexu@groupon.com>

Closes #11992 from luckyrandom/SPARK-14187.
2016-03-28 08:33:37 -07:00
Sean Owen 7b84154018 [SPARK-12494][MLLIB] Array out of bound Exception in KMeans Yarn Mode
## What changes were proposed in this pull request?

Better error message with k-means init can't be enough samples from input (because it is perhaps empty)

## How was this patch tested?

Jenkins tests.

Author: Sean Owen <sowen@cloudera.com>

Closes #11979 from srowen/SPARK-12494.
2016-03-28 12:01:33 +01:00
Joseph K. Bradley 8ef493760f [SPARK-10691][ML] Make LogisticRegressionModel, LinearRegressionModel evaluate() public
## What changes were proposed in this pull request?

Made evaluate method public.  Fixed LogisticRegressionModel evaluate to handle case when probabilityCol is not specified.

## How was this patch tested?

There were already unit tests for these methods.

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #11928 from jkbradley/public-evaluate.
2016-03-27 19:04:18 -07:00
Dongjoon Hyun 0f02a5c6e6 [MINOR][MLLIB] Remove TODO comment DecisionTreeModel.scala
## What changes were proposed in this pull request?

This PR fixes the following line and the related code. Historically, this code was added in [SPARK-5597](https://issues.apache.org/jira/browse/SPARK-5597). After [SPARK-5597](https://issues.apache.org/jira/browse/SPARK-5597) was committed, [SPARK-3365](https://issues.apache.org/jira/browse/SPARK-3365) is fixed now. Now, we had better remove the comment without changing persistent code.

```scala
-        categories: Seq[Double]) { // TODO: Change to List once SPARK-3365 is fixed
+        categories: Seq[Double]) {
```

## How was this patch tested?

Pass the Jenkins tests.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #11966 from dongjoon-hyun/change_categories_type.
2016-03-27 20:07:31 +01:00
Liwei Lin 62a85eb09f [SPARK-14089][CORE][MLLIB] Remove methods that has been deprecated since 1.1, 1.2, 1.3, 1.4, and 1.5
## What changes were proposed in this pull request?

Removed methods that has been deprecated since 1.1, 1.2, 1.3, 1.4, and 1.5.

## How was this patch tested?

- manully checked that no codes in Spark call these methods any more
- existing test suits

Author: Liwei Lin <lwlin7@gmail.com>
Author: proflin <proflin.me@gmail.com>

Closes #11910 from lw-lin/remove-deprecates.
2016-03-26 12:41:34 +00:00
Joseph K. Bradley 54d13bed87 [SPARK-14159][ML] Fixed bug in StringIndexer + related issue in RFormula
## What changes were proposed in this pull request?

StringIndexerModel.transform sets the output column metadata to use name inputCol.  It should not.  Fixing this causes a problem with the metadata produced by RFormula.

Fix in RFormula: I added the StringIndexer columns to prefixesToRewrite, and I modified VectorAttributeRewriter to find and replace all "prefixes" since attributes collect multiple prefixes from StringIndexer + Interaction.

Note that "prefixes" is no longer accurate since internal strings may be replaced.

## How was this patch tested?

Unit test which failed before this fix.

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #11965 from jkbradley/StringIndexer-fix.
2016-03-25 16:00:09 -07:00
Yanbo Liang 13cbb2de70 [SPARK-13010][ML][SPARKR] Implement a simple wrapper of AFTSurvivalRegression in SparkR
## What changes were proposed in this pull request?
This PR continues the work in #11447, we implemented the wrapper of ```AFTSurvivalRegression``` named ```survreg``` in SparkR.

## How was this patch tested?
Test against output from R package survival's survreg.

cc mengxr felixcheung

Close #11447

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #11932 from yanboliang/spark-13010-new.
2016-03-24 22:29:34 -07:00
Xusen Yin 2cf46d5a96 [SPARK-11871] Add save/load for MLPC
## What changes were proposed in this pull request?

https://issues.apache.org/jira/browse/SPARK-11871

Add save/load for MLPC

## How was this patch tested?

Test with Scala unit test

Author: Xusen Yin <yinxusen@gmail.com>

Closes #9854 from yinxusen/SPARK-11871.
2016-03-24 15:29:17 -07:00
Ruifeng Zheng 048a7594e2 [SPARK-14030][MLLIB] Add parameter check to MLLIB
## What changes were proposed in this pull request?

add parameter verification to MLLIB, like
numCorrections > 0
tolerance >= 0
iters > 0
regParam >= 0

## How was this patch tested?

manual tests

Author: Ruifeng Zheng <ruifengz@foxmail.com>
Author: Zheng RuiFeng <mllabs@datanode1.(none)>
Author: mllabs <mllabs@datanode1.(none)>
Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #11852 from zhengruifeng/lbfgs_check.
2016-03-24 09:25:00 +00:00
Juarez Bochi 1803bf6333 Fix typo in ALS.scala
## What changes were proposed in this pull request?

Just a typo

## How was this patch tested?

N/A

Author: Juarez Bochi <jbochi@gmail.com>

Closes #11896 from jbochi/patch-1.
2016-03-24 09:24:00 +00:00
Joseph K. Bradley cf823bead1 [SPARK-12183][ML][MLLIB] Remove mllib tree implementation, and wrap spark.ml one
Primary change:
* Removed spark.mllib.tree.DecisionTree implementation of tree and forest learning.
* spark.mllib now calls the spark.ml implementation.
* Moved unit tests (of tree learning internals) from spark.mllib to spark.ml as needed.

ml.tree.DecisionTreeModel
* Added toOld and made ```private[spark]```, implemented for Classifier and Regressor in subclasses.  These methods now use OldInformationGainStats.invalidInformationGainStats for LeafNodes in order to mimic the spark.mllib implementation.

ml.tree.Node
* Added ```private[tree] def deepCopy```, used by unit tests

Copied developer comments from spark.mllib implementation to spark.ml one.

Moving unit tests
* Tree learning internals were tested by spark.mllib.tree.DecisionTreeSuite, or spark.mllib.tree.RandomForestSuite.
* Those tests were all moved to spark.ml.tree.impl.RandomForestSuite.  The order in the file + the test names are the same, so you should be able to compare them by opening them in 2 windows side-by-side.
* I made minimal changes to each test to allow it to run.  Each test makes the same checks as before, except for a few removed assertions which were checking irrelevant values.
* No new unit tests were added.
* mllib.tree.DecisionTreeSuite: I removed some checks of splits and bins which were not relevant to the unit tests they were in.  Those same split calculations were already being tested in other unit tests, for each dataset type.

**Changes of behavior** (to be noted in SPARK-13448 once this PR is merged)
* spark.ml.tree.impl.RandomForest: Rather than throwing an error when maxMemoryInMB is set to too small a value (to split any node), we now allow 1 node to be split, even if its memory requirements exceed maxMemoryInMB.  This involved removing the maxMemoryPerNode check in RandomForest.run, as well as modifying selectNodesToSplit().  Once this PR is merged, I will note the change of behavior on SPARK-13448.
* spark.mllib.tree.DecisionTree: When a tree only has one node (root = leaf node), the "stats" field will now be empty, rather than being set to InformationGainStats.invalidInformationGainStats.  This does not remove information from the tree, and it will save a bit of storage.

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #11855 from jkbradley/remove-mllib-tree-impl.
2016-03-23 21:16:00 -07:00
sethah 69bc2c17f1 [SPARK-13952][ML] Add random seed to GBT
## What changes were proposed in this pull request?

`GBTClassifier` and `GBTRegressor` should use random seed for reproducible results. Because of the nature of current unit tests, which compare GBTs in ML and GBTs in MLlib for equality, I also added a random seed to MLlib GBT algorithm. I made alternate constructors in `mllib.tree.GradientBoostedTrees` to accept a random seed, but left them as private so as to not change the API unnecessarily.

## How was this patch tested?

Existing unit tests verify that functionality did not change. Other ML algorithms do not seem to have unit tests that directly test the functionality of random seeding, but reproducibility with seeding for GBTs is effectively verified in existing tests. I can add more tests if needed.

Author: sethah <seth.hendrickson16@gmail.com>

Closes #11903 from sethah/SPARK-13952.
2016-03-23 15:08:47 -07:00
Joseph K. Bradley 4d955cd694 [SPARK-14035][MLLIB] Make error message more verbose for mllib NaiveBayesSuite
## What changes were proposed in this pull request?

Print more info about failed NaiveBayesSuite tests which have exhibited flakiness.

## How was this patch tested?

Ran locally with incorrect check to cause failure.

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #11858 from jkbradley/naive-bayes-bug-log.
2016-03-23 10:51:58 +00:00
Xusen Yin d6dc12ef01 [SPARK-13449] Naive Bayes wrapper in SparkR
## What changes were proposed in this pull request?

This PR continues the work in #11486 from yinxusen with some code refactoring. In R package e1071, `naiveBayes` supports both categorical (Bernoulli) and continuous features (Gaussian), while in MLlib we support Bernoulli and multinomial. This PR implements the common subset: Bernoulli.

I moved the implementation out from SparkRWrappers to NaiveBayesWrapper to make it easier to read. Argument names, default values, and summary now match e1071's naiveBayes.

I removed the preprocess part that omit NA values because we don't know which columns to process.

## How was this patch tested?

Test against output from R package e1071's naiveBayes.

cc: yanboliang yinxusen

Closes #11486

Author: Xusen Yin <yinxusen@gmail.com>
Author: Xiangrui Meng <meng@databricks.com>

Closes #11890 from mengxr/SPARK-13449.
2016-03-22 14:16:51 -07:00
Dongjoon Hyun df61fbd978 [SPARK-13986][CORE][MLLIB] Remove DeveloperApi-annotations for non-publics
## What changes were proposed in this pull request?

Spark uses `DeveloperApi` annotation, but sometimes it seems to conflict with visibility. This PR tries to fix those conflict by removing annotations for non-publics. The following is the example.

**JobResult.scala**
```scala
DeveloperApi
sealed trait JobResult

DeveloperApi
case object JobSucceeded extends JobResult

-DeveloperApi
private[spark] case class JobFailed(exception: Exception) extends JobResult
```

## How was this patch tested?

Pass the existing Jenkins test.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #11797 from dongjoon-hyun/SPARK-13986.
2016-03-21 14:57:52 +00:00
Dongjoon Hyun 20fd254101 [SPARK-14011][CORE][SQL] Enable LineLength Java checkstyle rule
## What changes were proposed in this pull request?

[Spark Coding Style Guide](https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide) has 100-character limit on lines, but it's disabled for Java since 11/09/15. This PR enables **LineLength** checkstyle again. To help that, this also introduces **RedundantImport** and **RedundantModifier**, too. The following is the diff on `checkstyle.xml`.

```xml
-        <!-- TODO: 11/09/15 disabled - the lengths are currently > 100 in many places -->
-        <!--
         <module name="LineLength">
             <property name="max" value="100"/>
             <property name="ignorePattern" value="^package.*|^import.*|a href|href|http://|https://|ftp://"/>
         </module>
-        -->
         <module name="NoLineWrap"/>
         <module name="EmptyBlock">
             <property name="option" value="TEXT"/>
 -167,5 +164,7
         </module>
         <module name="CommentsIndentation"/>
         <module name="UnusedImports"/>
+        <module name="RedundantImport"/>
+        <module name="RedundantModifier"/>
```

## How was this patch tested?

Currently, `lint-java` is disabled in Jenkins. It needs a manual test.
After passing the Jenkins tests, `dev/lint-java` should passes locally.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #11831 from dongjoon-hyun/SPARK-14011.
2016-03-21 07:58:57 +00:00
sethah 811a524722 [SPARK-12182][ML] Distributed binning for trees in spark.ml
This PR changes the `findSplits` method in spark.ml to perform split calculations on the workers. This PR is meant to copy [PR-8246](https://github.com/apache/spark/pull/8246) which added the same feature for MLlib.

Author: sethah <seth.hendrickson16@gmail.com>

Closes #10231 from sethah/SPARK-12182.
2016-03-20 12:31:28 -07:00
Yuhao Yang f43a26ef92 [SPARK-13629][ML] Add binary toggle Param to CountVectorizer
## What changes were proposed in this pull request?

This is a continued work for https://github.com/apache/spark/pull/11536#issuecomment-198511013,
containing some comment update and style adjustment.
jkbradley

## How was this patch tested?

unit tests.

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #11830 from hhbyyh/cvToggle.
2016-03-18 17:34:33 -07:00
Yanbo Liang 7783b6f38f [MINOR][ML] When trainingSummary is None, it should throw RuntimeException.
## What changes were proposed in this pull request?
When trainingSummary is None, it should throw ```RuntimeException```.
cc mengxr
## How was this patch tested?
Existing tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #11784 from yanboliang/fix-summary.
2016-03-18 11:23:17 +00:00
sethah 1614485fd9 [SPARK-10788][MLLIB][ML] Remove duplicate bins for decision trees
Decision trees in spark.ml (RandomForest.scala) communicate twice as much data as needed for unordered categorical features. Here's an example.

Say there are 3 categories A, B, C. We consider 3 splits:

* A vs. B, C
* A, B vs. C
* A, C vs. B

Currently, we collect statistics for each of the 6 subsets of categories (3 * 2 = 6). However, we could instead collect statistics for the 3 subsets on the left-hand side of the 3 possible splits: A and A,B and A,C. If we also have stats for the entire node, then we can compute the stats for the 3 subsets on the right-hand side of the splits. In pseudomath: stats(B,C) = stats(A,B,C) - stats(A).

This patch adds a parent stats array to the `DTStatsAggregator` so that the right child stats do not need to be stored. The right child stats are computed by subtracting left child stats from the parent stats for unordered categorical features.

Author: sethah <seth.hendrickson16@gmail.com>

Closes #9474 from sethah/SPARK-10788.
2016-03-17 16:44:41 -07:00
Joseph K. Bradley b39e80d39d [SPARK-13761][ML] Remove remaining uses of validateParams
## What changes were proposed in this pull request?

Cleanups from [https://github.com/apache/spark/pull/11620]: remove remaining uses of validateParams, and put functionality into transformSchema

## How was this patch tested?

Existing unit tests, modified to check using transformSchema instead of validateParams

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #11790 from jkbradley/SPARK-13761-cleanup.
2016-03-17 13:23:07 -07:00
Xusen Yin edf8b8775b [SPARK-11891] Model export/import for RFormula and RFormulaModel
https://issues.apache.org/jira/browse/SPARK-11891

Author: Xusen Yin <yinxusen@gmail.com>

Closes #9884 from yinxusen/SPARK-11891.
2016-03-17 10:19:10 -07:00
Wenchen Fan 8ef3399aff [SPARK-13928] Move org.apache.spark.Logging into org.apache.spark.internal.Logging
## What changes were proposed in this pull request?

Logging was made private in Spark 2.0. If we move it, then users would be able to create a Logging trait themselves to avoid changing their own code.

## How was this patch tested?

existing tests.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #11764 from cloud-fan/logger.
2016-03-17 19:23:38 +08:00
Yuhao Yang 357d82d84d [SPARK-13629][ML] Add binary toggle Param to CountVectorizer
## What changes were proposed in this pull request?

It would be handy to add a binary toggle Param to CountVectorizer, as in the scikit-learn one: http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
If set, then all non-zero counts will be set to 1.

## How was this patch tested?

unit tests

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #11536 from hhbyyh/cvToggle.
2016-03-17 11:21:11 +02:00
Yuhao Yang 92b70576ea [SPARK-13761][ML] Deprecate validateParams
## What changes were proposed in this pull request?

Deprecate validateParams() method here: 035d3acdf3/mllib/src/main/scala/org/apache/spark/ml/param/params.scala (L553)
Move all functionality in overridden methods to transformSchema().
Check docs to make sure they indicate complex Param interaction checks should be done in transformSchema.

## How was this patch tested?

unit tests

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #11620 from hhbyyh/depreValid.
2016-03-16 17:31:55 -07:00
Jakob Odersky d4d84936fb [SPARK-11011][SQL] Narrow type of UDT serialization
## What changes were proposed in this pull request?

Narrow down the parameter type of `UserDefinedType#serialize()`. Currently, the parameter type is `Any`, however it would logically make more sense to narrow it down to the type of the actual user defined type.

## How was this patch tested?

Existing tests were successfully run on local machine.

Author: Jakob Odersky <jakob@odersky.com>

Closes #11379 from jodersky/SPARK-11011-udt-types.
2016-03-16 16:59:36 -07:00
Xiangrui Meng 85c42fda99 [SPARK-13927][MLLIB] add row/column iterator to local matrices
## What changes were proposed in this pull request?

Add row/column iterator to local matrices to simplify tasks like BlockMatrix => RowMatrix conversion. It handles dense and sparse matrices properly.

## How was this patch tested?

Unit tests on sparse and dense matrix.

cc: dbtsai

Author: Xiangrui Meng <meng@databricks.com>

Closes #11757 from mengxr/SPARK-13927.
2016-03-16 14:19:54 -07:00
Joseph K. Bradley 6fc2b6541f [SPARK-11888][ML] Decision tree persistence in spark.ml
### What changes were proposed in this pull request?

Made these MLReadable and MLWritable: DecisionTreeClassifier, DecisionTreeClassificationModel, DecisionTreeRegressor, DecisionTreeRegressionModel
* The shared implementation is in treeModels.scala
* I use case classes to create a DataFrame to save, and I use the Dataset API to parse loaded files.

Other changes:
* Made CategoricalSplit.numCategories public (to use in persistence)
* Fixed a bug in DefaultReadWriteTest.testEstimatorAndModelReadWrite, where it did not call the checkModelData function passed as an argument.  This caused an error in LDASuite, which I fixed.

### How was this patch tested?

Persistence is tested via unit tests.  For each algorithm, there are 2 non-trivial trees (depth 2).  One is built with continuous features, and one with categorical; this ensures that both types of splits are tested.

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #11581 from jkbradley/dt-io.
2016-03-16 14:18:35 -07:00