Commit graph

1722 commits

Author SHA1 Message Date
wm624@hotmail.com 036b50347c [SPARK-19110][ML][MLLIB] DistributedLDAModel returns different logPrior for original and loaded model
## What changes were proposed in this pull request?

While adding DistributedLDAModel training summary for SparkR, I found that the logPrior for original and loaded model is different.
For example, in the test("read/write DistributedLDAModel"), I add the test:
val logPrior = model.asInstanceOf[DistributedLDAModel].logPrior
val logPrior2 = model2.asInstanceOf[DistributedLDAModel].logPrior
assert(logPrior === logPrior2)
The test fails:
-4.394180878889078 did not equal -4.294290536919573

The reason is that `graph.vertices.aggregate(0.0)(seqOp, _ + _)` only returns the value of a single vertex instead of the aggregation of all vertices. Therefore, when the loaded model does the aggregation in a different order, it returns different `logPrior`.

Please refer to #16464 for details.
## How was this patch tested?
Add a new unit test for testing logPrior.

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #16491 from wangmiao1981/ldabug.
2017-01-07 11:07:49 -08:00
Wenchen Fan b3d39620c5 [SPARK-19085][SQL] cleanup OutputWriterFactory and OutputWriter
## What changes were proposed in this pull request?

`OutputWriterFactory`/`OutputWriter` are internal interfaces and we can remove some unnecessary APIs:
1. `OutputWriterFactory.newWriter(path: String)`: no one calls it and no one implements it.
2. `OutputWriter.write(row: Row)`: during execution we only call `writeInternal`, which is weird as `OutputWriter` is already an internal interface. We should rename `writeInternal` to `write` and remove `def write(row: Row)` and it's related converter code. All implementations should just implement `def write(row: InternalRow)`

## How was this patch tested?

existing tests.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #16479 from cloud-fan/hive-writer.
2017-01-08 00:42:09 +08:00
sueann d60f6f62d0 [SPARK-18194][ML] Log instrumentation in OneVsRest, CrossValidator, TrainValidationSplit
## What changes were proposed in this pull request?

Added instrumentation logging for OneVsRest classifier, CrossValidator, TrainValidationSplit fit() functions.

## How was this patch tested?

Ran unit tests and checked the log file (see output in comments).

Author: sueann <sueann@databricks.com>

Closes #16480 from sueann/SPARK-18194.
2017-01-06 18:53:16 -08:00
Yanbo Liang dfc4c935ba [MINOR] Correct LogisticRegression test case for probability2prediction.
## What changes were proposed in this pull request?
Set correct column names for ```force to use probability2prediction``` in ```LogisticRegressionSuite```.

## How was this patch tested?
Change unit test.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #16477 from yanboliang/lor-pred.
2017-01-05 18:59:49 -08:00
Niranjan Padmanabhan a1e40b1f5d
[MINOR][DOCS] Remove consecutive duplicated words/typo in Spark Repo
## What changes were proposed in this pull request?
There are many locations in the Spark repo where the same word occurs consecutively. Sometimes they are appropriately placed, but many times they are not. This PR removes the inappropriately duplicated words.

## How was this patch tested?
N/A since only docs or comments were updated.

Author: Niranjan Padmanabhan <niranjan.padmanabhan@gmail.com>

Closes #16455 from neurons/np.structure_streaming_doc.
2017-01-04 15:07:29 +00:00
Zheng RuiFeng 7a82505817
[SPARK-19054][ML] Eliminate extra pass in NB
## What changes were proposed in this pull request?
eliminate unnecessary extra pass in NB's train

## How was this patch tested?
existing tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #16453 from zhengruifeng/nb_getNC.
2017-01-04 11:54:13 +00:00
Weiqing Yang e5c307c50a
[MINOR] Add missing sc.stop() to end of examples
## What changes were proposed in this pull request?

Add `finally` clause for `sc.stop()` in the `test("register and deregister Spark listener from SparkContext")`.

## How was this patch tested?
Pass the build and unit tests.

Author: Weiqing Yang <yangweiqing001@gmail.com>

Closes #16426 from weiqingy/testIssue.
2017-01-03 09:56:42 +00:00
Sean Owen 56d3a7eb83
[SPARK-18808][ML][MLLIB] ml.KMeansModel.transform is very inefficient
## What changes were proposed in this pull request?

mllib.KMeansModel.clusterCentersWithNorm is a method than ends up being called every time `predict` is called on a single vector, which is bad news for now the ml.KMeansModel Transformer works, which necessarily transforms one vector at a time.

This causes the model to just store the vectors with norms upfront. The extra norm should be small compared to the vectors. This would avoid this form of overhead on this and other code paths.

## How was this patch tested?

Existing tests.

Author: Sean Owen <sowen@cloudera.com>

Closes #16328 from srowen/SPARK-18808.
2016-12-30 10:40:17 +00:00
Ilya Matiach 87bc4112c5 [SPARK-18698][ML] Adding public constructor that takes uid for IndexToString
## What changes were proposed in this pull request?

Based on SPARK-18698, this adds a public constructor that takes a UID for IndexToString.  Other transforms have similar constructors.

## How was this patch tested?

A unit test was added to verify the new functionality.

Author: Ilya Matiach <ilmat@microsoft.com>

Closes #16436 from imatiach-msft/ilmat/fix-indextostring.
2016-12-29 13:25:49 -08:00
sethah 6a475ae466 [SPARK-17772][ML][TEST] Add test functions for ML sample weights
## What changes were proposed in this pull request?

More and more ML algos are accepting sample weights, and they have been tested rather heterogeneously and with code duplication. This patch adds extensible helper methods to `MLTestingUtils` that can be reused by various algorithms accepting sample weights. Up to now, there seems to be a few tests that have been implemented commonly:

* Check that oversampling is the same as giving the instances sample weights proportional to the number of samples
* Check that outliers with tiny sample weights do not affect the algorithm's performance

This patch adds an additional test:

* Check that algorithms are invariant to constant scaling of the sample weights. i.e. uniform sample weights with `w_i = 1.0` is effectively the same as uniform sample weights with `w_i = 10000` or `w_i = 0.0001`

The instances of these tests occurred in LinearRegression, NaiveBayes, and LogisticRegression. Those tests have been removed/modified to use the new helper methods. These helper functions will be of use when [SPARK-9478](https://issues.apache.org/jira/browse/SPARK-9478) is implemented.

## How was this patch tested?

This patch only involves modifying test suites.

## Other notes

Both IsotonicRegression and GeneralizedLinearRegression also extend `HasWeightCol`. I did not modify these test suites because it will make this patch easier to review, and because they did not duplicate the same tests as the three suites that were modified. If we want to change them later, we can create a JIRA for it now, but it's open for debate.

Author: sethah <seth.hendrickson16@gmail.com>

Closes #15721 from sethah/SPARK-17772.
2016-12-28 07:01:14 -08:00
Yanbo Liang 9cff67f346 [MINOR][ML] Correct test cases of LoR raw2prediction & probability2prediction.
## What changes were proposed in this pull request?
Correct test cases of ```LogisticRegression``` raw2prediction & probability2prediction.

## How was this patch tested?
Changed unit tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #16407 from yanboliang/raw-probability.
2016-12-28 01:24:18 -08:00
Peng 79ff853631 [SPARK-17645][MLLIB][ML] add feature selector method based on: False Discovery Rate (FDR) and Family wise error rate (FWE)
## What changes were proposed in this pull request?

Univariate feature selection works by selecting the best features based on univariate statistical tests.
FDR and FWE are a popular univariate statistical test for feature selection.
In 2005, the Benjamini and Hochberg paper on FDR was identified as one of the 25 most-cited statistical papers. The FDR uses the Benjamini-Hochberg procedure in this PR. https://en.wikipedia.org/wiki/False_discovery_rate.
In statistics, FWE is the probability of making one or more false discoveries, or type I errors, among all the hypotheses when performing multiple hypotheses tests.
https://en.wikipedia.org/wiki/Family-wise_error_rate

We add  FDR and FWE methods for ChiSqSelector in this PR, like it is implemented in scikit-learn.
http://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection
## How was this patch tested?

ut will be added soon

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Author: Peng <peng.meng@intel.com>
Author: Peng, Meng <peng.meng@intel.com>

Closes #15212 from mpjlu/fdr_fwe.
2016-12-28 00:49:36 -08:00
Ryan Williams afd9bc1d8a [SPARK-17807][CORE] split test-tags into test-JAR
Remove spark-tag's compile-scope dependency (and, indirectly, spark-core's compile-scope transitive-dependency) on scalatest by splitting test-oriented tags into spark-tags' test JAR.

Alternative to #16303.

Author: Ryan Williams <ryan.blake.williams@gmail.com>

Closes #16311 from ryan-williams/tt.
2016-12-21 16:37:20 -08:00
Zakaria_Hili 7db09abb01
[SPARK-18356][ML] KMeans should cache RDD before training
## What changes were proposed in this pull request?

According to request of Mr. Joseph Bradley , I did this update of my PR https://github.com/apache/spark/pull/15965 in order to eliminate the extrat fit() method.

jkbradley
## How was this patch tested?
Pass existing tests

Author: Zakaria_Hili <zakahili@gmail.com>
Author: HILI Zakaria <zakahili@gmail.com>

Closes #16295 from ZakariaHili/zakbranch.
2016-12-19 10:30:38 +00:00
Anthony Truchet 9e8a9d7c6a
[SPARK-18471][MLLIB] In LBFGS, avoid sending huge vectors of 0
## What changes were proposed in this pull request?

CostFun used to send a dense vector of zeroes as a closure in a
treeAggregate call. To avoid that, we replace treeAggregate by
mapPartition + treeReduce, creating a zero vector inside the mapPartition
block in-place.

## How was this patch tested?

Unit test for module mllib run locally for correctness.

As for performance we run an heavy optimization on our production data (50 iterations on 128 MB weight vectors) and have seen significant decrease in terms both of runtime and container being killed by lack of off-heap memory.

Author: Anthony Truchet <a.truchet@criteo.com>
Author: sethah <seth.hendrickson16@gmail.com>
Author: Anthony Truchet <AnthonyTruchet@users.noreply.github.com>

Closes #16037 from AnthonyTruchet/ENG-17719-lbfgs-only.
2016-12-13 21:30:57 +00:00
actuaryzhang e57e3938c6
[SPARK-18715][ML] Fix AIC calculations in Binomial GLM
The AIC calculation in Binomial GLM seems to be off when the response or weight is non-integer: the result is different from that in R. This issue arises when one models rates, i.e, num of successes normalized over num of trials, and uses num of trials as weights. In this case, the effective likelihood is  weight * label ~ binomial(weight, mu), where weight = number of trials, and weight * label = number of successes and mu = is the success rate.

srowen sethah yanboliang HyukjinKwon zhengruifeng

## What changes were proposed in this pull request?
I suggest changing the current aic calculation for the Binomial family from
```
-2.0 * predictions.map { case (y: Double, mu: Double, weight: Double) =>
        weight * dist.Binomial(1, mu).logProbabilityOf(math.round(y).toInt)
      }.sum()
```
to the following which generalizes to the case of real-valued response and weights.
```
      -2.0 * predictions.map { case (y: Double, mu: Double, weight: Double) =>
        val wt = math.round(weight).toInt
        if (wt == 0){
          0.0
        } else {
          dist.Binomial(wt, mu).logProbabilityOf(math.round(y * weight).toInt)
        }
      }.sum()
```
## How was this patch tested?
I will write the unit test once the community wants to include the proposed change. For now, the following modifies existing tests in weighted Binomial GLM to illustrate the issue. The second label is changed from 0 to 0.5.

```
val datasetWithWeight = Seq(
    (1.0, 1.0, 0.0, 5.0),
    (0.5, 2.0, 1.0, 2.0),
    (1.0, 3.0, 2.0, 1.0),
    (0.0, 4.0, 3.0, 3.0)
  ).toDF("y", "w", "x1", "x2")

val formula = (new RFormula()
  .setFormula("y ~ x1 + x2")
  .setFeaturesCol("features")
  .setLabelCol("label"))
val output = formula.fit(datasetWithWeight).transform(datasetWithWeight).select("features", "label", "w")

val glr = new GeneralizedLinearRegression()
    .setFamily("binomial")
    .setWeightCol("w")
    .setFitIntercept(false)
    .setRegParam(0)

val model = glr.fit(output)
model.summary.aic
```
The AIC from Spark is 17.3227, and the AIC from R is 15.66454.

Author: actuaryzhang <actuaryzhang10@gmail.com>

Closes #16149 from actuaryzhang/aic.
2016-12-13 21:27:29 +00:00
krishnakalyan3 c802ad8718
[SPARK-18628][ML] Update Scala param and Python param to have quotes
## What changes were proposed in this pull request?

Updated Scala param and Python param to have quotes around the options making it easier for users to read.

## How was this patch tested?

Manually checked the docstrings

Author: krishnakalyan3 <krishnakalyan3@gmail.com>

Closes #16242 from krishnakalyan3/doc-string.
2016-12-11 09:28:16 +00:00
Michal Senkyr 114324832a
[SPARK-3359][DOCS] Fix greater-than symbols in Javadoc to allow building with Java 8
## What changes were proposed in this pull request?

The API documentation build was failing when using Java 8 due to incorrect character `>` in Javadoc.

Replace `>` with literals in Javadoc to allow the build to pass.

## How was this patch tested?

Documentation was built and inspected manually to ensure it still displays correctly in the browser

```
cd docs && jekyll serve
```

Author: Michal Senkyr <mike.senkyr@gmail.com>

Closes #16201 from michalsenkyr/javadoc8-gt-fix.
2016-12-10 19:54:07 +00:00
Yanbo Liang 97255497d8 [SPARK-18326][SPARKR][ML] Review SparkR ML wrappers API for 2.1
## What changes were proposed in this pull request?
Reviewing SparkR ML wrappers API for 2.1 release, mainly two issues:
* Remove ```probabilityCol``` from the argument list of ```spark.logit``` and ```spark.randomForest```. Since it was used when making prediction and should be an argument of ```predict```, and we will work on this at [SPARK-18618](https://issues.apache.org/jira/browse/SPARK-18618) in the next release cycle.
* Fix ```spark.als``` params to make it consistent with MLlib.

## How was this patch tested?
Existing tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #16169 from yanboliang/spark-18326.
2016-12-07 20:23:28 -08:00
actuaryzhang b828027139
[SPARK-18701][ML] Fix Poisson GLM failure due to wrong initialization
Poisson GLM fails for many standard data sets (see example in test or JIRA). The issue is incorrect initialization leading to almost zero probability and weights. Specifically, the mean is initialized as the response, which could be zero. Applying the log link results in very negative numbers (protected against -Inf), which again leads to close to zero probability and weights in the weighted least squares. Fix and test are included in the commits.

## What changes were proposed in this pull request?
Update initialization in Poisson GLM

## How was this patch tested?
Add test in GeneralizedLinearRegressionSuite

srowen sethah yanboliang HyukjinKwon mengxr

Author: actuaryzhang <actuaryzhang10@gmail.com>

Closes #16131 from actuaryzhang/master.
2016-12-07 16:37:25 +08:00
Yanbo Liang 90b59d1bf2 [SPARK-18686][SPARKR][ML] Several cleanup and improvements for spark.logit.
## What changes were proposed in this pull request?
Several cleanup and improvements for ```spark.logit```:
* ```summary``` should return coefficients matrix, and should output labels for each class if the model is multinomial logistic regression model.
* ```summary``` should not return ```areaUnderROC, roc, pr, ...```, since most of them are DataFrame which are less important for R users. Meanwhile, these metrics ignore instance weights (setting all to 1.0) which will be changed in later Spark version. In case it will introduce breaking changes, we do not expose them currently.
* SparkR test improvement: comparing the training result with native R glmnet.
* Remove argument ```aggregationDepth``` from ```spark.logit```, since it's an expert Param(related with Spark architecture and job execution) that would be used rarely by R users.

## How was this patch tested?
Unit tests.

The ```summary``` output after this change:
multinomial logistic regression:
```
> df <- suppressWarnings(createDataFrame(iris))
> model <- spark.logit(df, Species ~ ., regParam = 0.5)
> summary(model)
$coefficients
             versicolor  virginica   setosa
(Intercept)  1.514031    -2.609108   1.095077
Sepal_Length 0.02511006  0.2649821   -0.2900921
Sepal_Width  -0.5291215  -0.02016446 0.549286
Petal_Length 0.03647411  0.1544119   -0.190886
Petal_Width  0.000236092 0.4195804   -0.4198165
```
binomial logistic regression:
```
> df <- suppressWarnings(createDataFrame(iris))
> training <- df[df$Species %in% c("versicolor", "virginica"), ]
> model <- spark.logit(training, Species ~ ., regParam = 0.5)
> summary(model)
$coefficients
             Estimate
(Intercept)  -6.053815
Sepal_Length 0.2449379
Sepal_Width  0.1648321
Petal_Length 0.4730718
Petal_Width  1.031947
```

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #16117 from yanboliang/spark-18686.
2016-12-07 00:31:11 -08:00
Yuhao fac5b75b74
[SPARK-18374][ML] Incorrect words in StopWords/english.txt
## What changes were proposed in this pull request?

Currently English stop words list in MLlib contains only the argumented words after removing all the apostrophes, so "wouldn't" become "wouldn" and "t". Yet by default Tokenizer and RegexTokenizer don't split on apostrophes or quotes.

Adding original form to stop words list to match the behavior of Tokenizer and StopwordsRemover. Also remove "won" from list.

see more discussion in the jira: https://issues.apache.org/jira/browse/SPARK-18374

## How was this patch tested?
existing ut

Author: Yuhao <yuhao.yang@intel.com>
Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #16103 from hhbyyh/addstopwords.
2016-12-07 05:12:24 +08:00
Zheng RuiFeng bdfe7f6746 [SPARK-18625][ML] OneVsRestModel should support setFeaturesCol and setPredictionCol
## What changes were proposed in this pull request?
add `setFeaturesCol` and `setPredictionCol` for `OneVsRestModel`

## How was this patch tested?
added tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #16059 from zhengruifeng/ovrm_setCol.
2016-12-05 00:32:58 -08:00
Reynold Xin c7c7265950 [SPARK-18695] Bump master branch version to 2.2.0-SNAPSHOT
## What changes were proposed in this pull request?
This patch bumps master branch version to 2.2.0-SNAPSHOT.

## How was this patch tested?
N/A

Author: Reynold Xin <rxin@databricks.com>

Closes #16126 from rxin/SPARK-18695.
2016-12-02 21:09:37 -08:00
Yanbo Liang a985dd8e99 [SPARK-18291][SPARKR][ML] Revert "[SPARK-18291][SPARKR][ML] SparkR glm predict should output original label when family = binomial."
## What changes were proposed in this pull request?
It's better we can fix this issue by providing an option ```type``` for users to change the ```predict``` output schema, then they could output probabilities, log-space predictions, or original labels. In order to not involve breaking API change for 2.1, so revert this change firstly and will add it back after [SPARK-18618](https://issues.apache.org/jira/browse/SPARK-18618) resolved.

## How was this patch tested?
Existing unit tests.

This reverts commit daa975f4bf.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #16118 from yanboliang/spark-18291-revert.
2016-12-02 12:16:57 -08:00
Nathan Howell c82f16c15e [SPARK-18658][SQL] Write text records directly to a FileOutputStream
## What changes were proposed in this pull request?

This replaces uses of `TextOutputFormat` with an `OutputStream`, which will either write directly to the filesystem or indirectly via a compressor (if so configured). This avoids intermediate buffering.

The inverse of this (reading directly from a stream) is necessary for streaming large JSON records (when `wholeFile` is enabled) so I wanted to keep the read and write paths symmetric.

## How was this patch tested?

Existing unit tests.

Author: Nathan Howell <nhowell@godaddy.com>

Closes #16089 from NathanHowell/SPARK-18658.
2016-12-01 21:40:49 -08:00
wm624@hotmail.com 2eb6764fbb [SPARK-18476][SPARKR][ML] SparkR Logistic Regression should should support output original label.
## What changes were proposed in this pull request?

Similar to SPARK-18401, as a classification algorithm, logistic regression should support output original label instead of supporting index label.

In this PR, original label output is supported and test cases are modified and added. Document is also modified.

## How was this patch tested?

Unit tests.

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #15910 from wangmiao1981/audit.
2016-11-30 20:32:17 -08:00
Yanbo Liang 60022bfd65 [SPARK-18318][ML] ML, Graph 2.1 QA: API: New Scala APIs, docs
## What changes were proposed in this pull request?
API review for 2.1, except ```LSH``` related classes which are still under development.

## How was this patch tested?
Only doc changes, no new tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #16009 from yanboliang/spark-18318.
2016-11-30 13:21:05 -08:00
Anthony Truchet c5a64d7606
[SPARK-18612][MLLIB] Delete broadcasted variable in LBFGS CostFun
## What changes were proposed in this pull request?

Fix a broadcasted variable leak occurring at each invocation of CostFun in L-BFGS.

## How was this patch tested?

UTests + check that fixed fatal memory consumption on Criteo's use cases.

This contribution is made on behalf of Criteo S.A.
(http://labs.criteo.com/) under the terms of the Apache v2 License.

Author: Anthony Truchet <a.truchet@criteo.com>

Closes #16040 from AnthonyTruchet/SPARK-18612-lbfgs-cost-fun.
2016-11-30 10:04:47 +00:00
Yuhao 9b670bcaec [SPARK-18319][ML][QA2.1] 2.1 QA: API: Experimental, DeveloperApi, final, sealed audit
## What changes were proposed in this pull request?
make a pass through the items marked as Experimental or DeveloperApi and see if any are stable enough to be unmarked. Also check for items marked final or sealed to see if they are stable enough to be opened up as APIs.

Some discussions in the jira: https://issues.apache.org/jira/browse/SPARK-18319

## How was this patch tested?
existing ut

Author: Yuhao <yuhao.yang@intel.com>
Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #15972 from hhbyyh/experimental21.
2016-11-29 18:46:59 -08:00
Yanbo Liang 95f7985012 [SPARK-18592][ML] Move DT/RF/GBT Param setter methods to subclasses
## What changes were proposed in this pull request?
Mainly two changes:
* Move DT/RF/GBT Param setter methods to subclasses.
* Deprecate corresponding setter methods in the model classes.

See discussion here https://github.com/apache/spark/pull/15913#discussion_r89662469.

## How was this patch tested?
Existing tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #16017 from yanboliang/spark-18592.
2016-11-29 11:19:35 -08:00
hyukjinkwon 1a870090e4
[SPARK-18615][DOCS] Switch to multi-line doc to avoid a genjavadoc bug for backticks
## What changes were proposed in this pull request?

Currently, single line comment does not mark down backticks to `<code>..</code>` but prints as they are (`` `..` ``). For example, the line below:

```scala
/** Return an RDD with the pairs from `this` whose keys are not in `other`. */
```

So, we could work around this as below:

```scala
/**
 * Return an RDD with the pairs from `this` whose keys are not in `other`.
 */
```

- javadoc

  - **Before**
    ![2016-11-29 10 39 14](https://cloud.githubusercontent.com/assets/6477701/20693606/e64c8f90-b622-11e6-8dfc-4a029216e23d.png)

  - **After**
    ![2016-11-29 10 39 08](https://cloud.githubusercontent.com/assets/6477701/20693607/e7280d36-b622-11e6-8502-d2e21cd5556b.png)

- scaladoc (this one looks fine either way)

  - **Before**
    ![2016-11-29 10 38 22](https://cloud.githubusercontent.com/assets/6477701/20693640/12c18aa8-b623-11e6-901a-693e2f6f8066.png)

  - **After**
    ![2016-11-29 10 40 05](https://cloud.githubusercontent.com/assets/6477701/20693642/14eb043a-b623-11e6-82ac-7cd0000106d1.png)

I suspect this is related with SPARK-16153 and genjavadoc issue in ` typesafehub/genjavadoc#85`.

## How was this patch tested?

I found them via

```
grep -r "\/\*\*.*\`" . | grep .scala
````

and then checked if each is in the public API documentation with manually built docs (`jekyll build`) with Java 7.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #16050 from HyukjinKwon/javadoc-markdown.
2016-11-29 13:50:24 +00:00
hyukjinkwon f830bb9170
[SPARK-3359][DOCS] Make javadoc8 working for unidoc/genjavadoc compatibility in Java API documentation
## What changes were proposed in this pull request?

This PR make `sbt unidoc` complete with Java 8.

This PR roughly includes several fixes as below:

- Fix unrecognisable class and method links in javadoc by changing it from `[[..]]` to `` `...` ``

  ```diff
  - * A column that will be computed based on the data in a [[DataFrame]].
  + * A column that will be computed based on the data in a `DataFrame`.
  ```

- Fix throws annotations so that they are recognisable in javadoc

- Fix URL links to `<a href="http..."></a>`.

  ```diff
  - * [[http://en.wikipedia.org/wiki/Decision_tree_learning Decision tree]] model for regression.
  + * <a href="http://en.wikipedia.org/wiki/Decision_tree_learning">
  + * Decision tree (Wikipedia)</a> model for regression.
  ```

  ```diff
  -   * see http://en.wikipedia.org/wiki/Receiver_operating_characteristic
  +   * see <a href="http://en.wikipedia.org/wiki/Receiver_operating_characteristic">
  +   * Receiver operating characteristic (Wikipedia)</a>
  ```

- Fix < to > to

  - `greater than`/`greater than or equal to` or `less than`/`less than or equal to` where applicable.

  - Wrap it with `{{{...}}}` to print them in javadoc or use `{code ...}` or `{literal ..}`. Please refer https://github.com/apache/spark/pull/16013#discussion_r89665558

- Fix `</p>` complaint

## How was this patch tested?

Manually tested by `jekyll build` with Java 7 and 8

```
java version "1.7.0_80"
Java(TM) SE Runtime Environment (build 1.7.0_80-b15)
Java HotSpot(TM) 64-Bit Server VM (build 24.80-b11, mixed mode)
```

```
java version "1.8.0_45"
Java(TM) SE Runtime Environment (build 1.8.0_45-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.45-b02, mixed mode)
```

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #16013 from HyukjinKwon/SPARK-3359-errors-more.
2016-11-29 09:41:32 +00:00
Yun Ni 05f7c6ffab [SPARK-18408][ML] API Improvements for LSH
## What changes were proposed in this pull request?

(1) Change output schema to `Array of Vector` instead of `Vectors`
(2) Use `numHashTables` as the dimension of Array
(3) Rename `RandomProjection` to `BucketedRandomProjectionLSH`, `MinHash` to `MinHashLSH`
(4) Make `randUnitVectors/randCoefficients` private
(5) Make Multi-Probe NN Search and `hashDistance` private for future discussion

Saved for future PRs:
(1) AND-amplification and `numHashFunctions` as the dimension of Vector are saved for a future PR.
(2) `hashDistance` and MultiProbe NN Search needs more discussion. The current implementation is just a backward compatible one.

## How was this patch tested?
Related unit tests are modified to make sure the performance of LSH are ensured, and the outputs of the APIs meets expectation.

Author: Yun Ni <yunn@uber.com>
Author: Yunni <Euler57721@gmail.com>

Closes #15874 from Yunni/SPARK-18408-yunn-api-improvements.
2016-11-28 15:14:46 -08:00
Yanbo Liang c4a7eef0ce [SPARK-18481][ML] ML 2.1 QA: Remove deprecated methods for ML
## What changes were proposed in this pull request?
Remove deprecated methods for ML.

## How was this patch tested?
Existing tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #15913 from yanboliang/spark-18481.
2016-11-26 05:28:41 -08:00
Zakaria_Hili 445d4d9e13
[SPARK-18356][ML] Improve MLKmeans Performance
## What changes were proposed in this pull request?

Spark Kmeans fit() doesn't cache the RDD which generates a lot of warnings :
 WARN KMeans: The input data is not directly cached, which may hurt performance if its parent RDDs are also uncached.
So, Kmeans should cache the internal rdd before calling the Mllib.Kmeans algo, this helped to improve spark kmeans performance by 14%

a9cf905cf7

hhbyyh
## How was this patch tested?
Pass Kmeans tests and existing tests

Author: Zakaria_Hili <zakahili@gmail.com>
Author: HILI Zakaria <zakahili@gmail.com>

Closes #15965 from ZakariaHili/zakbranch.
2016-11-25 13:19:26 +00:00
hyukjinkwon 51b1c1551d
[SPARK-3359][BUILD][DOCS] More changes to resolve javadoc 8 errors that will help unidoc/genjavadoc compatibility
## What changes were proposed in this pull request?

This PR only tries to fix things that looks pretty straightforward and were fixed in other previous PRs before.

This PR roughly fixes several things as below:

- Fix unrecognisable class and method links in javadoc by changing it from `[[..]]` to `` `...` ``

  ```
  [error] .../spark/sql/core/target/java/org/apache/spark/sql/streaming/DataStreamReader.java:226: error: reference not found
  [error]    * Loads text files and returns a {link DataFrame} whose schema starts with a string column named
  ```

- Fix an exception annotation and remove code backticks in `throws` annotation

  Currently, sbt unidoc with Java 8 complains as below:

  ```
  [error] .../java/org/apache/spark/sql/streaming/StreamingQuery.java:72: error: unexpected text
  [error]    * throws StreamingQueryException, if <code>this</code> query has terminated with an exception.
  ```

  `throws` should specify the correct class name from `StreamingQueryException,` to `StreamingQueryException` without backticks. (see [JDK-8007644](https://bugs.openjdk.java.net/browse/JDK-8007644)).

- Fix `[[http..]]` to `<a href="http..."></a>`.

  ```diff
  -   * [[https://blogs.oracle.com/java-platform-group/entry/diagnosing_tls_ssl_and_https Oracle
  -   * blog page]].
  +   * <a href="https://blogs.oracle.com/java-platform-group/entry/diagnosing_tls_ssl_and_https">
  +   * Oracle blog page</a>.
  ```

   `[[http...]]` link markdown in scaladoc is unrecognisable in javadoc.

- It seems class can't have `return` annotation. So, two cases of this were removed.

  ```
  [error] .../java/org/apache/spark/mllib/regression/IsotonicRegression.java:27: error: invalid use of return
  [error]    * return New instance of IsotonicRegression.
  ```

- Fix < to `&lt;` and > to `&gt;` according to HTML rules.

- Fix `</p>` complaint

- Exclude unrecognisable in javadoc, `constructor`, `todo` and `groupname`.

## How was this patch tested?

Manually tested by `jekyll build` with Java 7 and 8

```
java version "1.7.0_80"
Java(TM) SE Runtime Environment (build 1.7.0_80-b15)
Java HotSpot(TM) 64-Bit Server VM (build 24.80-b11, mixed mode)
```

```
java version "1.8.0_45"
Java(TM) SE Runtime Environment (build 1.8.0_45-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.45-b02, mixed mode)
```

Note: this does not yet make sbt unidoc suceed with Java 8 yet but it reduces the number of errors with Java 8.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #15999 from HyukjinKwon/SPARK-3359-errors.
2016-11-25 11:27:07 +00:00
Zheng RuiFeng 2dfabec38c [SPARK-18520][ML] Add missing setXXXCol methods for BisectingKMeansModel and GaussianMixtureModel
## What changes were proposed in this pull request?
add `setFeaturesCol` and `setPredictionCol` for BiKModel and GMModel
add `setProbabilityCol` for GMModel
## How was this patch tested?
existing tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #15957 from zhengruifeng/bikm_set.
2016-11-24 05:46:05 -08:00
Yanbo Liang 982b82e32e [SPARK-18501][ML][SPARKR] Fix spark.glm errors when fitting on collinear data
## What changes were proposed in this pull request?
* Fix SparkR ```spark.glm``` errors when fitting on collinear data, since ```standard error of coefficients, t value and p value``` are not available in this condition.
* Scala/Python GLM summary should throw exception if users get ```standard error of coefficients, t value and p value``` but the underlying WLS was solved by local "l-bfgs".

## How was this patch tested?
Add unit tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #15930 from yanboliang/spark-18501.
2016-11-22 19:17:48 -08:00
sethah e811fbf9ed [SPARK-18282][ML][PYSPARK] Add python clustering summaries for GMM and BKM
## What changes were proposed in this pull request?

Add model summary APIs for `GaussianMixtureModel` and `BisectingKMeansModel` in pyspark.

## How was this patch tested?

Unit tests.

Author: sethah <seth.hendrickson16@gmail.com>

Closes #15777 from sethah/pyspark_cluster_summaries.
2016-11-21 05:36:49 -08:00
sethah 856e004200
[SPARK-18456][ML][FOLLOWUP] Use matrix abstraction for coefficients in LogisticRegression training
## What changes were proposed in this pull request?

This is a follow up to some of the discussion [here](https://github.com/apache/spark/pull/15593). During LogisticRegression training, we store the coefficients combined with intercepts as a flat vector, but a more natural abstraction is a matrix. Here, we refactor the code to use matrix where possible, which makes the code more readable and greatly simplifies the indexing.

Note: We do not use a Breeze matrix for the cost function as was mentioned in the linked PR. This is because LBFGS/OWLQN require an implicit `MutableInnerProductModule[DenseMatrix[Double], Double]` which is not natively defined in Breeze. We would need to extend Breeze in Spark to define it ourselves. Also, we do not modify the `regParamL1Fun` because OWLQN in Breeze requires a `MutableEnumeratedCoordinateField[(Int, Int), DenseVector[Double]]` (since we still use a dense vector for coefficients). Here again we would have to extend Breeze inside Spark.

## How was this patch tested?

This is internal code refactoring - the current unit tests passing show us that the change did not break anything. No added functionality in this patch.

Author: sethah <seth.hendrickson16@gmail.com>

Closes #15893 from sethah/logreg_refactor.
2016-11-20 01:42:37 +00:00
hyukjinkwon d5b1d5fc80
[SPARK-18445][BUILD][DOCS] Fix the markdown for Note:/NOTE:/Note that/'''Note:''' across Scala/Java API documentation
## What changes were proposed in this pull request?

It seems in Scala/Java,

- `Note:`
- `NOTE:`
- `Note that`
- `'''Note:'''`
- `note`

This PR proposes to fix those to `note` to be consistent.

**Before**

- Scala
  ![2016-11-17 6 16 39](https://cloud.githubusercontent.com/assets/6477701/20383180/1a7aed8c-acf2-11e6-9611-5eaf6d52c2e0.png)

- Java
  ![2016-11-17 6 14 41](https://cloud.githubusercontent.com/assets/6477701/20383096/c8ffc680-acf1-11e6-914a-33460bf1401d.png)

**After**

- Scala
  ![2016-11-17 6 16 44](https://cloud.githubusercontent.com/assets/6477701/20383167/09940490-acf2-11e6-937a-0d5e1dc2cadf.png)

- Java
  ![2016-11-17 6 13 39](https://cloud.githubusercontent.com/assets/6477701/20383132/e7c2a57e-acf1-11e6-9c47-b849674d4d88.png)

## How was this patch tested?

The notes were found via

```bash
grep -r "NOTE: " . | \ # Note:|NOTE:|Note that|'''Note:'''
grep -v "// NOTE: " | \  # starting with // does not appear in API documentation.
grep -E '.scala|.java' | \ # java/scala files
grep -v Suite | \ # exclude tests
grep -v Test | \ # exclude tests
grep -e 'org.apache.spark.api.java' \ # packages appear in API documenation
-e 'org.apache.spark.api.java.function' \ # note that this is a regular expression. So actual matches were mostly `org/apache/spark/api/java/functions ...`
-e 'org.apache.spark.api.r' \
...
```

```bash
grep -r "Note that " . | \ # Note:|NOTE:|Note that|'''Note:'''
grep -v "// Note that " | \  # starting with // does not appear in API documentation.
grep -E '.scala|.java' | \ # java/scala files
grep -v Suite | \ # exclude tests
grep -v Test | \ # exclude tests
grep -e 'org.apache.spark.api.java' \ # packages appear in API documenation
-e 'org.apache.spark.api.java.function' \
-e 'org.apache.spark.api.r' \
...
```

```bash
grep -r "Note: " . | \ # Note:|NOTE:|Note that|'''Note:'''
grep -v "// Note: " | \  # starting with // does not appear in API documentation.
grep -E '.scala|.java' | \ # java/scala files
grep -v Suite | \ # exclude tests
grep -v Test | \ # exclude tests
grep -e 'org.apache.spark.api.java' \ # packages appear in API documenation
-e 'org.apache.spark.api.java.function' \
-e 'org.apache.spark.api.r' \
...
```

```bash
grep -r "'''Note:'''" . | \ # Note:|NOTE:|Note that|'''Note:'''
grep -v "// '''Note:''' " | \  # starting with // does not appear in API documentation.
grep -E '.scala|.java' | \ # java/scala files
grep -v Suite | \ # exclude tests
grep -v Test | \ # exclude tests
grep -e 'org.apache.spark.api.java' \ # packages appear in API documenation
-e 'org.apache.spark.api.java.function' \
-e 'org.apache.spark.api.r' \
...
```

And then fixed one by one comparing with API documentation/access modifiers.

After that, manually tested via `jekyll build`.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #15889 from HyukjinKwon/SPARK-18437.
2016-11-19 11:24:15 +00:00
Zheng RuiFeng cdaf4ce9fe
[SPARK-18480][DOCS] Fix wrong links for ML guide docs
## What changes were proposed in this pull request?
1, There are two `[Graph.partitionBy]` in `graphx-programming-guide.md`, the first one had no effert.
2, `DataFrame`, `Transformer`, `Pipeline` and `Parameter`  in `ml-pipeline.md` were linked to `ml-guide.html` by mistake.
3, `PythonMLLibAPI` in `mllib-linear-methods.md` was not accessable, because class `PythonMLLibAPI` is private.
4, Other link updates.
## How was this patch tested?
 manual tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #15912 from zhengruifeng/md_fix.
2016-11-17 13:40:16 +00:00
VinceShieh de77c67750
[SPARK-17462][MLLIB]use VersionUtils to parse Spark version strings
## What changes were proposed in this pull request?

Several places in MLlib use custom regexes or other approaches to parse Spark versions.
Those should be fixed to use the VersionUtils. This PR replaces custom regexes with
VersionUtils to get Spark version numbers.
## How was this patch tested?

Existing tests.

Signed-off-by: VinceShieh vincent.xieintel.com

Author: VinceShieh <vincent.xie@intel.com>

Closes #15055 from VinceShieh/SPARK-17462.
2016-11-17 13:37:42 +00:00
Zheng RuiFeng c68f1a38af [SPARK-18434][ML] Add missing ParamValidations for ML algos
## What changes were proposed in this pull request?
Add missing ParamValidations for ML algos
## How was this patch tested?
existing tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #15881 from zhengruifeng/arg_checking.
2016-11-16 02:46:27 -08:00
Yanbo Liang 95eb06bd7d [SPARK-18438][SPARKR][ML] spark.mlp should support RFormula.
## What changes were proposed in this pull request?
```spark.mlp``` should support ```RFormula``` like other ML algorithm wrappers.
BTW, I did some cleanup and improvement for ```spark.mlp```.

## How was this patch tested?
Unit tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #15883 from yanboliang/spark-18438.
2016-11-16 01:04:18 -08:00
actuaryzhang ae6cddb787
[SPARK-18166][MLLIB] Fix Poisson GLM bug due to wrong requirement of response values
## What changes were proposed in this pull request?

The current implementation of Poisson GLM seems to allow only positive values. This is incorrect since the support of Poisson includes the origin. The bug is easily fixed by changing the test of the Poisson variable from  'require(y **>** 0.0' to  'require(y **>=** 0.0'.

mengxr  srowen

Author: actuaryzhang <actuaryzhang10@gmail.com>
Author: actuaryzhang <actuaryzhang@uber.com>

Closes #15683 from actuaryzhang/master.
2016-11-14 12:08:06 +01:00
Yanbo Liang 07be232ea1 [SPARK-18412][SPARKR][ML] Fix exception for some SparkR ML algorithms training on libsvm data
## What changes were proposed in this pull request?
* Fix the following exceptions which throws when ```spark.randomForest```(classification), ```spark.gbt```(classification), ```spark.naiveBayes``` and ```spark.glm```(binomial family) were fitted on libsvm data.
```
java.lang.IllegalArgumentException: requirement failed: If label column already exists, forceIndexLabel can not be set with true.
```
See [SPARK-18412](https://issues.apache.org/jira/browse/SPARK-18412) for more detail about how to reproduce this bug.
* Refactor out ```getFeaturesAndLabels``` to RWrapperUtils, since lots of ML algorithm wrappers use this function.
* Drop some unwanted columns when making prediction.

## How was this patch tested?
Add unit test.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #15851 from yanboliang/spark-18412.
2016-11-13 20:25:12 -08:00
Yanbo Liang 22cb3a060a [SPARK-14077][ML][FOLLOW-UP] Minor refactor and cleanup for NaiveBayes
## What changes were proposed in this pull request?
* Refactor out ```trainWithLabelCheck``` and make ```mllib.NaiveBayes``` call into it.
* Avoid capturing the outer object for ```modelType```.
* Move ```requireNonnegativeValues``` and ```requireZeroOneBernoulliValues``` to companion object.

## How was this patch tested?
Existing tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #15826 from yanboliang/spark-14077-2.
2016-11-12 06:13:22 -08:00
sethah 46b2550bcd
[SPARK-18060][ML] Avoid unnecessary computation for MLOR
## What changes were proposed in this pull request?

Before this patch, the gradient updates for multinomial logistic regression were computed by an outer loop over the number of classes and an inner loop over the number of features. Inside the inner loop, we standardized the feature value (`value / featuresStd(index)`), which means we performed the computation `numFeatures * numClasses` times. We only need to perform that computation `numFeatures` times, however. If we re-order the inner and outer loop, we can avoid this, but then we lose sequential memory access. In this patch, we instead lay out the coefficients in column major order while we train, so that we can avoid the extra computation and retain sequential memory access. We convert back to row-major order when we create the model.

## How was this patch tested?

This is an implementation detail only, so the original behavior should be maintained. All tests pass. I ran some performance tests to verify speedups. The results are below, and show significant speedups.
## Performance Tests

**Setup**

3 node bare-metal cluster
120 cores total
384 gb RAM total

**Results**

NOTE: The `currentMasterTime` and `thisPatchTime` are times in seconds for a single iteration of L-BFGS or OWL-QN.

|    |   numPoints |   numFeatures |   numClasses |   regParam |   elasticNetParam |   currentMasterTime (sec) |   thisPatchTime (sec) |   pctSpeedup |
|----|-------------|---------------|--------------|------------|-------------------|---------------------------|-----------------------|--------------|
|  0 |       1e+07 |           100 |          500 |       0.5  |                 0 |                        90 |                    18 |           80 |
|  1 |       1e+08 |           100 |           50 |       0.5  |                 0 |                        90 |                    19 |           78 |
|  2 |       1e+08 |           100 |           50 |       0.05 |                 1 |                        72 |                    19 |           73 |
|  3 |       1e+06 |           100 |         5000 |       0.5  |                 0 |                        93 |                    53 |           43 |
|  4 |       1e+07 |           100 |         5000 |       0.5  |                 0 |                       900 |                   390 |           56 |
|  5 |       1e+08 |           100 |          500 |       0.5  |                 0 |                       840 |                   174 |           79 |
|  6 |       1e+08 |           100 |          200 |       0.5  |                 0 |                       360 |                    72 |           80 |
|  7 |       1e+08 |          1000 |            5 |       0.5  |                 0 |                         9 |                     3 |           66 |

Author: sethah <seth.hendrickson16@gmail.com>

Closes #15593 from sethah/MLOR_PERF_COL_MAJOR_COEF.
2016-11-12 01:38:26 +00:00