Commit graph

1830 commits

Author SHA1 Message Date
Yanbo Liang b8733e0ad9 [SPARK-20606][ML] ML 2.2 QA: Remove deprecated methods for ML
## What changes were proposed in this pull request?
Remove ML methods we deprecated in 2.1.

## How was this patch tested?
Existing tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #17867 from yanboliang/spark-20606.
2017-05-09 17:30:37 +08:00
Jon McLean be53a78352 [SPARK-20615][ML][TEST] SparseVector.argmax throws IndexOutOfBoundsException
## What changes were proposed in this pull request?

Added a check for for the number of defined values.  Previously the argmax function assumed that at least one value was defined if the vector size was greater than zero.

## How was this patch tested?

Tests were added to the existing VectorsSuite to cover this case.

Author: Jon McLean <jon.mclean@atsid.com>

Closes #17877 from jonmclean/vectorArgmaxIndexBug.
2017-05-09 09:47:50 +01:00
Nick Pentreath 10b00abadf [SPARK-20587][ML] Improve performance of ML ALS recommendForAll
This PR is a `DataFrame` version of #17742 for [SPARK-11968](https://issues.apache.org/jira/browse/SPARK-11968), for improving the performance of `recommendAll` methods.

## How was this patch tested?

Existing unit tests.

Author: Nick Pentreath <nickp@za.ibm.com>

Closes #17845 from MLnick/ml-als-perf.
2017-05-09 10:13:15 +02:00
Peng 8079424763 [SPARK-11968][MLLIB] Optimize MLLIB ALS recommendForAll
The recommendForAll of MLLIB ALS is very slow.
GC is a key problem of the current method.
The task use the following code to keep temp result:
val output = new Array[(Int, (Int, Double))](m*n)
m = n = 4096 (default value, no method to set)
so output is about 4k * 4k * (4 + 4 + 8) = 256M. This is a large memory and cause serious GC problem, and it is frequently OOM.

Actually, we don't need to save all the temp result. Support we recommend topK (topK is about 10, or 20) product for each user, we only need 4k * topK * (4 + 4 + 8) memory to save the temp result.

The Test Environment:
3 workers: each work 10 core, each work 30G memory, each work 1 executor.
The Data: User 480,000, and Item 17,000

BlockSize:     1024  2048  4096  8192
Old method:  245s  332s  488s  OOM
This solution: 121s  118s   117s  120s

The existing UT.

Author: Peng <peng.meng@intel.com>
Author: Peng Meng <peng.meng@intel.com>

Closes #17742 from mpjlu/OptimizeAls.
2017-05-09 10:06:48 +02:00
Nick Pentreath 58518d0707 [SPARK-20596][ML][TEST] Consolidate and improve ALS recommendAll test cases
Existing test cases for `recommendForAllX` methods (added in [SPARK-19535](https://issues.apache.org/jira/browse/SPARK-19535)) test `k < num items` and `k = num items`. Technically we should also test that `k > num items` returns the same results as `k = num items`.

## How was this patch tested?

Updated existing unit tests.

Author: Nick Pentreath <nickp@za.ibm.com>

Closes #17860 from MLnick/SPARK-20596-als-rec-tests.
2017-05-08 12:45:00 +02:00
Daniel Li 88e6d75072 [SPARK-20484][MLLIB] Add documentation to ALS code
## What changes were proposed in this pull request?

This PR adds documentation to the ALS code.

## How was this patch tested?

Existing tests were used.

mengxr srowen

This contribution is my original work.  I have the license to work on this project under the Spark project’s open source license.

Author: Daniel Li <dan@danielyli.com>

Closes #17793 from danielyli/spark-20484.
2017-05-07 10:09:58 +01:00
Wayne Zhang 0d16faab90 [SPARK-20574][ML] Allow Bucketizer to handle non-Double numeric column
## What changes were proposed in this pull request?
Bucketizer currently requires input column to be Double, but the logic should work on any numeric data types. Many practical problems have integer/float data types, and it could get very tedious to manually cast them into Double before calling bucketizer. This PR extends bucketizer to handle all numeric types.

## How was this patch tested?
New test.

Author: Wayne Zhang <actuaryzhang@uber.com>

Closes #17840 from actuaryzhang/bucketizer.
2017-05-05 10:23:58 +08:00
Yanbo Liang c5dceb8c65 [SPARK-20047][FOLLOWUP][ML] Constrained Logistic Regression follow up
## What changes were proposed in this pull request?
Address some minor comments for #17715:
* Put bound-constrained optimization params under expertParams.
* Update some docs.

## How was this patch tested?
Existing tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #17829 from yanboliang/spark-20047-followup.
2017-05-04 17:56:43 +08:00
Yan Facai (颜发才) 7f96f2d7f2 [SPARK-16957][MLLIB] Use midpoints for split values.
## What changes were proposed in this pull request?

Use midpoints for split values now, and maybe later to make it weighted.

## How was this patch tested?

+ [x] add unit test.
+ [x] revise Split's unit test.

Author: Yan Facai (颜发才) <facai.yan@gmail.com>
Author: 颜发才(Yan Facai) <facai.yan@gmail.com>

Closes #17556 from facaiy/ENH/decision_tree_overflow_and_precision_in_aggregation.
2017-05-03 10:54:40 +01:00
Sean Owen 16fab6b0ef [SPARK-20523][BUILD] Clean up build warnings for 2.2.0 release
## What changes were proposed in this pull request?

Fix build warnings primarily related to Breeze 0.13 operator changes, Java style problems

## How was this patch tested?

Existing tests

Author: Sean Owen <sowen@cloudera.com>

Closes #17803 from srowen/SPARK-20523.
2017-05-03 10:18:35 +01:00
wangmiao1981 ee694cdff6 [SPARK-20533][SPARKR] SparkR Wrappers Model should be private and value should be lazy
## What changes were proposed in this pull request?

MultilayerPerceptronClassifierWrapper model should be private.
LogisticRegressionWrapper.scala rFeatures and rCoefficients should be lazy.

## How was this patch tested?

Unit tests.

Author: wangmiao1981 <wm624@hotmail.com>

Closes #17808 from wangmiao1981/lazy.
2017-04-29 10:58:48 -07:00
Yuhao Yang add9d1bba5 [SPARK-19791][ML] Add doc and example for fpgrowth
## What changes were proposed in this pull request?

Add a new section for fpm
Add Example for FPGrowth in scala and Java

updated: Rewrite transform to be more compact.

## How was this patch tested?

local doc generation.

Author: Yuhao Yang <yuhao.yang@intel.com>

Closes #17130 from hhbyyh/fpmdoc.
2017-04-29 10:51:45 -07:00
Yanbo Liang 606432a13a
[SPARK-20047][ML] Constrained Logistic Regression
## What changes were proposed in this pull request?
MLlib ```LogisticRegression``` should support bound constrained optimization (only for L2 regularization). Users can add bound constraints to coefficients to make the solver produce solution in the specified range.

Under the hood, we call Breeze [```L-BFGS-B```](https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/LBFGSB.scala) as the solver for bound constrained optimization. But in the current breeze implementation, there are some bugs in L-BFGS-B, and https://github.com/scalanlp/breeze/pull/633 fixed them. We need to upgrade dependent breeze later, and currently we use the workaround L-BFGS-B in this PR temporary for reviewing.

## How was this patch tested?
Unit tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #17715 from yanboliang/spark-20047.
2017-04-27 20:48:43 +00:00
ding 0a7f5f2798 [SPARK-5484][GRAPHX] Periodically do checkpoint in Pregel
## What changes were proposed in this pull request?

Pregel-based iterative algorithms with more than ~50 iterations begin to slow down and eventually fail with a StackOverflowError due to Spark's lack of support for long lineage chains.

This PR causes Pregel to checkpoint the graph periodically if the checkpoint directory is set.
This PR moves PeriodicGraphCheckpointer.scala from mllib to graphx, moves PeriodicRDDCheckpointer.scala, PeriodicCheckpointer.scala from mllib to core
## How was this patch tested?

unit tests, manual tests
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Author: ding <ding@localhost.localdomain>
Author: dding3 <ding.ding@intel.com>
Author: Michael Allman <michael@videoamp.com>

Closes #15125 from dding3/cp2_pregel.
2017-04-25 11:20:32 -07:00
Yanbo Liang 67eef47acf
[SPARK-20449][ML] Upgrade breeze version to 0.13.1
## What changes were proposed in this pull request?
Upgrade breeze version to 0.13.1, which fixed some critical bugs of L-BFGS-B.

## How was this patch tested?
Existing unit tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #17746 from yanboliang/spark-20449.
2017-04-25 17:10:41 +00:00
wangmiao1981 387565cf14 [SPARK-18901][FOLLOWUP][ML] Require in LR LogisticAggregator is redundant
## What changes were proposed in this pull request?

This is a follow-up PR of #17478.

## How was this patch tested?

Existing tests

Author: wangmiao1981 <wm624@hotmail.com>

Closes #17754 from wangmiao1981/followup.
2017-04-25 16:30:36 +08:00
Josh Rosen f44c8a843c [SPARK-20453] Bump master branch version to 2.3.0-SNAPSHOT
This patch bumps the master branch version to `2.3.0-SNAPSHOT`.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #17753 from JoshRosen/SPARK-20453.
2017-04-24 21:48:04 -07:00
wm624@hotmail.com 90264aced7 [SPARK-18901][ML] Require in LR LogisticAggregator is redundant
## What changes were proposed in this pull request?

In MultivariateOnlineSummarizer,

`add` and `merge` have check for weights and feature sizes. The checks in LR are redundant, which are removed from this PR.

## How was this patch tested?

Existing tests.

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #17478 from wangmiao1981/logit.
2017-04-24 23:43:06 +08:00
WeichenXu eb00378f0e
[SPARK-20423][ML] fix MLOR coeffs centering when reg == 0
## What changes were proposed in this pull request?

When reg == 0, MLOR has multiple solutions and we need to centralize the coeffs to get identical result.
BUT current implementation centralize the `coefficientMatrix` by the global coeffs means.

In fact the `coefficientMatrix` should be centralized on each feature index itself.
Because, according to the MLOR probability distribution function, it can be proven easily that:
suppose `{ w0, w1, .. w(K-1) }` make up the `coefficientMatrix`,
then `{ w0 + c, w1 + c, ... w(K - 1) + c}` will also be the equivalent solution.
`c` is an arbitrary vector of `numFeatures` dimension.
reference
https://core.ac.uk/download/pdf/6287975.pdf

So that we need to centralize the `coefficientMatrix` on each feature dimension separately.

**We can also confirm this through R library `glmnet`, that MLOR in `glmnet` always generate coefficients result that the sum of each dimension is all `zero`, when reg == 0.**

## How was this patch tested?

Tests added.

Author: WeichenXu <WeichenXu123@outlook.com>

Closes #17706 from WeichenXu123/mlor_center.
2017-04-21 17:58:13 +00:00
Syrux 095d1cb3aa [SPARK-20265][MLLIB] Improve Prefix'span pre-processing efficiency
## What changes were proposed in this pull request?

Improve PrefixSpan pre-processing efficency by preventing sequences of zero in the cleaned database.
The efficiency gain is reflected in the following graph : https://postimg.org/image/9x6ireuvn/

## How was this patch tested?

Using MLlib's PrefixSpan existing tests and tests of my own on the 8 datasets shown in the graph. All
result obtained were stricly the same as the original implementation (without this change).
dev/run-tests was also runned, no error were found.

Author : Cyril de Vogelaere <cyril.devogelaeregmail.com>

Author: Syrux <pokcyril@hotmail.com>

Closes #17575 from Syrux/SPARK-20265.
2017-04-13 09:44:33 +01:00
hyukjinkwon ceaf77ae43 [SPARK-18692][BUILD][DOCS] Test Java 8 unidoc build on Jenkins
## What changes were proposed in this pull request?

This PR proposes to run Spark unidoc to test Javadoc 8 build as Javadoc 8 is easily re-breakable.

There are several problems with it:

- It introduces little extra bit of time to run the tests. In my case, it took 1.5 mins more (`Elapsed :[94.8746569157]`). How it was tested is described in "How was this patch tested?".

- > One problem that I noticed was that Unidoc appeared to be processing test sources: if we can find a way to exclude those from being processed in the first place then that might significantly speed things up.

  (see  joshrosen's [comment](https://issues.apache.org/jira/browse/SPARK-18692?focusedCommentId=15947627&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15947627))

To complete this automated build, It also suggests to fix existing Javadoc breaks / ones introduced by test codes as described above.

There fixes are similar instances that previously fixed. Please refer https://github.com/apache/spark/pull/15999 and https://github.com/apache/spark/pull/16013

Note that this only fixes **errors** not **warnings**. Please see my observation https://github.com/apache/spark/pull/17389#issuecomment-288438704 for spurious errors by warnings.

## How was this patch tested?

Manually via `jekyll build` for building tests. Also, tested via running `./dev/run-tests`.

This was tested via manually adding `time.time()` as below:

```diff
     profiles_and_goals = build_profiles + sbt_goals

     print("[info] Building Spark unidoc (w/Hive 1.2.1) using SBT with these arguments: ",
           " ".join(profiles_and_goals))

+    import time
+    st = time.time()
     exec_sbt(profiles_and_goals)
+    print("Elapsed :[%s]" % str(time.time() - st))
```

produces

```
...
========================================================================
Building Unidoc API Documentation
========================================================================
...
[info] Main Java API documentation successful.
...
Elapsed :[94.8746569157]
...

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #17477 from HyukjinKwon/SPARK-18692.
2017-04-12 12:38:48 +01:00
Benjamin Fradet 0d2b796427 [SPARK-20097][ML] Fix visibility discrepancy with numInstances and degreesOfFreedom in LR and GLR
## What changes were proposed in this pull request?

- made `numInstances` public in GLR
- made `degreesOfFreedom` public in LR

## How was this patch tested?

reran the concerned test suites

Author: Benjamin Fradet <benjamin.fradet@gmail.com>

Closes #17431 from BenFradet/SPARK-20097.
2017-04-11 09:12:49 +02:00
Sean Owen a26e3ed5e4 [SPARK-20156][CORE][SQL][STREAMING][MLLIB] Java String toLowerCase "Turkish locale bug" causes Spark problems
## What changes were proposed in this pull request?

Add Locale.ROOT to internal calls to String `toLowerCase`, `toUpperCase`, to avoid inadvertent locale-sensitive variation in behavior (aka the "Turkish locale problem").

The change looks large but it is just adding `Locale.ROOT` (the locale with no country or language specified) to every call to these methods.

## How was this patch tested?

Existing tests.

Author: Sean Owen <sowen@cloudera.com>

Closes #17527 from srowen/SPARK-20156.
2017-04-10 20:11:56 +01:00
Vijay Ramesh 261eaf5149 [SPARK-20260][MLLIB] String interpolation required for error message
## What changes were proposed in this pull request?
This error message doesn't get properly formatted because of a missing `s`.  Currently the error looks like:

```
Caused by: java.lang.IllegalArgumentException: requirement failed: indices should be one-based and in ascending order; found current=$current, previous=$previous; line="$line"
```
(note the literal `$current` instead of the interpolated value)

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Vijay Ramesh <vramesh@demandbase.com>

Closes #17572 from vijaykramesh/master.
2017-04-09 19:39:09 +01:00
Liang-Chi Hsieh 1a52a62377 [SPARK-20076][ML][PYSPARK] Add Python interface for ml.stats.Correlation
## What changes were proposed in this pull request?

The Dataframes-based support for the correlation statistics is added in #17108. This patch adds the Python interface for it.

## How was this patch tested?

Python unit test.

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #17494 from viirya/correlation-python-api.
2017-04-07 11:00:10 +02:00
Bryan Cutler e156b5dd39 [SPARK-19953][ML] Random Forest Models use parent UID when being fit
## What changes were proposed in this pull request?

The ML `RandomForestClassificationModel` and `RandomForestRegressionModel` were not using the estimator parent UID when being fit.  This change fixes that so the models can be properly be identified with their parents.

## How was this patch tested?Existing tests.

Added check to verify that model uid matches that of the parent, then renamed `checkCopy` to `checkCopyAndUids` and verified that it was called by one test for each ML algorithm.

Author: Bryan Cutler <cutlerb@gmail.com>

Closes #17296 from BryanCutler/rfmodels-use-parent-uid-SPARK-19953.
2017-04-06 09:41:32 +02:00
Yuhao Yang b28bbffbad [SPARK-20003][ML] FPGrowthModel setMinConfidence should affect rules generation and transform
## What changes were proposed in this pull request?

jira: https://issues.apache.org/jira/browse/SPARK-20003
I was doing some test and found the issue. ml.fpm.FPGrowthModel `setMinConfidence` should always affect rules generation and transform.
Currently associationRules in FPGrowthModel is a lazy val and `setMinConfidence` in FPGrowthModel has no impact once associationRules got computed .

I try to cache the associationRules to avoid re-computation if `minConfidence` is not changed, but this makes FPGrowthModel somehow stateful. Let me know if there's any concern.

## How was this patch tested?

new unit test and I strength the unit test for model save/load to ensure the cache mechanism.

Author: Yuhao Yang <yuhao.yang@intel.com>

Closes #17336 from hhbyyh/fpmodelminconf.
2017-04-04 17:51:45 -07:00
Seth Hendrickson a59759e6c0 [SPARK-20183][ML] Added outlierRatio arg to MLTestingUtils.testOutliersWithSmallWeights
## What changes were proposed in this pull request?

This is a small piece from https://github.com/apache/spark/pull/16722 which ultimately will add sample weights to decision trees.  This is to allow more flexibility in testing outliers since linear models and trees behave differently.

Note: The primary author when this is committed should be sethah since this is taken from his code.

## How was this patch tested?

Existing tests

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #17501 from jkbradley/SPARK-20183.
2017-04-04 17:04:41 -07:00
zero323 b34f7665dd [SPARK-19825][R][ML] spark.ml R API for FPGrowth
## What changes were proposed in this pull request?

Adds SparkR API for FPGrowth: [SPARK-19825](https://issues.apache.org/jira/browse/SPARK-19825):

- `spark.fpGrowth` -model training.
- `freqItemsets` and `associationRules` methods with new corresponding generics.
- Scala helper: `org.apache.spark.ml.r. FPGrowthWrapper`
- unit tests.

## How was this patch tested?

Feature specific unit tests.

Author: zero323 <zero323@users.noreply.github.com>

Closes #17170 from zero323/SPARK-19825.
2017-04-03 23:42:04 -07:00
Yuhao Yang 4d28e8430d [SPARK-19969][ML] Imputer doc and example
## What changes were proposed in this pull request?

Add docs and examples for spark.ml.feature.Imputer. Currently scala and Java examples are included. Python example will be added after https://github.com/apache/spark/pull/17316

## How was this patch tested?

local doc generation and example execution

Author: Yuhao Yang <yuhao.yang@intel.com>

Closes #17324 from hhbyyh/imputerdoc.
2017-04-03 11:42:33 +02:00
Bryan Cutler 2a903a1eec [SPARK-19985][ML] Fixed copy method for some ML Models
## What changes were proposed in this pull request?
Some ML Models were using `defaultCopy` which expects a default constructor, and others were not setting the parent estimator.  This change fixes these by creating a new instance of the model and explicitly setting values and parent.

## How was this patch tested?
Added `MLTestingUtils.checkCopy` to the offending models to tests to verify the copy is made and parent is set.

Author: Bryan Cutler <cutlerb@gmail.com>

Closes #17326 from BryanCutler/ml-model-copy-error-SPARK-19985.
2017-04-03 10:56:54 +02:00
Jacek Laskowski 0197262a35 [DOCS] Docs-only improvements
…adoc

## What changes were proposed in this pull request?

Use recommended values for row boundaries in Window's scaladoc, i.e. `Window.unboundedPreceding`, `Window.unboundedFollowing`, and `Window.currentRow` (that were introduced in 2.1.0).

## How was this patch tested?

Local build

Author: Jacek Laskowski <jacek@japila.pl>

Closes #17417 from jaceklaskowski/window-expression-scaladoc.
2017-03-30 16:07:27 +01:00
Bago Amirbekian a5c87707ea [SPARK-20040][ML][PYTHON] pyspark wrapper for ChiSquareTest
## What changes were proposed in this pull request?

A pyspark wrapper for spark.ml.stat.ChiSquareTest

## How was this patch tested?

unit tests
doctests

Author: Bago Amirbekian <bago@databricks.com>

Closes #17421 from MrBago/chiSquareTestWrapper.
2017-03-28 19:19:16 -07:00
颜发才(Yan Facai) 7d432af8f3 [SPARK-20043][ML] DecisionTreeModel: ImpurityCalculator builder fails for uppercase impurity type Gini
Fix bug: DecisionTreeModel can't recongnize Impurity "Gini" when loading

TODO:
+ [x] add unit test
+ [x] fix the bug

Author: 颜发才(Yan Facai) <facai.yan@gmail.com>

Closes #17407 from facaiy/BUG/decision_tree_loader_failer_with_Gini_impurity.
2017-03-28 16:14:01 -07:00
sethah be85245a98
[SPARK-17137][ML][WIP] Compress logistic regression coefficients
## What changes were proposed in this pull request?

Use the new `compressed` method on matrices to store the logistic regression coefficients as sparse or dense - whichever is requires less memory.

Marked as WIP so we can add some performance test results. Basically, we should see if prediction is slower because of using a sparse matrix over a dense one. This can happen since sparse matrices do not use native BLAS operations when computing the margins.

## How was this patch tested?

Unit tests added.

Author: sethah <seth.hendrickson16@gmail.com>

Closes #17426 from sethah/SPARK-17137.
2017-03-25 17:41:59 +00:00
Nick Pentreath d9f4ce6943 [SPARK-15040][ML][PYSPARK] Add Imputer to PySpark
Add Python wrapper for `Imputer` feature transformer.

## How was this patch tested?

New doc tests and tweak to PySpark ML `tests.py`

Author: Nick Pentreath <nickp@za.ibm.com>

Closes #17316 from MLnick/SPARK-15040-pyspark-imputer.
2017-03-24 08:01:15 -07:00
Timothy Hunter d27daa54bd [SPARK-19636][ML] Feature parity for correlation statistics in MLlib
## What changes were proposed in this pull request?

This patch adds the Dataframes-based support for the correlation statistics found in the `org.apache.spark.mllib.stat.correlation.Statistics`, following the design doc discussed in the JIRA ticket.

The current implementation is a simple wrapper around the `spark.mllib` implementation. Future optimizations can be implemented at a later stage.

## How was this patch tested?

```
build/sbt "testOnly org.apache.spark.ml.stat.StatisticsSuite"
```

Author: Timothy Hunter <timhunter@databricks.com>

Closes #17108 from thunterdb/19636.
2017-03-23 18:42:13 -07:00
hyukjinkwon aefe798905 [MINOR][BUILD] Fix javadoc8 break
## What changes were proposed in this pull request?

Several javadoc8 breaks have been introduced. This PR proposes fix those instances so that we can build Scala/Java API docs.

```
[error] .../spark/sql/core/target/java/org/apache/spark/sql/streaming/GroupState.java:6: error: reference not found
[error]  * <code>flatMapGroupsWithState</code> operations on {link KeyValueGroupedDataset}.
[error]                                                             ^
[error] .../spark/sql/core/target/java/org/apache/spark/sql/streaming/GroupState.java:10: error: reference not found
[error]  * Both, <code>mapGroupsWithState</code> and <code>flatMapGroupsWithState</code> in {link KeyValueGroupedDataset}
[error]                                                                                            ^
[error] .../spark/sql/core/target/java/org/apache/spark/sql/streaming/GroupState.java:51: error: reference not found
[error]  *    {link GroupStateTimeout.ProcessingTimeTimeout}) or event time (i.e.
[error]              ^
[error] .../spark/sql/core/target/java/org/apache/spark/sql/streaming/GroupState.java:52: error: reference not found
[error]  *    {link GroupStateTimeout.EventTimeTimeout}).
[error]              ^
[error] .../spark/sql/core/target/java/org/apache/spark/sql/streaming/GroupState.java:158: error: reference not found
[error]  *           Spark SQL types (see {link Encoder} for more details).
[error]                                          ^
[error] .../spark/mllib/target/java/org/apache/spark/ml/fpm/FPGrowthParams.java:26: error: bad use of '>'
[error]    * Number of partitions (>=1) used by parallel FP-growth. By default the param is not set, and
[error]                            ^
[error] .../spark/sql/core/src/main/java/org/apache/spark/api/java/function/FlatMapGroupsWithStateFunction.java:30: error: reference not found
[error]  * {link org.apache.spark.sql.KeyValueGroupedDataset#flatMapGroupsWithState(
[error]           ^
[error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyValueGroupedDataset.java:211: error: reference not found
[error]    * See {link GroupState} for more details.
[error]                 ^
[error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyValueGroupedDataset.java:232: error: reference not found
[error]    * See {link GroupState} for more details.
[error]                 ^
[error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyValueGroupedDataset.java:254: error: reference not found
[error]    * See {link GroupState} for more details.
[error]                 ^
[error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyValueGroupedDataset.java:277: error: reference not found
[error]    * See {link GroupState} for more details.
[error]                 ^
[error] .../spark/core/target/java/org/apache/spark/TaskContextImpl.java:10: error: reference not found
[error]  * {link TaskMetrics} &amp; {link MetricsSystem} objects are not thread safe.
[error]           ^
[error] .../spark/core/target/java/org/apache/spark/TaskContextImpl.java:10: error: reference not found
[error]  * {link TaskMetrics} &amp; {link MetricsSystem} objects are not thread safe.
[error]                                     ^
[info] 13 errors
```

```
jekyll 3.3.1 | Error:  Unidoc generation failed
```

## How was this patch tested?

Manually via `jekyll build`

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #17389 from HyukjinKwon/minor-javadoc8-fix.
2017-03-23 08:41:30 +00:00
Joseph K. Bradley ae4b91d1f5 [SPARK-20039][ML] rename ChiSquare to ChiSquareTest
## What changes were proposed in this pull request?

I realized that since ChiSquare is in the package stat, it's pretty unclear if it's the hypothesis test, distribution, or what. This PR renames it to ChiSquareTest to clarify this.

## How was this patch tested?

Existing unit tests

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #17368 from jkbradley/SPARK-20039.
2017-03-21 11:01:25 -07:00
Zheng RuiFeng 63f077fbe5 [SPARK-20041][DOC] Update docs for NaN handling in approxQuantile
## What changes were proposed in this pull request?
Update docs for NaN handling in approxQuantile.

## How was this patch tested?
existing tests.

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #17369 from zhengruifeng/doc_quantiles_nan.
2017-03-21 08:45:59 -07:00
christopher snow 7620aed828 [SPARK-20011][ML][DOCS] Clarify documentation for ALS 'rank' parameter
## What changes were proposed in this pull request?

API documentation and collaborative filtering documentation page changes to clarify inconsistent description of ALS rank parameter.

 - [DOCS] was previously: "rank is the number of latent factors in the model."
 - [API] was previously:  "rank - number of features to use"

This change describes rank in both places consistently as:

 - "Number of features to use (also referred to as the number of latent factors)"

Author: Chris Snow <chris.snowuk.ibm.com>

Author: christopher snow <chsnow123@gmail.com>

Closes #17345 from snowch/SPARK-20011.
2017-03-21 13:23:59 +00:00
zero323 bec6b16c19 [SPARK-19899][ML] Replace featuresCol with itemsCol in ml.fpm.FPGrowth
## What changes were proposed in this pull request?

Replaces `featuresCol` `Param` with `itemsCol`. See [SPARK-19899](https://issues.apache.org/jira/browse/SPARK-19899).

## How was this patch tested?

Manual tests. Existing unit tests.

Author: zero323 <zero323@users.noreply.github.com>

Closes #17321 from zero323/SPARK-19899.
2017-03-20 10:58:30 -07:00
Joseph K. Bradley 4c3200546c [SPARK-19635][ML] DataFrame-based API for chi square test
## What changes were proposed in this pull request?

Wrapper taking and return a DataFrame

## How was this patch tested?

Copied unit tests from RDD-based API

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #17110 from jkbradley/df-hypotests.
2017-03-16 17:10:15 -07:00
Yuhao Yang d647aae278 [SPARK-13568][ML] Create feature transformer to impute missing values
## What changes were proposed in this pull request?

jira: https://issues.apache.org/jira/browse/SPARK-13568
It is quite common to encounter missing values in data sets. It would be useful to implement a Transformer that can impute missing data points, similar to e.g. Imputer in scikit-learn.
Initially, options for imputation could include mean, median and most frequent, but we could add various other approaches, where possible existing DataFrame code can be used (e.g. for approximate quantiles etc).

Currently this PR supports imputation for Double and Vector (null and NaN in Vector).
## How was this patch tested?

new unit tests and manual test

Author: Yuhao Yang <hhbyyh@gmail.com>
Author: Yuhao Yang <yuhao.yang@intel.com>
Author: Yuhao <yuhao.yang@intel.com>

Closes #11601 from hhbyyh/imputer.
2017-03-16 12:49:59 +02:00
Menglong TAN 85941ecf28 [SPARK-11569][ML] Fix StringIndexer to handle null value properly
## What changes were proposed in this pull request?

This PR is to enhance StringIndexer with NULL values handling.

Before the PR, StringIndexer will throw an exception when encounters NULL values.
With this PR:
- handleInvalid=error: Throw an exception as before
- handleInvalid=skip: Skip null values as well as unseen labels
- handleInvalid=keep: Give null values an additional index as well as unseen labels

BTW, I noticed someone was trying to solve the same problem ( #9920 ) but seems getting no progress or response for a long time. Would you mind to give me a chance to solve it ? I'm eager to help. :-)

## How was this patch tested?

new unit tests

Author: Menglong TAN <tanmenglong@renrenche.com>
Author: Menglong TAN <tanmenglong@gmail.com>

Closes #17233 from crackcell/11569_StringIndexer_NULL.
2017-03-14 07:45:42 -07:00
zero323 d4a637cd46 [SPARK-19940][ML][MINOR] FPGrowthModel.transform should skip duplicated items
## What changes were proposed in this pull request?

This commit moved `distinct` in its intended place to avoid duplicated predictions and adds unit test covering the issue.

## How was this patch tested?

Unit tests.

Author: zero323 <zero323@users.noreply.github.com>

Closes #17283 from zero323/SPARK-19940.
2017-03-14 07:34:44 -07:00
Asher Krim 5e96a57b2f [SPARK-19922][ML] small speedups to findSynonyms
Currently generating synonyms using a large model (I've tested with 3m words) is very slow. These efficiencies have sped things up for us by ~17%

I wasn't sure if such small changes were worthy of a jira, but the guidelines seemed to suggest that that is the preferred approach

## What changes were proposed in this pull request?

Address a few small issues in the findSynonyms logic:
1) remove usage of ``Array.fill`` to zero out the ``cosineVec`` array. The default float value in Scala and Java is 0.0f, so explicitly setting the values to zero is not needed
2) use Floats throughout. The conversion to Doubles before doing the ``priorityQueue`` is totally superfluous, since all the similarity computations are done using Floats anyway. Creating a second large array just serves to put extra strain on the GC
3) convert the slow ``for(i <- cosVec.indices)`` to an ugly, but faster, ``while`` loop

These efficiencies are really only apparent when working with a large model
## How was this patch tested?

Existing unit tests + some in-house tests to time the difference

cc jkbradley MLNick srowen

Author: Asher Krim <krim.asher@gmail.com>
Author: Asher Krim <krim.asher@gmail>

Closes #17263 from Krimit/fasterFindSynonyms.
2017-03-14 13:08:11 +00:00
actuaryzhang f6314eab4b [SPARK-19391][SPARKR][ML] Tweedie GLM API for SparkR
## What changes were proposed in this pull request?
Port Tweedie GLM  #16344  to SparkR

felixcheung yanboliang

## How was this patch tested?
new test in SparkR

Author: actuaryzhang <actuaryzhang10@gmail.com>

Closes #16729 from actuaryzhang/sparkRTweedie.
2017-03-14 00:50:38 -07:00
Joseph K. Bradley 72c66dbbb4 [MINOR][ML] Improve MLWriter overwrite error message
## What changes were proposed in this pull request?

Give proper syntax for Java and Python in addition to Scala.

## How was this patch tested?

Manually.

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #17215 from jkbradley/write-err-msg.
2017-03-13 16:30:15 -07:00
Xin Ren 9f8ce4825e [SPARK-19282][ML][SPARKR] RandomForest Wrapper and GBT Wrapper return param "maxDepth" to R models
## What changes were proposed in this pull request?

RandomForest R Wrapper and GBT R Wrapper return param `maxDepth` to R models.

Below 4 R wrappers are changed:
* `RandomForestClassificationWrapper`
* `RandomForestRegressionWrapper`
* `GBTClassificationWrapper`
* `GBTRegressionWrapper`

## How was this patch tested?

Test manually on my local machine.

Author: Xin Ren <iamshrek@126.com>

Closes #17207 from keypointt/SPARK-19282.
2017-03-12 12:15:19 -07:00