Commit graph

1762 commits

Author SHA1 Message Date
Zheng RuiFeng ef3c73535f [SPARK-19694][ML] Add missing 'setTopicDistributionCol' for LDAModel
## What changes were proposed in this pull request?
Add missing 'setTopicDistributionCol' for LDAModel
## How was this patch tested?
existing tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #17021 from zhengruifeng/lda_outputCol.
2017-02-22 16:33:14 +02:00
Sean Owen 1487c9af20
[SPARK-19534][TESTS] Convert Java tests to use lambdas, Java 8 features
## What changes were proposed in this pull request?

Convert tests to use Java 8 lambdas, and modest related fixes to surrounding code.

## How was this patch tested?

Jenkins tests

Author: Sean Owen <sowen@cloudera.com>

Closes #16964 from srowen/SPARK-19534.
2017-02-19 09:42:50 -08:00
Moussa Taifi 21c7d3c31a
[MLLIB][TYPO] Replace LeastSquaresAggregator with LogisticAggregator
## What changes were proposed in this pull request?

Replace LeastSquaresAggregator with LogisticAggregator in the require statement of the merge op.

## How was this patch tested?

Simple message fix.

Author: Moussa Taifi <moutai10@gmail.com>

Closes #16903 from moutai/master.
2017-02-18 14:10:21 +00:00
Yun Ni 08c1972a06 [SPARK-18080][ML][PYTHON] Python API & Examples for Locality Sensitive Hashing
## What changes were proposed in this pull request?
This pull request includes python API and examples for LSH. The API changes was based on yanboliang 's PR #15768 and resolved conflicts and API changes on the Scala API. The examples are consistent with Scala examples of MinHashLSH and BucketedRandomProjectionLSH.

## How was this patch tested?
API and examples are tested using spark-submit:
`bin/spark-submit examples/src/main/python/ml/min_hash_lsh.py`
`bin/spark-submit examples/src/main/python/ml/bucketed_random_projection_lsh.py`

User guide changes are generated and manually inspected:
`SKIP_API=1 jekyll build`

Author: Yun Ni <yunn@uber.com>
Author: Yanbo Liang <ybliang8@gmail.com>
Author: Yunni <Euler57721@gmail.com>

Closes #16715 from Yunni/spark-18080.
2017-02-15 16:26:05 -08:00
wm624@hotmail.com 3973403d5d [SPARK-19456][SPARKR] Add LinearSVC R API
## What changes were proposed in this pull request?

Linear SVM classifier is newly added into ML and python API has been added. This JIRA is to add R side API.

Marked as WIP, as I am designing unit tests.

## How was this patch tested?

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #16800 from wangmiao1981/svc.
2017-02-15 01:15:50 -08:00
sureshthalamati f48c5a57d6 [SPARK-19318][SQL] Fix to treat JDBC connection properties specified by the user in case-sensitive manner.
## What changes were proposed in this pull request?
The reason for test failure is that the property “oracle.jdbc.mapDateToTimestamp” set by the test was getting converted into all lower case. Oracle database expects this property in case-sensitive manner.

This test was passing in previous releases because connection properties were sent as user specified for the test case scenario. Fixes to handle all option uniformly in case-insensitive manner, converted the JDBC connection properties also to lower case.

This PR  enhances CaseInsensitiveMap to keep track of input case-sensitive keys , and uses those when creating connection properties that are passed to the JDBC connection.

Alternative approach PR https://github.com/apache/spark/pull/16847  is to pass original input keys to JDBC data source by adding check in the  Data source class and handle case-insensitivity in the JDBC source code.

## How was this patch tested?
Added new test cases to JdbcSuite , and OracleIntegrationSuite. Ran docker integration tests passed on my laptop, all tests passed successfully.

Author: sureshthalamati <suresh.thalamati@gmail.com>

Closes #16891 from sureshthalamati/jdbc_case_senstivity_props_fix-SPARK-19318.
2017-02-14 15:34:12 -08:00
sueann 3a43ae7c0b [SPARK-18613][ML] make spark.mllib LDA dependencies in spark.ml LDA private
## What changes were proposed in this pull request?
spark.ml.*LDAModel classes were exposing spark.mllib LDA models via protected methods. Made them package (clustering) private.

## How was this patch tested?
```
build/sbt doc  # "millib.clustering" no longer appears in the docs for *LDA* classes
build/sbt compile  # compiles
build/sbt
> mllib/testOnly   # tests pass
```

Author: sueann <sueann@databricks.com>

Closes #16860 from sueann/SPARK-18613.
2017-02-10 11:50:23 -08:00
actuaryzhang 1aeb9f6cba [SPARK-19400][ML] Allow GLM to handle intercept only model
## What changes were proposed in this pull request?
Intercept-only GLM is failing for non-Gaussian family because of reducing an empty array in IWLS. The following code `val maxTolOfCoefficients = oldCoefficients.toArray.reduce { (x, y) => math.max(math.abs(x), math.abs(y))` fails in the intercept-only model because `oldCoefficients` is empty. This PR fixes this issue.

yanboliang srowen imatiach-msft zhengruifeng

## How was this patch tested?
New test for intercept only model.

Author: actuaryzhang <actuaryzhang10@gmail.com>

Closes #16740 from actuaryzhang/interceptOnly.
2017-02-08 10:42:20 -08:00
gatorsmile e33aaa2ac5 [SPARK-19397][SQL] Make option names of LIBSVM and TEXT case insensitive
### What changes were proposed in this pull request?
Prior to Spark 2.1, the option names are case sensitive for all the formats. Since Spark 2.1, the option key names become case insensitive except the format `Text` and `LibSVM `. This PR is to fix these issues.

Also, add a check to know whether the input option vector type is legal for `LibSVM`.

### How was this patch tested?
Added test cases

Author: gatorsmile <gatorsmile@gmail.com>

Closes #16737 from gatorsmile/libSVMTextOptions.
2017-02-08 09:33:18 +08:00
gatorsmile 65b10ffb38 [SPARK-19279][SQL] Infer Schema for Hive Serde Tables and Block Creating a Hive Table With an Empty Schema
### What changes were proposed in this pull request?
So far, we allow users to create a table with an empty schema: `CREATE TABLE tab1`. This could break many code paths if we enable it. Thus, we should follow Hive to block it.

For Hive serde tables, some serde libraries require the specified schema and record it in the metastore. To get the list, we need to check `hive.serdes.using.metastore.for.schema,` which contains a list of serdes that require user-specified schema. The default values are

- org.apache.hadoop.hive.ql.io.orc.OrcSerde
- org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
- org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe
- org.apache.hadoop.hive.serde2.dynamic_type.DynamicSerDe
- org.apache.hadoop.hive.serde2.MetadataTypedColumnsetSerDe
- org.apache.hadoop.hive.serde2.columnar.LazyBinaryColumnarSerDe
- org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
- org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe

### How was this patch tested?
Added test cases for both Hive and data source tables

Author: gatorsmile <gatorsmile@gmail.com>

Closes #16636 from gatorsmile/fixEmptyTableSchema.
2017-02-06 13:30:07 +08:00
Asher Krim b3e89802ae [SPARK-19247][ML] Save large word2vec models
## What changes were proposed in this pull request?

* save word2vec models as distributed files rather than as one large datum. Backwards compatibility with the previous save format is maintained by checking for the "wordIndex" column
* migrate the fix for loading large models (SPARK-11994) to ml word2vec

## How was this patch tested?

Tested loading the new and old formats locally

srowen yanboliang MLnick

Author: Asher Krim <akrim@hubspot.com>

Closes #16607 from Krimit/saveLargeModels.
2017-02-05 16:14:07 -08:00
Joseph K. Bradley 1d5d2a9d09 [SPARK-19389][ML][PYTHON][DOC] Minor doc fixes for ML Python Params and LinearSVC
## What changes were proposed in this pull request?

* Removed Since tags in Python Params since they are inherited by other classes
* Fixed doc links for LinearSVC

## How was this patch tested?

* doc tests
* generating docs locally and checking manually

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #16723 from jkbradley/pyparam-fix-doc.
2017-02-02 11:58:46 -08:00
hyukjinkwon f1a1f2607d
[SPARK-19402][DOCS] Support LaTex inline formula correctly and fix warnings in Scala/Java APIs generation
## What changes were proposed in this pull request?

This PR proposes three things as below:

- Support LaTex inline-formula, `\( ... \)` in Scala API documentation
  It seems currently,

  ```
  \( ... \)
  ```

  are rendered as they are, for example,

  <img width="345" alt="2017-01-30 10 01 13" src="https://cloud.githubusercontent.com/assets/6477701/22423960/ab37d54a-e737-11e6-9196-4f6229c0189c.png">

  It seems mistakenly more backslashes were added.

- Fix warnings Scaladoc/Javadoc generation
  This PR fixes t two types of warnings as below:

  ```
  [warn] .../spark/sql/catalyst/src/main/scala/org/apache/spark/sql/Row.scala:335: Could not find any member to link for "UnsupportedOperationException".
  [warn]   /**
  [warn]   ^
  ```

  ```
  [warn] .../spark/sql/core/src/main/scala/org/apache/spark/sql/internal/VariableSubstitution.scala:24: Variable var undefined in comment for class VariableSubstitution in class VariableSubstitution
  [warn]  * `${var}`, `${system:var}` and `${env:var}`.
  [warn]      ^
  ```

- Fix Javadoc8 break
  ```
  [error] .../spark/mllib/target/java/org/apache/spark/ml/PredictionModel.java:7: error: reference not found
  [error]  *                       E.g., {link VectorUDT} for vector features.
  [error]                                       ^
  [error] .../spark/mllib/target/java/org/apache/spark/ml/PredictorParams.java:12: error: reference not found
  [error]    *                          E.g., {link VectorUDT} for vector features.
  [error]                                            ^
  [error] .../spark/mllib/target/java/org/apache/spark/ml/Predictor.java:10: error: reference not found
  [error]  *                       E.g., {link VectorUDT} for vector features.
  [error]                                       ^
  [error] .../spark/sql/hive/target/java/org/apache/spark/sql/hive/HiveAnalysis.java:5: error: reference not found
  [error]  * Note that, this rule must be run after {link PreprocessTableInsertion}.
  [error]                                                  ^
  ```

## How was this patch tested?

Manually via `sbt unidoc` and `jeykil build`.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #16741 from HyukjinKwon/warn-and-break.
2017-02-01 13:26:16 +00:00
wm624@hotmail.com 9ac05225e8 [SPARK-19319][SPARKR] SparkR Kmeans summary returns error when the cluster size doesn't equal to k
## What changes were proposed in this pull request

When Kmeans using initMode = "random" and some random seed, it is possible the actual cluster size doesn't equal to the configured `k`.

In this case, summary(model) returns error due to the number of cols of coefficient matrix doesn't equal to k.

Example:
>  col1 <- c(1, 2, 3, 4, 0, 1, 2, 3, 4, 0)
>   col2 <- c(1, 2, 3, 4, 0, 1, 2, 3, 4, 0)
>   col3 <- c(1, 2, 3, 4, 0, 1, 2, 3, 4, 0)
>   cols <- as.data.frame(cbind(col1, col2, col3))
>   df <- createDataFrame(cols)
>
>   model2 <- spark.kmeans(data = df, ~ ., k = 5, maxIter = 10,  initMode = "random", seed = 22222, tol = 1E-5)
>
> summary(model2)
Error in `colnames<-`(`*tmp*`, value = c("col1", "col2", "col3")) :
  length of 'dimnames' [2] not equal to array extent
In addition: Warning message:
In matrix(coefficients, ncol = k) :
  data length [9] is not a sub-multiple or multiple of the number of rows [2]

Fix: Get the actual cluster size in the summary and use it to build the coefficient matrix.
## How was this patch tested?

Add unit tests.

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #16666 from wangmiao1981/kmeans.
2017-01-31 21:16:37 -08:00
Bryan Cutler 57d70d26c8 [SPARK-17161][PYSPARK][ML] Add PySpark-ML JavaWrapper convenience function to create Py4J JavaArrays
## What changes were proposed in this pull request?

Adding convenience function to Python `JavaWrapper` so that it is easy to create a Py4J JavaArray that is compatible with current class constructors that have a Scala `Array` as input so that it is not necessary to have a Java/Python friendly constructor.  The function takes a Java class as input that is used by Py4J to create the Java array of the given class.  As an example, `OneVsRest` has been updated to use this and the alternate constructor is removed.

## How was this patch tested?

Added unit tests for the new convenience function and updated `OneVsRest` doctests which use this to persist the model.

Author: Bryan Cutler <cutlerb@gmail.com>

Closes #14725 from BryanCutler/pyspark-new_java_array-CountVectorizer-SPARK-17161.
2017-01-31 15:42:36 -08:00
Zheng RuiFeng 42ad93b2c9
[SPARK-19384][ML] forget unpersist input dataset in IsotonicRegression
## What changes were proposed in this pull request?
unpersist the input dataset if `handlePersistence` = true

## How was this patch tested?
existing tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #16718 from zhengruifeng/isoReg_unpersisit.
2017-01-28 10:18:47 +00:00
wm624@hotmail.com bb1a1fe05e [SPARK-19336][ML][PYSPARK] LinearSVC Python API
## What changes were proposed in this pull request?

Add Python API for the newly added LinearSVC algorithm.

## How was this patch tested?

Add new doc string test.

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #16694 from wangmiao1981/ser.
2017-01-27 16:03:53 -08:00
actuaryzhang 4172ff80dd [SPARK-18929][ML] Add Tweedie distribution in GLM
## What changes were proposed in this pull request?
I propose to add the full Tweedie family into the GeneralizedLinearRegression model. The Tweedie family is characterized by a power variance function. Currently supported distributions such as Gaussian, Poisson and Gamma families are a special case of the Tweedie https://en.wikipedia.org/wiki/Tweedie_distribution.

yanboliang srowen sethah

Author: actuaryzhang <actuaryzhang10@gmail.com>
Author: Wayne Zhang <actuaryzhang10@gmail.com>

Closes #16344 from actuaryzhang/tweedie.
2017-01-26 23:01:13 -08:00
wm624@hotmail.com c0ba284300 [SPARK-18821][SPARKR] Bisecting k-means wrapper in SparkR
## What changes were proposed in this pull request?

Add R wrapper for bisecting Kmeans.

As JIRA is down, I will update title to link with corresponding JIRA later.

## How was this patch tested?

Add new unit tests.

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #16566 from wangmiao1981/bk.
2017-01-26 21:01:59 -08:00
WeichenXu 1191fe267d [SPARK-18218][ML][MLLIB] Reduce shuffled data size of BlockMatrix multiplication and solve potential OOM and low parallelism usage problem By split middle dimension in matrix multiplication
## What changes were proposed in this pull request?

### The problem in current block matrix mulitiplication

As in JIRA https://issues.apache.org/jira/browse/SPARK-18218 described, block matrix multiplication in spark may cause some problem, suppose we have `M*N` dimensions matrix A multiply `N*P` dimensions matrix B, when N is much larger than M and P, then the following problem may occur:
- when the middle dimension N is too large, it will cause reducer OOM.
- even if OOM do not occur, it will still cause parallism too low.
- when N is much large than M and P, and matrix A and B have many partitions, it may cause too many partition on M and P dimension, it will cause much larger shuffled data size. (I will expain this in detail in the following.)

### Key point of my improvement

In this PR, I introduce `midDimSplitNum` parameter, and improve the algorithm, to resolve this problem.

In order to understand the improvement in this PR, first let me give a simple case to explain how the current mulitiplication works and what cause the problems above:

suppose we have block matrix A, contains 200 blocks (`2 numRowBlocks * 100 numColBlocks`), blocks arranged in 2 rows, 100 cols:
```
A00 A01 A02 ... A0,99
A10 A11 A12 ... A1,99
```
and we have block matrix B, also contains 200 blocks (`100 numRowBlocks * 2 numColBlocks`), blocks arranged in 100 rows, 2 cols:
```
B00    B01
B10    B11
B20    B21
...
B99,0  B99,1
```
Suppose all blocks in the two matrices are dense for now.
Now we call A.multiply(B), suppose the generated `resultPartitioner` contains 2 rowPartitions and 2 colPartitions (can't be more partitions because the result matrix only contains `2 * 2` blocks), the current algorithm will contains two shuffle steps:

**step-1**
Step-1 will generate 4 reducer, I tag them as reducer-00, reducer-01, reducer-10, reducer-11, and shuffle data as following:
```
A00 A01 A02 ... A0,99
B00 B10 B20 ... B99,0    shuffled into reducer-00

A00 A01 A02 ... A0,99
B01 B11 B21 ... B99,1    shuffled into reducer-01

A10 A11 A12 ... A1,99
B00 B10 B20 ... B99,0    shuffled into reducer-10

A10 A11 A12 ... A1,99
B01 B11 B21 ... B99,1    shuffled into reducer-11
```

and the shuffling above is a `cogroup` transform, note that each reducer contains **only one group**.

**step-2**
Step-2 will do an `aggregateByKey` transform on the result of step-1, will also generate 4 reducers, and generate the final result RDD, contains 4 partitions, each partition contains one block.

The main problems are in step-1. Now we have only 4 reducers, but matrix A and B have 400 blocks in total, obviously the reducer number is too small.
and, we can see that, each reducer contains only one group(the group concept in `coGroup` transform), each group contains 200 blocks. This is terrible because we know that `coGroup` transformer will load each group into memory when computing. It is un-extensable in the algorithm level. Suppose matrix A has 10000 cols blocks or more instead of 100? Than each reducer will load 20000 blocks into memory. It will easily cause reducer OOM.

This PR try to resolve the problem described above.
When matrix A with dimension M * N multiply matrix B with dimension N * P, the middle dimension N is the keypoint. If N is large, the current mulitiplication implementation works badly.
In this PR, I introduce a `numMidDimSplits` parameter, represent how many splits it will cut on the middle dimension N.
Still using the example described above, now we set `numMidDimSplits = 10`, now we can generate 40 reducers in **step-1**:

the reducer-ij above now will be splited into 10 reducers: reducer-ij0, reducer-ij1, ... reducer-ij9, each reducer will receive 20 blocks.
now the shuffle works as following:

**reducer-000 to reducer-009**
```
A0,0 A0,10 A0,20 ... A0,90
B0,0 B10,0 B20,0 ... B90,0    shuffled into reducer-000

A0,1 A0,11 A0,21 ... A0,91
B1,0 B11,0 B21,0 ... B91,0    shuffled into reducer-001

A0,2 A0,12 A0,22 ... A0,92
B2,0 B12,0 B22,0 ... B92,0    shuffled into reducer-002

...

A0,9 A0,19 A0,29 ... A0,99
B9,0 B19,0 B29,0 ... B99,0    shuffled into reducer-009
```

**reducer-010 to reducer-019**
```
A0,0 A0,10 A0,20 ... A0,90
B0,1 B10,1 B20,1 ... B90,1    shuffled into reducer-010

A0,1 A0,11 A0,21 ... A0,91
B1,1 B11,1 B21,1 ... B91,1    shuffled into reducer-011

A0,2 A0,12 A0,22 ... A0,92
B2,1 B12,1 B22,1 ... B92,1    shuffled into reducer-012

...

A0,9 A0,19 A0,29 ... A0,99
B9,1 B19,1 B29,1 ... B99,1    shuffled into reducer-019
```

**reducer-100 to reducer-109** and **reducer-110 to reducer-119** is similar to the above, I omit to write them out.

### API for this optimized algorithm

I add a new API as following:
```
  def multiply(
      other: BlockMatrix,
      numMidDimSplits: Int // middle dimension split number, expained above
): BlockMatrix
```

### Shuffled data size analysis (compared under the same parallelism)

The optimization has some subtle influence on the total shuffled data size. Appropriate `numMidDimSplits` will significantly reduce the shuffled data size,
but too large `numMidDimSplits` may increase the shuffled data in reverse. For now I don't want to introduce formula to make thing too complex, I only use a simple case to represent it here:

Suppose we have two same size square matrices X and Y, both have `16 numRowBlocks * 16 numColBlocks`. X and Y are both dense matrix. Now let me analysis the shuffling data size in the following case:

**case 1: X and Y both partitioned in 16 rowPartitions and 16 colPartitions, numMidDimSplits = 1**
ShufflingDataSize = (16 * 16 * (16 + 16) + 16 * 16) blocks = 8448 blocks
parallelism = 16 * 16 * 1 = 256 //use step-1 reducers number as the parallism because it cost most of the computation time in this algorithm.

**case 2: X and Y both partitioned in 8 rowPartitions and 8 colPartitions, numMidDimSplits = 4**
ShufflingDataSize = (8 * 8 * (32 + 32) + 16 * 16 * 4) blocks = 5120 blocks
parallelism = 8 * 8 * 4 = 256 //use step-1 reducers number as the parallism because it cost most of the computation time in this algorithm.

**The two cases above all have parallism = 256**, case 1 `numMidDimSplits = 1` is equivalent with current implementation in mllib, but case 2 shuffling data is 60.6% of case 1, **it shows that under the same parallelism, proper `numMidDimSplits` will significantly reduce the shuffling data size**.

## How was this patch tested?

Test suites added.
Running result:
![blockmatrix](https://cloud.githubusercontent.com/assets/19235986/21600989/5e162cc2-d1bf-11e6-868c-0ec29190b605.png)

Author: WeichenXu <WeichenXu123@outlook.com>

Closes #15730 from WeichenXu123/optim_block_matrix.
2017-01-26 20:10:17 -08:00
sethah 0e821ec6fa [SPARK-19313][ML][MLLIB] GaussianMixture should limit the number of features
## What changes were proposed in this pull request?

The following test will fail on current master

````scala
test("gmm fails on high dimensional data") {
    val ctx = spark.sqlContext
    import ctx.implicits._
    val df = Seq(
      Vectors.sparse(GaussianMixture.MAX_NUM_FEATURES + 1, Array(0, 4), Array(3.0, 8.0)),
      Vectors.sparse(GaussianMixture.MAX_NUM_FEATURES + 1, Array(1, 5), Array(4.0, 9.0)))
      .map(Tuple1.apply).toDF("features")
    val gm = new GaussianMixture()
    intercept[IllegalArgumentException] {
      gm.fit(df)
    }
  }
````

Instead, you'll get an `ArrayIndexOutOfBoundsException` or something similar for MLlib. That's because the covariance matrix allocates an array of `numFeatures * numFeatures`, and in this case we get integer overflow. While there is currently a warning that the algorithm does not perform well for high number of features, we should perform an appropriate check to communicate this limitation to users.

This patch adds a `require(numFeatures < GaussianMixture.MAX_NUM_FEATURES)` check to ML and MLlib algorithms. For the feature limitation, we can limit it such that we do not get numerical overflow to something like `math.sqrt(Integer.MaxValue).toInt` (about 46k) which eliminates the cryptic error. However in, for example WLS, we need to collect an array on the order of `numFeatures * numFeatures` to the driver and we therefore limit to 4096 features. We may want to keep that convention here for consistency.

## How was this patch tested?
Unit tests in ML and MLlib.

Author: sethah <seth.hendrickson16@gmail.com>

Closes #16661 from sethah/gmm_high_dim.
2017-01-25 07:12:25 -08:00
Ilya Matiach d9783380ff [SPARK-18036][ML][MLLIB] Fixing decision trees handling edge cases
## What changes were proposed in this pull request?

Decision trees/GBT/RF do not handle edge cases such as constant features or empty features.
In the case of constant features we choose any arbitrary split instead of failing with a cryptic error message.
In the case of empty features we fail with a better error message stating:
DecisionTree requires number of features > 0, but was given an empty features vector
Instead of the cryptic error message:
java.lang.UnsupportedOperationException: empty.max

## How was this patch tested?

Unit tests are added in the patch for:
DecisionTreeRegressor
GBTRegressor
Random Forest Regressor

Author: Ilya Matiach <ilmat@microsoft.com>

Closes #16377 from imatiach-msft/ilmat/fix-decision-tree.
2017-01-24 10:25:12 -08:00
Souljoy Zhuo cca8680047
delete useless var “j”
the var “j” defined in "var j = 0" is useless for “def compress”

Author: Souljoy Zhuo <zhuoshoujie@126.com>

Closes #16676 from xiaoyesoso/patch-1.
2017-01-24 11:33:17 +00:00
Zheng RuiFeng 49f5b0ae4c [SPARK-17747][ML] WeightCol support non-double numeric datatypes
## What changes were proposed in this pull request?

1, add test for `WeightCol` in `MLTestingUtils.checkNumericTypes`
2, move datatype cast to `Predict.fit`, and supply algos' `train()` with casted dataframe
## How was this patch tested?

local tests in spark-shell and unit tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #15314 from zhengruifeng/weightCol_support_int.
2017-01-23 17:24:53 -08:00
Ilya Matiach 5b258b8b07 [SPARK-16473][MLLIB] Fix BisectingKMeans Algorithm failing in edge case
[SPARK-16473][MLLIB] Fix BisectingKMeans Algorithm failing in edge case where no children exist in updateAssignments

## What changes were proposed in this pull request?

Fix a bug in which BisectingKMeans fails with error:
java.util.NoSuchElementException: key not found: 166
        at scala.collection.MapLike$class.default(MapLike.scala:228)
        at scala.collection.AbstractMap.default(Map.scala:58)
        at scala.collection.MapLike$class.apply(MapLike.scala:141)
        at scala.collection.AbstractMap.apply(Map.scala:58)
        at org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1$$anonfun$2.apply$mcDJ$sp(BisectingKMeans.scala:338)
        at org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1$$anonfun$2.apply(BisectingKMeans.scala:337)
        at org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1$$anonfun$2.apply(BisectingKMeans.scala:337)
        at scala.collection.TraversableOnce$$anonfun$minBy$1.apply(TraversableOnce.scala:231)
        at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
        at scala.collection.immutable.List.foldLeft(List.scala:84)
        at scala.collection.LinearSeqOptimized$class.reduceLeft(LinearSeqOptimized.scala:125)
        at scala.collection.immutable.List.reduceLeft(List.scala:84)
        at scala.collection.TraversableOnce$class.minBy(TraversableOnce.scala:231)
        at scala.collection.AbstractTraversable.minBy(Traversable.scala:105)
        at org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1.apply(BisectingKMeans.scala:337)
        at org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1.apply(BisectingKMeans.scala:334)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
        at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:389)

## How was this patch tested?

The dataset was run against the code change to verify that the code works.  I will try to add unit tests to the code.

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Ilya Matiach <ilmat@microsoft.com>

Closes #16355 from imatiach-msft/ilmat/fix-kmeans.
2017-01-23 13:34:27 -08:00
z001qdp c8aea7445c [SPARK-17455][MLLIB] Improve PAVA implementation in IsotonicRegression
## What changes were proposed in this pull request?

New implementation of the Pool Adjacent Violators Algorithm (PAVA) in mllib.IsotonicRegression, which used under the hood by ml.regression.IsotonicRegression. The previous implementation could have factorial complexity in the worst case. This implementation, which closely follows those in scikit-learn and the R `iso` package, runs in quadratic time in the worst case.
## How was this patch tested?

Existing unit tests in both `mllib` and `ml` passed before and after this patch. Scaling properties were tested by running the `poolAdjacentViolators` method in [scala-benchmarking-template](https://github.com/sirthias/scala-benchmarking-template) with the input generated by

``` scala
val x = (1 to length).toArray.map(_.toDouble)
val y = x.reverse.zipWithIndex.map{ case (yi, i) => if (i % 2 == 1) yi - 1.5 else yi}
val w = Array.fill(length)(1d)

val input: Array[(Double, Double, Double)] = (y zip x zip w) map{ case ((y, x), w) => (y, x, w)}
```

Before this patch:

| Input Length | Time (us) |
| --: | --: |
| 100 | 1.35 |
| 200 | 3.14 |
| 400 | 116.10 |
| 800 | 2134225.90 |

After this patch:

| Input Length | Time (us) |
| --: | --: |
| 100 | 1.25 |
| 200 | 2.53 |
| 400 | 5.86 |
| 800 | 10.55 |

Benchmarking was also performed with randomly-generated y values, with similar results.

Author: z001qdp <Nicholas.Eggert@target.com>

Closes #15018 from neggert/SPARK-17455-isoreg-algo.
2017-01-23 13:20:52 -08:00
Yuhao 4a11d029dc [SPARK-14709][ML] spark.ml API for linear SVM
## What changes were proposed in this pull request?

jira: https://issues.apache.org/jira/browse/SPARK-14709

Provide API for SVM algorithm for DataFrames. As discussed in jira, the initial implementation uses OWL-QN with Hinge loss function.
The API should mimic existing spark.ml.classification APIs.
Currently only Binary Classification is supported. Multinomial support can be added in this or following release.
## How was this patch tested?

new unit tests and simple manual test

Author: Yuhao <yuhao.yang@intel.com>
Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #15211 from hhbyyh/mlsvm.
2017-01-23 12:18:06 -08:00
actuaryzhang f067acefab [SPARK-19155][ML] Make family case insensitive in GLM
## What changes were proposed in this pull request?
This is a supplement to PR #16516 which did not make the value from `getFamily` case insensitive. Current tests of poisson/binomial glm with weight fail when specifying 'Poisson' or 'Binomial', because the calculation of `dispersion` and `pValue` checks the value of family retrieved from `getFamily`
```
model.getFamily == Binomial.name || model.getFamily == Poisson.name
```

## How was this patch tested?
Update existing tests for 'Poisson' and 'Binomial'.

yanboliang felixcheung imatiach-msft

Author: actuaryzhang <actuaryzhang10@gmail.com>

Closes #16675 from actuaryzhang/family.
2017-01-23 00:53:44 -08:00
Yanbo Liang 0c589e3713 [SPARK-19291][SPARKR][ML] spark.gaussianMixture supports output log-likelihood.
## What changes were proposed in this pull request?
```spark.gaussianMixture``` supports output total log-likelihood for the model like R ```mvnormalmixEM```.

## How was this patch tested?
R unit test.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #16646 from yanboliang/spark-19291.
2017-01-21 21:26:14 -08:00
Yanbo Liang 3dcad9fab1 [SPARK-19155][ML] MLlib GeneralizedLinearRegression family and link should case insensitive
## What changes were proposed in this pull request?
MLlib ```GeneralizedLinearRegression``` ```family``` and ```link``` should be case insensitive. This is consistent with some other MLlib params such as [```featureSubsetStrategy```](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/tree/treeParams.scala#L415).

## How was this patch tested?
Update corresponding tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #16516 from yanboliang/spark-19133.
2017-01-21 21:15:57 -08:00
Zheng RuiFeng 8ccca9170f [SPARK-14272][ML] Add Loglikelihood in GaussianMixtureSummary
## What changes were proposed in this pull request?

add loglikelihood in GMM.summary

## How was this patch tested?

added tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>
Author: Ruifeng Zheng <ruifengz@foxmail.com>

Closes #12064 from zhengruifeng/gmm_metric.
2017-01-19 03:46:37 -08:00
Ilya Matiach fe409f31d9 [SPARK-14975][ML] Fixed GBTClassifier to predict probability per training instance and fixed interfaces
## What changes were proposed in this pull request?

For all of the classifiers in MLLib we can predict probabilities except for GBTClassifier.
Also, all classifiers inherit from ProbabilisticClassifier but GBTClassifier strangely inherits from Predictor, which is a bug.
This change corrects the interface and adds the ability for the classifier to give a probabilities vector.

## How was this patch tested?

The basic ML tests were run after making the changes.  I've marked this as WIP as I need to add more tests.

Author: Ilya Matiach <ilmat@microsoft.com>

Closes #16441 from imatiach-msft/ilmat/fix-GBT.
2017-01-18 15:33:41 -08:00
uncleGen eefdf9f9dd
[SPARK-19227][SPARK-19251] remove unused imports and outdated comments
## What changes were proposed in this pull request?
remove ununsed imports and outdated comments, and fix some minor code style issue.

## How was this patch tested?
existing ut

Author: uncleGen <hustyugm@gmail.com>

Closes #16591 from uncleGen/SPARK-19227.
2017-01-18 09:44:32 +00:00
Zheng RuiFeng e7f982b20d [SPARK-18206][ML] Add instrumentation for MLP,NB,LDA,AFT,GLM,Isotonic,LiR
## What changes were proposed in this pull request?

add instrumentation for MLP,NB,LDA,AFT,GLM,Isotonic,LiR
## How was this patch tested?

local test in spark-shell

Author: Zheng RuiFeng <ruifengz@foxmail.com>
Author: Ruifeng Zheng <ruifengz@foxmail.com>

Closes #15671 from zhengruifeng/lir_instr.
2017-01-17 15:39:51 -08:00
hyukjinkwon 6c00c069e3
[SPARK-3249][DOC] Fix links in ScalaDoc that cause warning messages in sbt/sbt unidoc
## What changes were proposed in this pull request?

This PR proposes to fix ambiguous link warnings by simply making them as code blocks for both javadoc and scaladoc.

```
[warn] .../spark/core/src/main/scala/org/apache/spark/Accumulator.scala:20: The link target "SparkContext#accumulator" is ambiguous. Several members fit the target:
[warn] .../spark/mllib/src/main/scala/org/apache/spark/mllib/optimization/GradientDescent.scala:281: The link target "runMiniBatchSGD" is ambiguous. Several members fit the target:
[warn] .../spark/mllib/src/main/scala/org/apache/spark/mllib/fpm/AssociationRules.scala:83: The link target "run" is ambiguous. Several members fit the target:
...
```

This PR also fixes javadoc8 break as below:

```
[error] .../spark/sql/core/target/java/org/apache/spark/sql/LowPrioritySQLImplicits.java:7: error: reference not found
[error]  * newProductEncoder - to disambiguate for {link List}s which are both {link Seq} and {link Product}
[error]                                                   ^
[error] .../spark/sql/core/target/java/org/apache/spark/sql/LowPrioritySQLImplicits.java:7: error: reference not found
[error]  * newProductEncoder - to disambiguate for {link List}s which are both {link Seq} and {link Product}
[error]                                                                                ^
[error] .../spark/sql/core/target/java/org/apache/spark/sql/LowPrioritySQLImplicits.java:7: error: reference not found
[error]  * newProductEncoder - to disambiguate for {link List}s which are both {link Seq} and {link Product}
[error]                                                                                                ^
[info] 3 errors
```

## How was this patch tested?

Manually via `sbt unidoc > output.txt` and the checked it via `cat output.txt | grep ambiguous`

and `sbt unidoc | grep error`.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #16604 from HyukjinKwon/SPARK-3249.
2017-01-17 12:28:15 +00:00
wm624@hotmail.com 12c8c21608 [SPARK-19066][SPARKR] SparkR LDA doesn't set optimizer correctly
## What changes were proposed in this pull request?

spark.lda passes the optimizer "em" or "online" as a string to the backend. However, LDAWrapper doesn't set optimizer based on the value from R. Therefore, for optimizer "em", the `isDistributed` field is FALSE, which should be TRUE based on scala code.

In addition, the `summary` method should bring back the results related to `DistributedLDAModel`.

## How was this patch tested?
Manual tests by comparing with scala example.
Modified the current unit test: fix the incorrect unit test and add necessary tests for `summary` method.

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #16464 from wangmiao1981/new.
2017-01-16 06:05:59 -08:00
wm624@hotmail.com 7f24a0b6c3 [SPARK-19142][SPARKR] spark.kmeans should take seed, initSteps, and tol as parameters
## What changes were proposed in this pull request?
spark.kmeans doesn't have interface to set initSteps, seed and tol. As Spark Kmeans algorithm doesn't take the same set of parameters as R kmeans, we should maintain a different interface in spark.kmeans.

Add missing parameters and corresponding document.

Modified existing unit tests to take additional parameters.

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #16523 from wangmiao1981/kmeans.
2017-01-12 22:27:57 -08:00
wm624@hotmail.com c983267b08 [SPARK-19110][MLLIB][FOLLOWUP] Add a unit test for testing logPrior and logLikelihood of DistributedLDAModel in MLLIB
## What changes were proposed in this pull request?
#16491 added the fix to mllib and a unit test to ml. This followup PR, add unit tests to mllib suite.

## How was this patch tested?
Unit tests.

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #16524 from wangmiao1981/ldabug.
2017-01-12 18:31:57 -08:00
Peng, Meng 32286ba68a
[SPARK-17645][MLLIB][ML][FOLLOW-UP] document minor change
## What changes were proposed in this pull request?
Add FDR test case in ml/feature/ChiSqSelectorSuite.
Improve some comments in the code.
This is a follow-up pr for #15212.

## How was this patch tested?
ut

Author: Peng, Meng <peng.meng@intel.com>

Closes #16434 from mpjlu/fdr_fwe_update.
2017-01-10 13:09:58 +00:00
Yanbo Liang 3ef6d98a80 [SPARK-17847][ML] Reduce shuffled data size of GaussianMixture & copy the implementation from mllib to ml
## What changes were proposed in this pull request?

Copy `GaussianMixture` implementation from mllib to ml, then we can add new features to it.
I left mllib `GaussianMixture` untouched, unlike some other algorithms to wrap the ml implementation. For the following reasons:
- mllib `GaussianMixture` allows k == 1, but ml does not.
- mllib `GaussianMixture` supports setting initial model, but ml does not support currently. (We will definitely add this feature for ml in the future)

We can get around these issues to make mllib as a wrapper calling into ml, but I'd prefer to leave mllib untouched which can make ml clean.

Meanwhile, There is a big performance improvement for `GaussianMixture` in this PR. Since the covariance matrix of multivariate gaussian distribution is symmetric, we can only store the upper triangular part of the matrix and it will greatly reduce the shuffled data size. In my test, this change will reduce shuffled data size by about 50% and accelerate the job execution.

Before this PR:
![image](https://cloud.githubusercontent.com/assets/1962026/19641622/4bb017ac-9996-11e6-8ece-83db184b620a.png)
After this PR:
![image](https://cloud.githubusercontent.com/assets/1962026/19641635/629c21fe-9996-11e6-91e9-83ab74ae0126.png)
## How was this patch tested?

Existing tests and added new tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #15413 from yanboliang/spark-17847.
2017-01-09 21:38:46 -08:00
wm624@hotmail.com 036b50347c [SPARK-19110][ML][MLLIB] DistributedLDAModel returns different logPrior for original and loaded model
## What changes were proposed in this pull request?

While adding DistributedLDAModel training summary for SparkR, I found that the logPrior for original and loaded model is different.
For example, in the test("read/write DistributedLDAModel"), I add the test:
val logPrior = model.asInstanceOf[DistributedLDAModel].logPrior
val logPrior2 = model2.asInstanceOf[DistributedLDAModel].logPrior
assert(logPrior === logPrior2)
The test fails:
-4.394180878889078 did not equal -4.294290536919573

The reason is that `graph.vertices.aggregate(0.0)(seqOp, _ + _)` only returns the value of a single vertex instead of the aggregation of all vertices. Therefore, when the loaded model does the aggregation in a different order, it returns different `logPrior`.

Please refer to #16464 for details.
## How was this patch tested?
Add a new unit test for testing logPrior.

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #16491 from wangmiao1981/ldabug.
2017-01-07 11:07:49 -08:00
Wenchen Fan b3d39620c5 [SPARK-19085][SQL] cleanup OutputWriterFactory and OutputWriter
## What changes were proposed in this pull request?

`OutputWriterFactory`/`OutputWriter` are internal interfaces and we can remove some unnecessary APIs:
1. `OutputWriterFactory.newWriter(path: String)`: no one calls it and no one implements it.
2. `OutputWriter.write(row: Row)`: during execution we only call `writeInternal`, which is weird as `OutputWriter` is already an internal interface. We should rename `writeInternal` to `write` and remove `def write(row: Row)` and it's related converter code. All implementations should just implement `def write(row: InternalRow)`

## How was this patch tested?

existing tests.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #16479 from cloud-fan/hive-writer.
2017-01-08 00:42:09 +08:00
sueann d60f6f62d0 [SPARK-18194][ML] Log instrumentation in OneVsRest, CrossValidator, TrainValidationSplit
## What changes were proposed in this pull request?

Added instrumentation logging for OneVsRest classifier, CrossValidator, TrainValidationSplit fit() functions.

## How was this patch tested?

Ran unit tests and checked the log file (see output in comments).

Author: sueann <sueann@databricks.com>

Closes #16480 from sueann/SPARK-18194.
2017-01-06 18:53:16 -08:00
Yanbo Liang dfc4c935ba [MINOR] Correct LogisticRegression test case for probability2prediction.
## What changes were proposed in this pull request?
Set correct column names for ```force to use probability2prediction``` in ```LogisticRegressionSuite```.

## How was this patch tested?
Change unit test.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #16477 from yanboliang/lor-pred.
2017-01-05 18:59:49 -08:00
Niranjan Padmanabhan a1e40b1f5d
[MINOR][DOCS] Remove consecutive duplicated words/typo in Spark Repo
## What changes were proposed in this pull request?
There are many locations in the Spark repo where the same word occurs consecutively. Sometimes they are appropriately placed, but many times they are not. This PR removes the inappropriately duplicated words.

## How was this patch tested?
N/A since only docs or comments were updated.

Author: Niranjan Padmanabhan <niranjan.padmanabhan@gmail.com>

Closes #16455 from neurons/np.structure_streaming_doc.
2017-01-04 15:07:29 +00:00
Zheng RuiFeng 7a82505817
[SPARK-19054][ML] Eliminate extra pass in NB
## What changes were proposed in this pull request?
eliminate unnecessary extra pass in NB's train

## How was this patch tested?
existing tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #16453 from zhengruifeng/nb_getNC.
2017-01-04 11:54:13 +00:00
Weiqing Yang e5c307c50a
[MINOR] Add missing sc.stop() to end of examples
## What changes were proposed in this pull request?

Add `finally` clause for `sc.stop()` in the `test("register and deregister Spark listener from SparkContext")`.

## How was this patch tested?
Pass the build and unit tests.

Author: Weiqing Yang <yangweiqing001@gmail.com>

Closes #16426 from weiqingy/testIssue.
2017-01-03 09:56:42 +00:00
Sean Owen 56d3a7eb83
[SPARK-18808][ML][MLLIB] ml.KMeansModel.transform is very inefficient
## What changes were proposed in this pull request?

mllib.KMeansModel.clusterCentersWithNorm is a method than ends up being called every time `predict` is called on a single vector, which is bad news for now the ml.KMeansModel Transformer works, which necessarily transforms one vector at a time.

This causes the model to just store the vectors with norms upfront. The extra norm should be small compared to the vectors. This would avoid this form of overhead on this and other code paths.

## How was this patch tested?

Existing tests.

Author: Sean Owen <sowen@cloudera.com>

Closes #16328 from srowen/SPARK-18808.
2016-12-30 10:40:17 +00:00
Ilya Matiach 87bc4112c5 [SPARK-18698][ML] Adding public constructor that takes uid for IndexToString
## What changes were proposed in this pull request?

Based on SPARK-18698, this adds a public constructor that takes a UID for IndexToString.  Other transforms have similar constructors.

## How was this patch tested?

A unit test was added to verify the new functionality.

Author: Ilya Matiach <ilmat@microsoft.com>

Closes #16436 from imatiach-msft/ilmat/fix-indextostring.
2016-12-29 13:25:49 -08:00
sethah 6a475ae466 [SPARK-17772][ML][TEST] Add test functions for ML sample weights
## What changes were proposed in this pull request?

More and more ML algos are accepting sample weights, and they have been tested rather heterogeneously and with code duplication. This patch adds extensible helper methods to `MLTestingUtils` that can be reused by various algorithms accepting sample weights. Up to now, there seems to be a few tests that have been implemented commonly:

* Check that oversampling is the same as giving the instances sample weights proportional to the number of samples
* Check that outliers with tiny sample weights do not affect the algorithm's performance

This patch adds an additional test:

* Check that algorithms are invariant to constant scaling of the sample weights. i.e. uniform sample weights with `w_i = 1.0` is effectively the same as uniform sample weights with `w_i = 10000` or `w_i = 0.0001`

The instances of these tests occurred in LinearRegression, NaiveBayes, and LogisticRegression. Those tests have been removed/modified to use the new helper methods. These helper functions will be of use when [SPARK-9478](https://issues.apache.org/jira/browse/SPARK-9478) is implemented.

## How was this patch tested?

This patch only involves modifying test suites.

## Other notes

Both IsotonicRegression and GeneralizedLinearRegression also extend `HasWeightCol`. I did not modify these test suites because it will make this patch easier to review, and because they did not duplicate the same tests as the three suites that were modified. If we want to change them later, we can create a JIRA for it now, but it's open for debate.

Author: sethah <seth.hendrickson16@gmail.com>

Closes #15721 from sethah/SPARK-17772.
2016-12-28 07:01:14 -08:00