Commit graph

27 commits

Author SHA1 Message Date
sethah 78d740a08a [SPARK-17748][ML] One pass solver for Weighted Least Squares with ElasticNet
## What changes were proposed in this pull request?

1. Make a pluggable solver interface for `WeightedLeastSquares`
2. Add a `QuasiNewton` solver to handle elastic net regularization for `WeightedLeastSquares`
3. Add method `BLAS.dspmv` used by QN solver
4. Add mechanism for WLS to handle singular covariance matrices by falling back to QN solver when Cholesky fails.

## How was this patch tested?
Unit tests - see below.

## Design choices

**Pluggable Normal Solver**

Before, the `WeightedLeastSquares` package always used the Cholesky decomposition solver to compute the solution to the normal equations. Now, we specify the solver as a constructor argument to the `WeightedLeastSquares`. We introduce a new trait:

````scala
private[ml] sealed trait NormalEquationSolver {

  def solve(
      bBar: Double,
      bbBar: Double,
      abBar: DenseVector,
      aaBar: DenseVector,
      aBar: DenseVector): NormalEquationSolution
}
````

We extend this trait for different variants of normal equation solvers. In the future, we can easily add others (like QR) using this interface.

**Always train in the standardized space**

The normal solver did not previously standardize the data, but this patch introduces a change such that we always solve the normal equations in the standardized space. We convert back to the original space in the same way that is done for distributed L-BFGS/OWL-QN. We add test cases for zero variance features/labels.

**Use L-BFGS locally to solve normal equations for singular matrix**

When linear regression with the normal solver is called for a singular matrix, we initially try to solve with Cholesky. We use the output of `lapack.dppsv` to determine if the matrix is singular. If it is, we fall back to using L-BFGS locally to solve the normal equations. We add test cases for this as well.

## Test cases
I found it helpful to enumerate some of the test cases and hopefully it makes review easier.

**WeightedLeastSquares**

1. Constant columns - Cholesky solver fails with no regularization, Auto solver falls back to QN, and QN trains successfully.
2. Collinear features - Cholesky solver fails with no regularization, Auto solver falls back to QN, and QN trains successfully.
3. Label is constant zero - no training is performed regardless of intercept. Coefficients are zero and intercept is zero.
4. Label is constant - if fitIntercept, then no training is performed and intercept equals label mean. If not fitIntercept, then we train and return an answer that matches R's lm package.
5. Test with L1 - go through various combinations of L1/L2, standardization, fitIntercept and verify that output matches glmnet.
6. Initial intercept - verify that setting the initial intercept to label mean is correct by training model with strong L1 regularization so that all coefficients are zero and intercept converges to label mean.
7. Test diagInvAtWA - since we are standardizing features now during training, we should test that the inverse is computed to match R.

**LinearRegression**
1. For all existing L1 test cases, test the "normal" solver too.
2. Check that using the normal solver now handles singular matrices.
3. Check that using the normal solver with L1 produces an objective history in the model summary, but does not produce the inverse of AtA.

**BLAS**
1. Test new method `dspmv`.

## Performance Testing
This patch will speed up linear regression with L1/elasticnet penalties when the feature size is < 4096. I have not conducted performance tests at scale, only observed by testing locally that there is a speed improvement.

We should decide if this PR needs to be blocked before performance testing is conducted.

Author: sethah <seth.hendrickson16@gmail.com>

Closes #15394 from sethah/SPARK-17748.
2016-10-24 23:47:59 -07:00
Zheng RuiFeng a8ea4da8d0
[SPARK-17331][FOLLOWUP][ML][CORE] Avoid allocating 0-length arrays
## What changes were proposed in this pull request?

`Array[T]()` -> `Array.empty[T]` to avoid allocating 0-length arrays.
Use regex `find . -name '*.scala' | xargs -i bash -c 'egrep "Array\[[A-Za-z]+\]\(\)" -n {} && echo {}'` to find modification candidates.

cc srowen

## How was this patch tested?
existing tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #15564 from zhengruifeng/avoid_0_length_array.
2016-10-21 09:49:37 +01:00
Bjarne Fruergaard 29396e7d14 [SPARK-17721][MLLIB][ML] Fix for multiplying transposed SparseMatrix with SparseVector
## What changes were proposed in this pull request?

* changes the implementation of gemv with transposed SparseMatrix and SparseVector both in mllib-local and mllib (identical)
* adds a test that was failing before this change, but succeeds with these changes.

The problem in the previous implementation was that it only increments `i`, that is enumerating the columns of a row in the SparseMatrix, when the row-index of the vector matches the column-index of the SparseMatrix. In cases where a particular row of the SparseMatrix has non-zero values at column-indices lower than corresponding non-zero row-indices of the SparseVector, the non-zero values of the SparseVector are enumerated without ever matching the column-index at index `i` and the remaining column-indices i+1,...,indEnd-1 are never attempted. The test cases in this PR illustrate this issue.

## How was this patch tested?

I have run the specific `gemv` tests in both mllib-local and mllib. I am currently still running `./dev/run-tests`.

## ___
As per instructions, I hereby state that this is my original work and that I license the work to the project (Apache Spark) under the project's open source license.

Mentioning dbtsai, viirya and brkyvz whom I can see have worked/authored on these parts before.

Author: Bjarne Fruergaard <bwahlgreen@gmail.com>

Closes #15296 from bwahlgreen/bugfix-spark-17721.
2016-09-29 15:39:57 -07:00
Liwei Lin 3ce3a282c8 [SPARK-17359][SQL][MLLIB] Use ArrayBuffer.+=(A) instead of ArrayBuffer.append(A) in performance critical paths
## What changes were proposed in this pull request?

We should generally use `ArrayBuffer.+=(A)` rather than `ArrayBuffer.append(A)`, because `append(A)` would involve extra boxing / unboxing.

## How was this patch tested?

N/A

Author: Liwei Lin <lwlin7@gmail.com>

Closes #14914 from lw-lin/append_to_plus_eq_v2.
2016-09-07 10:04:00 +01:00
Yanbo Liang 1b001b5203 [MINOR][ML][MLLIB] Remove work around for breeze sparse matrix.
## What changes were proposed in this pull request?
Since we have updated breeze version to 0.12, we should remove work around for bug of breeze sparse matrix in v0.11.
I checked all mllib code and found this is the only work around for breeze 0.11.

## How was this patch tested?
Existing tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #14953 from yanboliang/matrices.
2016-09-04 05:38:47 -07:00
Sean Owen 3893e8c576 [SPARK-17331][CORE][MLLIB] Avoid allocating 0-length arrays
## What changes were proposed in this pull request?

Avoid allocating some 0-length arrays, esp. in UTF8String, and by using Array.empty in Scala over Array[T]()

## How was this patch tested?

Jenkins

Author: Sean Owen <sowen@cloudera.com>

Closes #14895 from srowen/SPARK-17331.
2016-09-01 12:13:07 -07:00
Peng, Meng 40168dbe77 [ML][MLLIB] The require condition and message doesn't match in SparseMatrix.
## What changes were proposed in this pull request?
The require condition and message doesn't match, and the condition also should be optimized.
Small change.  Please kindly let me know if JIRA required.

## How was this patch tested?
No additional test required.

Author: Peng, Meng <peng.meng@intel.com>

Closes #14824 from mpjlu/smallChangeForMatrixRequire.
2016-08-27 08:46:01 +01:00
Peng, Meng c0949dc944 [SPARK-17207][MLLIB] fix comparing Vector bug in TestingUtils
## What changes were proposed in this pull request?

fix comparing Vector bug in TestingUtils.
There is the same bug for Matrix comparing. How to check the length of Matrix should be discussed first.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Author: Peng, Meng <peng.meng@intel.com>

Closes #14785 from mpjlu/testUtils.
2016-08-26 11:54:10 -07:00
Jeff Zhang 072acf5e14 [SPARK-16965][MLLIB][PYSPARK] Fix bound checking for SparseVector.
## What changes were proposed in this pull request?

1. In scala, add negative low bound checking and put all the low/upper bound checking in one place
2. In python, add low/upper bound checking of indices.

## How was this patch tested?

unit test added

Author: Jeff Zhang <zjffdu@apache.org>

Closes #14555 from zjffdu/SPARK-16965.
2016-08-19 12:38:15 +01:00
Xin Ren 21a6dd2aef [SPARK-16535][BUILD] In pom.xml, remove groupId which is redundant definition and inherited from the parent
https://issues.apache.org/jira/browse/SPARK-16535

## What changes were proposed in this pull request?

When I scan through the pom.xml of sub projects, I found this warning as below and attached screenshot
```
Definition of groupId is redundant, because it's inherited from the parent
```
![screen shot 2016-07-13 at 3 13 11 pm](https://cloud.githubusercontent.com/assets/3925641/16823121/744f893e-4916-11e6-8a52-042f83b9db4e.png)

I've tried to remove some of the lines with groupId definition, and the build on my local machine is still ok.
```
<groupId>org.apache.spark</groupId>
```
As I just find now `<maven.version>3.3.9</maven.version>` is being used in Spark 2.x, and Maven-3 supports versionless parent elements: Maven 3 will remove the need to specify the parent version in sub modules. THIS is great (in Maven 3.1).

ref: http://stackoverflow.com/questions/3157240/maven-3-worth-it/3166762#3166762

## How was this patch tested?

I've tested by re-building the project, and build succeeded.

Author: Xin Ren <iamshrek@126.com>

Closes #14189 from keypointt/SPARK-16535.
2016-07-19 11:59:46 +01:00
Sean Owen 5ec0d692b0 [SPARK-3359][DOCS] More changes to resolve javadoc 8 errors that will help unidoc/genjavadoc compatibility
## What changes were proposed in this pull request?

These are yet more changes that resolve problems with unidoc/genjavadoc and Java 8. It does not fully resolve the problem, but gets rid of as many errors as we can from this end.

## How was this patch tested?

Jenkins build of docs

Author: Sean Owen <sowen@cloudera.com>

Closes #14221 from srowen/SPARK-3359.3.
2016-07-16 13:26:58 -07:00
Reynold Xin ffcb6e055a [SPARK-16477] Bump master version to 2.1.0-SNAPSHOT
## What changes were proposed in this pull request?
After SPARK-16476 (committed earlier today as #14128), we can finally bump the version number.

## How was this patch tested?
N/A

Author: Reynold Xin <rxin@databricks.com>

Closes #14130 from rxin/SPARK-16477.
2016-07-11 09:42:56 -07:00
Zheng RuiFeng fd8af39713 [MINOR] Fix Typos 'an -> a'
## What changes were proposed in this pull request?

`an -> a`

Use cmds like `find . -name '*.R' | xargs -i sh -c "grep -in ' an [^aeiou]' {} && echo {}"` to generate candidates, and review them one by one.

## How was this patch tested?
manual tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #13515 from zhengruifeng/an_a.
2016-06-06 09:35:47 +01:00
DB Tsai 21b2605dc4 [SPARK-15413][ML][MLLIB] Change toBreeze to asBreeze in Vector and Matrix
## What changes were proposed in this pull request?

We're using `asML` to convert the mllib vector/matrix to ml vector/matrix now. Using `as` is more correct given that this conversion actually shares the same underline data structure. As a result, in this PR, `toBreeze` will be changed to `asBreeze`. This is a private API, as a result, it will not affect any user's application.

## How was this patch tested?

unit tests

Author: DB Tsai <dbt@netflix.com>

Closes #13198 from dbtsai/minor.
2016-05-27 14:02:39 -07:00
DB Tsai 5255e55c84 [SPARK-15411][ML] Add @since to ml.stat.MultivariateOnlineSummarizer.scala
## What changes were proposed in this pull request?

Add since to ml.stat.MultivariateOnlineSummarizer.scala

## How was this patch tested?

unit tests

Author: DB Tsai <dbt@netflix.com>

Closes #13197 from dbtsai/cleanup.
2016-05-19 13:10:51 -07:00
Pravin Gadakh 31f63ac25d [SPARK-14613][ML] Add @Since into the matrix and vector classes in spark-mllib-local
## What changes were proposed in this pull request?

This PR add `Since` annotations in `Vectors.scala` and `Matrices.scala` of spark-mllib-local.

## How was this patch tested?

Scala Style Checks.

Author: Pravin Gadakh <prgadakh@in.ibm.com>

Closes #13191 from pravingadakh/SPARK-14613.
2016-05-19 10:59:07 -07:00
Sean Owen 122302cbf5 [SPARK-15290][BUILD] Move annotations, like @Since / @DeveloperApi, into spark-tags
## What changes were proposed in this pull request?

(See https://github.com/apache/spark/pull/12416 where most of this was already reviewed and committed; this is just the module structure and move part. This change does not move the annotations into test scope, which was the apparently problem last time.)

Rename `spark-test-tags` -> `spark-tags`; move common annotations like `Since` to `spark-tags`

## How was this patch tested?

Jenkins tests.

Author: Sean Owen <sowen@cloudera.com>

Closes #13074 from srowen/SPARK-15290.
2016-05-17 09:55:53 +01:00
Xiangrui Meng 0847fe4eb3 [SPARK-14653][ML] Remove json4s from mllib-local
## What changes were proposed in this pull request?

This PR moves Vector.toJson/fromJson to ml.linalg.VectorEncoder under mllib/ to keep mllib-local's dependency minimal. The json encoding is used by Params. So we still need this feature in SPARK-14615, where we will switch to ml.linalg in spark.ml APIs.

## How was this patch tested?

Copied existing unit tests over.

cc; dbtsai

Author: Xiangrui Meng <meng@databricks.com>

Closes #12802 from mengxr/SPARK-14653.
2016-04-30 06:30:39 -07:00
Yin Huai 9c7c42bc6a Revert "[SPARK-14613][ML] Add @Since into the matrix and vector classes in spark-mllib-local"
This reverts commit dae538a4d7.
2016-04-28 19:57:41 -07:00
Pravin Gadakh dae538a4d7 [SPARK-14613][ML] Add @Since into the matrix and vector classes in spark-mllib-local
## What changes were proposed in this pull request?

This PR adds `since` tag into the matrix and vector classes in spark-mllib-local.

## How was this patch tested?

Scala-style checks passed.

Author: Pravin Gadakh <prgadakh@in.ibm.com>

Closes #12416 from pravingadakh/SPARK-14613.
2016-04-28 15:59:18 -07:00
Joseph K. Bradley bd2c9a6d48 [SPARK-14732][ML] spark.ml GaussianMixture should use MultivariateGaussian in mllib-local
## What changes were proposed in this pull request?

Before, spark.ml GaussianMixtureModel used the spark.mllib MultivariateGaussian in its public API.  This was added after 1.6, so we can modify this API without breaking APIs.

This PR copies MultivariateGaussian to mllib-local in spark.ml, with a few changes:
* Renamed fields to match numpy, scipy: mu => mean, sigma => cov

This PR then uses the spark.ml MultivariateGaussian in the spark.ml GaussianMixtureModel, which involves:
* Modifying the constructor
* Adding a computeProbabilities method

Also:
* Added EPSILON to mllib-local for use in MultivariateGaussian

## How was this patch tested?

Existing unit tests

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #12593 from jkbradley/sparkml-gmm-fix.
2016-04-26 16:53:16 -07:00
Joan bf95b8da27 [SPARK-6429] Implement hashCode and equals together
## What changes were proposed in this pull request?

Implement some `hashCode` and `equals` together in order to enable the scalastyle.
This is a first batch, I will continue to implement them but I wanted to know your thoughts.

Author: Joan <joan@goyeau.com>

Closes #12157 from joan38/SPARK-6429-HashCode-Equals.
2016-04-22 12:24:12 +01:00
DB Tsai 96534aa47c [SPARK-14549][ML] Copy the Vector and Matrix classes from mllib to ml in mllib-local
## What changes were proposed in this pull request?

This task will copy the Vector and Matrix classes from mllib to ml package in mllib-local jar. The UDTs and `since` annotation in ml vector and matrix will be removed from now. UDTs will be achieved by #SPARK-14487, and `since` will be replaced by /*  since 1.2.0 */

The BLAS implementation will be copied, and some of the test utilities will be copies as well.

Summary of changes:

1. In mllib-local/src/main/scala/org/apache/spark/**ml**/linalg/BLAS.scala
  - Copied from mllib/src/main/scala/org/apache/spark/**mllib**/linalg/BLAS.scala
  - logDebug("gemm: alpha is equal to 0 and beta is equal to 1. Returning C.") is removed in ml version.
2. In  mllib-local/src/main/scala/org/apache/spark/**ml**/linalg/Matrices.scala
  - Copied from mllib/src/main/scala/org/apache/spark/**mllib**/linalg/Matrices.scala
  - `Since` was removed, and we'll use standard `/* Since /*` Java doc. Will be in another PR.
  - `UDT` related code was removed, and will use `SPARK-13944` https://github.com/apache/spark/pull/12259  to replace the annotation.
3. In mllib-local/src/main/scala/org/apache/spark/**ml**/linalg/Vectors.scala
  - Copied from mllib/src/main/scala/org/apache/spark/**mllib**/linalg/Vectors.scala
  - `Since` was removed.
  - `UDT` related code was removed.
  - In `def parseNumeric`, it was throwing `throw new SparkException(s"Cannot parse $other.")`, and now it's throwing `throw new IllegalArgumentException(s"Cannot parse $other.")`
4. In mllib/src/main/scala/org/apache/spark/**mllib**/linalg/Vectors.scala
  - For consistency with ML version of vector, `def parseNumeric` is now throwing `throw new IllegalArgumentException(s"Cannot parse $other.")`
5. mllib/src/main/scala/org/apache/spark/**mllib**/util/NumericParser.scala is moved to mllib-local/src/main/scala/org/apache/spark/**ml**/util/NumericParser.scala
  - All the `throw new SparkException` were replaced by `throw new IllegalArgumentException`

## How was this patch tested?

unit tests

Author: DB Tsai <dbt@netflix.com>

Closes #12317 from dbtsai/dbtsai-ml-vector.
2016-04-15 01:17:03 -07:00
Sean Owen 9fa43a33b9 [SPARK-14612][ML] Consolidate the version of dependencies in mllib and mllib-local into one place
## What changes were proposed in this pull request?

Move json4s, breeze dependency declaration into parent

## How was this patch tested?

Should be no functional change, but Jenkins tests will test that.

Author: Sean Owen <sowen@cloudera.com>

Closes #12390 from srowen/SPARK-14612.
2016-04-14 10:48:17 -07:00
DB Tsai efaf7d1820 [SPARK-14462][ML][MLLIB] Add the mllib-local build to maven pom
## What changes were proposed in this pull request?

In order to separate the linear algebra, and vector matrix classes into a standalone jar, we need to setup the build first. This PR will create a new jar called mllib-local with minimal dependencies.

The previous PR was failing the build because of `spark-core:test` dependency, and that was reverted. In this PR, `FunSuite` with `// scalastyle:ignore funsuite` in mllib-local test was used, similar to sketch.

Thanks.

## How was this patch tested?

Unit tests

mengxr tedyu holdenk

Author: DB Tsai <dbt@netflix.com>

Closes #12298 from dbtsai/dbtsai-mllib-local-build-fix.
2016-04-11 09:35:47 -07:00
Xiangrui Meng 415446cc9b Revert "[SPARK-14462][ML][MLLIB] add the mllib-local build to maven pom"
This reverts commit 1598d11bb0.
2016-04-09 14:03:03 -07:00
DB Tsai 1598d11bb0 [SPARK-14462][ML][MLLIB] add the mllib-local build to maven pom
## What changes were proposed in this pull request?

In order to separate the linear algebra, and vector matrix classes into a standalone jar, we need to setup the build first. This PR will create a new jar called mllib-local with minimal dependencies. The test scope will still depend on spark-core and spark-core-test in order to use the common utilities, but the runtime will avoid any platform dependency. Couple platform independent classes will be moved to this package to demonstrate how this work.

## How was this patch tested?

Unit tests

Author: DB Tsai <dbt@netflix.com>

Closes #12241 from dbtsai/dbtsai-mllib-local-build.
2016-04-09 09:21:12 -07:00