Commit graph

1060 commits

Author SHA1 Message Date
Yin Huai b0baa11d3b [HOT-FIX] Fix style.
https://github.com/apache/spark/pull/8882 broke our build.

Author: Yin Huai <yhuai@databricks.com>

Closes #8964 from yhuai/fixStyle.
2015-10-02 11:23:08 -07:00
Xusen Yin 633aaae0a1 [SPARK-6530] [ML] Add chi-square selector for ml package
See JIRA [here](https://issues.apache.org/jira/browse/SPARK-6530).

Author: Xusen Yin <yinxusen@gmail.com>

Closes #5742 from yinxusen/SPARK-6530.
2015-10-02 10:25:58 -07:00
Xusen Yin 23a9448c04 [SPARK-5890] [ML] Add feature discretizer
JIRA issue [here](https://issues.apache.org/jira/browse/SPARK-5890).

I borrow the code of `findSplits` from `RandomForest`. I don't think it's good to call it from `RandomForest` directly.

Author: Xusen Yin <yinxusen@gmail.com>

Closes #5779 from yinxusen/SPARK-5890.
2015-10-02 10:19:18 -07:00
Rerngvit Yanggratoke 2a717821bb [SPARK-9798] [ML] CrossValidatorModel Documentation Improvements
Document CrossValidatorModel members: bestModel and avgMetrics

Author: Rerngvit Yanggratoke <rerngvit@kth.se>

Closes #8882 from rerngvit/Spark-9798.
2015-10-02 10:15:02 -07:00
Yanbo Liang 2931e89f0c [SPARK-10736] [ML] Use 1 for all ratings if $(ratingCol) = ""
For some implicit dataset, ratings may not exist in the training data. In this case, we can assume all observed pairs to be positive and treat their ratings as 1. This should happen when users set ```ratingCol``` to an empty string.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #8937 from yanboliang/spark-10736.
2015-09-29 23:58:32 -07:00
y-shimizu 299b439920 [SPARK-10778] [MLLIB] Implement toString for AssociationRules.Rule
I implemented toString for AssociationRules.Rule, format like `[x, y] => {z}: 1.0`

Author: y-shimizu <y.shimizu0429@gmail.com>

Closes #8904 from y-shimizu/master.
2015-09-27 16:36:03 +01:00
Eric Liang 922338812c [SPARK-9681] [ML] Support R feature interactions in RFormula
This integrates the Interaction feature transformer with SparkR R formula support (i.e. support `:`).

To generate reasonable ML attribute names for feature interactions, it was necessary to add the ability to read attribute the original attribute names back from `StructField`, and also to specify custom group prefixes in `VectorAssembler`. This also has the side-benefit of cleaning up the double-underscores in the attributes generated for non-interaction terms.

mengxr

Author: Eric Liang <ekl@databricks.com>

Closes #8830 from ericl/interaction-2.
2015-09-25 00:43:22 -07:00
Holden Karau d91967e159 [SPARK-10763] [ML] [JAVA] [TEST] Update Java MLLIB/ML tests to use simplified dataframe construction
As introduced in https://issues.apache.org/jira/browse/SPARK-10630 we now have an easier way to create dataframes from local Java lists. Lets update the tests to use those.

Author: Holden Karau <holden@pigscanfly.ca>

Closes #8886 from holdenk/SPARK-10763-update-java-mllib-ml-tests-to-use-simplified-dataframe-construction.
2015-09-23 22:49:08 -07:00
Yanbo Liang 067afb4e9b [SPARK-10699] [ML] Support checkpointInterval can be disabled
Currently use can set ```checkpointInterval``` to specify how often should the cache be check-pointed. But we also need the function that users can disable it. This PR supports that users can disable checkpoint if user setting ```checkpointInterval = -1```.
We also add documents for GBT ```cacheNodeIds``` to make users can understand more clearly about checkpoint.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #8820 from yanboliang/spark-10699.
2015-09-23 16:41:42 -07:00
Yanbo Liang ce2b056d35 [SPARK-10686] [ML] Add quantilesCol to AFTSurvivalRegression
By default ```quantilesCol``` should be empty. If ```quantileProbabilities``` is set, we should append quantiles as a new column (of type Vector).

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #8836 from yanboliang/spark-10686.
2015-09-23 15:26:02 -07:00
sethah 098be27ad5 [SPARK-9715] [ML] Store numFeatures in all ML PredictionModel types
All prediction models should store `numFeatures` indicating the number of features the model was trained on. Default value of -1 added for backwards compatibility.

Author: sethah <seth.hendrickson16@gmail.com>

Closes #8675 from sethah/SPARK-9715.
2015-09-23 15:00:52 -07:00
Yanbo Liang 7104ee0e5d [SPARK-10750] [ML] ML Param validate should print better error information
Currently when you set illegal value for params of array type (such as IntArrayParam, DoubleArrayParam, StringArrayParam), it will throw IllegalArgumentException but with incomprehensible error information.
Take ```VectorSlicer.setNames``` as an example:
```scala
val vectorSlicer = new VectorSlicer().setInputCol("features").setOutputCol("result")
// The value of setNames must be contain distinct elements, so the next line will throw exception.
vectorSlicer.setIndices(Array.empty).setNames(Array("f1", "f4", "f1"))
```
It will throw IllegalArgumentException as:
```
vectorSlicer_b3b4d1a10f43 parameter names given invalid value [Ljava.lang.String;798256c5.
java.lang.IllegalArgumentException: vectorSlicer_b3b4d1a10f43 parameter names given invalid value [Ljava.lang.String;798256c5.
```
We should distinguish the value of array type from primitive type at Param.validate(value: T), and we will get better error information.
```
vectorSlicer_3b744ea277b2 parameter names given invalid value [f1,f4,f1].
java.lang.IllegalArgumentException: vectorSlicer_3b744ea277b2 parameter names given invalid value [f1,f4,f1].
```

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #8863 from yanboliang/spark-10750.
2015-09-22 11:00:33 -07:00
Holden Karau f4a3c4e34c [SPARK-9962] [ML] Decision Tree training: prevNodeIdsForInstances.unpersist() at end of training
NodeIdCache: prevNodeIdsForInstances.unpersist() needs to be called at end of training.

Author: Holden Karau <holden@pigscanfly.ca>

Closes #8541 from holdenk/SPARK-9962-decission-tree-training-prevNodeIdsForiNstances-unpersist-at-end-of-training.
2015-09-22 10:19:08 -07:00
Meihua Wu 870b8a2edd [SPARK-10706] [MLLIB] Add java wrapper for random vector rdd
Add java wrapper for random vector rdd

holdenk srowen

Author: Meihua Wu <meihuawu@umich.edu>

Closes #8841 from rotationsymmetry/SPARK-10706.
2015-09-22 11:05:24 +01:00
Feynman Liang aeef44a3e3 [SPARK-3147] [MLLIB] [STREAMING] Streaming 2-sample statistical significance testing
Implementation of significance testing using Streaming API.

Author: Feynman Liang <fliang@databricks.com>
Author: Feynman Liang <feynman.liang@gmail.com>

Closes #4716 from feynmanliang/ab_testing.
2015-09-21 13:11:28 -07:00
Meihua Wu 331f0b10f7 [SPARK-9642] [ML] LinearRegression should supported weighted data
In many modeling application, data points are not necessarily sampled with equal probabilities. Linear regression should support weighting which account the over or under sampling.

work in progress.

Author: Meihua Wu <meihuawu@umich.edu>

Closes #8631 from rotationsymmetry/SPARK-9642.
2015-09-21 12:09:00 -07:00
Holden Karau 20a61dbd9b [SPARK-10626] [MLLIB] create java friendly method for random rdd
SPARK-3136 added a large number of functions for creating Java RandomRDDs, but for people that want to use custom RandomDataGenerators we should make a Java friendly method.

Author: Holden Karau <holden@pigscanfly.ca>

Closes #8782 from holdenk/SPARK-10626-create-java-friendly-method-for-randomRDD.
2015-09-21 18:53:28 +01:00
lewuathe 0c498717ba [SPARK-10715] [ML] Duplicate initialization flag in WeightedLeastSquare
There are duplicate set of initialization flag in `WeightedLeastSquares#add`.
`initialized` is already set in `init(Int)`.

Author: lewuathe <lewuathe@me.com>

Closes #8837 from Lewuathe/duplicate-initialization-flag.
2015-09-20 16:16:31 -07:00
Sean Owen 1aa9e50256 [SPARK-5905] [MLLIB] Note requirements for certain RowMatrix methods in docs
Note methods that fail for cols > 65535; note that SVD does not require n >= m
CC mengxr

Author: Sean Owen <sowen@cloudera.com>

Closes #8839 from srowen/SPARK-5905.
2015-09-20 16:05:12 -07:00
Eric Liang c8149ef2c5 [MINOR] [ML] override toString of AttributeGroup
This makes equality test failures much more readable.

mengxr

Author: Eric Liang <ekl@databricks.com>
Author: Eric Liang <ekhliang@gmail.com>

Closes #8826 from ericl/attrgroupstr.
2015-09-18 16:23:05 -07:00
Yanbo Liang 98f1ea67da [SPARK-8518] [ML] Log-linear models for survival analysis
[Accelerated Failure Time (AFT) model](https://en.wikipedia.org/wiki/Accelerated_failure_time_model) is the most commonly used and easy to parallel method of survival analysis for censored survival data. It is the log-linear model based on the Weibull distribution of the survival time.
Users can refer to the R function [```survreg```](https://stat.ethz.ch/R-manual/R-devel/library/survival/html/survreg.html) to compare the model and [```predict```](https://stat.ethz.ch/R-manual/R-devel/library/survival/html/predict.survreg.html) to compare the prediction. There are different kinds of model prediction, I have just select the type ```response``` which is default used for R.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #8611 from yanboliang/spark-8518.
2015-09-17 21:37:10 -07:00
Eric Liang 4fbf332869 [SPARK-9698] [ML] Add RInteraction transformer for supporting R-style feature interactions
This is a pre-req for supporting the ":" operator in the RFormula feature transformer.

Design doc from umbrella task: https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/edit

mengxr

Author: Eric Liang <ekl@databricks.com>

Closes #7987 from ericl/interaction.
2015-09-17 14:09:06 -07:00
Yanbo Liang 64743870f2 [SPARK-10394] [ML] Make GBTParams use shared stepSize
```GBTParams``` has ```stepSize``` as learning rate currently.
ML has shared param class ```HasStepSize```, ```GBTParams``` can extend from it rather than duplicated implementation.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #8552 from yanboliang/spark-10394.
2015-09-17 11:24:38 -07:00
Holden Karau e51345e1e0 [SPARK-10077] [DOCS] [ML] Add package info for java of ml/feature
Should be the same as SPARK-7808 but use Java for the code example.
It would be great to add package doc for `spark.ml.feature`.

Author: Holden Karau <holden@pigscanfly.ca>

Closes #8740 from holdenk/SPARK-10077-JAVA-PACKAGE-DOC-FOR-SPARK.ML.FEATURE.
2015-09-17 09:17:43 -07:00
DB Tsai be52faa7c7 [SPARK-7685] [ML] Apply weights to different samples in Logistic Regression
In fraud detection dataset, almost all the samples are negative while only couple of them are positive. This type of high imbalanced data will bias the models toward negative resulting poor performance. In python-scikit, they provide a correction allowing users to Over-/undersample the samples of each class according to the given weights. In auto mode, selects weights inversely proportional to class frequencies in the training set. This can be done in a more efficient way by multiplying the weights into loss and gradient instead of doing actual over/undersampling in the training dataset which is very expensive.
http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
On the other hand, some of the training data maybe more important like the training samples from tenure users while the training samples from new users maybe less important. We should be able to provide another "weight: Double" information in the LabeledPoint to weight them differently in the learning algorithm.

Author: DB Tsai <dbt@netflix.com>
Author: DB Tsai <dbt@dbs-mac-pro.corp.netflix.com>

Closes #7884 from dbtsai/SPARK-7685.
2015-09-15 15:46:47 -07:00
Marcelo Vanzin b42059d2ef Revert "[SPARK-10300] [BUILD] [TESTS] Add support for test tags in run-tests.py."
This reverts commit 8abef21dac.
2015-09-15 13:03:38 -07:00
Marcelo Vanzin 8abef21dac [SPARK-10300] [BUILD] [TESTS] Add support for test tags in run-tests.py.
This change does two things:

- tag a few tests and adds the mechanism in the build to be able to disable those tags,
  both in maven and sbt, for both junit and scalatest suites.
- add some logic to run-tests.py to disable some tags depending on what files have
  changed; that's used to disable expensive tests when a module hasn't explicitly
  been changed, to speed up testing for changes that don't directly affect those
  modules.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #8437 from vanzin/test-tags.
2015-09-15 10:45:02 -07:00
Yuhao Yang c35fdcb7e9 [SPARK-10491] [MLLIB] move RowMatrix.dspr to BLAS
jira: https://issues.apache.org/jira/browse/SPARK-10491

We implemented dspr with sparse vector support in `RowMatrix`. This method is also used in WeightedLeastSquares and other places. It would be useful to move it to `linalg.BLAS`.

Let me know if new UT needed.

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #8663 from hhbyyh/movedspr.
2015-09-15 09:58:49 -07:00
Reynold Xin 09b7e7c198 Update version to 1.6.0-SNAPSHOT.
Author: Reynold Xin <rxin@databricks.com>

Closes #8350 from rxin/1.6.
2015-09-15 00:54:20 -07:00
Nick Pritchard 8a634e9bcc [SPARK-10573] [ML] IndexToString output schema should be StringType
Fixes bug where IndexToString output schema was DoubleType. Correct me if I'm wrong, but it doesn't seem like the output needs to have any "ML Attribute" metadata.

Author: Nick Pritchard <nicholas.pritchard@falkonry.com>

Closes #8751 from pnpritchard/SPARK-10573.
2015-09-14 13:27:45 -07:00
Yanbo Liang ce6f3f163b [SPARK-10194] [MLLIB] [PYSPARK] SGD algorithms need convergenceTol parameter in Python
[SPARK-3382](https://issues.apache.org/jira/browse/SPARK-3382) added a ```convergenceTol``` parameter for GradientDescent-based methods in Scala. We need that parameter in Python; otherwise, Python users will not be able to adjust that behavior (or even reproduce behavior from previous releases since the default changed).

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #8457 from yanboliang/spark-10194.
2015-09-14 12:08:52 -07:00
Bertrand Dechoux d81565465c [SPARK-9720] [ML] Identifiable types need UID in toString methods
A few Identifiable types did override their toString method but without using the parent implementation. As a consequence, the uid was not present anymore in the toString result. It is the default behaviour.

This patch is a quick fix. The question of enforcement is still up.

No tests have been written to verify the toString method behaviour. That would be long to do because all types should be tested and not only those which have a regression now.

It is possible to enforce the condition using the compiler by making the toString method final but that would introduce unwanted potential API breaking changes (see jira).

Author: Bertrand Dechoux <BertrandDechoux@users.noreply.github.com>

Closes #8062 from BertrandDechoux/SPARK-9720.
2015-09-14 09:18:46 +01:00
Joseph K. Bradley 2e3a280754 [MINOR] [MLLIB] [ML] [DOC] Minor doc fixes for StringIndexer and MetadataUtils
Changes:
* Make Scala doc for StringIndexerInverse clearer.  Also remove Scala doc from transformSchema, so that the doc is inherited.
* MetadataUtils.scala: “ Helper utilities for tree-based algorithms” —> not just trees anymore

CC: holdenk mengxr

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #8679 from jkbradley/doc-fixes-1.5.
2015-09-11 08:55:35 -07:00
Xiangrui Meng 960d2d0ac6 [SPARK-10537] [ML] document LIBSVM source options in public API doc and some minor improvements
We should document options in public API doc. Otherwise, it is hard to find out the options without looking at the code. I tried to make `DefaultSource` private and put the documentation to package doc. However, since then there exists no public class under `source.libsvm`, the Java package doc doesn't show up in the generated html file (http://bugs.java.com/bugdatabase/view_bug.do?bug_id=4492654). So I put the doc to `DefaultSource` instead. There are several minor updates in this PR:

1. Do `vectorType == "sparse"` only once.
2. Update `hashCode` and `equals`.
3. Remove inherited doc.
4. Delete temp dir in `afterAll`.

Lewuathe

Author: Xiangrui Meng <meng@databricks.com>

Closes #8699 from mengxr/SPARK-10537.
2015-09-11 08:53:40 -07:00
Yanbo Liang b01b262606 [SPARK-9773] [ML] [PySpark] Add Python API for MultilayerPerceptronClassifier
Add Python API for ```MultilayerPerceptronClassifier```.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #8067 from yanboliang/SPARK-9773.
2015-09-11 08:52:28 -07:00
Yanbo Liang 339a527141 [SPARK-10023] [ML] [PySpark] Unified DecisionTreeParams checkpointInterval between Scala and Python API.
"checkpointInterval" is member of DecisionTreeParams in Scala API which is inconsistency with Python API, we should unified them.
```
member of DecisionTreeParams <-> Scala API
shared param for all ML Transformer/Estimator <-> Python API
```
Proposal:
"checkpointInterval" is also used by ALS, so we make it shared params at Scala.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #8528 from yanboliang/spark-10023.
2015-09-10 20:34:00 -07:00
lewuathe 2ddeb63126 [SPARK-10117] [MLLIB] Implement SQL data source API for reading LIBSVM data
It is convenient to implement data source API for LIBSVM format to have a better integration with DataFrames and ML pipeline API.

Two option is implemented.
* `numFeatures`: Specify the dimension of features vector
* `featuresType`: Specify the type of output vector. `sparse` is default.

Author: lewuathe <lewuathe@me.com>

Closes #8537 from Lewuathe/SPARK-10117 and squashes the following commits:

986999d [lewuathe] Change unit test phrase
11d513f [lewuathe] Fix some reviews
21600a4 [lewuathe] Merge branch 'master' into SPARK-10117
9ce63c7 [lewuathe] Rewrite service loader file
1fdd2df [lewuathe] Merge branch 'SPARK-10117' of github.com:Lewuathe/spark into SPARK-10117
ba3657c [lewuathe] Merge branch 'master' into SPARK-10117
0ea1c1c [lewuathe] LibSVMRelation is registered into META-INF
4f40891 [lewuathe] Improve test suites
5ab62ab [lewuathe] Merge branch 'master' into SPARK-10117
8660d0e [lewuathe] Fix Java unit test
b56a948 [lewuathe] Merge branch 'master' into SPARK-10117
2c12894 [lewuathe] Remove unnecessary tag
7d693c2 [lewuathe] Resolv conflict
62010af [lewuathe] Merge branch 'master' into SPARK-10117
a97ee97 [lewuathe] Fix some points
aef9564 [lewuathe] Fix
70ee4dd [lewuathe] Add Java test
3fd8dce [lewuathe] [SPARK-10117] Implement SQL data source API for reading LIBSVM data
40d3027 [lewuathe] Add Java test
7056d4a [lewuathe] Merge branch 'master' into SPARK-10117
99accaa [lewuathe] [SPARK-10117] Implement SQL data source API for reading LIBSVM data
2015-09-09 09:29:10 -07:00
Luc Bourlier c1bc4f439f [SPARK-10227] fatal warnings with sbt on Scala 2.11
The bulk of the changes are on `transient` annotation on class parameter. Often the compiler doesn't generate a field for this parameters, so the the transient annotation would be unnecessary.
But if the class parameter are used in methods, then fields are created. So it is safer to keep the annotations.

The remainder are some potential bugs, and deprecated syntax.

Author: Luc Bourlier <luc.bourlier@typesafe.com>

Closes #8433 from skyluc/issue/sbt-2.11.
2015-09-09 09:57:58 +01:00
Holden Karau 2f6fd5256c [SPARK-9654] [ML] [PYSPARK] Add IndexToString to PySpark
Adds IndexToString to PySpark.

Author: Holden Karau <holden@pigscanfly.ca>

Closes #7976 from holdenk/SPARK-9654-add-string-indexer-inverse-in-pyspark.
2015-09-08 22:13:05 -07:00
Yanbo Liang a1573489a3 [SPARK-10464] [MLLIB] Add WeibullGenerator for RandomDataGenerator
Add WeibullGenerator for RandomDataGenerator.
#8611 need use WeibullGenerator to generate random data based on Weibull distribution.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #8622 from yanboliang/spark-10464.
2015-09-08 20:54:02 -07:00
Xiangrui Meng 52fe32f6ac [SPARK-9834] [MLLIB] implement weighted least squares via normal equation
The goal of this PR is to have a weighted least squares implementation that takes the normal equation approach, and hence to be able to provide R-like summary statistics and support IRLS (used by GLMs). The tests match R's lm and glmnet.

There are couple TODOs that can be addressed in future PRs:
* consolidate summary statistics aggregators
* move `dspr` to `BLAS`
* etc

It would be nice to have this merged first because it blocks couple other features.

dbtsai

Author: Xiangrui Meng <meng@databricks.com>

Closes #8588 from mengxr/SPARK-9834.
2015-09-08 20:51:20 -07:00
Vinod K C e6f8d36860 [SPARK-10468] [ MLLIB ] Verify schema before Dataframe select API call
Loader.checkSchema was called to verify the schema after dataframe.select(...).
Schema verification should be done before dataframe.select(...)

Author: Vinod K C <vinod.kc@huawei.com>

Closes #8636 from vinodkc/fix_GaussianMixtureModel_load_verification.
2015-09-08 14:44:05 -07:00
Yanbo Liang f7b55dbfc3 [SPARK-10470] [ML] ml.IsotonicRegressionModel.copy should set parent
Copied model must have the same parent, but ml.IsotonicRegressionModel.copy did not set parent.
Here fix it and add test case.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #8637 from yanboliang/spark-10470.
2015-09-08 12:48:21 -07:00
Yanbo Liang 5b2192e846 [SPARK-10480] [ML] Fix ML.LinearRegressionModel.copy()
This PR fix two model ```copy()``` related issues:
[SPARK-10480](https://issues.apache.org/jira/browse/SPARK-10480)
```ML.LinearRegressionModel.copy()``` ignored argument ```extra```, it will not take effect when users setting this parameter.
[SPARK-10479](https://issues.apache.org/jira/browse/SPARK-10479)
```ML.LogisticRegressionModel.copy()``` should copy model summary if available.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #8641 from yanboliang/linear-regression-copy.
2015-09-08 11:11:35 -07:00
Holden Karau 871764c6ce [SPARK-10013] [ML] [JAVA] [TEST] remove java assert from java unit tests
From Jira: We should use assertTrue, etc. instead to make sure the asserts are not ignored in tests.

Author: Holden Karau <holden@pigscanfly.ca>

Closes #8607 from holdenk/SPARK-10013-remove-java-assert-from-java-unit-tests.
2015-09-05 00:04:00 -10:00
Holden Karau 22eab706f4 [SPARK-10402] [DOCS] [ML] Add defaults to the scaladoc for params in ml/
We should make sure the scaladoc for params includes their default values through the models in ml/

Author: Holden Karau <holden@pigscanfly.ca>

Closes #8591 from holdenk/SPARK-10402-add-scaladoc-for-default-values-of-params-in-ml.
2015-09-04 17:32:35 -07:00
Holden Karau 44948a2e9d [SPARK-9723] [ML] params getordefault should throw more useful error
Params.getOrDefault should throw a more meaningful exception than what you get from a bad key lookup.

Author: Holden Karau <holden@pigscanfly.ca>

Closes #8567 from holdenk/SPARK-9723-params-getordefault-should-throw-more-useful-error.
2015-09-02 21:19:42 -07:00
Holden Karau e6e483cc4d [SPARK-9679] [ML] [PYSPARK] Add Python API for Stop Words Remover
Add a python API for the Stop Words Remover.

Author: Holden Karau <holden@pigscanfly.ca>

Closes #8118 from holdenk/SPARK-9679-python-StopWordsRemover.
2015-09-01 10:48:57 -07:00
Yanbo Liang fe16fd0b8b [SPARK-10349] [ML] OneVsRest use 'when ... otherwise' not UDF to generate new label at binary reduction
Currently OneVsRest use UDF to generate new binary label during training.
Considering that [SPARK-7321](https://issues.apache.org/jira/browse/SPARK-7321) has been merged, we can use ```when ... otherwise``` which will be more efficiency.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #8519 from yanboliang/spark-10349.
2015-08-31 16:06:38 -07:00
Xiangrui Meng 23e39cc7b1 [SPARK-9954] [MLLIB] use first 128 nonzeros to compute Vector.hashCode
This could help reduce hash collisions, e.g., in `RDD[Vector].repartition`. jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #8182 from mengxr/SPARK-9954.
2015-08-31 15:49:25 -07:00