SparkR glm currently support :
```formula, family = c(“gaussian”, “binomial”), data, lambda = 0, alpha = 0```
We should also support setting standardize which has been defined at [design documentation](https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/edit)
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#9331 from yanboliang/spark-11369.
WeightedLeastSquares now uses the common Instance class in ml.feature instead of a private one.
Author: Nakul Jindal <njindal@us.ibm.com>
Closes#9325 from nakul02/SPARK-11332_refactor_WeightedLeastSquares_dot_Instance.
Fix computation of root-sigma-inverse in multivariate Gaussian; add a test and fix related Python mixture model test.
Supersedes https://github.com/apache/spark/pull/9293
Author: Sean Owen <sowen@cloudera.com>
Closes#9309 from srowen/SPARK-11302.2.
Add columnSimilarities to IndexedRowMatrix by delegating to functionality already in RowMatrix.
With a test.
Author: Reza Zadeh <reza@databricks.com>
Closes#8792 from rezazadeh/colsims.
Remove "Experimental" from .mllib code that has been around since 1.4.0 or earlier
Author: Sean Owen <sowen@cloudera.com>
Closes#9169 from srowen/SPARK-11184.
This is a PR for Parquet-based model import/export.
* Added save/load for ChiSqSelectorModel
* Updated the test suite ChiSqSelectorSuite
Author: Jayant Shekar <jayant@user-MBPMBA-3.local>
Closes#6785 from jayantshekhar/SPARK-6723.
Given row_ind should be less than the number of rows
Given col_ind should be less than the number of cols.
The current code in master gives unpredictable behavior for such cases.
Author: MechCoder <manojkumarsivaraj334@gmail.com>
Closes#8271 from MechCoder/hash_code_matrices.
…2 regularization if the number of features is small
Author: lewuathe <lewuathe@me.com>
Author: Lewuathe <sasaki@treasure-data.com>
Author: Kai Sasaki <sasaki@treasure-data.com>
Author: Lewuathe <lewuathe@me.com>
Closes#8884 from Lewuathe/SPARK-10668.
predictNodeIndex is moved to LearningNode and renamed predictImpl for consistency with Node.predictImpl
Author: Luvsandondov Lkhamsuren <lkhamsurenl@gmail.com>
Closes#8609 from lkhamsurenl/SPARK-9963.
jira: https://issues.apache.org/jira/browse/SPARK-11029
We should add a method analogous to spark.mllib.clustering.KMeansModel.computeCost to spark.ml.clustering.KMeansModel.
This will be a temp fix until we have proper evaluators defined for clustering.
Author: Yuhao Yang <hhbyyh@gmail.com>
Author: yuhaoyang <yuhao@zhanglipings-iMac.local>
Closes#9073 from hhbyyh/computeCost.
This PR aims to decrease communication costs in BlockMatrix multiplication in two ways:
- Simulate the multiplication on the driver, and figure out which blocks actually need to be shuffled
- Send the block once to a partition, and join inside the partition rather than sending multiple copies to the same partition
**NOTE**: One important note is that right now, the old behavior of checking for multiple blocks with the same index is lost. This is not hard to add, but is a little more expensive than how it was.
Initial benchmarking showed promising results (look below), however I did hit some `FileNotFound` exceptions with the new implementation after the shuffle.
Size A: 1e5 x 1e5
Size B: 1e5 x 1e5
Block Sizes: 1024 x 1024
Sparsity: 0.01
Old implementation: 1m 13s
New implementation: 9s
cc avulanov Would you be interested in helping me benchmark this? I used your code from the mailing list (which you sent about 3 months ago?), and the old implementation didn't even run, but the new implementation completed in 268s in a 120 GB / 16 core cluster
Author: Burak Yavuz <brkyvz@gmail.com>
Closes#8757 from brkyvz/opt-bmm.
Value of the quantile probabilities array should be in the range (0, 1) instead of [0,1]
in `AFTSurvivalRegression.scala` according to [Discussion] (https://github.com/apache/spark/pull/8926#discussion-diff-40698242)
Author: vectorijk <jiangkai@gmail.com>
Closes#9083 from vectorijk/spark-11059.
This PR implements the JSON SerDe for the following param types: `Boolean`, `Int`, `Long`, `Float`, `Double`, `String`, `Array[Int]`, `Array[Double]`, and `Array[String]`. The implementation of `Float`, `Double`, and `Array[Double]` are specialized to handle `NaN` and `Inf`s. This will be used in pipeline persistence. jkbradley
Author: Xiangrui Meng <meng@databricks.com>
Closes#9090 from mengxr/SPARK-7402.
Support for recommendUsersForProducts and recommendProductsForUsers in matrix factorization model for PySpark
Author: Vladimir Vladimirov <vladimir.vladimirov@magnetic.com>
Closes#8700 from smartkiwi/SPARK-10535_.
Compute upper triangular values of the covariance matrix, then copy to lower triangular values.
Author: Nick Pritchard <nicholas.pritchard@falkonry.com>
Closes#8940 from pnpritchard/SPARK-10875.
GBT compare ValidateError with tolerance switching between relative and absolute ones, where the former one is relative to the current loss on the training set.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#8549 from yanboliang/spark-7770.
LinearRegression training summary: The transformed dataset should hold all columns, not just selected ones like prediction and label. There is no real need to remove some, and the user may find them useful.
Author: Holden Karau <holden@pigscanfly.ca>
Closes#8564 from holdenk/SPARK-9718-LinearRegressionTrainingSummary-all-columns.
Reimplement `DecisionTree.findSplitsBins` via `RDD` to parallelize bin calculation.
With large feature spaces the current implementation is very slow. This change limits the features that are distributed (or collected) to just the continuous features, and performs the split calculations in parallel. It completes on a real multi terabyte dataset in less than a minute instead of multiple hours.
Author: Nathan Howell <nhowell@godaddy.com>
Closes#8246 from NathanHowell/SPARK-10064.
Refactoring `Instance` case class out from LOR and LIR, and also cleaning up some code.
Author: DB Tsai <dbt@netflix.com>
Closes#8853 from dbtsai/refactoring.
Provide initialModel param for pyspark.mllib.clustering.KMeans
Author: Evan Chen <chene@us.ibm.com>
Closes#8967 from evanyc15/SPARK-10779-pyspark-mllib.
It is currently impossible to clear Param values once set. It would be helpful to be able to.
Author: Holden Karau <holden@pigscanfly.ca>
Closes#8619 from holdenk/SPARK-9841-params-clear-needs-to-be-public.
JIRA issue [here](https://issues.apache.org/jira/browse/SPARK-5890).
I borrow the code of `findSplits` from `RandomForest`. I don't think it's good to call it from `RandomForest` directly.
Author: Xusen Yin <yinxusen@gmail.com>
Closes#5779 from yinxusen/SPARK-5890.
For some implicit dataset, ratings may not exist in the training data. In this case, we can assume all observed pairs to be positive and treat their ratings as 1. This should happen when users set ```ratingCol``` to an empty string.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#8937 from yanboliang/spark-10736.
I implemented toString for AssociationRules.Rule, format like `[x, y] => {z}: 1.0`
Author: y-shimizu <y.shimizu0429@gmail.com>
Closes#8904 from y-shimizu/master.
This integrates the Interaction feature transformer with SparkR R formula support (i.e. support `:`).
To generate reasonable ML attribute names for feature interactions, it was necessary to add the ability to read attribute the original attribute names back from `StructField`, and also to specify custom group prefixes in `VectorAssembler`. This also has the side-benefit of cleaning up the double-underscores in the attributes generated for non-interaction terms.
mengxr
Author: Eric Liang <ekl@databricks.com>
Closes#8830 from ericl/interaction-2.
As introduced in https://issues.apache.org/jira/browse/SPARK-10630 we now have an easier way to create dataframes from local Java lists. Lets update the tests to use those.
Author: Holden Karau <holden@pigscanfly.ca>
Closes#8886 from holdenk/SPARK-10763-update-java-mllib-ml-tests-to-use-simplified-dataframe-construction.
Currently use can set ```checkpointInterval``` to specify how often should the cache be check-pointed. But we also need the function that users can disable it. This PR supports that users can disable checkpoint if user setting ```checkpointInterval = -1```.
We also add documents for GBT ```cacheNodeIds``` to make users can understand more clearly about checkpoint.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#8820 from yanboliang/spark-10699.
By default ```quantilesCol``` should be empty. If ```quantileProbabilities``` is set, we should append quantiles as a new column (of type Vector).
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#8836 from yanboliang/spark-10686.
All prediction models should store `numFeatures` indicating the number of features the model was trained on. Default value of -1 added for backwards compatibility.
Author: sethah <seth.hendrickson16@gmail.com>
Closes#8675 from sethah/SPARK-9715.
Currently when you set illegal value for params of array type (such as IntArrayParam, DoubleArrayParam, StringArrayParam), it will throw IllegalArgumentException but with incomprehensible error information.
Take ```VectorSlicer.setNames``` as an example:
```scala
val vectorSlicer = new VectorSlicer().setInputCol("features").setOutputCol("result")
// The value of setNames must be contain distinct elements, so the next line will throw exception.
vectorSlicer.setIndices(Array.empty).setNames(Array("f1", "f4", "f1"))
```
It will throw IllegalArgumentException as:
```
vectorSlicer_b3b4d1a10f43 parameter names given invalid value [Ljava.lang.String;798256c5.
java.lang.IllegalArgumentException: vectorSlicer_b3b4d1a10f43 parameter names given invalid value [Ljava.lang.String;798256c5.
```
We should distinguish the value of array type from primitive type at Param.validate(value: T), and we will get better error information.
```
vectorSlicer_3b744ea277b2 parameter names given invalid value [f1,f4,f1].
java.lang.IllegalArgumentException: vectorSlicer_3b744ea277b2 parameter names given invalid value [f1,f4,f1].
```
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#8863 from yanboliang/spark-10750.
NodeIdCache: prevNodeIdsForInstances.unpersist() needs to be called at end of training.
Author: Holden Karau <holden@pigscanfly.ca>
Closes#8541 from holdenk/SPARK-9962-decission-tree-training-prevNodeIdsForiNstances-unpersist-at-end-of-training.
In many modeling application, data points are not necessarily sampled with equal probabilities. Linear regression should support weighting which account the over or under sampling.
work in progress.
Author: Meihua Wu <meihuawu@umich.edu>
Closes#8631 from rotationsymmetry/SPARK-9642.
SPARK-3136 added a large number of functions for creating Java RandomRDDs, but for people that want to use custom RandomDataGenerators we should make a Java friendly method.
Author: Holden Karau <holden@pigscanfly.ca>
Closes#8782 from holdenk/SPARK-10626-create-java-friendly-method-for-randomRDD.
There are duplicate set of initialization flag in `WeightedLeastSquares#add`.
`initialized` is already set in `init(Int)`.
Author: lewuathe <lewuathe@me.com>
Closes#8837 from Lewuathe/duplicate-initialization-flag.
Note methods that fail for cols > 65535; note that SVD does not require n >= m
CC mengxr
Author: Sean Owen <sowen@cloudera.com>
Closes#8839 from srowen/SPARK-5905.
This makes equality test failures much more readable.
mengxr
Author: Eric Liang <ekl@databricks.com>
Author: Eric Liang <ekhliang@gmail.com>
Closes#8826 from ericl/attrgroupstr.
```GBTParams``` has ```stepSize``` as learning rate currently.
ML has shared param class ```HasStepSize```, ```GBTParams``` can extend from it rather than duplicated implementation.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#8552 from yanboliang/spark-10394.
Should be the same as SPARK-7808 but use Java for the code example.
It would be great to add package doc for `spark.ml.feature`.
Author: Holden Karau <holden@pigscanfly.ca>
Closes#8740 from holdenk/SPARK-10077-JAVA-PACKAGE-DOC-FOR-SPARK.ML.FEATURE.
In fraud detection dataset, almost all the samples are negative while only couple of them are positive. This type of high imbalanced data will bias the models toward negative resulting poor performance. In python-scikit, they provide a correction allowing users to Over-/undersample the samples of each class according to the given weights. In auto mode, selects weights inversely proportional to class frequencies in the training set. This can be done in a more efficient way by multiplying the weights into loss and gradient instead of doing actual over/undersampling in the training dataset which is very expensive.
http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
On the other hand, some of the training data maybe more important like the training samples from tenure users while the training samples from new users maybe less important. We should be able to provide another "weight: Double" information in the LabeledPoint to weight them differently in the learning algorithm.
Author: DB Tsai <dbt@netflix.com>
Author: DB Tsai <dbt@dbs-mac-pro.corp.netflix.com>
Closes#7884 from dbtsai/SPARK-7685.
This change does two things:
- tag a few tests and adds the mechanism in the build to be able to disable those tags,
both in maven and sbt, for both junit and scalatest suites.
- add some logic to run-tests.py to disable some tags depending on what files have
changed; that's used to disable expensive tests when a module hasn't explicitly
been changed, to speed up testing for changes that don't directly affect those
modules.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#8437 from vanzin/test-tags.
jira: https://issues.apache.org/jira/browse/SPARK-10491
We implemented dspr with sparse vector support in `RowMatrix`. This method is also used in WeightedLeastSquares and other places. It would be useful to move it to `linalg.BLAS`.
Let me know if new UT needed.
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes#8663 from hhbyyh/movedspr.
Fixes bug where IndexToString output schema was DoubleType. Correct me if I'm wrong, but it doesn't seem like the output needs to have any "ML Attribute" metadata.
Author: Nick Pritchard <nicholas.pritchard@falkonry.com>
Closes#8751 from pnpritchard/SPARK-10573.
[SPARK-3382](https://issues.apache.org/jira/browse/SPARK-3382) added a ```convergenceTol``` parameter for GradientDescent-based methods in Scala. We need that parameter in Python; otherwise, Python users will not be able to adjust that behavior (or even reproduce behavior from previous releases since the default changed).
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#8457 from yanboliang/spark-10194.
A few Identifiable types did override their toString method but without using the parent implementation. As a consequence, the uid was not present anymore in the toString result. It is the default behaviour.
This patch is a quick fix. The question of enforcement is still up.
No tests have been written to verify the toString method behaviour. That would be long to do because all types should be tested and not only those which have a regression now.
It is possible to enforce the condition using the compiler by making the toString method final but that would introduce unwanted potential API breaking changes (see jira).
Author: Bertrand Dechoux <BertrandDechoux@users.noreply.github.com>
Closes#8062 from BertrandDechoux/SPARK-9720.
Changes:
* Make Scala doc for StringIndexerInverse clearer. Also remove Scala doc from transformSchema, so that the doc is inherited.
* MetadataUtils.scala: “ Helper utilities for tree-based algorithms” —> not just trees anymore
CC: holdenk mengxr
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#8679 from jkbradley/doc-fixes-1.5.
We should document options in public API doc. Otherwise, it is hard to find out the options without looking at the code. I tried to make `DefaultSource` private and put the documentation to package doc. However, since then there exists no public class under `source.libsvm`, the Java package doc doesn't show up in the generated html file (http://bugs.java.com/bugdatabase/view_bug.do?bug_id=4492654). So I put the doc to `DefaultSource` instead. There are several minor updates in this PR:
1. Do `vectorType == "sparse"` only once.
2. Update `hashCode` and `equals`.
3. Remove inherited doc.
4. Delete temp dir in `afterAll`.
Lewuathe
Author: Xiangrui Meng <meng@databricks.com>
Closes#8699 from mengxr/SPARK-10537.
"checkpointInterval" is member of DecisionTreeParams in Scala API which is inconsistency with Python API, we should unified them.
```
member of DecisionTreeParams <-> Scala API
shared param for all ML Transformer/Estimator <-> Python API
```
Proposal:
"checkpointInterval" is also used by ALS, so we make it shared params at Scala.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#8528 from yanboliang/spark-10023.
It is convenient to implement data source API for LIBSVM format to have a better integration with DataFrames and ML pipeline API.
Two option is implemented.
* `numFeatures`: Specify the dimension of features vector
* `featuresType`: Specify the type of output vector. `sparse` is default.
Author: lewuathe <lewuathe@me.com>
Closes#8537 from Lewuathe/SPARK-10117 and squashes the following commits:
986999d [lewuathe] Change unit test phrase
11d513f [lewuathe] Fix some reviews
21600a4 [lewuathe] Merge branch 'master' into SPARK-10117
9ce63c7 [lewuathe] Rewrite service loader file
1fdd2df [lewuathe] Merge branch 'SPARK-10117' of github.com:Lewuathe/spark into SPARK-10117
ba3657c [lewuathe] Merge branch 'master' into SPARK-10117
0ea1c1c [lewuathe] LibSVMRelation is registered into META-INF
4f40891 [lewuathe] Improve test suites
5ab62ab [lewuathe] Merge branch 'master' into SPARK-10117
8660d0e [lewuathe] Fix Java unit test
b56a948 [lewuathe] Merge branch 'master' into SPARK-10117
2c12894 [lewuathe] Remove unnecessary tag
7d693c2 [lewuathe] Resolv conflict
62010af [lewuathe] Merge branch 'master' into SPARK-10117
a97ee97 [lewuathe] Fix some points
aef9564 [lewuathe] Fix
70ee4dd [lewuathe] Add Java test
3fd8dce [lewuathe] [SPARK-10117] Implement SQL data source API for reading LIBSVM data
40d3027 [lewuathe] Add Java test
7056d4a [lewuathe] Merge branch 'master' into SPARK-10117
99accaa [lewuathe] [SPARK-10117] Implement SQL data source API for reading LIBSVM data
The bulk of the changes are on `transient` annotation on class parameter. Often the compiler doesn't generate a field for this parameters, so the the transient annotation would be unnecessary.
But if the class parameter are used in methods, then fields are created. So it is safer to keep the annotations.
The remainder are some potential bugs, and deprecated syntax.
Author: Luc Bourlier <luc.bourlier@typesafe.com>
Closes#8433 from skyluc/issue/sbt-2.11.
Add WeibullGenerator for RandomDataGenerator.
#8611 need use WeibullGenerator to generate random data based on Weibull distribution.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#8622 from yanboliang/spark-10464.
The goal of this PR is to have a weighted least squares implementation that takes the normal equation approach, and hence to be able to provide R-like summary statistics and support IRLS (used by GLMs). The tests match R's lm and glmnet.
There are couple TODOs that can be addressed in future PRs:
* consolidate summary statistics aggregators
* move `dspr` to `BLAS`
* etc
It would be nice to have this merged first because it blocks couple other features.
dbtsai
Author: Xiangrui Meng <meng@databricks.com>
Closes#8588 from mengxr/SPARK-9834.
Loader.checkSchema was called to verify the schema after dataframe.select(...).
Schema verification should be done before dataframe.select(...)
Author: Vinod K C <vinod.kc@huawei.com>
Closes#8636 from vinodkc/fix_GaussianMixtureModel_load_verification.
Copied model must have the same parent, but ml.IsotonicRegressionModel.copy did not set parent.
Here fix it and add test case.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#8637 from yanboliang/spark-10470.
This PR fix two model ```copy()``` related issues:
[SPARK-10480](https://issues.apache.org/jira/browse/SPARK-10480)
```ML.LinearRegressionModel.copy()``` ignored argument ```extra```, it will not take effect when users setting this parameter.
[SPARK-10479](https://issues.apache.org/jira/browse/SPARK-10479)
```ML.LogisticRegressionModel.copy()``` should copy model summary if available.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#8641 from yanboliang/linear-regression-copy.
From Jira: We should use assertTrue, etc. instead to make sure the asserts are not ignored in tests.
Author: Holden Karau <holden@pigscanfly.ca>
Closes#8607 from holdenk/SPARK-10013-remove-java-assert-from-java-unit-tests.
We should make sure the scaladoc for params includes their default values through the models in ml/
Author: Holden Karau <holden@pigscanfly.ca>
Closes#8591 from holdenk/SPARK-10402-add-scaladoc-for-default-values-of-params-in-ml.
Params.getOrDefault should throw a more meaningful exception than what you get from a bad key lookup.
Author: Holden Karau <holden@pigscanfly.ca>
Closes#8567 from holdenk/SPARK-9723-params-getordefault-should-throw-more-useful-error.
Currently OneVsRest use UDF to generate new binary label during training.
Considering that [SPARK-7321](https://issues.apache.org/jira/browse/SPARK-7321) has been merged, we can use ```when ... otherwise``` which will be more efficiency.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#8519 from yanboliang/spark-10349.
This could help reduce hash collisions, e.g., in `RDD[Vector].repartition`. jkbradley
Author: Xiangrui Meng <meng@databricks.com>
Closes#8182 from mengxr/SPARK-9954.
* do not cache first cost RDD
* change following cost RDD cache level to MEMORY_AND_DISK
* remove Vector wrapper to save a object per instance
Further improvements will be addressed in SPARK-10329
cc: yu-iskw HuJiayin
Author: Xiangrui Meng <meng@databricks.com>
Closes#8526 from mengxr/SPARK-10354.
* Adds user guide for ml.feature.StopWordsRemovers, ran code examples on my machine
* Cleans up scaladocs for public methods
* Adds test for Java compatibility
* Follow up Python user guide code example is tracked by SPARK-10249
Author: Feynman Liang <fliang@databricks.com>
Closes#8436 from feynmanliang/SPARK-10230.
`GeneralizedLinearModel` creates a cached RDD when building a model. It's inconvenient, since these RDDs flood the memory when building several models in a row, so useful data might get evicted from the cache.
The proposed solution is to always cache the dataset & remove the warning. There's a caveat though: input dataset gets evaluated twice, in line 270 when fitting `StandardScaler` for the first time, and when running optimizer for the second time. So, it might worth to return removed warning.
Another possible solution is to disable caching entirely & return removed warning. I don't really know what approach is better.
Author: Vyacheslav Baranov <slavik.baranov@gmail.com>
Closes#8395 from SlavikBaranov/SPARK-10182.
* Replaces instances of `Lists.newArrayList` with `Arrays.asList`
* Replaces `commons.lang.StringUtils` over `com.google.collections.Strings`
* Replaces `List` interface over `ArrayList` implementations
This PR along with #8445#8446#8447 completely removes all `com.google.collections.Lists` dependencies within mllib's Java tests.
Author: Feynman Liang <fliang@databricks.com>
Closes#8451 from feynmanliang/SPARK-10257.
Fix for [JavaConverters.asJavaListConverter](http://www.scala-lang.org/api/2.10.5/index.html#scala.collection.JavaConverters$) being removed in 2.11.7 and hence the build fails with the 2.11 profile enabled. Tested with the default 2.10 and 2.11 profiles. BUILD SUCCESS in both cases.
Build for 2.10:
./build/mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.7.1 -DskipTests clean install
and 2.11:
./dev/change-scala-version.sh 2.11
./build/mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.7.1 -Dscala-2.11 -DskipTests clean install
Author: Jacek Laskowski <jacek@japila.pl>
Closes#8479 from jaceklaskowski/SPARK-9613-hotfix.
* Replaces `com.google.common` dependencies with `java.util.Arrays`
* Small clean up in `JavaNormalizerSuite`
Author: Feynman Liang <fliang@databricks.com>
Closes#8445 from feynmanliang/SPARK-10254.
I only found `ml.NaiveBayes` missing `Experimental` annotation. This PR doesn't cover Python APIs.
cc jkbradley
Author: Xiangrui Meng <meng@databricks.com>
Closes#8452 from mengxr/SPARK-9665.
Same as #8421 but for `mllib.feature`.
cc dbtsai
Author: Xiangrui Meng <meng@databricks.com>
Closes#8449 from mengxr/SPARK-10236.feature and squashes the following commits:
0e8d658 [Xiangrui Meng] remove unnecessary comment
ad70b03 [Xiangrui Meng] update since versions in mllib.feature
Same as #8421 but for `mllib.regression`.
cc freeman-lab dbtsai
Author: Xiangrui Meng <meng@databricks.com>
Closes#8426 from mengxr/SPARK-10235 and squashes the following commits:
6cd28e4 [Xiangrui Meng] update since versions in mllib.regression
The same as #8241 but for `mllib.stat` and `mllib.random`.
cc feynmanliang
Author: Xiangrui Meng <meng@databricks.com>
Closes#8439 from mengxr/SPARK-10242.
Same as #8421 but for `mllib.linalg`.
cc dbtsai
Author: Xiangrui Meng <meng@databricks.com>
Closes#8440 from mengxr/SPARK-10238 and squashes the following commits:
b38437e [Xiangrui Meng] update since versions in mllib.linalg
* Adds two new sections to LDA's user guide; one for each optimizer/model
* Documents new features added to LDA (e.g. topXXXperXXX, asymmetric priors, hyperpam optimization)
* Cleans up a TODO and sets a default parameter in LDA code
jkbradley hhbyyh
Author: Feynman Liang <fliang@databricks.com>
Closes#8254 from feynmanliang/SPARK-9888.
Same as #8421 but for `mllib.pmml` and `mllib.util`.
cc dbtsai
Author: Xiangrui Meng <meng@databricks.com>
Closes#8430 from mengxr/SPARK-10239 and squashes the following commits:
a189acf [Xiangrui Meng] update since versions in mllib.pmml and mllib.util
Adds default convergence tolerance (0.001, set in `GradientDescent.convergenceTol`) to `setConvergenceTol`'s scaladoc
Author: Feynman Liang <fliang@databricks.com>
Closes#8424 from feynmanliang/SPARK-9797.
* Adds doc for alias of runMIniBatchSGD documenting default value for convergeTol
* Cleans up a note in code
Author: Feynman Liang <fliang@databricks.com>
Closes#8425 from feynmanliang/SPARK-9800.
Update `Since` annotation in `mllib.classification`:
1. add version to classes, objects, constructors, and public variables declared in constructors
2. correct some versions
3. remove `Since` on `toString`
MechCoder dbtsai
Author: Xiangrui Meng <meng@databricks.com>
Closes#8421 from mengxr/SPARK-10231 and squashes the following commits:
b2dce80 [Xiangrui Meng] update @Since annotation for mllib.classification
Replace `JavaConversions` implicits with `JavaConverters`
Most occurrences I've seen so far are necessary conversions; a few have been avoidable. None are in critical code as far as I see, yet.
Author: Sean Owen <sowen@cloudera.com>
Closes#8033 from srowen/SPARK-9613.
GaussianMixture now distributes matrix decompositions for certain problem sizes. Distributed computation actually fails, but this was not tested in unit tests.
This PR adds a unit test which checks this. It failed previously but works with this fix.
CC: mengxr
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#8370 from jkbradley/gmm-fix.
Add user guide for `VectorSlicer`, with Java test suite and Python version VectorSlicer.
Note that Python version does not support selecting by names now.
Author: Xusen Yin <yinxusen@gmail.com>
Closes#8267 from yinxusen/SPARK-9893.
Removed categorical feature info validation since no longer needed
This is needed to make the ML user guide examples work (in another current PR).
CC: mengxr
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#8367 from jkbradley/gbt-single-cat.
For each (document, term) pair, return top topic. Note that instances of (doc, term) pairs within a document (a.k.a. "tokens") are exchangeable, so we should provide an estimate per document-term, rather than per token.
CC: rotationsymmetry mengxr
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#8329 from jkbradley/lda-topic-assignments.
This continues the work from #8256. I removed `since` tags from private/protected/local methods/variables (see 72fdeb6463). MechCoder
Closes#8256
Author: Xiangrui Meng <meng@databricks.com>
Author: Xiaoqing Wang <spark445@126.com>
Author: MechCoder <manojkumarsivaraj334@gmail.com>
Closes#8288 from mengxr/SPARK-8918.
Previously, users of evaluator (`CrossValidator` and `TrainValidationSplit`) would only maximize the metric in evaluator, leading to a hacky solution which negated metrics to be minimized and caused erroneous negative values to be reported to the user.
This PR adds a `isLargerBetter` attribute to the `Evaluator` base class, instructing users of `Evaluator` on whether the chosen metric should be maximized or minimized.
CC jkbradley
Author: Feynman Liang <fliang@databricks.com>
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#8290 from feynmanliang/SPARK-10097.
jira: https://issues.apache.org/jira/browse/SPARK-9028
Add an estimator for CountVectorizerModel. The estimator will extract a vocabulary from document collections according to the term frequency.
I changed the meaning of minCount as a filter across the corpus. This aligns with Word2Vec and the similar parameter in SKlearn.
Author: Yuhao Yang <hhbyyh@gmail.com>
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#7388 from hhbyyh/cvEstimator.
Fix the issue that ```layers``` and ```weights``` should be public variables of ```MultilayerPerceptronClassificationModel```. Users can not get ```layers``` and ```weights``` from a ```MultilayerPerceptronClassificationModel``` currently.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#8263 from yanboliang/mlp-public.
This PR adds a short description of `ml.feature` package with code example. The Java package doc will come in a separate PR. jkbradley
Author: Xiangrui Meng <meng@databricks.com>
Closes#8260 from mengxr/SPARK-7808.
Added since tags to mllib.regression
Author: Prayag Chandran <prayagchandran@gmail.com>
Closes#7518 from prayagchandran/sinceTags and squashes the following commits:
fa4dda2 [Prayag Chandran] Re-formatting
6c6d584 [Prayag Chandran] Corrected a few tags. Removed few unnecessary tags
1a0365f [Prayag Chandran] Reformating and adding a few more tags
89fdb66 [Prayag Chandran] SPARK-8916 [Documentation, MLlib] Add @since tags to mllib.regression
Also added unit test for integration between StringIndexerModel and IndexToString
CC: holdenk We realized we should have left in your unit test (to catch the issue with removing the inverse() method), so this adds it back. mengxr
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#8211 from jkbradley/stridx-labels.
in MLlib sometimes we need to set metadata for the new column, thus we will alias the new column with metadata before call `withColumn` and in `withColumn` we alias this clolumn again. Here I overloaded `withColumn` to allow user set metadata, just like what we did for `Column.as`.
Author: Wenchen Fan <cloud0fan@outlook.com>
Closes#8159 from cloud-fan/withColumn.
It would be helpful to allow users to pass a pre-computed index to create an indexer, rather than always going through StringIndexer to create the model.
Author: Holden Karau <holden@pigscanfly.ca>
Closes#7267 from holdenk/SPARK-8744-StringIndexerModel-should-have-public-constructor.
This modifies DecisionTreeMetadata construction to treat 1-category features as continuous, so that trees do not fail with such features. It is important for the pipelines API, where VectorIndexer can automatically categorize certain features as categorical.
As stated in the JIRA, this is a temp fix which we can improve upon later by automatically filtering out those features. That will take longer, though, since it will require careful indexing.
Targeted for 1.5 and master
CC: manishamde mengxr yanboliang
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#8187 from jkbradley/tree-1cat.
Some minor clean-ups after SPARK-9661. See my inline comments. MechCoder jkbradley
Author: Xiangrui Meng <meng@databricks.com>
Closes#8190 from mengxr/SPARK-9661-fix.
What `StringIndexerInverse` does is not strictly associated with `StringIndexer`, and the name is not clearly describing the transformation. Renaming to `IndexToString` might be better.
~~I also changed `invert` to `inverse` without arguments. `inputCol` and `outputCol` could be set after.~~
I also removed `invert`.
jkbradley holdenk
Author: Xiangrui Meng <meng@databricks.com>
Closes#8152 from mengxr/SPARK-9922.
I skimmed through the docs for various instance of Object and replaced them with Java compaible versions of the same.
1. Some methods in LDAModel.
2. runMiniBatchSGD
3. kolmogorovSmirnovTest
Author: MechCoder <manojkumarsivaraj334@gmail.com>
Closes#8126 from MechCoder/java_incop.
To follow the naming rule of ML, change `MultilayerPerceptronClassifierModel` to `MultilayerPerceptronClassificationModel` like `DecisionTreeClassificationModel`, `GBTClassificationModel` and so on.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#8164 from yanboliang/mlp-name.
Copied ML models must have the same parent of original ones
Author: lewuathe <lewuathe@me.com>
Author: Lewuathe <lewuathe@me.com>
Closes#7447 from Lewuathe/SPARK-9073.
This requires some discussion. I'm not sure whether `runs` is a useful parameter. It certainly complicates the implementation. We might want to optimize the k-means implementation with block matrix operations. In this case, having `runs` may not be worth the trade-off. Also it increases the communication cost in a single job, which might cause other issues.
This PR also renames `epsilon` to `tol` to have consistent naming among algorithms. The Python constructor is updated to include all parameters.
jkbradley yu-iskw
Author: Xiangrui Meng <meng@databricks.com>
Closes#8148 from mengxr/SPARK-9918 and squashes the following commits:
149b9e5 [Xiangrui Meng] fix constructor in Python and rename epsilon to tol
3cc15b3 [Xiangrui Meng] fix test and change initStep to initSteps in python
a0a0274 [Xiangrui Meng] remove runs from k-means in the pipeline API
The problem with defining setters in the base class is that it doesn't return the correct type in Java.
ericl
Author: Xiangrui Meng <meng@databricks.com>
Closes#8143 from mengxr/SPARK-9914 and squashes the following commits:
d36c887 [Xiangrui Meng] remove setters from model
a49021b [Xiangrui Meng] define setters explicitly for Java and use setParam group
There exists a chance that the prefixes keep growing to the maximum pattern length. Then the final local processing step becomes unnecessary. feynmanliang
Author: Xiangrui Meng <meng@databricks.com>
Closes#8136 from mengxr/SPARK-9903.
Made ProbabilisticClassifier, Identifiable, VectorUDT public. All are annotated as DeveloperApi.
CC: mengxr EronWright
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#8004 from jkbradley/ml-api-public-items and squashes the following commits:
7ebefda [Joseph K. Bradley] update per code review
7ff0768 [Joseph K. Bradley] attepting to add mima fix
756d84c [Joseph K. Bradley] VectorUDT annotated as AlphaComponent
ae7767d [Joseph K. Bradley] added another warning
94fd553 [Joseph K. Bradley] Made ProbabilisticClassifier, Identifiable, VectorUDT public APIs
As per the TODO move weightCol to Shared Params.
Author: Holden Karau <holden@pigscanfly.ca>
Closes#8144 from holdenk/SPARK-9909-move-weightCol-toSharedParams.
Reinstated LogisticRegression.threshold Param for binary compatibility. Param thresholds overrides threshold, if set.
CC: mengxr dbtsai feynmanliang
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#8079 from jkbradley/logreg-reinstate-threshold.
From JIRA: Currently, Params.copyValues copies default parameter values to the paramMap of the target instance, rather than the defaultParamMap. It should copy to the defaultParamMap because explicitly setting a parameter can change the semantics.
This issue arose in SPARK-9789, where 2 params "threshold" and "thresholds" for LogisticRegression can have mutually exclusive values. If thresholds is set, then fit() will copy the default value of threshold as well, easily resulting in inconsistent settings for the 2 params.
CC: mengxr
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#8115 from jkbradley/copyvalues-fix.
Went thru the history of changes the file MLUtils.scala and picked up the version that the change went in.
Author: Sudhakar Thota <sudhakarthota@yahoo.com>
Author: Sudhakar Thota <sudhakarthota@sudhakars-mbp-2.usca.ibm.com>
Closes#7436 from sthota2014/SPARK-8925_thotas.
1. Add “asymmetricDocConcentration” and revert docConcentration changes. If the (internal) doc concentration vector is a single value, “getDocConcentration" returns it. If it is a constant vector, getDocConcentration returns the first item, and fails otherwise.
2. Give `LDAModel.gammaShape` a default value in `LDAModel` concrete class constructors.
jkbradley
Author: Feynman Liang <fliang@databricks.com>
Closes#8077 from feynmanliang/SPARK-9788 and squashes the following commits:
6b07bc8 [Feynman Liang] Code review changes
9d6a71e [Feynman Liang] Add asymmetricAlpha alias
bf4e685 [Feynman Liang] Asymmetric docConcentration
4cab972 [Feynman Liang] Default gammaShape
Adds unit test for `equals` on `mllib.linalg.Matrix` class and `equals` to both `SparseMatrix` and `DenseMatrix`. Supports equality testing between `SparseMatrix` and `DenseMatrix`.
mengxr
Author: Feynman Liang <fliang@databricks.com>
Closes#8042 from feynmanliang/SPARK-9750 and squashes the following commits:
bb70d5e [Feynman Liang] Breeze compare for dense matrices as well, in case other is sparse
ab6f3c8 [Feynman Liang] Sparse matrix compare for equals
22782df [Feynman Liang] Add equality based on matrix semantics, not representation
78f9426 [Feynman Liang] Add casts
43d28fa [Feynman Liang] Fix failing test
6416fa0 [Feynman Liang] Add failing sparse matrix equals tests
As a precursor to adding a public constructor add an option to handle unseen values by skipping rather than throwing an exception (default remains throwing an exception),
Author: Holden Karau <holden@pigscanfly.ca>
Closes#7266 from holdenk/SPARK-8764-string-indexer-should-take-option-to-handle-unseen-values and squashes the following commits:
38a4de9 [Holden Karau] fix long line
045bf22 [Holden Karau] Add a second b entry so b gets 0 for sure
81dd312 [Holden Karau] Update the docs for handleInvalid param to be more descriptive
7f37f6e [Holden Karau] remove extra space (scala style)
414e249 [Holden Karau] And switch to using handleInvalid instead of skipInvalid
1e53f9b [Holden Karau] update the param (codegen side)
7a22215 [Holden Karau] fix typo
100a39b [Holden Karau] Merge in master
aa5b093 [Holden Karau] Since we filter we should never go down this code path if getSkipInvalid is true
75ffa69 [Holden Karau] Remove extra newline
d69ef5e [Holden Karau] Add a test
b5734be [Holden Karau] Add support for unseen labels
afecd4e [Holden Karau] Add a param to skip invalid entries.
Implements the transforms which are defined by SQL statement.
Currently we only support SQL syntax like 'SELECT ... FROM __THIS__'
where '__THIS__' represents the underlying table of the input dataset.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#7465 from yanboliang/spark-8345 and squashes the following commits:
b403fcb [Yanbo Liang] address comments
0d4bb15 [Yanbo Liang] a better transformSchema() implementation
51eb9e7 [Yanbo Liang] Add an SQL node as a feature transformer
Adds method documentations back to `MultivariateOnlineSummarizer`, which were present in 1.4 but disappeared somewhere along the way to 1.5.
jkbradley
Author: Feynman Liang <fliang@databricks.com>
Closes#8045 from feynmanliang/SPARK-9755 and squashes the following commits:
af67fde [Feynman Liang] Add MultivariateOnlineSummarizer docs
Small documentation cleanups, including:
* Adds documentation for `pi` and `theta`
* setParam to `setModelType`
Author: Feynman Liang <fliang@databricks.com>
Closes#8047 from feynmanliang/SPARK-9719 and squashes the following commits:
b372438 [Feynman Liang] Clean up naive bayes doc
These should be made private until there is a public constructor for providing `rootNode: Node` to use these constructors.
jkbradley
Author: Feynman Liang <fliang@databricks.com>
Closes#8046 from feynmanliang/SPARK-9756 and squashes the following commits:
2cbdf08 [Feynman Liang] Make RFRegressionModel aux constructor private
a06f596 [Feynman Liang] Make constructors in ML decision trees private
A minor typo (centriod -> centroid). Readable variable names help every users.
Author: Bertrand Dechoux <BertrandDechoux@users.noreply.github.com>
Closes#8037 from BertrandDechoux/kmeans-typo and squashes the following commits:
47632fe [Bertrand Dechoux] centriod typo
Resubmit of [https://github.com/apache/spark/pull/6906] for adding single-vec predict to GMMs
CC: dkobylarz mengxr
To be merged with master and branch-1.5
Primary author: dkobylarz
Author: Dariusz Kobylarz <darek.kobylarz@gmail.com>
Closes#8039 from jkbradley/gmm-predict-vec and squashes the following commits:
bfbedc4 [Dariusz Kobylarz] [SPARK-8481] [MLlib] GaussianMixtureModel predict accepting single vector
This PR contains the following changes:
* add `featureIndex` to handle vector features (in order to chain isotonic regression easily with output from logistic regression
* make getter/setter names consistent with params
* remove inheritance from Regressor because it is tricky to handle both `DoubleType` and `VectorType`
* simplify test data generation
jkbradley zapletal-martin
Author: Xiangrui Meng <meng@databricks.com>
Closes#7952 from mengxr/SPARK-9493 and squashes the following commits:
8818ac3 [Xiangrui Meng] address comments
05e2216 [Xiangrui Meng] address comments
8d08090 [Xiangrui Meng] add featureIndex to handle vector features make getter/setter names consistent with params remove inheritance from Regressor
After https://github.com/apache/spark/pull/7263 it is pretty straightforward to Python wrappers.
Author: MechCoder <manojkumarsivaraj334@gmail.com>
Closes#7930 from MechCoder/spark-9533 and squashes the following commits:
1bea394 [MechCoder] make getVectors a lazy val
5522756 [MechCoder] [SPARK-9533] [PySpark] [ML] Add missing methods in Word2Vec ML
I have added support for stats in LogisticRegression. The API is similar to that of LinearRegression with LogisticRegressionTrainingSummary and LogisticRegressionSummary
I have some queries and asked them inline.
Author: MechCoder <manojkumarsivaraj334@gmail.com>
Closes#7538 from MechCoder/log_reg_stats and squashes the following commits:
2e9f7c7 [MechCoder] Change defs into lazy vals
d775371 [MechCoder] Clean up class inheritance
9586125 [MechCoder] Add abstraction to handle Multiclass Metrics
40ad8ef [MechCoder] minor
640376a [MechCoder] remove unnecessary dataframe stuff and add docs
80d9954 [MechCoder] Added tests
fbed861 [MechCoder] DataFrame support for metrics
70a0fc4 [MechCoder] [SPARK-9112] [ML] Implement Stats for LogisticRegression
Add VectorSlicer transformer to spark.ml, with features specified as either indices or names. Transfers feature attributes for selected features.
Updated version of [https://github.com/apache/spark/pull/5731]
CC: yinxusen This updates your PR. You'll still be the primary author of this PR.
CC: mengxr
Author: Xusen Yin <yinxusen@gmail.com>
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#7972 from jkbradley/yinxusen-SPARK-5895 and squashes the following commits:
b16e86e [Joseph K. Bradley] fixed scala style
71c65d2 [Joseph K. Bradley] fix import order
86e9739 [Joseph K. Bradley] cleanups per code review
9d8d6f1 [Joseph K. Bradley] style fix
83bc2e9 [Joseph K. Bradley] Updated VectorSlicer
98c6939 [Xusen Yin] fix style error
ecbf2d3 [Xusen Yin] change interfaces and params
f6be302 [Xusen Yin] Merge branch 'master' into SPARK-5895
e4781f2 [Xusen Yin] fix commit error
fd154d7 [Xusen Yin] add test suite of vector slicer
17171f8 [Xusen Yin] fix slicer
9ab9747 [Xusen Yin] add vector slicer
aa5a0bf [Xusen Yin] add vector slicer
mengxr
Author: Feynman Liang <fliang@databricks.com>
Closes#7974 from feynmanliang/SPARK-9657 and squashes the following commits:
7ca533f [Feynman Liang] Fix return type of getMaxPatternLength
mengxr This adds the `BlockMatrix` to PySpark. I have the conversions to `IndexedRowMatrix` and `CoordinateMatrix` ready as well, so once PR #7554 is completed (which relies on PR #7746), this PR can be finished.
Author: Mike Dusenberry <mwdusenb@us.ibm.com>
Closes#7761 from dusenberrymw/SPARK-6486_Add_BlockMatrix_to_PySpark and squashes the following commits:
27195c2 [Mike Dusenberry] Adding one more check to _convert_to_matrix_block_tuple, and a few minor documentation changes.
ae50883 [Mike Dusenberry] Minor update: BlockMatrix should inherit from DistributedMatrix.
b8acc1c [Mike Dusenberry] Moving BlockMatrix to pyspark.mllib.linalg.distributed, updating the logic to match that of the other distributed matrices, adding conversions, and adding documentation.
c014002 [Mike Dusenberry] Using properties for better documentation.
3bda6ab [Mike Dusenberry] Adding documentation.
8fb3095 [Mike Dusenberry] Small cleanup.
e17af2e [Mike Dusenberry] Adding BlockMatrix to PySpark.
This is a major refactoring of the PrefixSpan implementation. It contains the following changes:
1. Expand prefix with one item at a time. The existing implementation generates all subsets for each itemset, which might have scalability issue when the itemset is large.
2. Use a new internal format. `<(12)(31)>` is represented by `[0, 1, 2, 0, 1, 3, 0]` internally. We use `0` because negative numbers are used to indicates partial prefix items, e.g., `_2` is represented by `-2`.
3. Remember the start indices of all partial projections in the projected postfix to help next projection.
4. Reuse the original sequence array for projected postfixes.
5. Use `Prefix` IDs in aggregation rather than its content.
6. Use `ArrayBuilder` for building primitive arrays.
7. Expose `maxLocalProjDBSize`.
8. Tests are not changed except using `0` instead of `-1` as the delimiter.
`Postfix`'s API doc should be a good place to start.
Closes#7594
feynmanliang zhangjiajin
Author: Xiangrui Meng <meng@databricks.com>
Closes#7937 from mengxr/SPARK-9540 and squashes the following commits:
2d0ec31 [Xiangrui Meng] address more comments
48f450c [Xiangrui Meng] address comments from Feynman; fixed a bug in project and added a test
65f90e8 [Xiangrui Meng] naming and documentation
8afc86a [Xiangrui Meng] refactor impl
All compressed sensing applications, and some of the regression use-cases will have better result by turning the feature scaling off. However, if we implement this naively by training the dataset without doing any standardization, the rate of convergency will not be good. This can be implemented by still standardizing the training dataset but we penalize each component differently to get effectively the same objective function but a better numerical problem. As a result, for those columns with high variances, they will be penalized less, and vice versa. Without this, since all the features are standardized, so they will be penalized the same.
In R, there is an option for this.
standardize
Logical flag for x variable standardization, prior to fitting the model sequence. The coefficients are always returned on the original scale. Default is standardize=TRUE. If variables are in the same units already, you might not wish to standardize. See details below for y standardization with family="gaussian".
Note that the primary author for this PR is holdenk
Author: Holden Karau <holden@pigscanfly.ca>
Author: DB Tsai <dbt@netflix.com>
Closes#7875 from dbtsai/SPARK-8522 and squashes the following commits:
e856036 [DB Tsai] scala doc
596e96c [DB Tsai] minor
bbff347 [DB Tsai] naming
baa0805 [DB Tsai] touch up
d6234ba [DB Tsai] Merge branch 'master' into SPARK-8522-Disable-Linear_featureScaling-Spark-8601-in-Linear_regression
6b1dc09 [Holden Karau] Merge branch 'master' into SPARK-8522-Disable-Linear_featureScaling-Spark-8601-in-Linear_regression
332f140 [Holden Karau] Merge in master
eebe10a [Holden Karau] Use same comparision operator throughout the test
3f92935 [Holden Karau] merge
b83a41e [Holden Karau] Expand the tests and make them similar to the other PR also providing an option to disable standardization (but for LoR).
0c334a2 [Holden Karau] Remove extra line
99ce053 [Holden Karau] merge in master
e54a8a9 [Holden Karau] Fix long line
e47c574 [Holden Karau] Add support for L2 without standardization.
55d3a66 [Holden Karau] Add standardization param for linear regression
00a1dc5 [Holden Karau] Add the param to the linearregression impl
Update BinaryClassificationEvaluator to use setRawPredictionCol, rather than setScoreCol. Deprecated setScoreCol.
I don't think setScoreCol was actually used anywhere (based on search).
CC: mengxr
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#7921 from jkbradley/binary-eval-rawpred and squashes the following commits:
e5d7dfa [Joseph K. Bradley] Update BinaryClassificationEvaluator to use setRawPredictionCol
This PR adds the RowMatrix, IndexedRowMatrix, and CoordinateMatrix distributed matrices to PySpark. Each distributed matrix class acts as a wrapper around the Scala/Java counterpart by maintaining a reference to the Java object. New distributed matrices can be created using factory methods added to DistributedMatrices, which creates the Java distributed matrix and then wraps it with the corresponding PySpark class. This design allows for simple conversion between the various distributed matrices, and lets us re-use the Scala code. Serialization between Python and Java is implemented using DataFrames as needed for IndexedRowMatrix and CoordinateMatrix for simplicity. Associated documentation and unit-tests have also been added. To facilitate code review, this PR implements access to the rows/entries as RDDs, the number of rows & columns, and conversions between the various distributed matrices (not including BlockMatrix), and does not implement the other linear algebra functions of the matrices, although this will be very simple to add now.
Author: Mike Dusenberry <mwdusenb@us.ibm.com>
Closes#7554 from dusenberrymw/SPARK-6485_Add_CoordinateMatrix_RowMatrix_IndexedMatrix_to_PySpark and squashes the following commits:
bb039cb [Mike Dusenberry] Minor documentation update.
b887c18 [Mike Dusenberry] Updating the matrix conversion logic again to make it even cleaner. Now, we allow the 'rows' parameter in the constructors to be either an RDD or the Java matrix object. If 'rows' is an RDD, we create a Java matrix object, wrap it, and then store that. If 'rows' is a Java matrix object of the correct type, we just wrap and store that directly. This is only for internal usage, and publicly, we still require 'rows' to be an RDD. We no longer store the 'rows' RDD, and instead just compute it from the Java object when needed. The point of this is that when we do matrix conversions, we do the conversion on the Scala/Java side, which returns a Java object, so we should use that directly, but exposing 'java_matrix' parameter in the public API is not ideal. This non-public feature of allowing 'rows' to be a Java matrix object is documented in the '__init__' constructor docstrings, which are not part of the generated public API, and doctests are also included.
7f0dcb6 [Mike Dusenberry] Updating module docstring.
cfc1be5 [Mike Dusenberry] Use 'new SQLContext(matrix.rows.sparkContext)' rather than 'SQLContext.getOrCreate', as the later doesn't guarantee that the SparkContext will be the same as for the matrix.rows data.
687e345 [Mike Dusenberry] Improving conversion performance. This adds an optional 'java_matrix' parameter to the constructors, and pulls the conversion logic out into a '_create_from_java' function. Now, if the constructors are given a valid Java distributed matrix object as 'java_matrix', they will store those internally, rather than create a new one on the Scala/Java side.
3e50b6e [Mike Dusenberry] Moving the distributed matrices to pyspark.mllib.linalg.distributed.
308f197 [Mike Dusenberry] Using properties for better documentation.
1633f86 [Mike Dusenberry] Minor documentation cleanup.
f0c13a7 [Mike Dusenberry] CoordinateMatrix should inherit from DistributedMatrix.
ffdd724 [Mike Dusenberry] Updating doctests to make documentation cleaner.
3fd4016 [Mike Dusenberry] Updating docstrings.
27cd5f6 [Mike Dusenberry] Simplifying input conversions in the constructors for each distributed matrix.
a409cf5 [Mike Dusenberry] Updating doctests to be less verbose by using lists instead of DenseVectors explicitly.
d19b0ba [Mike Dusenberry] Updating code and documentation to note that a vector-like object (numpy array, list, etc.) can be used in place of explicit Vector object, and adding conversions when necessary to RowMatrix construction.
4bd756d [Mike Dusenberry] Adding param documentation to IndexedRow and MatrixEntry.
c6bded5 [Mike Dusenberry] Move conversion logic from tuples to IndexedRow or MatrixEntry types from within the IndexedRowMatrix and CoordinateMatrix constructors to separate _convert_to_indexed_row and _convert_to_matrix_entry functions.
329638b [Mike Dusenberry] Moving the Experimental tag to the top of each docstring.
0be6826 [Mike Dusenberry] Simplifying doctests by removing duplicated rows/entries RDDs within the various tests.
c0900df [Mike Dusenberry] Adding the colons that were accidentally not inserted.
4ad6819 [Mike Dusenberry] Documenting the and parameters.
3b854b9 [Mike Dusenberry] Minor updates to documentation.
10046e8 [Mike Dusenberry] Updating documentation to use class constructors instead of the removed DistributedMatrices factory methods.
119018d [Mike Dusenberry] Adding static methods to each of the distributed matrix classes to consolidate conversion logic.
4d7af86 [Mike Dusenberry] Adding type checks to the constructors. Although it is slightly verbose, it is better for the user to have a good error message than a cryptic stacktrace.
93b6a3d [Mike Dusenberry] Pulling the DistributedMatrices Python class out of this pull request.
f6f3c68 [Mike Dusenberry] Pulling the DistributedMatrices Scala class out of this pull request.
6a3ecb7 [Mike Dusenberry] Updating pattern matching.
08f287b [Mike Dusenberry] Slight reformatting of the documentation.
a245dc0 [Mike Dusenberry] Updating Python doctests for compatability between Python 2 & 3. Since Python 3 removed the idea of a separate 'long' type, all values that would have been outputted as a 'long' (ex: '4L') will now be treated as an 'int' and outputed as one (ex: '4'). The doctests now explicitly convert to ints so that both Python 2 and 3 will have the same output. This is fine since the values are all small, and thus can be easily represented as ints.
4d3a37e [Mike Dusenberry] Reformatting a few long Python doctest lines.
7e3ca16 [Mike Dusenberry] Fixing long lines.
f721ead [Mike Dusenberry] Updating documentation for each of the distributed matrices.
ab0e8b6 [Mike Dusenberry] Updating unit test to be more useful.
dda2f89 [Mike Dusenberry] Added wrappers for the conversions between the various distributed matrices. Added logic to be able to access the rows/entries of the distributed matrices, which requires serialization through DataFrames for IndexedRowMatrix and CoordinateMatrix types. Added unit tests.
0cd7166 [Mike Dusenberry] Implemented the CoordinateMatrix API in PySpark, following the idea of the IndexedRowMatrix API, including using DataFrames for serialization.
3c369cb [Mike Dusenberry] Updating the architecture a bit to make conversions between the various distributed matrix types easier. The different distributed matrix classes are now only wrappers around the Java objects, and take the Java object as an argument during construction. This way, we can call for example on an , which returns a reference to a Java RowMatrix object, and then construct a PySpark RowMatrix object wrapped around the Java object. This is analogous to the behavior of PySpark RDDs and DataFrames. We now delegate creation of the various distributed matrices from scratch in PySpark to the factory methods on .
4bdd09b [Mike Dusenberry] Implemented the IndexedRowMatrix API in PySpark, following the idea of the RowMatrix API. Note that for the IndexedRowMatrix, we use DataFrames to serialize the data between Python and Scala/Java, so we accept PySpark RDDs, then convert to a DataFrame, then convert back to RDDs on the Scala/Java side before constructing the IndexedRowMatrix.
23bf1ec [Mike Dusenberry] Updating documentation to add PySpark RowMatrix. Inserting newline above doctest so that it renders properly in API docs.
b194623 [Mike Dusenberry] Updating design to have a PySpark RowMatrix simply create and keep a reference to a wrapper over a Java RowMatrix. Updating DistributedMatrices factory methods to accept numRows and numCols with default values. Updating PySpark DistributedMatrices factory method to simply create a PySpark RowMatrix. Adding additional doctests for numRows and numCols parameters.
bc2d220 [Mike Dusenberry] Adding unit tests for RowMatrix methods.
d7e316f [Mike Dusenberry] Implemented the RowMatrix API in PySpark by doing the following: Added a DistributedMatrices class to contain factory methods for creating the various distributed matrices. Added a factory method for creating a RowMatrix from an RDD of Vectors. Added a createRowMatrix function to the PythonMLlibAPI to interface with the factory method. Added DistributedMatrix, DistributedMatrices, and RowMatrix classes to the pyspark.mllib.linalg api.
Small cleanups to recent LDA additions and docs.
CC: feynmanliang
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#7916 from jkbradley/lda-cleanups and squashes the following commits:
f7021d9 [Joseph K. Bradley] broadcasting large matrices for LDA in local model and online learning
97947aa [Joseph K. Bradley] a few more cleanups
5b03f88 [Joseph K. Bradley] reverted split of lda log likelihood
c566915 [Joseph K. Bradley] small edit to make review easier
63f6c7d [Joseph K. Bradley] clarified log likelihood for lda models
This PR replaces the old "threshold" with a generalized "thresholds" Param. We keep getThreshold,setThreshold for backwards compatibility for binary classification.
Note that the primary author of this PR is holdenk
Author: Holden Karau <holden@pigscanfly.ca>
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#7909 from jkbradley/holdenk-SPARK-8069-add-cutoff-aka-threshold-to-random-forest and squashes the following commits:
3952977 [Joseph K. Bradley] fixed pyspark doc test
85febc8 [Joseph K. Bradley] made python unit tests a little more robust
7eb1d86 [Joseph K. Bradley] small cleanups
6cc2ed8 [Joseph K. Bradley] Fixed remaining merge issues.
0255e44 [Joseph K. Bradley] Many cleanups for thresholds, some more tests
7565a60 [Holden Karau] fix pep8 style checks, add a getThreshold method similar to our LogisticRegression.scala one for API compat
be87f26 [Holden Karau] Convert threshold to thresholds in the python code, add specialized support for Array[Double] to shared parems codegen, etc.
6747dad [Holden Karau] Override raw2prediction for ProbabilisticClassifier, fix some tests
25df168 [Holden Karau] Fix handling of thresholds in LogisticRegression
c02d6c0 [Holden Karau] No default for thresholds
5e43628 [Holden Karau] CR feedback and fixed the renamed test
f3fbbd1 [Holden Karau] revert the changes to random forest :(
51f581c [Holden Karau] Add explicit types to public methods, fix long line
f7032eb [Holden Karau] Fix a java test bug, remove some unecessary changes
adf15b4 [Holden Karau] rename the classifier suite test to ProbabilisticClassifierSuite now that we only have it in Probabilistic
398078a [Holden Karau] move the thresholding around a bunch based on the design doc
4893bdc [Holden Karau] Use numtrees of 3 since previous result was tied (one tree for each) and the switch from different max methods picked a different element (since they were equal I think this is ok)
638854c [Holden Karau] Add a scala RandomForestClassifierSuite test based on corresponding python test
e09919c [Holden Karau] Fix return type, I need more coffee....
8d92cac [Holden Karau] Use ClassifierParams as the head
3456ed3 [Holden Karau] Add explicit return types even though just test
a0f3b0c [Holden Karau] scala style fixes
6f14314 [Holden Karau] Since hasthreshold/hasthresholds is in root classifier now
ffc8dab [Holden Karau] Update the sharedParams
0420290 [Holden Karau] Allow us to override the get methods selectively
978e77a [Holden Karau] Move HasThreshold into classifier params and start defining the overloaded getThreshold/getThresholds functions
1433e52 [Holden Karau] Revert "try and hide threshold but chainges the API so no dice there"
1f09a2e [Holden Karau] try and hide threshold but chainges the API so no dice there
efb9084 [Holden Karau] move setThresholds only to where its used
6b34809 [Holden Karau] Add a test with thresholding for the RFCS
74f54c3 [Holden Karau] Fix creation of vote array
1986fa8 [Holden Karau] Setting the thresholds only makes sense if the underlying class hasn't overridden predict, so lets push it down.
2f44b18 [Holden Karau] Add a global default of null for thresholds param
f338cfc [Holden Karau] Wait that wasn't a good idea, Revert "Some progress towards unifying threshold and thresholds"
634b06f [Holden Karau] Some progress towards unifying threshold and thresholds
85c9e01 [Holden Karau] Test passes again... little fnur
099c0f3 [Holden Karau] Move thresholds around some more (set on model not trainer)
0f46836 [Holden Karau] Start adding a classifiersuite
f70eb5e [Holden Karau] Fix test compile issues
a7d59c8 [Holden Karau] Move thresholding into Classifier trait
5d999d2 [Holden Karau] Some more progress, start adding a test (maybe try and see if we can find a better thing to use for the base of the test)
1fed644 [Holden Karau] Use thresholds to scale scores in random forest classifcation
31d6bf2 [Holden Karau] Start threading the threshold info through
0ef228c [Holden Karau] Add hasthresholds
Enable most javac lint warnings; fix a lot of build warnings. In a few cases, touch up surrounding code in the process.
I'll explain several of the changes inline in comments.
Author: Sean Owen <sowen@cloudera.com>
Closes#7862 from srowen/SPARK-9534 and squashes the following commits:
ea51618 [Sean Owen] Enable most javac lint warnings; fix a lot of build warnings. In a few cases, touch up surrounding code in the process.
Add missing methods
1. getVectors
2. findSynonyms
to W2Vec scala and python API
mengxr
Author: MechCoder <manojkumarsivaraj334@gmail.com>
Closes#7263 from MechCoder/missing_methods_w2vec and squashes the following commits:
149d5ca [MechCoder] minor doc
69d91b7 [MechCoder] [SPARK-8874] [ML] Add missing methods in Word2Vec
Add Python API for RFormula. Similar to other feature transformers in Python. This is just a thin wrapper over the Scala implementation. ericl MechCoder
Author: Xiangrui Meng <meng@databricks.com>
Closes#7879 from mengxr/SPARK-9544 and squashes the following commits:
3d5ff03 [Xiangrui Meng] add an doctest for . and -
5e969a5 [Xiangrui Meng] fix pydoc
1cd41f8 [Xiangrui Meng] organize imports
3c18b10 [Xiangrui Meng] add Python API for RFormula
Added featureImportance to RandomForestClassifier and Regressor.
This follows the scikit-learn implementation here: [a95203b249/sklearn/tree/_tree.pyx (L3341)]
CC: yanboliang Would you mind taking a look? Thanks!
Author: Joseph K. Bradley <joseph@databricks.com>
Author: Feynman Liang <fliang@databricks.com>
Closes#7838 from jkbradley/dt-feature-importance and squashes the following commits:
72a167a [Joseph K. Bradley] fixed unit test
86cea5f [Joseph K. Bradley] Modified RF featuresImportances to return Vector instead of Map
5aa74f0 [Joseph K. Bradley] finally fixed unit test for real
33df5db [Joseph K. Bradley] fix unit test
42a2d3b [Joseph K. Bradley] fix unit test
fe94e72 [Joseph K. Bradley] modified feature importance unit tests
cc693ee [Feynman Liang] Add classifier tests
79a6f87 [Feynman Liang] Compare dense vectors in test
21d01fc [Feynman Liang] Added failing SKLearn test
ac0b254 [Joseph K. Bradley] Added featureImportance to RandomForestClassifier/Regressor. Need to add unit tests
RandomForestClassifier now outputs rawPrediction based on tree probabilities, plus probability column computed from normalized rawPrediction.
CC: holdenk
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#7859 from jkbradley/rf-prob and squashes the following commits:
6c28f51 [Joseph K. Bradley] Changed RandomForestClassifier to extend ProbabilisticClassifier
1. Use `PrefixSpanModel` to wrap the frequent sequences.
2. Define `FreqSequence` to wrap each frequent sequence, which contains a Java-friendly method `javaSequence`
3. Overload `run` for Java users.
4. Added a unit test in Java to check Java compatibility.
zhangjiajin feynmanliang
Author: Xiangrui Meng <meng@databricks.com>
Closes#7869 from mengxr/SPARK-9527 and squashes the following commits:
4345594 [Xiangrui Meng] add PrefixSpanModel and make PrefixSpan Java friendly
mengxr Please review after #7818 merges and master is rebased.
Continues work by rikima
Closes#7400
Author: Feynman Liang <fliang@databricks.com>
Author: masaki rikitoku <rikima3132@gmail.com>
Closes#7837 from feynmanliang/SPARK-7400-genericItems and squashes the following commits:
8b2c756 [Feynman Liang] Remove orig
92443c8 [Feynman Liang] Style fixes
42c6349 [Feynman Liang] Style fix
14e67fc [Feynman Liang] Generic prefixSpan itemtypes
b3b21e0 [Feynman Liang] Initial support for generic itemtype in public api
b86e0d5 [masaki rikitoku] modify to support generic item type
Remove ScalaDoc that suggests describeTopics and topDocumentsPerTopic are approximate.
cc jkbradley
Author: Meihua Wu <meihuawu@umich.edu>
Closes#7858 from rotationsymmetry/SPARK-9530 and squashes the following commits:
b574923 [Meihua Wu] Remove ScalaDoc that suggests describeTopics and topDocumentsPerTopic are approximate.
jira: https://issues.apache.org/jira/browse/SPARK-8169
stop words: http://en.wikipedia.org/wiki/Stop_words
StopWordsRemover takes a string array column and outputs a string array column with all defined stop words removed. The transformer should also come with a standard set of stop words as default.
Currently I used a minimum stop words set since on some [case](http://nlp.stanford.edu/IR-book/html/htmledition/dropping-common-terms-stop-words-1.html), small set of stop words is preferred.
ASCII char has been tested, Yet I cannot check it in due to style check.
Further thought,
1. Maybe I should use OpenHashSet. Is it recommended?
2. Currently I leave the null in input array untouched, i.e. Array(null, null) => Array(null, null).
3. If the current stop words set looks too limited, any suggestion for replacement? We can have something similar to the one in [SKlearn](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/feature_extraction/stop_words.py).
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes#6742 from hhbyyh/stopwords and squashes the following commits:
fa959d8 [Yuhao Yang] separating udf
f190217 [Yuhao Yang] replace default list and other small fix
04403ab [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into stopwords
b3aa957 [Yuhao Yang] add stopWordsRemover
mengxr Extends PrefixSpan to non-temporal itemsets. Continues work by zhangjiajin
* Internal API uses List[Set[Int]] which is likely not efficient; will need to refactor during QA
Closes#7646
Author: zhangjiajin <zhangjiajin@huawei.com>
Author: Feynman Liang <fliang@databricks.com>
Author: zhang jiajin <zhangjiajin@huawei.com>
Closes#7818 from feynmanliang/SPARK-8999-nonTemporal and squashes the following commits:
4ded81d [Feynman Liang] Replace all filters to filter nonempty
350e67e [Feynman Liang] Code review feedback
03156ca [Feynman Liang] Fix tests, drop delimiters at boundaries of sequences
d1fe0ed [Feynman Liang] Remove comments
86ca4e5 [Feynman Liang] Fix style
7c7bf39 [Feynman Liang] Fixed itemSet sequences
6073b10 [Feynman Liang] Basic itemset functionality, failing test
1a7fb48 [Feynman Liang] Add delimiter to results
5db00aa [Feynman Liang] Working for items, not itemsets
6787716 [Feynman Liang] Working on temporal sequences
f1114b9 [Feynman Liang] Add -1 delimiter
00fe756 [Feynman Liang] Reset base files for rebase
f486dcd [zhangjiajin] change maxLocalProjDBSize and fix a bug (remove -3 from frequent items).
60a0b76 [zhangjiajin] fixed a scala style error.
740c203 [zhangjiajin] fixed a scala style error.
5785cb8 [zhangjiajin] support non-temporal sequence
a5d649d [zhangjiajin] restore original version
09dc409 [zhangjiajin] Merge branch 'master' of https://github.com/apache/spark into multiItems_2
ae8c02d [zhangjiajin] Fixed some Scala style errors.
216ab0c [zhangjiajin] Support non-temporal sequence in PrefixSpan
b572f54 [zhangjiajin] initialize file before rebase.
f06772f [zhangjiajin] fix a scala style error.
a7e50d4 [zhangjiajin] Add feature: Collect enough frequent prefixes before projection in PrefixSpan.
c1d13d0 [zhang jiajin] Delete PrefixspanSuite.scala
d9d8137 [zhang jiajin] Delete Prefixspan.scala
c6ceb63 [zhangjiajin] Add new algorithm PrefixSpan and test file.
It is useful to convert the encoded indices back to their string representation for result inspection. We can add a function which creates an inverse transformation.
Author: Holden Karau <holden@pigscanfly.ca>
Closes#6339 from holdenk/SPARK-7446-inverse-transform-for-string-indexer and squashes the following commits:
7cdf915 [Holden Karau] scala style comment fix
b9cffb6 [Holden Karau] Update the labels param to have the metadata note
6a38edb [Holden Karau] Setting the default needs to come after the value gets defined
9e241d8 [Holden Karau] use Array.empty
21c8cfa [Holden Karau] Merge branch 'master' into SPARK-7446-inverse-transform-for-string-indexer
64dd3a3 [Holden Karau] Merge branch 'master' into SPARK-7446-inverse-transform-for-string-indexer
4f06c59 [Holden Karau] Fix comment styles, use empty array as the default, etc.
a60c0e3 [Holden Karau] CR feedback (remove old constructor, add a note about use of setLabels)
1987b95 [Holden Karau] Use default copy
71e8d66 [Holden Karau] Make labels a local param for StringIndexerInverse
8450d0b [Holden Karau] Use the labels param in StringIndexerInverse
7464019 [Holden Karau] Add a labels param
868b1a9 [Holden Karau] Update scaladoc since we don't have labelsCol anymore
5aa38bf [Holden Karau] Add an inverse test using only meta data, pass labels when calling inverse method
f3e0c64 [Holden Karau] CR feedback
ebed932 [Holden Karau] Add Experimental tag and some scaladocs. Also don't require that the inputCol has the metadata on it, instead have the labelsCol specified when creating the inverse.
03ebf95 [Holden Karau] Add explicit type for invert function
ecc65e0 [Holden Karau] Read the metadata correctly, use the array, pass the test
a42d773 [Holden Karau] Fix test to supply cols as per new invert method
16cc3c3 [Holden Karau] Add an invert method
d4bcb20 [Holden Karau] Make the inverse string indexer into a transformer (still needs test updates but compiles)
e8bf3ad [Holden Karau] Merge branch 'master' into SPARK-7446-inverse-transform-for-string-indexer
c3fdee1 [Holden Karau] Some WIP refactoring based on jkbradley's CR feedback. Definite work-in-progress
557bef8 [Holden Karau] Instead of using a private inverse transform, add an invert function so we can use it in a pipeline
88779c1 [Holden Karau] fix long line
78b28c1 [Holden Karau] Finish reverse part and add a test :)
bb16a6a [Holden Karau] Some progress
This PR adds a `MapData` as internal representation of map type in Spark SQL, and provides a default implementation with just 2 `ArrayData`.
After that, we have specialized getters for all internal type, so I removed generic getter in `ArrayData` and added specialized `toArray` for it.
Also did some refactor and cleanup for `InternalRow` and its subclasses.
Author: Wenchen Fan <cloud0fan@outlook.com>
Closes#7799 from cloud-fan/map-data and squashes the following commits:
77d482f [Wenchen Fan] fix python
e8f6682 [Wenchen Fan] skip MapData equality check in HiveInspectorSuite
40cc9db [Wenchen Fan] add toString
6e06ec9 [Wenchen Fan] some more cleanup
a90aca1 [Wenchen Fan] add MapData
Adds `alpha` (document-topic Dirichlet parameter) hyperparameter optimization to `OnlineLDAOptimizer` following Huang: Maximum Likelihood Estimation of Dirichlet Distribution Parameters. Also introduces a private `setSampleWithReplacement` to `OnlineLDAOptimizer` for unit testing purposes.
Author: Feynman Liang <fliang@databricks.com>
Closes#7836 from feynmanliang/SPARK-8936-alpha-optimize and squashes the following commits:
4bef484 [Feynman Liang] Documentation improvements
c3c6c1d [Feynman Liang] Fix docs
151e859 [Feynman Liang] Fix style
fa77518 [Feynman Liang] Hyperparameter optimization
Make NaiveBayesModel support predicting class probabilities, inherit from ProbabilisticClassificationModel.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#7672 from yanboliang/spark-9308 and squashes the following commits:
25e224c [Yanbo Liang] raw2probabilityInPlace should operate in-place
3ee56d6 [Yanbo Liang] change predictRaw and raw2probabilityInPlace
c07e7a2 [Yanbo Liang] ml.NaiveBayesModel support predicting class probabilities
Add topDocumentsPerTopic to DistributedLDAModel.
Add ScalaDoc and unit tests.
Author: Meihua Wu <meihuawu@umich.edu>
Closes#7769 from rotationsymmetry/SPARK-9246 and squashes the following commits:
1029e79c [Meihua Wu] clean up code comments
a023b82 [Meihua Wu] Update tests to use Long for doc index.
91e5998 [Meihua Wu] Use Long for doc index.
b9f70cf [Meihua Wu] Revise topDocumentsPerTopic
26ff3f6 [Meihua Wu] Add topDocumentsPerTopic, scala doc and unit tests
jkbradley Exposes `bound` (variational log likelihood bound) through public API as `logLikelihood`. Also adds unit tests, some DRYing of `LDASuite`, and includes unit tests mentioned in #7760
Author: Feynman Liang <fliang@databricks.com>
Closes#7801 from feynmanliang/SPARK-9481-logLikelihood and squashes the following commits:
6d1b2c9 [Feynman Liang] Negate perplexity definition
5f62b20 [Feynman Liang] Add logLikelihood
Decision tree support predict class probabilities.
Implement the prediction probabilities function referred the old DecisionTree API and the [sklean API](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/tree.py#L593).
I make the DecisionTreeClassificationModel inherit from ProbabilisticClassificationModel, make the predictRaw to return the raw counts vector and make raw2probabilityInPlace/predictProbability return the probabilities for each prediction.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#7694 from yanboliang/spark-6885 and squashes the following commits:
08d5b7f [Yanbo Liang] fix ImpurityStats null parameters and raw2probabilityInPlace sum = 0 issue
2174278 [Yanbo Liang] solve merge conflicts
7e90ba8 [Yanbo Liang] fix typos
33ae183 [Yanbo Liang] fix annotation
ff043d3 [Yanbo Liang] raw2probabilityInPlace should operate in-place
c32d6ce [Yanbo Liang] optimize calculateImpurityStats function again
6167fb0 [Yanbo Liang] optimize calculateImpurityStats function
fbbe2ec [Yanbo Liang] eliminate duplicated struct and code
beb1634 [Yanbo Liang] try to eliminate impurityStats for each LearningNode
99e8943 [Yanbo Liang] code optimization
5ec3323 [Yanbo Liang] implement InformationGainAndImpurityStats
227c91b [Yanbo Liang] refactor LearningNode to store ImpurityCalculator
d746ffc [Yanbo Liang] decision tree support predict class probabilities
jira: https://issues.apache.org/jira/browse/SPARK-9231
Helper method in DistributedLDAModel of this form:
```
/**
* For each document, return the top k weighted topics for that document.
* return RDD of (doc ID, topic indices, topic weights)
*/
def topTopicsPerDocument(k: Int): RDD[(Long, Array[Int], Array[Double])]
```
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes#7785 from hhbyyh/topTopicsPerdoc and squashes the following commits:
30ad153 [Yuhao Yang] small fix
fd24580 [Yuhao Yang] add topTopics per document to DistributedLDAModel
This pull request contains the following feature for ML:
- Multilayer Perceptron classifier
This implementation is based on our initial pull request with bgreeven: https://github.com/apache/spark/pull/1290 and inspired by very insightful suggestions from mengxr and witgo (I would like to thank all other people from the mentioned thread for useful discussions). The original code was extensively tested and benchmarked. Since then, I've addressed two main requirements that prevented the code from merging into the main branch:
- Extensible interface, so it will be easy to implement new types of networks
- Main building blocks are traits `Layer` and `LayerModel`. They are used for constructing layers of ANN. New layers can be added by extending the `Layer` and `LayerModel` traits. These traits are private in this release in order to save path to improve them based on community feedback
- Back propagation is implemented in general form, so there is no need to change it (optimization algorithm) when new layers are implemented
- Speed and scalability: this implementation has to be comparable in terms of speed to the state of the art single node implementations.
- The developed benchmark for large ANN shows that the proposed code is on par with C++ CPU implementation and scales nicely with the number of workers. Details can be found here: https://github.com/avulanov/ann-benchmark
- DBN and RBM by witgo https://github.com/witgo/spark/tree/ann-interface-gemm-dbn
- Dropout https://github.com/avulanov/spark/tree/ann-interface-gemm
mengxr and dbtsai kindly agreed to perform code review.
Author: Alexander Ulanov <nashb@yandex.ru>
Author: Bert Greevenbosch <opensrc@bertgreevenbosch.nl>
Closes#7621 from avulanov/SPARK-2352-ann and squashes the following commits:
4806b6f [Alexander Ulanov] Addressing reviewers comments.
a7e7951 [Alexander Ulanov] Default blockSize: 100. Added documentation to blockSize parameter and DataStacker class
f69bb3d [Alexander Ulanov] Addressing reviewers comments.
374bea6 [Alexander Ulanov] Moving ANN to ML package. GradientDescent constructor is now spark private.
43b0ae2 [Alexander Ulanov] Addressing reviewers comments. Adding multiclass test.
9d18469 [Alexander Ulanov] Addressing reviewers comments: unnecessary copy of data in predict
35125ab [Alexander Ulanov] Style fix in tests
e191301 [Alexander Ulanov] Apache header
a226133 [Alexander Ulanov] Multilayer Perceptron regressor and classifier
support ml.NaiveBayes for Python
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#7568 from yanboliang/spark-9214 and squashes the following commits:
5ee3fd6 [Yanbo Liang] fix typos
3ecd046 [Yanbo Liang] fix typos
f9c94d1 [Yanbo Liang] change lambda_ to smoothing and fix other issues
180452a [Yanbo Liang] fix typos
7dda1f4 [Yanbo Liang] support ml.NaiveBayes for Python
Improve error message when number of examples is less than arity of high-arity categorical feature
CC jkbradley is this about what you had in mind? I know it's a starter, but was on my list to close out in the short term.
Author: Sean Owen <sowen@cloudera.com>
Closes#7800 from srowen/SPARK-9077 and squashes the following commits:
b8f6cdb [Sean Owen] Improve error message when number of examples is less than arity of high-arity categorical feature
Add checkpointing to GradientBoostedTrees, GBTClassifier, GBTRegressor
CC: mengxr
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#7804 from jkbradley/gbt-checkpoint3 and squashes the following commits:
3fbd7ba [Joseph K. Bradley] tiny fix
b3e160c [Joseph K. Bradley] unset checkpoint dir after test
9cc3a04 [Joseph K. Bradley] added checkpointing to GBTs
Author: martinzapletal <zapletal-martin@email.cz>
Closes#7517 from zapletal-martin/SPARK-8671-isotonic-regression-api and squashes the following commits:
8c435c1 [martinzapletal] Review https://github.com/apache/spark/pull/7517 feedback update.
bebbb86 [martinzapletal] Merge remote-tracking branch 'upstream/master' into SPARK-8671-isotonic-regression-api
b68efc0 [martinzapletal] Added tests for param validation.
07c12bd [martinzapletal] Comments and refactoring.
834fcf7 [martinzapletal] Merge remote-tracking branch 'upstream/master' into SPARK-8671-isotonic-regression-api
b611fee [martinzapletal] SPARK-8671. Added first version of isotonic regression to pipeline API
See https://issues.apache.org/jira/browse/SPARK-9479 for the failure cause.
The PR includes the following changes:
1. Make ReceiverTrackerSuite create StreamingContext in the test body.
2. Fix places that don't stop StreamingContext. I verified no SparkContext was stopped in the shutdown hook locally after this fix.
3. Fix an issue that `ReceiverTracker.endpoint` may be null.
4. Make sure stopping SparkContext in non-main thread won't fail other tests.
Author: zsxwing <zsxwing@gmail.com>
Closes#7797 from zsxwing/fix-ReceiverTrackerSuite and squashes the following commits:
3a4bb98 [zsxwing] Fix another potential NPE
d7497df [zsxwing] Fix ReceiverTrackerSuite; make sure StreamingContext in tests is closed
jkbradley Changes the current hacky string-comparison for vector compares.
Author: Feynman Liang <fliang@databricks.com>
Closes#7775 from feynmanliang/SPARK-9454-ldasuite-vector-compare and squashes the following commits:
bd91a82 [Feynman Liang] Remove println
905c76e [Feynman Liang] Fix string compare in distributed EM
2f24c13 [Feynman Liang] Improve LDASuite tests
Check that SparseVector size is at least as big as the number of indices/values provided. And add tests for constructor checks.
CC MechCoder jkbradley -- I am not sure if a change needs to also happen in the Python API? I didn't see it had any similar checks to begin with, but I don't know it well.
Author: Sean Owen <sowen@cloudera.com>
Closes#7794 from srowen/SPARK-9277 and squashes the following commits:
e8dc31e [Sean Owen] Fix scalastyle
6ffe34a [Sean Owen] Check that SparseVector size is at least as big as the number of indices/values provided. And add tests for constructor checks.
Add unit tests for running LDA with empty documents.
Both EMLDAOptimizer and OnlineLDAOptimizer are tested.
feynmanliang
Author: Meihua Wu <meihuawu@umich.edu>
Closes#7620 from rotationsymmetry/SPARK-9225 and squashes the following commits:
3ed7c88 [Meihua Wu] Incorporate reviewer's further comments
f9432e8 [Meihua Wu] Incorporate reviewer's comments
8e1b9ec [Meihua Wu] Merge remote-tracking branch 'upstream/master' into SPARK-9225
ad55665 [Meihua Wu] Add unit tests for running LDA with empty documents