## What changes were proposed in this pull request?
(Please fill in changes proposed in this fix)
Add Isotonic Regression wrapper in SparkR
Wrappers in R and Scala are added.
Unit tests
Documentation
## How was this patch tested?
Manually tested with sudo ./R/run-tests.sh
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
Author: wm624@hotmail.com <wm624@hotmail.com>
Closes#14182 from wangmiao1981/isoR.
## What changes were proposed in this pull request?
Training GLMs on weighted dataset is very important use cases, but it is not supported by SparkR currently. Users can pass argument ```weights``` to specify the weights vector in native R. For ```spark.glm```, we can pass in the ```weightCol``` which is consistent with MLlib.
## How was this patch tested?
Unit test.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#14346 from yanboliang/spark-16710.
## What changes were proposed in this pull request?
Fix R SparkSession init/stop, and warnings of reusing existing Spark Context
## How was this patch tested?
unit tests
shivaram
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#14177 from felixcheung/rsessiontest.
## What changes were proposed in this pull request?
This PR is a subset of #13023 by yanboliang to make SparkR model param names and default values consistent with MLlib. I tried to avoid other changes from #13023 to keep this PR minimal. I will send a follow-up PR to improve the documentation.
Main changes:
* `spark.glm`: epsilon -> tol, maxit -> maxIter
* `spark.kmeans`: default k -> 2, default maxIter -> 20, default initMode -> "k-means||"
* `spark.naiveBayes`: laplace -> smoothing, default 1.0
## How was this patch tested?
Existing unit tests.
Author: Xiangrui Meng <meng@databricks.com>
Closes#13801 from mengxr/SPARK-15177.1.
## What changes were proposed in this pull request?
This PR introduces the new SparkSession API for SparkR.
`sparkR.session.getOrCreate()` and `sparkR.session.stop()`
"getOrCreate" is a bit unusual in R but it's important to name this clearly.
SparkR implementation should
- SparkSession is the main entrypoint (vs SparkContext; due to limited functionality supported with SparkContext in SparkR)
- SparkSession replaces SQLContext and HiveContext (both a wrapper around SparkSession, and because of API changes, supporting all 3 would be a lot more work)
- Changes to SparkSession is mostly transparent to users due to SPARK-10903
- Full backward compatibility is expected - users should be able to initialize everything just in Spark 1.6.1 (`sparkR.init()`), but with deprecation warning
- Mostly cosmetic changes to parameter list - users should be able to move to `sparkR.session.getOrCreate()` easily
- An advanced syntax with named parameters (aka varargs aka "...") is supported; that should be closer to the Builder syntax that is in Scala/Python (which unfortunately does not work in R because it will look like this: `enableHiveSupport(config(config(master(appName(builder(), "foo"), "local"), "first", "value"), "next, "value"))`
- Updating config on an existing SparkSession is supported, the behavior is the same as Python, in which config is applied to both SparkContext and SparkSession
- Some SparkSession changes are not matched in SparkR, mostly because it would be breaking API change: `catalog` object, `createOrReplaceTempView`
- Other SQLContext workarounds are replicated in SparkR, eg. `tables`, `tableNames`
- `sparkR` shell is updated to use the SparkSession entrypoint (`sqlContext` is removed, just like with Scale/Python)
- All tests are updated to use the SparkSession entrypoint
- A bug in `read.jdbc` is fixed
TODO
- [x] Add more tests
- [ ] Separate PR - update all roxygen2 doc coding example
- [ ] Separate PR - update SparkR programming guide
## How was this patch tested?
unit tests, manual tests
shivaram sun-rui rxin
Author: Felix Cheung <felixcheung_m@hotmail.com>
Author: felixcheung <felixcheung_m@hotmail.com>
Closes#13635 from felixcheung/rsparksession.
Eliminate the need to pass sqlContext to method since it is a singleton - and we don't want to support multiple contexts in a R session.
Changes are done in a back compat way with deprecation warning added. Method signature for S3 methods are added in a concise, clean approach such that in the next release the deprecated signature can be taken out easily/cleanly (just delete a few lines per method).
Custom method dispatch is implemented to allow for multiple JVM reference types that are all 'jobj' in R and to avoid having to add 30 new exports.
Author: felixcheung <felixcheung_m@hotmail.com>
Closes#9192 from felixcheung/rsqlcontext.
## What changes were proposed in this pull request?
Fix warnings and a failure in SparkR test cases with testthat version 1.0.1
## How was this patch tested?
SparkR unit test cases.
Author: Sun Rui <sunrui2016@gmail.com>
Closes#12867 from sun-rui/SPARK-15091.
## What changes were proposed in this pull request?
* ```RFormula``` supports empty response variable like ```~ x + y```.
* Support formula in ```spark.kmeans``` in SparkR.
* Fix some outdated docs for SparkR.
## How was this patch tested?
Unit tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#12813 from yanboliang/spark-15030.
## What changes were proposed in this pull request?
Continue the work of #12789 to rename ml.asve/ml.load to write.ml/read.ml, which are more consistent with read.df/write.df and other methods in SparkR.
I didn't rename `data` to `df` because we still use `predict` for prediction, which uses `newData` to match the signature in R.
## How was this patch tested?
Existing unit tests.
cc: yanboliang thunterdb
Author: Xiangrui Meng <meng@databricks.com>
Closes#12807 from mengxr/SPARK-14831.
## What changes were proposed in this pull request?
This PR splits the MLlib algorithms into two flavors:
- the R flavor, which tries to mimic the existing R API for these algorithms (and works as an S4 specialization for Spark dataframes)
- the Spark flavor, which follows the same API and naming conventions as the rest of the MLlib algorithms in the other languages
In practice, the former calls the latter.
## How was this patch tested?
The tests for the various algorithms were adapted to be run against both interfaces.
Author: Timothy Hunter <timhunter@databricks.com>
Closes#12789 from thunterdb/14831.
SparkR ```glm``` and ```kmeans``` model persistence.
Unit tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Author: Gayathri Murali <gayathri.m.softie@gmail.com>
Closes#12778 from yanboliang/spark-14311.
Closes#12680Closes#12683
## What changes were proposed in this pull request?
```AFTSurvivalRegressionModel``` supports ```save/load``` in SparkR.
## How was this patch tested?
Unit tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#12685 from yanboliang/spark-14313.
## What changes were proposed in this pull request?
SparkR ```NaiveBayesModel``` supports ```save/load``` by the following API:
```
df <- createDataFrame(sqlContext, infert)
model <- naiveBayes(education ~ ., df, laplace = 0)
ml.save(model, path)
model2 <- ml.load(path)
```
## How was this patch tested?
Add unit tests.
cc mengxr
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#12573 from yanboliang/spark-14312.
## What changes were proposed in this pull request?
The concurrency issue reported in SPARK-13178 was fixed by the PR https://github.com/apache/spark/pull/10947 for SPARK-12792.
This PR just removes a workaround not needed anymore.
## How was this patch tested?
SparkR unit tests.
Author: Sun Rui <rui.sun@intel.com>
Closes#12606 from sun-rui/SPARK-13178.
## What changes were proposed in this pull request?
Expose R-like summary statistics in SparkR::glm for more family and link functions.
Note: Not all values in R [summary.glm](http://stat.ethz.ch/R-manual/R-patched/library/stats/html/summary.glm.html) are exposed, we only provide the most commonly used statistics in this PR. More statistics can be added in the followup work.
## How was this patch tested?
Unit tests.
SparkR Output:
```
Deviance Residuals:
(Note: These are approximate quantiles with relative error <= 0.01)
Min 1Q Median 3Q Max
-0.95096 -0.16585 -0.00232 0.17410 0.72918
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.6765 0.23536 7.1231 4.4561e-11
Sepal_Length 0.34988 0.046301 7.5566 4.1873e-12
Species_versicolor -0.98339 0.072075 -13.644 0
Species_virginica -1.0075 0.093306 -10.798 0
(Dispersion parameter for gaussian family taken to be 0.08351462)
Null deviance: 28.307 on 149 degrees of freedom
Residual deviance: 12.193 on 146 degrees of freedom
AIC: 59.22
Number of Fisher Scoring iterations: 1
```
R output:
```
Deviance Residuals:
Min 1Q Median 3Q Max
-0.95096 -0.16522 0.00171 0.18416 0.72918
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.67650 0.23536 7.123 4.46e-11 ***
Sepal.Length 0.34988 0.04630 7.557 4.19e-12 ***
Speciesversicolor -0.98339 0.07207 -13.644 < 2e-16 ***
Speciesvirginica -1.00751 0.09331 -10.798 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for gaussian family taken to be 0.08351462)
Null deviance: 28.307 on 149 degrees of freedom
Residual deviance: 12.193 on 146 degrees of freedom
AIC: 59.217
Number of Fisher Scoring iterations: 2
```
cc mengxr
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#12393 from yanboliang/spark-13925.
* SparkR glm supports families and link functions which match R's signature for family.
* SparkR glm API refactor. The comparative standard of the new API is R glm, so I only expose the arguments that R glm supports: ```formula, family, data, epsilon and maxit```.
* This PR is focus on glm() and predict(), summary statistics will be done in a separate PR after this get in.
* This PR depends on #12287 which make GLMs support link prediction at Scala side. After that merged, I will add more tests for predict() to this PR.
Unit tests.
cc mengxr jkbradley hhbyyh
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#12294 from yanboliang/spark-12566.
## What changes were proposed in this pull request?
This PR continues the work in #11447, we implemented the wrapper of ```AFTSurvivalRegression``` named ```survreg``` in SparkR.
## How was this patch tested?
Test against output from R package survival's survreg.
cc mengxr felixcheung
Close#11447
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#11932 from yanboliang/spark-13010-new.
## What changes were proposed in this pull request?
This PR continues the work in #11486 from yinxusen with some code refactoring. In R package e1071, `naiveBayes` supports both categorical (Bernoulli) and continuous features (Gaussian), while in MLlib we support Bernoulli and multinomial. This PR implements the common subset: Bernoulli.
I moved the implementation out from SparkRWrappers to NaiveBayesWrapper to make it easier to read. Argument names, default values, and summary now match e1071's naiveBayes.
I removed the preprocess part that omit NA values because we don't know which columns to process.
## How was this patch tested?
Test against output from R package e1071's naiveBayes.
cc: yanboliang yinxusen
Closes#11486
Author: Xusen Yin <yinxusen@gmail.com>
Author: Xiangrui Meng <meng@databricks.com>
Closes#11890 from mengxr/SPARK-13449.
## What changes were proposed in this pull request?
This PR fixes all newly captured SparkR lint-r errors after the lintr package is updated from github.
## How was this patch tested?
dev/lint-r
SparkR unit tests
Author: Sun Rui <rui.sun@intel.com>
Closes#11652 from sun-rui/SPARK-13812.
JIRA: https://issues.apache.org/jira/browse/SPARK-13472
## What changes were proposed in this pull request?
One Kmeans test in R is unstable and sometimes fails. We should fix it.
## How was this patch tested?
Unit test is modified in this PR.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#11345 from viirya/fix-kmeans-r-test and squashes the following commits:
f959f61 [Liang-Chi Hsieh] Sort resulted clusters.
This PR:
1. Suppress all known warnings.
2. Cleanup test cases and fix some errors in test cases.
3. Fix errors in HiveContext related test cases. These test cases are actually not run previously due to a bug of creating TestHiveContext.
4. Support 'testthat' package version 0.11.0 which prefers that test cases be under 'tests/testthat'
5. Make sure the default Hadoop file system is local when running test cases.
6. Turn on warnings into errors.
Author: Sun Rui <rui.sun@intel.com>
Closes#10030 from sun-rui/SPARK-12034.
2015-12-07 10:38:17 -08:00
Renamed from R/pkg/inst/tests/test_mllib.R (Browse further)