## What changes were proposed in this pull request?
Support overriding the download url (include version directory) in an environment variable, `SPARKR_RELEASE_DOWNLOAD_URL`
## How was this patch tested?
unit test, manually testing
- snapshot build url
- download when spark jar not cached
- when spark jar is cached
- RC build url
- download when spark jar not cached
- when spark jar is cached
- multiple cached spark versions
- starting with sparkR shell
To use this,
```
SPARKR_RELEASE_DOWNLOAD_URL=http://this_is_the_url_to_spark_release_tgz R
```
then in R,
```
library(SparkR) # or specify lib.loc
sparkR.session()
```
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#16248 from felixcheung/rinstallurl.
## What changes were proposed in this pull request?
Several SparkR API calling into JVM methods that have void return values are getting printed out, especially when running in a REPL or IDE.
example:
```
> setLogLevel("WARN")
NULL
```
We should fix this to make the result more clear.
Also found a small change to return value of dropTempView in 2.1 - adding doc and test for it.
## How was this patch tested?
manually - I didn't find a expect_*() method in testthat for this
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#16237 from felixcheung/rinvis.
## What changes were proposed in this pull request?
In this PR, the document of `summary` method is improved in the format:
returns summary information of the fitted model, which is a list. The list includes .......
Since `summary` in R is mainly about the model, which is not the same as `summary` object on scala side, if there is one, the scala API doc is not pointed here.
In current document, some `return` have `.` and some don't have. `.` is added to missed ones.
Since spark.logit `summary` has a big refactoring, this PR doesn't include this one. It will be changed when the `spark.logit` PR is merged.
## How was this patch tested?
Manual build.
Author: wm624@hotmail.com <wm624@hotmail.com>
Closes#16150 from wangmiao1981/audit2.
## What changes were proposed in this pull request?
This PR has 2 key changes. One, we are building source package (aka bundle package) for SparkR which could be released on CRAN. Two, we should include in the official Spark binary distributions SparkR installed from this source package instead (which would have help/vignettes rds needed for those to work when the SparkR package is loaded in R, whereas earlier approach with devtools does not)
But, because of various differences in how R performs different tasks, this PR is a fair bit more complicated. More details below.
This PR also includes a few minor fixes.
### more details
These are the additional steps in make-distribution; please see [here](https://github.com/apache/spark/blob/master/R/CRAN_RELEASE.md) on what's going to a CRAN release, which is now run during make-distribution.sh.
1. package needs to be installed because the first code block in vignettes is `library(SparkR)` without lib path
2. `R CMD build` will build vignettes (this process runs Spark/SparkR code and captures outputs into pdf documentation)
3. `R CMD check` on the source package will install package and build vignettes again (this time from source packaged) - this is a key step required to release R package on CRAN
(will skip tests here but tests will need to pass for CRAN release process to success - ideally, during release signoff we should install from the R source package and run tests)
4. `R CMD Install` on the source package (this is the only way to generate doc/vignettes rds files correctly, not in step # 1)
(the output of this step is what we package into Spark dist and sparkr.zip)
Alternatively,
R CMD build should already be installing the package in a temp directory though it might just be finding this location and set it to lib.loc parameter; another approach is perhaps we could try calling `R CMD INSTALL --build pkg` instead.
But in any case, despite installing the package multiple times this is relatively fast.
Building vignettes takes a while though.
## How was this patch tested?
Manually, CI.
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#16014 from felixcheung/rdist.
## What changes were proposed in this pull request?
Reviewing SparkR ML wrappers API for 2.1 release, mainly two issues:
* Remove ```probabilityCol``` from the argument list of ```spark.logit``` and ```spark.randomForest```. Since it was used when making prediction and should be an argument of ```predict```, and we will work on this at [SPARK-18618](https://issues.apache.org/jira/browse/SPARK-18618) in the next release cycle.
* Fix ```spark.als``` params to make it consistent with MLlib.
## How was this patch tested?
Existing tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#16169 from yanboliang/spark-18326.
## What changes were proposed in this pull request?
Fix reservoir sampling bias for small k. An off-by-one error meant that the probability of replacement was slightly too high -- k/(l-1) after l element instead of k/l, which matters for small k.
## How was this patch tested?
Existing test plus new test case.
Author: Sean Owen <sowen@cloudera.com>
Closes#16129 from srowen/SPARK-18678.
## What changes were proposed in this pull request?
Several cleanup and improvements for ```spark.logit```:
* ```summary``` should return coefficients matrix, and should output labels for each class if the model is multinomial logistic regression model.
* ```summary``` should not return ```areaUnderROC, roc, pr, ...```, since most of them are DataFrame which are less important for R users. Meanwhile, these metrics ignore instance weights (setting all to 1.0) which will be changed in later Spark version. In case it will introduce breaking changes, we do not expose them currently.
* SparkR test improvement: comparing the training result with native R glmnet.
* Remove argument ```aggregationDepth``` from ```spark.logit```, since it's an expert Param(related with Spark architecture and job execution) that would be used rarely by R users.
## How was this patch tested?
Unit tests.
The ```summary``` output after this change:
multinomial logistic regression:
```
> df <- suppressWarnings(createDataFrame(iris))
> model <- spark.logit(df, Species ~ ., regParam = 0.5)
> summary(model)
$coefficients
versicolor virginica setosa
(Intercept) 1.514031 -2.609108 1.095077
Sepal_Length 0.02511006 0.2649821 -0.2900921
Sepal_Width -0.5291215 -0.02016446 0.549286
Petal_Length 0.03647411 0.1544119 -0.190886
Petal_Width 0.000236092 0.4195804 -0.4198165
```
binomial logistic regression:
```
> df <- suppressWarnings(createDataFrame(iris))
> training <- df[df$Species %in% c("versicolor", "virginica"), ]
> model <- spark.logit(training, Species ~ ., regParam = 0.5)
> summary(model)
$coefficients
Estimate
(Intercept) -6.053815
Sepal_Length 0.2449379
Sepal_Width 0.1648321
Petal_Length 0.4730718
Petal_Width 1.031947
```
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#16117 from yanboliang/spark-18686.
## What changes were proposed in this pull request?
If SparkR is running as a package and it has previously downloaded Spark Jar it should be able to run as before without having to set SPARK_HOME. Basically with this bug the auto install Spark will only work in the first session.
This seems to be a regression on the earlier behavior.
Fix is to always try to install or check for the cached Spark if running in an interactive session.
As discussed before, we should probably only install Spark iff running in an interactive session (R shell, RStudio etc)
## How was this patch tested?
Manually
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#16077 from felixcheung/rsessioninteractive.
## What changes were proposed in this pull request?
It's better we can fix this issue by providing an option ```type``` for users to change the ```predict``` output schema, then they could output probabilities, log-space predictions, or original labels. In order to not involve breaking API change for 2.1, so revert this change firstly and will add it back after [SPARK-18618](https://issues.apache.org/jira/browse/SPARK-18618) resolved.
## How was this patch tested?
Existing unit tests.
This reverts commit daa975f4bf.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#16118 from yanboliang/spark-18291-revert.
## What changes were proposed in this pull request?
Similar to SPARK-18401, as a classification algorithm, logistic regression should support output original label instead of supporting index label.
In this PR, original label output is supported and test cases are modified and added. Document is also modified.
## How was this patch tested?
Unit tests.
Author: wm624@hotmail.com <wm624@hotmail.com>
Closes#15910 from wangmiao1981/audit.
## What changes were proposed in this pull request?
### The Issue
If I specify my schema when doing
```scala
spark.read
.schema(someSchemaWherePartitionColumnsAreStrings)
```
but if the partition inference can infer it as IntegerType or I assume LongType or DoubleType (basically fixed size types), then once UnsafeRows are generated, your data will be corrupted.
### Proposed solution
The partition handling code path is kind of a mess. In my fix I'm probably adding to the mess, but at least trying to standardize the code path.
The real issue is that a user that uses the `spark.read` code path can never clearly specify what the partition columns are. If you try to specify the fields in `schema`, we practically ignore what the user provides, and fall back to our inferred data types. What happens in the end is data corruption.
My solution tries to fix this by always trying to infer partition columns the first time you specify the table. Once we find what the partition columns are, we try to find them in the user specified schema and use the dataType provided there, or fall back to the smallest common data type.
We will ALWAYS append partition columns to the user's schema, even if they didn't ask for it. We will only use the data type they provided if they specified it. While this is confusing, this has been the behavior since Spark 1.6, and I didn't want to change this behavior in the QA period of Spark 2.1. We may revisit this decision later.
A side effect of this PR is that we won't need https://github.com/apache/spark/pull/15942 if this PR goes in.
## How was this patch tested?
Regression tests
Author: Burak Yavuz <brkyvz@gmail.com>
Closes#15951 from brkyvz/partition-corruption.
## What changes were proposed in this pull request?
Updates links to the wiki to links to the new location of content on spark.apache.org.
## How was this patch tested?
Doc builds
Author: Sean Owen <sowen@cloudera.com>
Closes#15967 from srowen/SPARK-18073.1.
## What changes were proposed in this pull request?
* Fix SparkR ```spark.glm``` errors when fitting on collinear data, since ```standard error of coefficients, t value and p value``` are not available in this condition.
* Scala/Python GLM summary should throw exception if users get ```standard error of coefficients, t value and p value``` but the underlying WLS was solved by local "l-bfgs".
## How was this patch tested?
Add unit tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#15930 from yanboliang/spark-18501.
## What changes were proposed in this pull request?
When running SparkR job in yarn-cluster mode, it will download Spark package from apache website which is not necessary.
```
./bin/spark-submit --master yarn-cluster ./examples/src/main/r/dataframe.R
```
The following is output:
```
Attaching package: ‘SparkR’
The following objects are masked from ‘package:stats’:
cov, filter, lag, na.omit, predict, sd, var, window
The following objects are masked from ‘package:base’:
as.data.frame, colnames, colnames<-, drop, endsWith, intersect,
rank, rbind, sample, startsWith, subset, summary, transform, union
Spark not found in SPARK_HOME:
Spark not found in the cache directory. Installation will start.
MirrorUrl not provided.
Looking for preferred site from apache website...
......
```
There's no ```SPARK_HOME``` in yarn-cluster mode since the R process is in a remote host of the yarn cluster rather than in the client host. The JVM comes up first and the R process then connects to it. So in such cases we should never have to download Spark as Spark is already running.
## How was this patch tested?
Offline test.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#15888 from yanboliang/spark-18444.
## What changes were proposed in this pull request?
I found the documentation for the sample method to be confusing, this adds more clarification across all languages.
- [x] Scala
- [x] Python
- [x] R
- [x] RDD Scala
- [ ] RDD Python with SEED
- [X] RDD Java
- [x] RDD Java with SEED
- [x] RDD Python
## How was this patch tested?
NA
Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before opening a pull request.
Author: anabranch <wac.chambers@gmail.com>
Author: Bill Chambers <bill@databricks.com>
Closes#15815 from anabranch/SPARK-18365.
## What changes were proposed in this pull request?
```spark.mlp``` should support ```RFormula``` like other ML algorithm wrappers.
BTW, I did some cleanup and improvement for ```spark.mlp```.
## How was this patch tested?
Unit tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#15883 from yanboliang/spark-18438.
## What changes were proposed in this pull request?
* Fix the following exceptions which throws when ```spark.randomForest```(classification), ```spark.gbt```(classification), ```spark.naiveBayes``` and ```spark.glm```(binomial family) were fitted on libsvm data.
```
java.lang.IllegalArgumentException: requirement failed: If label column already exists, forceIndexLabel can not be set with true.
```
See [SPARK-18412](https://issues.apache.org/jira/browse/SPARK-18412) for more detail about how to reproduce this bug.
* Refactor out ```getFeaturesAndLabels``` to RWrapperUtils, since lots of ML algorithm wrappers use this function.
* Drop some unwanted columns when making prediction.
## How was this patch tested?
Add unit test.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#15851 from yanboliang/spark-18412.
## What changes were proposed in this pull request?
SparkR ```spark.randomForest``` classification prediction should output original label rather than the indexed label. This issue is very similar with [SPARK-18291](https://issues.apache.org/jira/browse/SPARK-18291).
## How was this patch tested?
Add unit tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#15842 from yanboliang/spark-18401.
## What changes were proposed in this pull request?
Gradient Boosted Tree in R.
With a few minor improvements to RandomForest in R.
Since this is relatively isolated I'd like to target this for branch-2.1
## How was this patch tested?
manual tests, unit tests
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#15746 from felixcheung/rgbt.
## What changes were proposed in this pull request?
minor doc update that should go to master & branch-2.1
## How was this patch tested?
manual
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#15747 from felixcheung/pySPARK-14393.
## What changes were proposed in this pull request?
In test_mllib.R, there are two unnecessary suppressWarnings. This PR just removes them.
## How was this patch tested?
Existing unit tests.
Author: wm624@hotmail.com <wm624@hotmail.com>
Closes#15697 from wangmiao1981/rtest.
## What changes were proposed in this pull request?
Due to a limitation of hive metastore(table location must be directory path, not file path), we always store `path` for data source table in storage properties, instead of the `locationUri` field. However, we should not expose this difference to `CatalogTable` level, but just treat it as a hack in `HiveExternalCatalog`, like we store table schema of data source table in table properties.
This PR unifies `path` and `locationUri` outside of `HiveExternalCatalog`, both data source table and hive serde table should use the `locationUri` field.
This PR also unifies the way we handle default table location for managed table. Previously, the default table location of hive serde managed table is set by external catalog, but the one of data source table is set by command. After this PR, we follow the hive way and the default table location is always set by external catalog.
For managed non-file-based tables, we will assign a default table location and create an empty directory for it, the table location will be removed when the table is dropped. This is reasonable as metastore doesn't care about whether a table is file-based or not, and an empty table directory has no harm.
For external non-file-based tables, ideally we can omit the table location, but due to a hive metastore issue, we will assign a random location to it, and remove it right after the table is created. See SPARK-15269 for more details. This is fine as it's well isolated in `HiveExternalCatalog`.
To keep the existing behaviour of the `path` option, in this PR we always add the `locationUri` to storage properties using key `path`, before passing storage properties to `DataSource` as data source options.
## How was this patch tested?
existing tests.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#15024 from cloud-fan/path.
## What changes were proposed in this pull request?
Simplify struct creation, especially the aspect of `CleanupAliases` which missed some aliases when handling trees created by `CreateStruct`.
This PR includes:
1. A failing test (create struct with nested aliases, some of the aliases survive `CleanupAliases`).
2. A fix that transforms `CreateStruct` into a `CreateNamedStruct` constructor, effectively eliminating `CreateStruct` from all expression trees.
3. A `NamePlaceHolder` used by `CreateStruct` when column names cannot be extracted from unresolved `NamedExpression`.
4. A new Analyzer rule that resolves `NamePlaceHolder` into a string literal once the `NamedExpression` is resolved.
5. `CleanupAliases` code was simplified as it no longer has to deal with `CreateStruct`'s top level columns.
## How was this patch tested?
Running all tests-suits in package org.apache.spark.sql, especially including the analysis suite, making sure added test initially fails, after applying suggested fix rerun the entire analysis package successfully.
Modified few tests that expected `CreateStruct` which is now transformed into `CreateNamedStruct`.
Author: eyal farago <eyal farago>
Author: Herman van Hovell <hvanhovell@databricks.com>
Author: eyal farago <eyal.farago@gmail.com>
Author: Eyal Farago <eyal.farago@actimize.com>
Author: Hyukjin Kwon <gurwls223@gmail.com>
Author: eyalfa <eyal.farago@gmail.com>
Closes#15718 from hvanhovell/SPARK-16839-2.
## What changes were proposed in this pull request?
This PR proposes to
- improve the R-friendly error messages rather than raw JVM exception one.
As `read.json`, `read.text`, `read.orc`, `read.parquet` and `read.jdbc` are executed in the same path with `read.df`, and `write.json`, `write.text`, `write.orc`, `write.parquet` and `write.jdbc` shares the same path with `write.df`, it seems it is safe to call `handledCallJMethod` to handle
JVM messages.
- prevent `zero-length variable name` and prints the ignored options as an warning message.
**Before**
``` r
> read.json("path", a = 1, 2, 3, "a")
Error in env[[name]] <- value :
zero-length variable name
```
``` r
> read.json("arbitrary_path")
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
org.apache.spark.sql.AnalysisException: Path does not exist: file:/...;
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:398)
...
> read.orc("arbitrary_path")
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
org.apache.spark.sql.AnalysisException: Path does not exist: file:/...;
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:398)
...
> read.text("arbitrary_path")
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
org.apache.spark.sql.AnalysisException: Path does not exist: file:/...;
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:398)
...
> read.parquet("arbitrary_path")
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
org.apache.spark.sql.AnalysisException: Path does not exist: file:/...;
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:398)
...
```
``` r
> write.json(df, "existing_path")
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
org.apache.spark.sql.AnalysisException: path file:/... already exists.;
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:68)
> write.orc(df, "existing_path")
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
org.apache.spark.sql.AnalysisException: path file:/... already exists.;
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:68)
> write.text(df, "existing_path")
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
org.apache.spark.sql.AnalysisException: path file:/... already exists.;
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:68)
> write.parquet(df, "existing_path")
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
org.apache.spark.sql.AnalysisException: path file:/... already exists.;
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:68)
```
**After**
``` r
read.json("arbitrary_path", a = 1, 2, 3, "a")
Unnamed arguments ignored: 2, 3, a.
```
``` r
> read.json("arbitrary_path")
Error in json : analysis error - Path does not exist: file:/...
> read.orc("arbitrary_path")
Error in orc : analysis error - Path does not exist: file:/...
> read.text("arbitrary_path")
Error in text : analysis error - Path does not exist: file:/...
> read.parquet("arbitrary_path")
Error in parquet : analysis error - Path does not exist: file:/...
```
``` r
> write.json(df, "existing_path")
Error in json : analysis error - path file:/... already exists.;
> write.orc(df, "existing_path")
Error in orc : analysis error - path file:/... already exists.;
> write.text(df, "existing_path")
Error in text : analysis error - path file:/... already exists.;
> write.parquet(df, "existing_path")
Error in parquet : analysis error - path file:/... already exists.;
```
## How was this patch tested?
Unit tests in `test_utils.R` and `test_sparkSQL.R`.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#15608 from HyukjinKwon/SPARK-17838.
## What changes were proposed in this pull request?
Simplify struct creation, especially the aspect of `CleanupAliases` which missed some aliases when handling trees created by `CreateStruct`.
This PR includes:
1. A failing test (create struct with nested aliases, some of the aliases survive `CleanupAliases`).
2. A fix that transforms `CreateStruct` into a `CreateNamedStruct` constructor, effectively eliminating `CreateStruct` from all expression trees.
3. A `NamePlaceHolder` used by `CreateStruct` when column names cannot be extracted from unresolved `NamedExpression`.
4. A new Analyzer rule that resolves `NamePlaceHolder` into a string literal once the `NamedExpression` is resolved.
5. `CleanupAliases` code was simplified as it no longer has to deal with `CreateStruct`'s top level columns.
## How was this patch tested?
running all tests-suits in package org.apache.spark.sql, especially including the analysis suite, making sure added test initially fails, after applying suggested fix rerun the entire analysis package successfully.
modified few tests that expected `CreateStruct` which is now transformed into `CreateNamedStruct`.
Credit goes to hvanhovell for assisting with this PR.
Author: eyal farago <eyal farago>
Author: eyal farago <eyal.farago@gmail.com>
Author: Herman van Hovell <hvanhovell@databricks.com>
Author: Eyal Farago <eyal.farago@actimize.com>
Author: Hyukjin Kwon <gurwls223@gmail.com>
Author: eyalfa <eyal.farago@gmail.com>
Closes#14444 from eyalfa/SPARK-16839_redundant_aliases_after_cleanupAliases.
## What changes were proposed in this pull request?
Random Forest Regression and Classification for R
Clean-up/reordering generics.R
## How was this patch tested?
manual tests, unit tests
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#15607 from felixcheung/rrandomforest.
## What changes were proposed in this pull request?
This patch makes RBackend connection timeout configurable by user.
## How was this patch tested?
N/A
Author: Hossein <hossein@databricks.com>
Closes#15471 from falaki/SPARK-17919.
## What changes were proposed in this pull request?
API and programming guide doc changes for Scala, Python and R.
## How was this patch tested?
manual test
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#15629 from felixcheung/jsondoc.
## What changes were proposed in this pull request?
a couple of small late finding fixes for doc
## How was this patch tested?
manually
wangmiao1981
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#15650 from felixcheung/logitfix.
## What changes were proposed in this pull request?
As we discussed in #14818, I added a separate R wrapper spark.logit for logistic regression.
This single interface supports both binary and multinomial logistic regression. It also has "predict" and "summary" for binary logistic regression.
## How was this patch tested?
New unit tests are added.
Author: wm624@hotmail.com <wm624@hotmail.com>
Closes#15365 from wangmiao1981/glm.
## What changes were proposed in this pull request?
Add storageLevel to DataFrame for SparkR.
This is similar to this RP: https://github.com/apache/spark/pull/13780
but in R I do not make a class for `StorageLevel`
but add a method `storageToString`
## How was this patch tested?
test added.
Author: WeichenXu <WeichenXu123@outlook.com>
Closes#15516 from WeichenXu123/storageLevel_df_r.
## What changes were proposed in this pull request?
update SparkR MLP, add initalWeights parameter.
## How was this patch tested?
test added.
Author: WeichenXu <WeichenXu123@outlook.com>
Closes#15552 from WeichenXu123/mlp_r_add_initialWeight_param.
## What changes were proposed in this pull request?
Fixes for R doc
## How was this patch tested?
N/A
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#15589 from felixcheung/rdocmergefix.
(cherry picked from commit 0e0d83a597)
Signed-off-by: Felix Cheung <felixcheung@apache.org>
## What changes were proposed in this pull request?
NA date values are serialized as "NA" and NA time values are serialized as NaN from R. In the backend we did not have proper logic to deal with them. As a result we got an IllegalArgumentException for Date and wrong value for time. This PR adds support for deserializing NA as Date and Time.
## How was this patch tested?
* [x] TODO
Author: Hossein <hossein@databricks.com>
Closes#15421 from falaki/SPARK-17811.
## What changes were proposed in this pull request?
Add crossJoin and do not default to cross join if joinExpr is left out
## How was this patch tested?
unit test
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#15559 from felixcheung/rcrossjoin.
## What changes were proposed in this pull request?
Fix for a bunch of test warnings that were added recently.
We need to investigate why warnings are not turning into errors.
```
Warnings -----------------------------------------------------------------------
1. createDataFrame uses files for large objects (test_sparkSQL.R#215) - Use Sepal_Length instead of Sepal.Length as column name
2. createDataFrame uses files for large objects (test_sparkSQL.R#215) - Use Sepal_Width instead of Sepal.Width as column name
3. createDataFrame uses files for large objects (test_sparkSQL.R#215) - Use Petal_Length instead of Petal.Length as column name
4. createDataFrame uses files for large objects (test_sparkSQL.R#215) - Use Petal_Width instead of Petal.Width as column name
Consider adding
importFrom("utils", "object.size")
to your NAMESPACE file.
```
## How was this patch tested?
unit tests
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#15560 from felixcheung/rwarnings.
## What changes were proposed in this pull request?
If the R data structure that is being parallelized is larger than `INT_MAX` we use files to transfer data to JVM. The serialization protocol mimics Python pickling. This allows us to simply call `PythonRDD.readRDDFromFile` to create the RDD.
I tested this on my MacBook. Following code works with this patch:
```R
intMax <- .Machine$integer.max
largeVec <- 1:intMax
rdd <- SparkR:::parallelize(sc, largeVec, 2)
```
## How was this patch tested?
* [x] Unit tests
Author: Hossein <hossein@databricks.com>
Closes#15375 from falaki/SPARK-17790.
## What changes were proposed in this pull request?
SQLConf is session-scoped and mutable. However, we do have the requirement for a static SQL conf, which is global and immutable, e.g. the `schemaStringThreshold` in `HiveExternalCatalog`, the flag to enable/disable hive support, the global temp view database in https://github.com/apache/spark/pull/14897.
Actually we've already implemented static SQL conf implicitly via `SparkConf`, this PR just make it explicit and expose it to users, so that they can see the config value via SQL command or `SparkSession.conf`, and forbid users to set/unset static SQL conf.
## How was this patch tested?
new tests in SQLConfSuite
Author: Wenchen Fan <wenchen@databricks.com>
Closes#15295 from cloud-fan/global-conf.
## What changes were proposed in this pull request?
Fix SparkR ```spark.naiveBayes``` error when response variable of dataset is numeric type.
See details and how to reproduce this bug at [SPARK-15153](https://issues.apache.org/jira/browse/SPARK-15153).
## How was this patch tested?
Add unit test.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#15431 from yanboliang/spark-15153-2.
## What changes were proposed in this pull request?
This PR includes the changes below:
- Support `mode`/`options` in `read.parquet`, `write.parquet`, `read.orc`, `write.orc`, `read.text`, `write.text`, `read.json` and `write.json` APIs
- Support other types (logical, numeric and string) as options for `write.df`, `read.df`, `read.parquet`, `write.parquet`, `read.orc`, `write.orc`, `read.text`, `write.text`, `read.json` and `write.json`
## How was this patch tested?
Unit tests in `test_sparkSQL.R`/ `utils.R`.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#15239 from HyukjinKwon/SPARK-17665.
## What changes were proposed in this pull request?
`write.df`/`read.df` API require path which is not actually always necessary in Spark. Currently, it only affects the datasources implementing `CreatableRelationProvider`. Currently, Spark currently does not have internal data sources implementing this but it'd affect other external datasources.
In addition we'd be able to use this way in Spark's JDBC datasource after https://github.com/apache/spark/pull/12601 is merged.
**Before**
- `read.df`
```r
> read.df(source = "json")
Error in dispatchFunc("read.df(path = NULL, source = NULL, schema = NULL, ...)", :
argument "x" is missing with no default
```
```r
> read.df(path = c(1, 2))
Error in dispatchFunc("read.df(path = NULL, source = NULL, schema = NULL, ...)", :
argument "x" is missing with no default
```
```r
> read.df(c(1, 2))
Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
java.lang.ClassCastException: java.lang.Double cannot be cast to java.lang.String
at org.apache.spark.sql.execution.datasources.DataSource.hasMetadata(DataSource.scala:300)
at
...
In if (is.na(object)) { :
...
```
- `write.df`
```r
> write.df(df, source = "json")
Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘write.df’ for signature ‘"function", "missing"’
```
```r
> write.df(df, source = c(1, 2))
Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘write.df’ for signature ‘"SparkDataFrame", "missing"’
```
```r
> write.df(df, mode = TRUE)
Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘write.df’ for signature ‘"SparkDataFrame", "missing"’
```
**After**
- `read.df`
```r
> read.df(source = "json")
Error in loadDF : analysis error - Unable to infer schema for JSON at . It must be specified manually;
```
```r
> read.df(path = c(1, 2))
Error in f(x, ...) : path should be charactor, null or omitted.
```
```r
> read.df(c(1, 2))
Error in f(x, ...) : path should be charactor, null or omitted.
```
- `write.df`
```r
> write.df(df, source = "json")
Error in save : illegal argument - 'path' is not specified
```
```r
> write.df(df, source = c(1, 2))
Error in .local(df, path, ...) :
source should be charactor, null or omitted. It is 'parquet' by default.
```
```r
> write.df(df, mode = TRUE)
Error in .local(df, path, ...) :
mode should be charactor or omitted. It is 'error' by default.
```
## How was this patch tested?
Unit tests in `test_sparkSQL.R`
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#15231 from HyukjinKwon/write-default-r.
## What changes were proposed in this pull request?
Some tests in `test_mllib.r` are as below:
```r
expect_error(spark.mlp(df, layers = NULL), "layers must be a integer vector with length > 1.")
expect_error(spark.mlp(df, layers = c()), "layers must be a integer vector with length > 1.")
```
The problem is, `is.na` is internally called via `na.omit` in `spark.mlp` which causes warnings as below:
```
Warnings -----------------------------------------------------------------------
1. spark.mlp (test_mllib.R#400) - is.na() applied to non-(list or vector) of type 'NULL'
2. spark.mlp (test_mllib.R#401) - is.na() applied to non-(list or vector) of type 'NULL'
```
## How was this patch tested?
Manually tested. Also, Jenkins tests and AppVeyor.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#15232 from HyukjinKwon/remove-warnnings.
## What changes were proposed in this pull request?
#15140 exposed ```JavaSparkContext.addFile(path: String, recursive: Boolean)``` to Python/R, then we can update SparkR ```spark.addFile``` to support adding directory recursively.
## How was this patch tested?
Added unit test.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#15216 from yanboliang/spark-17577-2.
## What changes were proposed in this pull request?
Spark will add sparkr.zip to archive only when it is yarn mode (SparkSubmit.scala).
```
if (args.isR && clusterManager == YARN) {
val sparkRPackagePath = RUtils.localSparkRPackagePath
if (sparkRPackagePath.isEmpty) {
printErrorAndExit("SPARK_HOME does not exist for R application in YARN mode.")
}
val sparkRPackageFile = new File(sparkRPackagePath.get, SPARKR_PACKAGE_ARCHIVE)
if (!sparkRPackageFile.exists()) {
printErrorAndExit(s"$SPARKR_PACKAGE_ARCHIVE does not exist for R application in YARN mode.")
}
val sparkRPackageURI = Utils.resolveURI(sparkRPackageFile.getAbsolutePath).toString
// Distribute the SparkR package.
// Assigns a symbol link name "sparkr" to the shipped package.
args.archives = mergeFileLists(args.archives, sparkRPackageURI + "#sparkr")
// Distribute the R package archive containing all the built R packages.
if (!RUtils.rPackages.isEmpty) {
val rPackageFile =
RPackageUtils.zipRLibraries(new File(RUtils.rPackages.get), R_PACKAGE_ARCHIVE)
if (!rPackageFile.exists()) {
printErrorAndExit("Failed to zip all the built R packages.")
}
val rPackageURI = Utils.resolveURI(rPackageFile.getAbsolutePath).toString
// Assigns a symbol link name "rpkg" to the shipped package.
args.archives = mergeFileLists(args.archives, rPackageURI + "#rpkg")
}
}
```
So it is necessary to pass spark.master from R process to JVM. Otherwise sparkr.zip won't be distributed to executor. Besides that I also pass spark.yarn.keytab/spark.yarn.principal to spark side, because JVM process need them to access secured cluster.
## How was this patch tested?
Verify it manually in R Studio using the following code.
```
Sys.setenv(SPARK_HOME="/Users/jzhang/github/spark")
.libPaths(c(file.path(Sys.getenv(), "R", "lib"), .libPaths()))
library(SparkR)
sparkR.session(master="yarn-client", sparkConfig = list(spark.executor.instances="1"))
df <- as.DataFrame(mtcars)
head(df)
```
…
Author: Jeff Zhang <zjffdu@apache.org>
Closes#14784 from zjffdu/SPARK-17210.
## What changes were proposed in this pull request?
update `MultilayerPerceptronClassifierWrapper.fit` paramter type:
`layers: Array[Int]`
`seed: String`
update several default params in sparkR `spark.mlp`:
`tol` --> 1e-6
`stepSize` --> 0.03
`seed` --> NULL ( when seed == NULL, the scala-side wrapper regard it as a `null` value and the seed will use the default one )
r-side `seed` only support 32bit integer.
remove `layers` default value, and move it in front of those parameters with default value.
add `layers` parameter validation check.
## How was this patch tested?
tests added.
Author: WeichenXu <WeichenXu123@outlook.com>
Closes#15051 from WeichenXu123/update_py_mlp_default.
## What changes were proposed in this pull request?
#14881 added Kolmogorov-Smirnov Test wrapper to SparkR. I found that ```print.summary.KSTest``` was implemented inappropriately and result in no effect.
Running the following code for KSTest:
```Scala
data <- data.frame(test = c(0.1, 0.15, 0.2, 0.3, 0.25, -1, -0.5))
df <- createDataFrame(data)
testResult <- spark.kstest(df, "test", "norm")
summary(testResult)
```
Before this PR:
![image](https://cloud.githubusercontent.com/assets/1962026/18615016/b9a2823a-7d4f-11e6-934b-128beade355e.png)
After this PR:
![image](https://cloud.githubusercontent.com/assets/1962026/18615014/aafe2798-7d4f-11e6-8b99-c705bb9fe8f2.png)
The new implementation is similar with [```print.summary.GeneralizedLinearRegressionModel```](https://github.com/apache/spark/blob/master/R/pkg/R/mllib.R#L284) of SparkR and [```print.summary.glm```](https://svn.r-project.org/R/trunk/src/library/stats/R/glm.R) of native R.
BTW, I removed the comparison of ```print.summary.KSTest``` in unit test, since it's only wrappers of the summary output which has been checked. Another reason is that these comparison will output summary information to the test console, it will make the test output in a mess.
## How was this patch tested?
Existing test.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#15139 from yanboliang/spark-17315.
## What changes were proposed in this pull request?
Scala/Python users can add files to Spark job by submit options ```--files``` or ```SparkContext.addFile()```. Meanwhile, users can get the added file by ```SparkFiles.get(filename)```.
We should also support this function for SparkR users, since they also have the requirements for some shared dependency files. For example, SparkR users can download third party R packages to driver firstly, add these files to the Spark job as dependency by this API and then each executor can install these packages by ```install.packages```.
## How was this patch tested?
Add unit test.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#15131 from yanboliang/spark-17577.
## What changes were proposed in this pull request?
Clarify that slide and window duration are absolute, and not relative to a calendar.
## How was this patch tested?
Doc build (no functional change)
Author: Sean Owen <sowen@cloudera.com>
Closes#15142 from srowen/SPARK-17297.
## What changes were proposed in this pull request?
Point references to spark-packages.org to https://cwiki.apache.org/confluence/display/SPARK/Third+Party+Projects
This will be accompanied by a parallel change to the spark-website repo, and additional changes to this wiki.
## How was this patch tested?
Jenkins tests.
Author: Sean Owen <sowen@cloudera.com>
Closes#15075 from srowen/SPARK-17445.
## What changes were proposed in this pull request?
This PR tries to add a SparkR vignette, which works as a friendly guidance going through the functionality provided by SparkR.
## How was this patch tested?
Manual test.
Author: junyangq <qianjunyang@gmail.com>
Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
Author: Junyang Qian <junyangq@databricks.com>
Closes#14980 from junyangq/SPARKR-vignette.
## What changes were proposed in this pull request?
Fix summary() method's `return` description for spark.mlp
## How was this patch tested?
Ran tests locally on my laptop.
Author: Xin Ren <iamshrek@126.com>
Closes#15015 from keypointt/SPARK-16445-2.
## What changes were proposed in this pull request?
SparkR ```spark.als``` arguments ```reg``` should be 0.1 by default, which need to be consistent with ML.
## How was this patch tested?
Existing tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#15021 from yanboliang/spark-17464.
## What changes were proposed in this pull request?
additional options were not passed down in write.df.
## How was this patch tested?
unit tests
falaki shivaram
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#15010 from felixcheung/testreadoptions.
## What changes were proposed in this pull request?
Fixed bug in `dapplyCollect` by changing the `compute` function of `worker.R` to explicitly handle raw (binary) vectors.
cc shivaram
## How was this patch tested?
Unit tests
Author: Clark Fitzgerald <clarkfitzg@gmail.com>
Closes#14783 from clarkfitzg/SPARK-16785.
## What changes were proposed in this pull request?
This PR tries to add Kolmogorov-Smirnov Test wrapper to SparkR. This wrapper implementation only supports one sample test against normal distribution.
## How was this patch tested?
R unit test.
Author: Junyang Qian <junyangq@databricks.com>
Closes#14881 from junyangq/SPARK-17315.
## What changes were proposed in this pull request?
This PR tries to add some more explanation to `sparkR.session`. It also modifies doc for `count` so when grouped in one doc, the description doesn't confuse users.
## How was this patch tested?
Manual test.
![screen shot 2016-09-02 at 1 21 36 pm](https://cloud.githubusercontent.com/assets/15318264/18217198/409613ac-7110-11e6-8dae-cb0c8df557bf.png)
Author: Junyang Qian <junyangq@databricks.com>
Closes#14942 from junyangq/fixSparkRSessionDoc.
## What changes were proposed in this pull request?
Require the use of CROSS join syntax in SQL (and a new crossJoin
DataFrame API) to specify explicit cartesian products between relations.
By cartesian product we mean a join between relations R and S where
there is no join condition involving columns from both R and S.
If a cartesian product is detected in the absence of an explicit CROSS
join, an error must be thrown. Turning on the
"spark.sql.crossJoin.enabled" configuration flag will disable this check
and allow cartesian products without an explicit CROSS join.
The new crossJoin DataFrame API must be used to specify explicit cross
joins. The existing join(DataFrame) method will produce a INNER join
that will require a subsequent join condition.
That is df1.join(df2) is equivalent to select * from df1, df2.
## How was this patch tested?
Added cross-join.sql to the SQLQueryTestSuite to test the check for cartesian products. Added a couple of tests to the DataFrameJoinSuite to test the crossJoin API. Modified various other test suites to explicitly specify a cross join where an INNER join or a comma-separated list was previously used.
Author: Srinath Shankar <srinath@databricks.com>
Closes#14866 from srinathshankar/crossjoin.
## What changes were proposed in this pull request?
change since version in doc
## How was this patch tested?
manual
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#14939 from felixcheung/rsparkversion2.
## What changes were proposed in this pull request?
Doc change - see https://issues.apache.org/jira/browse/SPARK-16324
## How was this patch tested?
manual check
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#14934 from felixcheung/regexpextractdoc.
## What changes were proposed in this pull request?
Add sparkR.version() API.
```
> sparkR.version()
[1] "2.1.0-SNAPSHOT"
```
## How was this patch tested?
manual, unit tests
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#14935 from felixcheung/rsparksessionversion.
## What changes were proposed in this pull request?
(Please fill in changes proposed in this fix)
registerTempTable(createDataFrame(iris), "iris")
str(collect(sql("select cast('1' as double) as x, cast('2' as decimal) as y from iris limit 5")))
'data.frame': 5 obs. of 2 variables:
$ x: num 1 1 1 1 1
$ y:List of 5
..$ : num 2
..$ : num 2
..$ : num 2
..$ : num 2
..$ : num 2
The problem is that spark returns `decimal(10, 0)` col type, instead of `decimal`. Thus, `decimal(10, 0)` is not handled correctly. It should be handled as "double".
As discussed in JIRA thread, we can have two potential fixes:
1). Scala side fix to add a new case when writing the object back; However, I can't use spark.sql.types._ in Spark core due to dependency issues. I don't find a way of doing type case match;
2). SparkR side fix: Add a helper function to check special type like `"decimal(10, 0)"` and replace it with `double`, which is PRIMITIVE type. This special helper is generic for adding new types handling in the future.
I open this PR to discuss pros and cons of both approaches. If we want to do Scala side fix, we need to find a way to match the case of DecimalType and StructType in Spark Core.
## How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
Manual test:
> str(collect(sql("select cast('1' as double) as x, cast('2' as decimal) as y from iris limit 5")))
'data.frame': 5 obs. of 2 variables:
$ x: num 1 1 1 1 1
$ y: num 2 2 2 2 2
R Unit tests
Author: wm624@hotmail.com <wm624@hotmail.com>
Closes#14613 from wangmiao1981/type.
https://issues.apache.org/jira/browse/SPARK-17241
## What changes were proposed in this pull request?
Spark has configurable L2 regularization parameter for generalized linear regression. It is very important to have them in SparkR so that users can run ridge regression.
## How was this patch tested?
Test manually on local laptop.
Author: Xin Ren <iamshrek@126.com>
Closes#14856 from keypointt/SPARK-17241.
## What changes were proposed in this pull request?
The usage in the original example is incorrect. This PR fixes it.
## How was this patch tested?
Manual test.
Author: Junyang Qian <junyangq@databricks.com>
Closes#14903 from junyangq/SPARKR-FixWindowPartitionByDoc.
## What changes were proposed in this pull request?
Remove cleanup.jobj test. Use JVM wrapper API for other test cases.
## How was this patch tested?
Run R unit tests with testthat 1.0
Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
Closes#14904 from shivaram/sparkr-jvm-tests-fix.
## What changes were proposed in this pull request?
Currently, `HiveContext` in SparkR is not being tested and always skipped.
This is because the initiation of `TestHiveContext` is being failed due to trying to load non-existing data paths (test tables).
This is introduced from https://github.com/apache/spark/pull/14005
This enables the tests with SparkR.
## How was this patch tested?
Manually,
**Before** (on Mac OS)
```
...
Skipped ------------------------------------------------------------------------
1. create DataFrame from RDD (test_sparkSQL.R#200) - Hive is not build with SparkSQL, skipped
2. test HiveContext (test_sparkSQL.R#1041) - Hive is not build with SparkSQL, skipped
3. read/write ORC files (test_sparkSQL.R#1748) - Hive is not build with SparkSQL, skipped
4. enableHiveSupport on SparkSession (test_sparkSQL.R#2480) - Hive is not build with SparkSQL, skipped
5. sparkJars tag in SparkContext (test_Windows.R#21) - This test is only for Windows, skipped
...
```
**After** (on Mac OS)
```
...
Skipped ------------------------------------------------------------------------
1. sparkJars tag in SparkContext (test_Windows.R#21) - This test is only for Windows, skipped
...
```
Please refer the tests below (on Windows)
- Before: https://ci.appveyor.com/project/HyukjinKwon/spark/build/45-test123
- After: https://ci.appveyor.com/project/HyukjinKwon/spark/build/46-test123
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#14889 from HyukjinKwon/SPARK-17326.
## What changes were proposed in this pull request?
This change exposes a public API in SparkR to create objects, call methods on the Spark driver JVM
## How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
Unit tests, CRAN checks
Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
Closes#14775 from shivaram/sparkr-java-api.
## What changes were proposed in this pull request?
This PR tries to fix the name of the `SparkDataFrame` used in the example. Also, it gives a reference url of an example data file so that users can play with.
## How was this patch tested?
Manual test.
Author: Junyang Qian <junyangq@databricks.com>
Closes#14853 from junyangq/SPARKR-FixLDADoc.
## What changes were proposed in this pull request?
The original example doesn't work because the features are not categorical. This PR fixes this by changing to another dataset.
## How was this patch tested?
Manual test.
Author: Junyang Qian <junyangq@databricks.com>
Closes#14820 from junyangq/SPARK-FixNaiveBayes.
## What changes were proposed in this pull request?
This PR gives informative message to users when they try to connect to a remote master but don't have Spark package in their local machine.
As a clarification, for now, automatic installation will only happen if they start SparkR in R console (rather than from sparkr-shell) and connect to local master. In the remote master mode, local Spark package is still needed, but we will not trigger the install.spark function because the versions have to match those on the cluster, which involves more user input. Instead, we here try to provide detailed message that may help the users.
Some of the other messages have also been slightly changed.
## How was this patch tested?
Manual test.
Author: Junyang Qian <junyangq@databricks.com>
Closes#14761 from junyangq/SPARK-16579-V1.
## What changes were proposed in this pull request?
This PR adds more examples to window function docs to make them more accessible to the users.
It also fixes default value issues for `lag` and `lead`.
## How was this patch tested?
Manual test, R unit test.
Author: Junyang Qian <junyangq@databricks.com>
Closes#14779 from junyangq/SPARKR-FixWindowFunctionDocs.
## What changes were proposed in this pull request?
Fixed several misplaced param tag - they should be on the spark.* method generics
## How was this patch tested?
run knitr
junyangq
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#14792 from felixcheung/rdocmllib.
https://issues.apache.org/jira/browse/SPARK-16445
## What changes were proposed in this pull request?
Create Multilayer Perceptron Classifier wrapper in SparkR
## How was this patch tested?
Tested manually on local machine
Author: Xin Ren <iamshrek@126.com>
Closes#14447 from keypointt/SPARK-16445.
## What changes were proposed in this pull request?
The original doc of `show` put methods for multiple classes together but the text only talks about `SparkDataFrame`. This PR tries to fix this problem.
## How was this patch tested?
Manual test.
Author: Junyang Qian <junyangq@databricks.com>
Closes#14776 from junyangq/SPARK-FixShowDoc.
## What changes were proposed in this pull request?
The PR removes reference link in the doc for environment variables for common Windows folders. The cran check gave code 503: service unavailable on the original link.
## How was this patch tested?
Manual check.
Author: Junyang Qian <junyangq@databricks.com>
Closes#14767 from junyangq/SPARKR-RemoveLink.
## What changes were proposed in this pull request?
Update DESCRIPTION
## How was this patch tested?
Run install and CRAN tests
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#14764 from felixcheung/rpackagedescription.
## What changes were proposed in this pull request?
replace ``` ` ``` in code doc with `\code{thing}`
remove added `...` for drop(DataFrame)
fix remaining CRAN check warnings
## How was this patch tested?
create doc with knitr
junyangq
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#14734 from felixcheung/rdoccleanup.
## What changes were proposed in this pull request?
This change adds Xiangrui Meng and Felix Cheung to the maintainers field in the package description.
## How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
Closes#14758 from shivaram/sparkr-maintainers.
## What changes were proposed in this pull request?
refactor, cleanup, reformat, fix deprecation in test
## How was this patch tested?
unit tests, manual tests
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#14735 from felixcheung/rmllibutil.
## What changes were proposed in this pull request?
This PR tries to fix the scheme of local cache folder in Windows. The name of the environment variable should be `LOCALAPPDATA` rather than `%LOCALAPPDATA%`.
## How was this patch tested?
Manual test in Windows 7.
Author: Junyang Qian <junyangq@databricks.com>
Closes#14743 from junyangq/SPARKR-FixWindowsInstall.
## What changes were proposed in this pull request?
#14551 fixed off-by-one bug in ```randomizeInPlace``` and some test failure caused by this fix.
But for SparkR ```spark.gaussianMixture``` test case, the fix is inappropriate. It only changed the output result of native R which should be compared by SparkR, however, it did not change the R code in annotation which is used for reproducing the result in native R. It will confuse users who can not reproduce the same result in native R. This PR sends a more robust test case which can produce same result between SparkR and native R.
## How was this patch tested?
Unit test update.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#14730 from yanboliang/spark-16961-followup.
## What changes were proposed in this pull request?
This PR tries to fix all the remaining "undocumented/duplicated arguments" warnings given by CRAN-check.
One left is doc for R `stats::glm` exported in SparkR. To mute that warning, we have to also provide document for all arguments of that non-SparkR function.
Some previous conversation is in #14558.
## How was this patch tested?
R unit test and `check-cran.sh` script (with no-test).
Author: Junyang Qian <junyangq@databricks.com>
Closes#14705 from junyangq/SPARK-16508-master.
JIRA issue link:
https://issues.apache.org/jira/browse/SPARK-16961
Changed one line of Utils.randomizeInPlace to allow elements to stay in place.
Created a unit test that runs a Pearson's chi squared test to determine whether the output diverges significantly from a uniform distribution.
Author: Nick Lavers <nick.lavers@videoamp.com>
Closes#14551 from nicklavers/SPARK-16961-randomizeInPlace.
## What changes were proposed in this pull request?
Add LDA Wrapper in SparkR with the following interfaces:
- spark.lda(data, ...)
- spark.posterior(object, newData, ...)
- spark.perplexity(object, ...)
- summary(object)
- write.ml(object)
- read.ml(path)
## How was this patch tested?
Test with SparkR unit test.
Author: Xusen Yin <yinxusen@gmail.com>
Closes#14229 from yinxusen/SPARK-16447.
## What changes were proposed in this pull request?
Gaussian Mixture Model wrapper in SparkR, similarly to R's ```mvnormalmixEM```.
## How was this patch tested?
Unit test.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#14392 from yanboliang/spark-16446.
## What changes were proposed in this pull request?
(Please fill in changes proposed in this fix)
Add Isotonic Regression wrapper in SparkR
Wrappers in R and Scala are added.
Unit tests
Documentation
## How was this patch tested?
Manually tested with sudo ./R/run-tests.sh
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
Author: wm624@hotmail.com <wm624@hotmail.com>
Closes#14182 from wangmiao1981/isoR.
## What changes were proposed in this pull request?
Rename RDD functions for now to avoid CRAN check warnings.
Some RDD functions are sharing generics with DataFrame functions (hence the problem) so after the renames we need to add new generics, for now.
## How was this patch tested?
unit tests
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#14626 from felixcheung/rrddfunctions.
## What changes were proposed in this pull request?
Fix the issue that ```spark.glm``` ```weightCol``` should in the signature.
## How was this patch tested?
Existing tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#14641 from yanboliang/weightCol.
## What changes were proposed in this pull request?
Add an install_spark function to the SparkR package. User can run `install_spark()` to install Spark to a local directory within R.
Updates:
Several changes have been made:
- `install.spark()`
- check existence of tar file in the cache folder, and download only if not found
- trial priority of mirror_url look-up: user-provided -> preferred mirror site from apache website -> hardcoded backup option
- use 2.0.0
- `sparkR.session()`
- can install spark when not found in `SPARK_HOME`
## How was this patch tested?
Manual tests, running the check-cran.sh script added in #14173.
Author: Junyang Qian <junyangq@databricks.com>
Closes#14258 from junyangq/SPARK-16579.
## What changes were proposed in this pull request?
Training GLMs on weighted dataset is very important use cases, but it is not supported by SparkR currently. Users can pass argument ```weights``` to specify the weights vector in native R. For ```spark.glm```, we can pass in the ```weightCol``` which is consistent with MLlib.
## How was this patch tested?
Unit test.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#14346 from yanboliang/spark-16710.
## What changes were proposed in this pull request?
This change moves the include jar test from R to SparkSubmitSuite and uses a dynamically compiled jar. This helps us remove the binary jar from the R package and solves both the CRAN warnings and the lack of source being available for this jar.
## How was this patch tested?
SparkR unit tests, SparkSubmitSuite, check-cran.sh
Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
Closes#14243 from shivaram/sparkr-jar-move.
## What changes were proposed in this pull request?
https://issues.apache.org/jira/browse/SPARK-16055
sparkPackages - argument is passed and we detect that we are in the R script mode, we should print some warning like --packages flag should be used with with spark-submit
## How was this patch tested?
In my system locally
Author: krishnakalyan3 <krishnakalyan3@gmail.com>
Closes#14179 from krishnakalyan3/spark-pkg.
## What changes were proposed in this pull request?
Fix R SparkSession init/stop, and warnings of reusing existing Spark Context
## How was this patch tested?
unit tests
shivaram
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#14177 from felixcheung/rsessiontest.
## What changes were proposed in this pull request?
Add a check-cran.sh script that runs `R CMD check` as CRAN. Also fixes a number of issues pointed out by the check. These include
- Updating `DESCRIPTION` to be appropriate
- Adding a .Rbuildignore to ignore lintr, src-native, html that are non-standard files / dirs
- Adding aliases to all S4 methods in DataFrame, Column, GroupedData etc. This is required as stated in https://cran.r-project.org/doc/manuals/r-release/R-exts.html#Documenting-S4-classes-and-methods
- Other minor fixes
## How was this patch tested?
SparkR unit tests, running the above mentioned script
Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
Closes#14173 from shivaram/sparkr-cran-changes.
## What changes were proposed in this pull request?
More tests
I don't think this is critical for Spark 2.0.0 RC, maybe Spark 2.0.1 or 2.1.0.
## How was this patch tested?
unit tests
shivaram dongjoon-hyun
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#14206 from felixcheung/rroutetests.
## What changes were proposed in this pull request?
Fix function routing to work with and without namespace operator `SparkR::createDataFrame`
## How was this patch tested?
manual, unit tests
shivaram
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#14195 from felixcheung/rroutedefault.
## What changes were proposed in this pull request?
Rename window.partitionBy and window.orderBy to windowPartitionBy and windowOrderBy to pass CRAN package check.
## How was this patch tested?
SparkR unit tests.
Author: Sun Rui <sunrui2016@gmail.com>
Closes#14192 from sun-rui/SPARK-16509.
## What changes were proposed in this pull request?
Minor documentation update for code example, code style, and missed reference to "sparkR.init"
## How was this patch tested?
manual
shivaram
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#14178 from felixcheung/rcsvprogrammingguide.
## What changes were proposed in this pull request?
Minor example updates
## How was this patch tested?
manual
shivaram
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#14171 from felixcheung/rexample.
## What changes were proposed in this pull request?
* Update SparkR ML section to make them consistent with SparkR API docs.
* Since #13972 adds labelling support for the ```include_example``` Jekyll plugin, so that we can split the single ```ml.R``` example file into multiple line blocks with different labels, and include them in different algorithms/models in the generated HTML page.
## How was this patch tested?
Only docs update, manually check the generated docs.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#14011 from yanboliang/r-user-guide-update.
## What changes were proposed in this pull request?
This PR prevents ERRORs when `summary(df)` is called for `SparkDataFrame` with not-numeric columns. This failure happens only in `SparkR`.
**Before**
```r
> df <- createDataFrame(faithful)
> df <- withColumn(df, "boolean", df$waiting==79)
> summary(df)
16/07/07 14:15:16 ERROR RBackendHandler: describe on 34 failed
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
org.apache.spark.sql.AnalysisException: cannot resolve 'avg(`boolean`)' due to data type mismatch: function average requires numeric types, not BooleanType;
```
**After**
```r
> df <- createDataFrame(faithful)
> df <- withColumn(df, "boolean", df$waiting==79)
> summary(df)
SparkDataFrame[summary:string, eruptions:string, waiting:string]
```
## How was this patch tested?
Pass the Jenkins with a updated testcase.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#14096 from dongjoon-hyun/SPARK-16425.
## What changes were proposed in this pull request?
Apply default "NA" as null string for R, like R read.csv na.string parameter.
https://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html
na.strings = "NA"
An user passing a csv file with NA value should get the same behavior with SparkR read.df(... source = "csv")
(couldn't open JIRA, will do that later)
## How was this patch tested?
unit tests
shivaram
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#13984 from felixcheung/rcsvnastring.
## What changes were proposed in this pull request?
ORC test should be enabled only when HiveContext is available.
## How was this patch tested?
Manual.
```
$ R/run-tests.sh
...
1. create DataFrame from RDD (test_sparkSQL.R#200) - Hive is not build with SparkSQL, skipped
2. test HiveContext (test_sparkSQL.R#1021) - Hive is not build with SparkSQL, skipped
3. read/write ORC files (test_sparkSQL.R#1728) - Hive is not build with SparkSQL, skipped
4. enableHiveSupport on SparkSession (test_sparkSQL.R#2448) - Hive is not build with SparkSQL, skipped
5. sparkJars tag in SparkContext (test_Windows.R#21) - This test is only for Windows, skipped
DONE ===========================================================================
Tests passed.
```
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#14019 from dongjoon-hyun/SPARK-16233.
## What changes were proposed in this pull request?
Capture errors from R workers in daemon.R to avoid deletion of R session temporary directory. See detailed description at https://issues.apache.org/jira/browse/SPARK-16299
## How was this patch tested?
SparkR unit tests.
Author: Sun Rui <sunrui2016@gmail.com>
Closes#13975 from sun-rui/SPARK-16299.
## What changes were proposed in this pull request?
gapplyCollect() does gapply() on a SparkDataFrame and collect the result back to R. Compared to gapply() + collect(), gapplyCollect() offers performance optimization as well as programming convenience, as no schema is needed to be provided.
This is similar to dapplyCollect().
## How was this patch tested?
Added test cases for gapplyCollect similar to dapplyCollect
Author: Narine Kokhlikyan <narine@slice.com>
Closes#13760 from NarineK/gapplyCollect.
## What changes were proposed in this pull request?
This PR implements `posexplode` table generating function. Currently, master branch raises the following exception for `map` argument. It's different from Hive.
**Before**
```scala
scala> sql("select posexplode(map('a', 1, 'b', 2))").show
org.apache.spark.sql.AnalysisException: No handler for Hive UDF ... posexplode() takes an array as a parameter; line 1 pos 7
```
**After**
```scala
scala> sql("select posexplode(map('a', 1, 'b', 2))").show
+---+---+-----+
|pos|key|value|
+---+---+-----+
| 0| a| 1|
| 1| b| 2|
+---+---+-----+
```
For `array` argument, `after` is the same with `before`.
```
scala> sql("select posexplode(array(1, 2, 3))").show
+---+---+
|pos|col|
+---+---+
| 0| 1|
| 1| 2|
| 2| 3|
+---+---+
```
## How was this patch tested?
Pass the Jenkins tests with newly added testcases.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#13971 from dongjoon-hyun/SPARK-16289.
https://issues.apache.org/jira/browse/SPARK-16140
## What changes were proposed in this pull request?
Group the R doc of spark.kmeans, predict(KM), summary(KM), read/write.ml(KM) under Rd spark.kmeans. The example code was updated.
## How was this patch tested?
Tested on my local machine
And on my laptop `jekyll build` is failing to build API docs, so here I can only show you the html I manually generated from Rd files, with no CSS applied, but the doc content should be there.
![screenshotkmeans](https://cloud.githubusercontent.com/assets/3925641/16403203/c2c9ca1e-3ca7-11e6-9e29-f2164aee75fc.png)
Author: Xin Ren <iamshrek@126.com>
Closes#13921 from keypointt/SPARK-16140.
## What changes were proposed in this pull request?
Add unit tests for csv data for SPARKR
## How was this patch tested?
unit tests
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#13904 from felixcheung/rcsv.
## What changes were proposed in this pull request?
update sparkR DataFrame.R comment
SQLContext ==> SparkSession
## How was this patch tested?
N/A
Author: WeichenXu <WeichenXu123@outlook.com>
Closes#13946 from WeichenXu123/sparkR_comment_update_sparkSession.
## What changes were proposed in this pull request?
Allowing truncate to a specific number of character is convenient at times, especially while operating from the REPL. Sometimes those last few characters make all the difference, and showing everything brings in whole lot of noise.
## How was this patch tested?
Existing tests. + 1 new test in DataFrameSuite.
For SparkR and pyspark, existing tests and manual testing.
Author: Prashant Sharma <prashsh1@in.ibm.com>
Author: Prashant Sharma <prashant@apache.org>
Closes#13839 from ScrapCodes/add_truncateTo_DF.show.
## What changes were proposed in this pull request?
Add `conf` method to get Runtime Config from SparkSession
## How was this patch tested?
unit tests, manual tests
This is how it works in sparkR shell:
```
SparkSession available as 'spark'.
> conf()
$hive.metastore.warehouse.dir
[1] "file:/opt/spark-2.0.0-bin-hadoop2.6/R/spark-warehouse"
$spark.app.id
[1] "local-1466749575523"
$spark.app.name
[1] "SparkR"
$spark.driver.host
[1] "10.0.2.1"
$spark.driver.port
[1] "45629"
$spark.executorEnv.LD_LIBRARY_PATH
[1] "$LD_LIBRARY_PATH:/usr/lib/R/lib:/usr/lib/x86_64-linux-gnu:/usr/lib/jvm/default-java/jre/lib/amd64/server"
$spark.executor.id
[1] "driver"
$spark.home
[1] "/opt/spark-2.0.0-bin-hadoop2.6"
$spark.master
[1] "local[*]"
$spark.sql.catalogImplementation
[1] "hive"
$spark.submit.deployMode
[1] "client"
> conf("spark.master")
$spark.master
[1] "local[*]"
```
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#13885 from felixcheung/rconf.
## What changes were proposed in this pull request?
Updated setJobGroup, cancelJobGroup, clearJobGroup to not require sc/SparkContext as parameter.
Also updated roxygen2 doc and R programming guide on deprecations.
## How was this patch tested?
unit tests
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#13838 from felixcheung/rjobgroup.
## What changes were proposed in this pull request?
Guide for
- UDFs with dapply, dapplyCollect
- spark.lapply for running parallel R functions
## How was this patch tested?
build locally
<img width="654" alt="screen shot 2016-06-14 at 03 12 56" src="https://cloud.githubusercontent.com/assets/3419881/16039344/12a3b6a0-31de-11e6-8d77-fe23308075c0.png">
Author: Kai Jiang <jiangkai@gmail.com>
Closes#13660 from vectorijk/spark-15672-R-guide-update.
## What changes were proposed in this pull request?
add union and deprecate unionAll, separate roxygen2 doc for rbind (since their usage and parameter lists are quite different)
`explode` is also deprecated - but seems like replacement is a combination of calls; not sure if we should deprecate it in SparkR, yet.
## How was this patch tested?
unit tests, manual checks for r doc
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#13805 from felixcheung/runion.
## What changes were proposed in this pull request?
Found these issues while reviewing for SPARK-16090
## How was this patch tested?
roxygen2 doc gen, checked output html
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#13803 from felixcheung/rdocrd.
## What changes were proposed in this pull request?
This PR is a subset of #13023 by yanboliang to make SparkR model param names and default values consistent with MLlib. I tried to avoid other changes from #13023 to keep this PR minimal. I will send a follow-up PR to improve the documentation.
Main changes:
* `spark.glm`: epsilon -> tol, maxit -> maxIter
* `spark.kmeans`: default k -> 2, default maxIter -> 20, default initMode -> "k-means||"
* `spark.naiveBayes`: laplace -> smoothing, default 1.0
## How was this patch tested?
Existing unit tests.
Author: Xiangrui Meng <meng@databricks.com>
Closes#13801 from mengxr/SPARK-15177.1.
## What changes were proposed in this pull request?
I ran a full pass from A to Z and fixed the obvious duplications, improper grouping etc.
There are still more doc issues to be cleaned up.
## How was this patch tested?
manual tests
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#13798 from felixcheung/rdocseealso.
## What changes were proposed in this pull request?
This PR adds `pivot` function to SparkR for API parity. Since this PR is based on https://github.com/apache/spark/pull/13295 , mhnatiuk should be credited for the work he did.
## How was this patch tested?
Pass the Jenkins tests (including new testcase.)
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#13786 from dongjoon-hyun/SPARK-15294.
## What changes were proposed in this pull request?
Removed unnecessary duplicated documentation in dapply and dapplyCollect.
In this pull request I created separate R docs for dapply and dapplyCollect - kept dapply's documentation separate from dapplyCollect's and referred from one to another via a link.
## How was this patch tested?
Existing test cases.
Author: Narine Kokhlikyan <narine@slice.com>
Closes#13790 from NarineK/dapply-docs-fix.
## What changes were proposed in this pull request?
This PR adds `since` tags to Roxygen documentation according to the previous documentation archive.
https://home.apache.org/~dongjoon/spark-2.0.0-docs/api/R/
## How was this patch tested?
Manual.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#13734 from dongjoon-hyun/SPARK-14995.
## What changes were proposed in this pull request?
roxygen2 doc, programming guide, example updates
## How was this patch tested?
manual checks
shivaram
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#13751 from felixcheung/rsparksessiondoc.
## What changes were proposed in this pull request?
This PR adds `spark_partition_id` virtual column function in SparkR for API parity.
The following is just an example to illustrate a SparkR usage on a partitioned parquet table created by `spark.range(10).write.mode("overwrite").parquet("/tmp/t1")`.
```r
> collect(select(read.parquet('/tmp/t1'), c('id', spark_partition_id())))
id SPARK_PARTITION_ID()
1 3 0
2 4 0
3 8 1
4 9 1
5 0 2
6 1 3
7 2 4
8 5 5
9 6 6
10 7 7
```
## How was this patch tested?
Pass the Jenkins tests (including new testcase).
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#13768 from dongjoon-hyun/SPARK-16053.
## What changes were proposed in this pull request?
fix code doc
## How was this patch tested?
manual
shivaram
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#13782 from felixcheung/rcountdoc.
## What changes were proposed in this pull request?
spark.lapply and setLogLevel
## How was this patch tested?
unit test
shivaram thunterdb
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#13752 from felixcheung/rlapply.
## What changes were proposed in this pull request?
This issue adds `read.orc/write.orc` to SparkR for API parity.
## How was this patch tested?
Pass the Jenkins tests (with new testcases).
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#13763 from dongjoon-hyun/SPARK-16051.
## What changes were proposed in this pull request?
Add dropTempView and deprecate dropTempTable
## How was this patch tested?
unit tests
shivaram liancheng
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#13753 from felixcheung/rdroptempview.
## What changes were proposed in this pull request?
This PR adds `monotonically_increasing_id` column function in SparkR for API parity.
After this PR, SparkR supports the followings.
```r
> df <- read.json("examples/src/main/resources/people.json")
> collect(select(df, monotonically_increasing_id(), df$name, df$age))
monotonically_increasing_id() name age
1 0 Michael NA
2 1 Andy 30
3 2 Justin 19
```
## How was this patch tested?
Pass the Jenkins tests (with added testcase).
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#13774 from dongjoon-hyun/SPARK-16059.
## What changes were proposed in this pull request?
This PR introduces the new SparkSession API for SparkR.
`sparkR.session.getOrCreate()` and `sparkR.session.stop()`
"getOrCreate" is a bit unusual in R but it's important to name this clearly.
SparkR implementation should
- SparkSession is the main entrypoint (vs SparkContext; due to limited functionality supported with SparkContext in SparkR)
- SparkSession replaces SQLContext and HiveContext (both a wrapper around SparkSession, and because of API changes, supporting all 3 would be a lot more work)
- Changes to SparkSession is mostly transparent to users due to SPARK-10903
- Full backward compatibility is expected - users should be able to initialize everything just in Spark 1.6.1 (`sparkR.init()`), but with deprecation warning
- Mostly cosmetic changes to parameter list - users should be able to move to `sparkR.session.getOrCreate()` easily
- An advanced syntax with named parameters (aka varargs aka "...") is supported; that should be closer to the Builder syntax that is in Scala/Python (which unfortunately does not work in R because it will look like this: `enableHiveSupport(config(config(master(appName(builder(), "foo"), "local"), "first", "value"), "next, "value"))`
- Updating config on an existing SparkSession is supported, the behavior is the same as Python, in which config is applied to both SparkContext and SparkSession
- Some SparkSession changes are not matched in SparkR, mostly because it would be breaking API change: `catalog` object, `createOrReplaceTempView`
- Other SQLContext workarounds are replicated in SparkR, eg. `tables`, `tableNames`
- `sparkR` shell is updated to use the SparkSession entrypoint (`sqlContext` is removed, just like with Scale/Python)
- All tests are updated to use the SparkSession entrypoint
- A bug in `read.jdbc` is fixed
TODO
- [x] Add more tests
- [ ] Separate PR - update all roxygen2 doc coding example
- [ ] Separate PR - update SparkR programming guide
## How was this patch tested?
unit tests, manual tests
shivaram sun-rui rxin
Author: Felix Cheung <felixcheung_m@hotmail.com>
Author: felixcheung <felixcheung_m@hotmail.com>
Closes#13635 from felixcheung/rsparksession.
## What changes were proposed in this pull request?
This PR adds `randomSplit` to SparkR for API parity.
## How was this patch tested?
Pass the Jenkins tests (with new testcase.)
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#13721 from dongjoon-hyun/SPARK-16005.
## What changes were proposed in this pull request?
Add registerTempTable to DataFrame with Deprecate
## How was this patch tested?
unit tests
shivaram liancheng
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#13722 from felixcheung/rregistertemptable.
## What changes were proposed in this pull request?
This PR adds varargs-type `dropDuplicates` function to SparkR for API parity.
Refer to https://issues.apache.org/jira/browse/SPARK-15807, too.
## How was this patch tested?
Pass the Jenkins tests with new testcases.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#13684 from dongjoon-hyun/SPARK-15908.
## What changes were proposed in this pull request?
R Docs changes
include typos, format, layout.
## How was this patch tested?
Test locally.
Author: Kai Jiang <jiangkai@gmail.com>
Closes#13394 from vectorijk/spark-15490.
## What changes were proposed in this pull request?
gapply() applies an R function on groups grouped by one or more columns of a DataFrame, and returns a DataFrame. It is like GroupedDataSet.flatMapGroups() in the Dataset API.
Please, let me know what do you think and if you have any ideas to improve it.
Thank you!
## How was this patch tested?
Unit tests.
1. Primitive test with different column types
2. Add a boolean column
3. Compute average by a group
Author: Narine Kokhlikyan <narine.kokhlikyan@gmail.com>
Author: NarineK <narine.kokhlikyan@us.ibm.com>
Closes#12836 from NarineK/gapply2.
## What changes were proposed in this pull request?
Because of the fix in SPARK-15684, this exclusion is no longer necessary.
## How was this patch tested?
unit tests
shivaram
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#13636 from felixcheung/rendswith.
## What changes were proposed in this pull request?
This PR replaces `registerTempTable` with `createOrReplaceTempView` as a follow-up task of #12945.
## How was this patch tested?
Existing SparkR tests.
Author: Cheng Lian <lian@databricks.com>
Closes#13644 from liancheng/spark-15925-temp-view-for-r.
## What changes were proposed in this pull request?
When reviewing SPARK-15545, we found that is.nan is not exported, which should be exported.
Add it to the NAMESPACE.
## How was this patch tested?
Manual tests.
Author: wm624@hotmail.com <wm624@hotmail.com>
Closes#13508 from wangmiao1981/unused.
## What changes were proposed in this pull request?
In R 3.3.0, startsWith and endsWith are added. In this PR, I make the two work in SparkR.
1. Remove signature in generic.R
2. Add setMethod in column.R
3. Add unit tests
## How was this patch tested?
Manually test it through SparkR shell for both column data and string data, which are added into the unit test file.
Author: wm624@hotmail.com <wm624@hotmail.com>
Closes#13476 from wangmiao1981/start.
## What changes were proposed in this pull request?
`an -> a`
Use cmds like `find . -name '*.R' | xargs -i sh -c "grep -in ' an [^aeiou]' {} && echo {}"` to generate candidates, and review them one by one.
## How was this patch tested?
manual tests
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#13515 from zhengruifeng/an_a.
## What changes were proposed in this pull request?
Change version check in R tests
## How was this patch tested?
R tests
shivaram
Author: felixcheung <felixcheung_m@hotmail.com>
Closes#13369 from felixcheung/rversioncheck.
## What changes were proposed in this pull request?
Follow up on the earlier PR - in here we are fixing up roxygen2 doc examples.
Also add to the programming guide migration section.
## How was this patch tested?
SparkR tests
Author: felixcheung <felixcheung_m@hotmail.com>
Closes#13340 from felixcheung/sqlcontextdoc.
## What changes were proposed in this pull request?
This PR corrects SparkR to use `shell()` instead of `system2()` on Windows.
Using `system2(...)` on Windows does not process windows file separator `\`. `shell(tralsate = TRUE, ...)` can treat this problem. So, this was changed to be chosen according to OS.
Existing tests were failed on Windows due to this problem. For example, those were failed.
```
8. Failure: sparkJars tag in SparkContext (test_includeJAR.R#34)
9. Failure: sparkJars tag in SparkContext (test_includeJAR.R#36)
```
The cases above were due to using of `system2`.
In addition, this PR also fixes some tests failed on Windows.
```
5. Failure: sparkJars sparkPackages as comma-separated strings (test_context.R#128)
6. Failure: sparkJars sparkPackages as comma-separated strings (test_context.R#131)
7. Failure: sparkJars sparkPackages as comma-separated strings (test_context.R#134)
```
The cases above were due to a weird behaviour of `normalizePath()`. On Linux, if the path does not exist, it just prints out the input but it prints out including the current path on Windows.
```r
# On Linus
path <- normalizePath("aa")
print(path)
[1] "aa"
# On Windows
path <- normalizePath("aa")
print(path)
[1] "C:\\Users\\aa"
```
## How was this patch tested?
Jenkins tests and manually tested in a Window machine as below:
Here is the [stdout](https://gist.github.com/HyukjinKwon/4bf35184f3a30f3bce987a58ec2bbbab) of testing.
Closes#7025
Author: hyukjinkwon <gurwls223@gmail.com>
Author: Hyukjin Kwon <gurwls223@gmail.com>
Author: Prakash PC <prakash.chinnu@gmail.com>
Closes#13165 from HyukjinKwon/pr/7025.
Eliminate the need to pass sqlContext to method since it is a singleton - and we don't want to support multiple contexts in a R session.
Changes are done in a back compat way with deprecation warning added. Method signature for S3 methods are added in a concise, clean approach such that in the next release the deprecated signature can be taken out easily/cleanly (just delete a few lines per method).
Custom method dispatch is implemented to allow for multiple JVM reference types that are all 'jobj' in R and to avoid having to add 30 new exports.
Author: felixcheung <felixcheung_m@hotmail.com>
Closes#9192 from felixcheung/rsqlcontext.
## What changes were proposed in this pull request?
(Please fill in changes proposed in this fix)
There are some failures when running SparkR unit tests.
In this PR, I fixed two of these failures in test_context.R and test_sparkSQL.R
The first one is due to different masked name. I added missed names in the expected arrays.
The second one is because one PR removed the logic of a previous fix of missing subset method.
The file privilege issue is still there. I am debugging it. SparkR shell can run the test case successfully.
test_that("pipeRDD() on RDDs", {
actual <- collect(pipeRDD(rdd, "more"))
When using run-test script, it complains no such directories as below:
cannot open file '/tmp/Rtmp4FQbah/filee2273f9d47f7': No such file or directory
## How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
Manually test it
Author: wm624@hotmail.com <wm624@hotmail.com>
Closes#13284 from wangmiao1981/R.
## What changes were proposed in this pull request?
in hive, `locate("aa", "aaa", 0)` would yield 0, `locate("aa", "aaa", 1)` would yield 1 and `locate("aa", "aaa", 2)` would yield 2, while in Spark, `locate("aa", "aaa", 0)` would yield 1, `locate("aa", "aaa", 1)` would yield 2 and `locate("aa", "aaa", 2)` would yield 0. This results from the different understanding of the third parameter in udf `locate`. It means the starting index and starts from 1, so when we use 0, the return would always be 0.
## How was this patch tested?
tested with modified `StringExpressionsSuite` and `StringFunctionsSuite`
Author: Daoyuan Wang <daoyuan.wang@intel.com>
Closes#13186 from adrian-wang/locate.
## What changes were proposed in this pull request?
This patch is a follow-up to https://github.com/apache/spark/pull/13104 and adds documentation to clarify the semantics of read.text with respect to partitioning.
## How was this patch tested?
N/A
Author: Reynold Xin <rxin@databricks.com>
Closes#13184 from rxin/SPARK-14463.
## What changes were proposed in this pull request?
dapplyCollect() applies an R function on each partition of a SparkDataFrame and collects the result back to R as a data.frame.
```
dapplyCollect(df, function(ldf) {...})
```
## How was this patch tested?
SparkR unit tests.
Author: Sun Rui <sunrui2016@gmail.com>
Closes#12989 from sun-rui/SPARK-15202.
## What changes were proposed in this pull request?
* Since Spark has supported native csv reader, it does not necessary to use the third party ```spark-csv``` in ```examples/src/main/r/data-manipulation.R```. Meanwhile, remove all ```spark-csv``` usage in SparkR.
* Running R applications through ```sparkR``` is not supported as of Spark 2.0, so we change to use ```./bin/spark-submit``` to run the example.
## How was this patch tested?
Offline test.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#13005 from yanboliang/r-df-examples.
## What changes were proposed in this pull request?
This PR is a workaround for NA handling in hash code computation.
This PR is on behalf of paulomagalhaes whose PR is https://github.com/apache/spark/pull/10436
## How was this patch tested?
SparkR unit tests.
Author: Sun Rui <sunrui2016@gmail.com>
Author: ray <ray@rays-MacBook-Air.local>
Closes#12976 from sun-rui/SPARK-12479.
This PR:
1. Implement WindowSpec S4 class.
2. Implement Window.partitionBy() and Window.orderBy() as utility functions to create WindowSpec objects.
3. Implement over() of Column class.
Author: Sun Rui <rui.sun@intel.com>
Author: Sun Rui <sunrui2016@gmail.com>
Closes#10094 from sun-rui/SPARK-11395.
## What changes were proposed in this pull request?
Implement repartitionByColumn on DataFrame.
This will allow us to run R functions on each partition identified by column groups with dapply() method.
## How was this patch tested?
Unit tests
Author: NarineK <narine.kokhlikyan@us.ibm.com>
Closes#12887 from NarineK/repartitionByColumns.
## What changes were proposed in this pull request?
Fix warnings and a failure in SparkR test cases with testthat version 1.0.1
## How was this patch tested?
SparkR unit test cases.
Author: Sun Rui <sunrui2016@gmail.com>
Closes#12867 from sun-rui/SPARK-15091.
## What changes were proposed in this pull request?
* ```RFormula``` supports empty response variable like ```~ x + y```.
* Support formula in ```spark.kmeans``` in SparkR.
* Fix some outdated docs for SparkR.
## How was this patch tested?
Unit tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#12813 from yanboliang/spark-15030.
## What changes were proposed in this pull request?
Continue the work of #12789 to rename ml.asve/ml.load to write.ml/read.ml, which are more consistent with read.df/write.df and other methods in SparkR.
I didn't rename `data` to `df` because we still use `predict` for prediction, which uses `newData` to match the signature in R.
## How was this patch tested?
Existing unit tests.
cc: yanboliang thunterdb
Author: Xiangrui Meng <meng@databricks.com>
Closes#12807 from mengxr/SPARK-14831.
## What changes were proposed in this pull request?
This PR splits the MLlib algorithms into two flavors:
- the R flavor, which tries to mimic the existing R API for these algorithms (and works as an S4 specialization for Spark dataframes)
- the Spark flavor, which follows the same API and naming conventions as the rest of the MLlib algorithms in the other languages
In practice, the former calls the latter.
## How was this patch tested?
The tests for the various algorithms were adapted to be run against both interfaces.
Author: Timothy Hunter <timhunter@databricks.com>
Closes#12789 from thunterdb/14831.
## What changes were proposed in this pull request?
dapply() applies an R function on each partition of a DataFrame and returns a new DataFrame.
The function signature is:
dapply(df, function(localDF) {}, schema = NULL)
R function input: local data.frame from the partition on local node
R function output: local data.frame
Schema specifies the Row format of the resulting DataFrame. It must match the R function's output.
If schema is not specified, each partition of the result DataFrame will be serialized in R into a single byte array. Such resulting DataFrame can be processed by successive calls to dapply().
## How was this patch tested?
SparkR unit tests.
Author: Sun Rui <rui.sun@intel.com>
Author: Sun Rui <sunrui2016@gmail.com>
Closes#12493 from sun-rui/SPARK-12919.
SparkR ```glm``` and ```kmeans``` model persistence.
Unit tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Author: Gayathri Murali <gayathri.m.softie@gmail.com>
Closes#12778 from yanboliang/spark-14311.
Closes#12680Closes#12683
## What changes were proposed in this pull request?
This PR adds a new function in SparkR called `sparkLapply(list, function)`. This function implements a distributed version of `lapply` using Spark as a backend.
TODO:
- [x] check documentation
- [ ] check tests
Trivial example in SparkR:
```R
sparkLapply(1:5, function(x) { 2 * x })
```
Output:
```
[[1]]
[1] 2
[[2]]
[1] 4
[[3]]
[1] 6
[[4]]
[1] 8
[[5]]
[1] 10
```
Here is a slightly more complex example to perform distributed training of multiple models. Under the hood, Spark broadcasts the dataset.
```R
library("MASS")
data(menarche)
families <- c("gaussian", "poisson")
train <- function(family){glm(Menarche ~ Age , family=family, data=menarche)}
results <- sparkLapply(families, train)
```
## How was this patch tested?
This PR was tested in SparkR. I am unfamiliar with R and SparkR, so any feedback on style, testing, etc. will be much appreciated.
cc falaki davies
Author: Timothy Hunter <timhunter@databricks.com>
Closes#12426 from thunterdb/7264.
Make the behavior of mutate more consistent with that in dplyr, besides support for replacing existing columns.
1. Throw error message when there are duplicated column names in the DataFrame being mutated.
2. when there are duplicated column names in specified columns by arguments, the last column of the same name takes effect.
Author: Sun Rui <rui.sun@intel.com>
Closes#10220 from sun-rui/SPARK-12235.
Added parameter drop to subsetting operator [. This is useful to get a Column from a DataFrame, given its name. R supports it.
In R:
```
> name <- "Sepal_Length"
> class(iris[, name])
[1] "numeric"
```
Currently, in SparkR:
```
> name <- "Sepal_Length"
> class(irisDF[, name])
[1] "DataFrame"
```
Previous code returns a DataFrame, which is inconsistent with R's behavior. SparkR should return a Column instead. Currently, in order for the user to return a Column given a column name as a character variable would be through `eval(parse(x))`, where x is the string `"irisDF$Sepal_Length"`. That itself is pretty hacky. `SparkR:::getColumn() `is another choice, but I don't see why this method should be externalized. Instead, following R's way to do things, the proposed implementation allows this:
```
> name <- "Sepal_Length"
> class(irisDF[, name, drop=T])
[1] "Column"
> class(irisDF[, name, drop=F])
[1] "DataFrame"
```
This is consistent with R:
```
> name <- "Sepal_Length"
> class(iris[, name])
[1] "numeric"
> class(iris[, name, drop=F])
[1] "data.frame"
```
Author: Oscar D. Lara Yejas <odlaraye@oscars-mbp.usca.ibm.com>
Author: Oscar D. Lara Yejas <odlaraye@oscars-mbp.attlocal.net>
Closes#11318 from olarayej/SPARK-13436.
## What changes were proposed in this pull request?
Added method histogram() to compute the histogram of a Column
Usage:
```
## Create a DataFrame from the Iris dataset
irisDF <- createDataFrame(sqlContext, iris)
## Render a histogram for the Sepal_Length column
histogram(irisDF, "Sepal_Length", nbins=12)
```
![histogram](https://cloud.githubusercontent.com/assets/13985649/13588486/e1e751c6-e484-11e5-85db-2fc2115c4bb2.png)
Note: Usage will change once SPARK-9325 is figured out so that histogram() only takes a Column as a parameter, as opposed to a DataFrame and a name
## How was this patch tested?
All unit tests pass. I added specific unit cases for different scenarios.
Author: Oscar D. Lara Yejas <odlaraye@oscars-mbp.usca.ibm.com>
Author: Oscar D. Lara Yejas <odlaraye@oscars-mbp.attlocal.net>
Closes#11569 from olarayej/SPARK-13734.
## What changes were proposed in this pull request?
```AFTSurvivalRegressionModel``` supports ```save/load``` in SparkR.
## How was this patch tested?
Unit tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#12685 from yanboliang/spark-14313.
## What changes were proposed in this pull request?
SparkR ```NaiveBayesModel``` supports ```save/load``` by the following API:
```
df <- createDataFrame(sqlContext, infert)
model <- naiveBayes(education ~ ., df, laplace = 0)
ml.save(model, path)
model2 <- ml.load(path)
```
## How was this patch tested?
Add unit tests.
cc mengxr
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#12573 from yanboliang/spark-14312.
## What changes were proposed in this pull request?
This issue aims to fix some errors in R examples and make them up-to-date in docs and example modules.
- Remove the wrong usage of `map`. We need to use `lapply` in `sparkR` if needed. However, `lapply` is private so far. The corrected example will be added later.
- Fix the wrong example in Section `Generic Load/Save Functions` of `docs/sql-programming-guide.md` for consistency
- Fix datatypes in `sparkr.md`.
- Update a data result in `sparkr.md`.
- Replace deprecated functions to remove warnings: jsonFile -> read.json, parquetFile -> read.parquet
- Use up-to-date R-like functions: loadDF -> read.df, saveDF -> write.df, saveAsParquetFile -> write.parquet
- Replace `SparkR DataFrame` with `SparkDataFrame` in `dataframe.R` and `data-manipulation.R`.
- Other minor syntax fixes and a typo.
## How was this patch tested?
Manual.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#12649 from dongjoon-hyun/SPARK-14883.
## What changes were proposed in this pull request?
Fixed inadvertent roxygen2 doc changes, added class name change to programming guide
Follow up of #12621
## How was this patch tested?
manually checked
Author: felixcheung <felixcheung_m@hotmail.com>
Closes#12647 from felixcheung/rdataframe.
## What changes were proposed in this pull request?
In order to support running SQL directly on files, we added some code in ResolveRelations to catch the exception thrown by catalog.lookupRelation and ignore it. This unfortunately masks all the exceptions. This patch changes the logic to simply test the table's existence.
## How was this patch tested?
I manually hacked some bugs into Spark and made sure the exceptions were being propagated up.
Author: Reynold Xin <rxin@databricks.com>
Closes#12634 from rxin/SPARK-14869.
## What changes were proposed in this pull request?
When JVM backend fails without going proper error handling (eg. process crashed), the R error message could be ambiguous.
```
Error in if (returnStatus != 0) { : argument is of length zero
```
This change attempts to make it more clear (however, one would still need to investigate why JVM fails)
## How was this patch tested?
manually
Author: felixcheung <felixcheung_m@hotmail.com>
Closes#12622 from felixcheung/rreturnstatus.
## What changes were proposed in this pull request?
Changed class name defined in R from "DataFrame" to "SparkDataFrame". A popular package, S4Vector already defines "DataFrame" - this change is to avoid conflict.
Aside from class name and API/roxygen2 references, SparkR APIs like `createDataFrame`, `as.DataFrame` are not changed (S4Vector does not define a "as.DataFrame").
Since in R, one would rarely reference type/class, this change should have minimal/almost-no impact to a SparkR user in terms of back compat.
## How was this patch tested?
SparkR tests, manually loading S4Vector then SparkR package
Author: felixcheung <felixcheung_m@hotmail.com>
Closes#12621 from felixcheung/rdataframe.
## What changes were proposed in this pull request?
The concurrency issue reported in SPARK-13178 was fixed by the PR https://github.com/apache/spark/pull/10947 for SPARK-12792.
This PR just removes a workaround not needed anymore.
## How was this patch tested?
SparkR unit tests.
Author: Sun Rui <rui.sun@intel.com>
Closes#12606 from sun-rui/SPARK-13178.
## What changes were proposed in this pull request?
This PR aims to add `setLogLevel` function to SparkR shell.
**Spark Shell**
```scala
scala> sc.setLogLevel("ERROR")
```
**PySpark**
```python
>>> sc.setLogLevel("ERROR")
```
**SparkR (this PR)**
```r
> setLogLevel(sc, "ERROR")
NULL
```
## How was this patch tested?
Pass the Jenkins tests including a new R testcase.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#12547 from dongjoon-hyun/SPARK-14780.
## What changes were proposed in this pull request?
This issue aims to expose Scala `bround` function in Python/R API.
`bround` function is implemented in SPARK-14614 by extending current `round` function.
We used the following semantics from Hive.
```java
public static double bround(double input, int scale) {
if (Double.isNaN(input) || Double.isInfinite(input)) {
return input;
}
return BigDecimal.valueOf(input).setScale(scale, RoundingMode.HALF_EVEN).doubleValue();
}
```
After this PR, `pyspark` and `sparkR` also support `bround` function.
**PySpark**
```python
>>> from pyspark.sql.functions import bround
>>> sqlContext.createDataFrame([(2.5,)], ['a']).select(bround('a', 0).alias('r')).collect()
[Row(r=2.0)]
```
**SparkR**
```r
> df = createDataFrame(sqlContext, data.frame(x = c(2.5, 3.5)))
> head(collect(select(df, bround(df$x, 0))))
bround(x, 0)
1 2
2 4
```
## How was this patch tested?
Pass the Jenkins tests (including new testcases).
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#12509 from dongjoon-hyun/SPARK-14639.
## What changes were proposed in this pull request?
Change the signature of as.data.frame() to be consistent with that in the R base package to meet R user's convention.
## How was this patch tested?
dev/lint-r
SparkR unit tests
Author: Sun Rui <rui.sun@intel.com>
Closes#11811 from sun-rui/SPARK-13905.
Add R API for `read.jdbc`, `write.jdbc`.
Tested this quite a bit manually with different combinations of parameters. It's not clear if we could have automated tests in R for this - Scala `JDBCSuite` depends on Java H2 in-memory database.
Refactored some code into util so they could be tested.
Core's R SerDe code needs to be updated to allow access to java.util.Properties as `jobj` handle which is required by DataFrameReader/Writer's `jdbc` method. It would be possible, though more code to add a `sql/r/SQLUtils` helper function.
Tested:
```
# with postgresql
../bin/sparkR --driver-class-path /usr/share/java/postgresql-9.4.1207.jre7.jar
# read.jdbc
df <- read.jdbc(sqlContext, "jdbc:postgresql://localhost/db", "films2", user = "user", password = "12345")
df <- read.jdbc(sqlContext, "jdbc:postgresql://localhost/db", "films2", user = "user", password = 12345)
# partitionColumn and numPartitions test
df <- read.jdbc(sqlContext, "jdbc:postgresql://localhost/db", "films2", partitionColumn = "did", lowerBound = 0, upperBound = 200, numPartitions = 4, user = "user", password = 12345)
a <- SparkR:::toRDD(df)
SparkR:::getNumPartitions(a)
[1] 4
SparkR:::collectPartition(a, 2L)
# defaultParallelism test
df <- read.jdbc(sqlContext, "jdbc:postgresql://localhost/db", "films2", partitionColumn = "did", lowerBound = 0, upperBound = 200, user = "user", password = 12345)
SparkR:::getNumPartitions(a)
[1] 2
# predicates test
df <- read.jdbc(sqlContext, "jdbc:postgresql://localhost/db", "films2", predicates = list("did<=105"), user = "user", password = 12345)
count(df) == 1
# write.jdbc, default save mode "error"
irisDf <- as.DataFrame(sqlContext, iris)
write.jdbc(irisDf, "jdbc:postgresql://localhost/db", "films2", user = "user", password = "12345")
"error, already exists"
write.jdbc(irisDf, "jdbc:postgresql://localhost/db", "iris", user = "user", password = "12345")
```
Author: felixcheung <felixcheung_m@hotmail.com>
Closes#10480 from felixcheung/rreadjdbc.
## What changes were proposed in this pull request?
Expose R-like summary statistics in SparkR::glm for more family and link functions.
Note: Not all values in R [summary.glm](http://stat.ethz.ch/R-manual/R-patched/library/stats/html/summary.glm.html) are exposed, we only provide the most commonly used statistics in this PR. More statistics can be added in the followup work.
## How was this patch tested?
Unit tests.
SparkR Output:
```
Deviance Residuals:
(Note: These are approximate quantiles with relative error <= 0.01)
Min 1Q Median 3Q Max
-0.95096 -0.16585 -0.00232 0.17410 0.72918
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.6765 0.23536 7.1231 4.4561e-11
Sepal_Length 0.34988 0.046301 7.5566 4.1873e-12
Species_versicolor -0.98339 0.072075 -13.644 0
Species_virginica -1.0075 0.093306 -10.798 0
(Dispersion parameter for gaussian family taken to be 0.08351462)
Null deviance: 28.307 on 149 degrees of freedom
Residual deviance: 12.193 on 146 degrees of freedom
AIC: 59.22
Number of Fisher Scoring iterations: 1
```
R output:
```
Deviance Residuals:
Min 1Q Median 3Q Max
-0.95096 -0.16522 0.00171 0.18416 0.72918
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.67650 0.23536 7.123 4.46e-11 ***
Sepal.Length 0.34988 0.04630 7.557 4.19e-12 ***
Speciesversicolor -0.98339 0.07207 -13.644 < 2e-16 ***
Speciesvirginica -1.00751 0.09331 -10.798 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for gaussian family taken to be 0.08351462)
Null deviance: 28.307 on 149 degrees of freedom
Residual deviance: 12.193 on 146 degrees of freedom
AIC: 59.217
Number of Fisher Scoring iterations: 2
```
cc mengxr
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#12393 from yanboliang/spark-13925.
* SparkR glm supports families and link functions which match R's signature for family.
* SparkR glm API refactor. The comparative standard of the new API is R glm, so I only expose the arguments that R glm supports: ```formula, family, data, epsilon and maxit```.
* This PR is focus on glm() and predict(), summary statistics will be done in a separate PR after this get in.
* This PR depends on #12287 which make GLMs support link prediction at Scala side. After that merged, I will add more tests for predict() to this PR.
Unit tests.
cc mengxr jkbradley hhbyyh
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#12294 from yanboliang/spark-12566.
#### What changes were proposed in this pull request?
This PR is to address the comment: https://github.com/apache/spark/pull/12146#discussion-diff-59092238. It removes the function `isViewSupported` from `SessionCatalog`. After the removal, we still can capture the user errors if users try to drop a table using `DROP VIEW`.
#### How was this patch tested?
Modified the existing test cases
Author: gatorsmile <gatorsmile@gmail.com>
Closes#12284 from gatorsmile/followupDropTable.
## What changes were proposed in this pull request?
The `window` function was added to Dataset with [this PR](https://github.com/apache/spark/pull/12008).
This PR adds the R API for this function.
With this PR, SQL, Java, and Scala will share the same APIs as in users can use:
- `window(timeColumn, windowDuration)`
- `window(timeColumn, windowDuration, slideDuration)`
- `window(timeColumn, windowDuration, slideDuration, startTime)`
In Python and R, users can access all APIs above, but in addition they can do
- In R:
`window(timeColumn, windowDuration, startTime=...)`
that is, they can provide the startTime without providing the `slideDuration`. In this case, we will generate tumbling windows.
## How was this patch tested?
Unit tests + manual tests
Author: Burak Yavuz <brkyvz@gmail.com>
Closes#12141 from brkyvz/R-windows.
## What changes were proposed in this pull request?
Define and use ```KMeansWrapper``` for ```SparkR::kmeans```. It's only the code refactor for the original ```KMeans``` wrapper.
## How was this patch tested?
Existing tests.
cc mengxr
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#12039 from yanboliang/spark-14059.
## What changes were proposed in this pull request?
Refactor RRDD by separating the common logic interacting with the R worker to a new class RRunner, which can be used to evaluate R UDFs.
Now RRDD relies on RRuner for RDD computation and RRDD could be reomved if we want to remove RDD API in SparkR later.
## How was this patch tested?
dev/lint-r
SparkR unit tests
Author: Sun Rui <rui.sun@intel.com>
Closes#12024 from sun-rui/SPARK-12792_new.
Refactor RRDD by separating the common logic interacting with the R worker to a new class RRunner, which can be used to evaluate R UDFs.
Now RRDD relies on RRuner for RDD computation and RRDD could be reomved if we want to remove RDD API in SparkR later.
Author: Sun Rui <rui.sun@intel.com>
Closes#10947 from sun-rui/SPARK-12792.
## What changes were proposed in this pull request?
This reopens#11836, which was merged but promptly reverted because it introduced flaky Hive tests.
## How was this patch tested?
See `CatalogTestCases`, `SessionCatalogSuite` and `HiveContextSuite`.
Author: Andrew Or <andrew@databricks.com>
Closes#11938 from andrewor14/session-catalog-again.