## What changes were proposed in this pull request?
#14881 added Kolmogorov-Smirnov Test wrapper to SparkR. I found that ```print.summary.KSTest``` was implemented inappropriately and result in no effect.
Running the following code for KSTest:
```Scala
data <- data.frame(test = c(0.1, 0.15, 0.2, 0.3, 0.25, -1, -0.5))
df <- createDataFrame(data)
testResult <- spark.kstest(df, "test", "norm")
summary(testResult)
```
Before this PR:
![image](https://cloud.githubusercontent.com/assets/1962026/18615016/b9a2823a-7d4f-11e6-934b-128beade355e.png)
After this PR:
![image](https://cloud.githubusercontent.com/assets/1962026/18615014/aafe2798-7d4f-11e6-8b99-c705bb9fe8f2.png)
The new implementation is similar with [```print.summary.GeneralizedLinearRegressionModel```](https://github.com/apache/spark/blob/master/R/pkg/R/mllib.R#L284) of SparkR and [```print.summary.glm```](https://svn.r-project.org/R/trunk/src/library/stats/R/glm.R) of native R.
BTW, I removed the comparison of ```print.summary.KSTest``` in unit test, since it's only wrappers of the summary output which has been checked. Another reason is that these comparison will output summary information to the test console, it will make the test output in a mess.
## How was this patch tested?
Existing test.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#15139 from yanboliang/spark-17315.
## What changes were proposed in this pull request?
Scala/Python users can add files to Spark job by submit options ```--files``` or ```SparkContext.addFile()```. Meanwhile, users can get the added file by ```SparkFiles.get(filename)```.
We should also support this function for SparkR users, since they also have the requirements for some shared dependency files. For example, SparkR users can download third party R packages to driver firstly, add these files to the Spark job as dependency by this API and then each executor can install these packages by ```install.packages```.
## How was this patch tested?
Add unit test.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#15131 from yanboliang/spark-17577.
## What changes were proposed in this pull request?
Clarify that slide and window duration are absolute, and not relative to a calendar.
## How was this patch tested?
Doc build (no functional change)
Author: Sean Owen <sowen@cloudera.com>
Closes#15142 from srowen/SPARK-17297.
## What changes were proposed in this pull request?
Point references to spark-packages.org to https://cwiki.apache.org/confluence/display/SPARK/Third+Party+Projects
This will be accompanied by a parallel change to the spark-website repo, and additional changes to this wiki.
## How was this patch tested?
Jenkins tests.
Author: Sean Owen <sowen@cloudera.com>
Closes#15075 from srowen/SPARK-17445.
## What changes were proposed in this pull request?
This PR tries to add a SparkR vignette, which works as a friendly guidance going through the functionality provided by SparkR.
## How was this patch tested?
Manual test.
Author: junyangq <qianjunyang@gmail.com>
Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
Author: Junyang Qian <junyangq@databricks.com>
Closes#14980 from junyangq/SPARKR-vignette.
## What changes were proposed in this pull request?
Fix summary() method's `return` description for spark.mlp
## How was this patch tested?
Ran tests locally on my laptop.
Author: Xin Ren <iamshrek@126.com>
Closes#15015 from keypointt/SPARK-16445-2.
## What changes were proposed in this pull request?
SparkR ```spark.als``` arguments ```reg``` should be 0.1 by default, which need to be consistent with ML.
## How was this patch tested?
Existing tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#15021 from yanboliang/spark-17464.
## What changes were proposed in this pull request?
additional options were not passed down in write.df.
## How was this patch tested?
unit tests
falaki shivaram
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#15010 from felixcheung/testreadoptions.
## What changes were proposed in this pull request?
Fixed bug in `dapplyCollect` by changing the `compute` function of `worker.R` to explicitly handle raw (binary) vectors.
cc shivaram
## How was this patch tested?
Unit tests
Author: Clark Fitzgerald <clarkfitzg@gmail.com>
Closes#14783 from clarkfitzg/SPARK-16785.
## What changes were proposed in this pull request?
This PR tries to add Kolmogorov-Smirnov Test wrapper to SparkR. This wrapper implementation only supports one sample test against normal distribution.
## How was this patch tested?
R unit test.
Author: Junyang Qian <junyangq@databricks.com>
Closes#14881 from junyangq/SPARK-17315.
## What changes were proposed in this pull request?
This PR tries to add some more explanation to `sparkR.session`. It also modifies doc for `count` so when grouped in one doc, the description doesn't confuse users.
## How was this patch tested?
Manual test.
![screen shot 2016-09-02 at 1 21 36 pm](https://cloud.githubusercontent.com/assets/15318264/18217198/409613ac-7110-11e6-8dae-cb0c8df557bf.png)
Author: Junyang Qian <junyangq@databricks.com>
Closes#14942 from junyangq/fixSparkRSessionDoc.
## What changes were proposed in this pull request?
Require the use of CROSS join syntax in SQL (and a new crossJoin
DataFrame API) to specify explicit cartesian products between relations.
By cartesian product we mean a join between relations R and S where
there is no join condition involving columns from both R and S.
If a cartesian product is detected in the absence of an explicit CROSS
join, an error must be thrown. Turning on the
"spark.sql.crossJoin.enabled" configuration flag will disable this check
and allow cartesian products without an explicit CROSS join.
The new crossJoin DataFrame API must be used to specify explicit cross
joins. The existing join(DataFrame) method will produce a INNER join
that will require a subsequent join condition.
That is df1.join(df2) is equivalent to select * from df1, df2.
## How was this patch tested?
Added cross-join.sql to the SQLQueryTestSuite to test the check for cartesian products. Added a couple of tests to the DataFrameJoinSuite to test the crossJoin API. Modified various other test suites to explicitly specify a cross join where an INNER join or a comma-separated list was previously used.
Author: Srinath Shankar <srinath@databricks.com>
Closes#14866 from srinathshankar/crossjoin.
## What changes were proposed in this pull request?
change since version in doc
## How was this patch tested?
manual
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#14939 from felixcheung/rsparkversion2.
## What changes were proposed in this pull request?
Doc change - see https://issues.apache.org/jira/browse/SPARK-16324
## How was this patch tested?
manual check
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#14934 from felixcheung/regexpextractdoc.
## What changes were proposed in this pull request?
Add sparkR.version() API.
```
> sparkR.version()
[1] "2.1.0-SNAPSHOT"
```
## How was this patch tested?
manual, unit tests
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#14935 from felixcheung/rsparksessionversion.
## What changes were proposed in this pull request?
(Please fill in changes proposed in this fix)
registerTempTable(createDataFrame(iris), "iris")
str(collect(sql("select cast('1' as double) as x, cast('2' as decimal) as y from iris limit 5")))
'data.frame': 5 obs. of 2 variables:
$ x: num 1 1 1 1 1
$ y:List of 5
..$ : num 2
..$ : num 2
..$ : num 2
..$ : num 2
..$ : num 2
The problem is that spark returns `decimal(10, 0)` col type, instead of `decimal`. Thus, `decimal(10, 0)` is not handled correctly. It should be handled as "double".
As discussed in JIRA thread, we can have two potential fixes:
1). Scala side fix to add a new case when writing the object back; However, I can't use spark.sql.types._ in Spark core due to dependency issues. I don't find a way of doing type case match;
2). SparkR side fix: Add a helper function to check special type like `"decimal(10, 0)"` and replace it with `double`, which is PRIMITIVE type. This special helper is generic for adding new types handling in the future.
I open this PR to discuss pros and cons of both approaches. If we want to do Scala side fix, we need to find a way to match the case of DecimalType and StructType in Spark Core.
## How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
Manual test:
> str(collect(sql("select cast('1' as double) as x, cast('2' as decimal) as y from iris limit 5")))
'data.frame': 5 obs. of 2 variables:
$ x: num 1 1 1 1 1
$ y: num 2 2 2 2 2
R Unit tests
Author: wm624@hotmail.com <wm624@hotmail.com>
Closes#14613 from wangmiao1981/type.
https://issues.apache.org/jira/browse/SPARK-17241
## What changes were proposed in this pull request?
Spark has configurable L2 regularization parameter for generalized linear regression. It is very important to have them in SparkR so that users can run ridge regression.
## How was this patch tested?
Test manually on local laptop.
Author: Xin Ren <iamshrek@126.com>
Closes#14856 from keypointt/SPARK-17241.
## What changes were proposed in this pull request?
The usage in the original example is incorrect. This PR fixes it.
## How was this patch tested?
Manual test.
Author: Junyang Qian <junyangq@databricks.com>
Closes#14903 from junyangq/SPARKR-FixWindowPartitionByDoc.
## What changes were proposed in this pull request?
Remove cleanup.jobj test. Use JVM wrapper API for other test cases.
## How was this patch tested?
Run R unit tests with testthat 1.0
Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
Closes#14904 from shivaram/sparkr-jvm-tests-fix.
## What changes were proposed in this pull request?
Currently, `HiveContext` in SparkR is not being tested and always skipped.
This is because the initiation of `TestHiveContext` is being failed due to trying to load non-existing data paths (test tables).
This is introduced from https://github.com/apache/spark/pull/14005
This enables the tests with SparkR.
## How was this patch tested?
Manually,
**Before** (on Mac OS)
```
...
Skipped ------------------------------------------------------------------------
1. create DataFrame from RDD (test_sparkSQL.R#200) - Hive is not build with SparkSQL, skipped
2. test HiveContext (test_sparkSQL.R#1041) - Hive is not build with SparkSQL, skipped
3. read/write ORC files (test_sparkSQL.R#1748) - Hive is not build with SparkSQL, skipped
4. enableHiveSupport on SparkSession (test_sparkSQL.R#2480) - Hive is not build with SparkSQL, skipped
5. sparkJars tag in SparkContext (test_Windows.R#21) - This test is only for Windows, skipped
...
```
**After** (on Mac OS)
```
...
Skipped ------------------------------------------------------------------------
1. sparkJars tag in SparkContext (test_Windows.R#21) - This test is only for Windows, skipped
...
```
Please refer the tests below (on Windows)
- Before: https://ci.appveyor.com/project/HyukjinKwon/spark/build/45-test123
- After: https://ci.appveyor.com/project/HyukjinKwon/spark/build/46-test123
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#14889 from HyukjinKwon/SPARK-17326.
## What changes were proposed in this pull request?
This change exposes a public API in SparkR to create objects, call methods on the Spark driver JVM
## How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
Unit tests, CRAN checks
Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
Closes#14775 from shivaram/sparkr-java-api.
## What changes were proposed in this pull request?
This PR tries to fix the name of the `SparkDataFrame` used in the example. Also, it gives a reference url of an example data file so that users can play with.
## How was this patch tested?
Manual test.
Author: Junyang Qian <junyangq@databricks.com>
Closes#14853 from junyangq/SPARKR-FixLDADoc.
## What changes were proposed in this pull request?
The original example doesn't work because the features are not categorical. This PR fixes this by changing to another dataset.
## How was this patch tested?
Manual test.
Author: Junyang Qian <junyangq@databricks.com>
Closes#14820 from junyangq/SPARK-FixNaiveBayes.
## What changes were proposed in this pull request?
This PR gives informative message to users when they try to connect to a remote master but don't have Spark package in their local machine.
As a clarification, for now, automatic installation will only happen if they start SparkR in R console (rather than from sparkr-shell) and connect to local master. In the remote master mode, local Spark package is still needed, but we will not trigger the install.spark function because the versions have to match those on the cluster, which involves more user input. Instead, we here try to provide detailed message that may help the users.
Some of the other messages have also been slightly changed.
## How was this patch tested?
Manual test.
Author: Junyang Qian <junyangq@databricks.com>
Closes#14761 from junyangq/SPARK-16579-V1.
## What changes were proposed in this pull request?
This PR adds more examples to window function docs to make them more accessible to the users.
It also fixes default value issues for `lag` and `lead`.
## How was this patch tested?
Manual test, R unit test.
Author: Junyang Qian <junyangq@databricks.com>
Closes#14779 from junyangq/SPARKR-FixWindowFunctionDocs.
## What changes were proposed in this pull request?
Fixed several misplaced param tag - they should be on the spark.* method generics
## How was this patch tested?
run knitr
junyangq
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#14792 from felixcheung/rdocmllib.
https://issues.apache.org/jira/browse/SPARK-16445
## What changes were proposed in this pull request?
Create Multilayer Perceptron Classifier wrapper in SparkR
## How was this patch tested?
Tested manually on local machine
Author: Xin Ren <iamshrek@126.com>
Closes#14447 from keypointt/SPARK-16445.
## What changes were proposed in this pull request?
The original doc of `show` put methods for multiple classes together but the text only talks about `SparkDataFrame`. This PR tries to fix this problem.
## How was this patch tested?
Manual test.
Author: Junyang Qian <junyangq@databricks.com>
Closes#14776 from junyangq/SPARK-FixShowDoc.
## What changes were proposed in this pull request?
The PR removes reference link in the doc for environment variables for common Windows folders. The cran check gave code 503: service unavailable on the original link.
## How was this patch tested?
Manual check.
Author: Junyang Qian <junyangq@databricks.com>
Closes#14767 from junyangq/SPARKR-RemoveLink.
## What changes were proposed in this pull request?
Update DESCRIPTION
## How was this patch tested?
Run install and CRAN tests
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#14764 from felixcheung/rpackagedescription.
## What changes were proposed in this pull request?
(Please fill in changes proposed in this fix)
## How was this patch tested?
This change adds CRAN documentation checks to be run as a part of `R/run-tests.sh` . As this script is also used by Jenkins this means that we will get documentation checks on every PR going forward.
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
Closes#14759 from shivaram/sparkr-cran-jenkins.
## What changes were proposed in this pull request?
replace ``` ` ``` in code doc with `\code{thing}`
remove added `...` for drop(DataFrame)
fix remaining CRAN check warnings
## How was this patch tested?
create doc with knitr
junyangq
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#14734 from felixcheung/rdoccleanup.
## What changes were proposed in this pull request?
This change adds Xiangrui Meng and Felix Cheung to the maintainers field in the package description.
## How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
Closes#14758 from shivaram/sparkr-maintainers.
## What changes were proposed in this pull request?
refactor, cleanup, reformat, fix deprecation in test
## How was this patch tested?
unit tests, manual tests
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#14735 from felixcheung/rmllibutil.
## What changes were proposed in this pull request?
This PR tries to fix the scheme of local cache folder in Windows. The name of the environment variable should be `LOCALAPPDATA` rather than `%LOCALAPPDATA%`.
## How was this patch tested?
Manual test in Windows 7.
Author: Junyang Qian <junyangq@databricks.com>
Closes#14743 from junyangq/SPARKR-FixWindowsInstall.
## What changes were proposed in this pull request?
Ignore temp files generated by `check-cran.sh`.
Author: Xiangrui Meng <meng@databricks.com>
Closes#14740 from mengxr/R-gitignore.
## What changes were proposed in this pull request?
#14551 fixed off-by-one bug in ```randomizeInPlace``` and some test failure caused by this fix.
But for SparkR ```spark.gaussianMixture``` test case, the fix is inappropriate. It only changed the output result of native R which should be compared by SparkR, however, it did not change the R code in annotation which is used for reproducing the result in native R. It will confuse users who can not reproduce the same result in native R. This PR sends a more robust test case which can produce same result between SparkR and native R.
## How was this patch tested?
Unit test update.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#14730 from yanboliang/spark-16961-followup.
## What changes were proposed in this pull request?
This PR tries to fix all the remaining "undocumented/duplicated arguments" warnings given by CRAN-check.
One left is doc for R `stats::glm` exported in SparkR. To mute that warning, we have to also provide document for all arguments of that non-SparkR function.
Some previous conversation is in #14558.
## How was this patch tested?
R unit test and `check-cran.sh` script (with no-test).
Author: Junyang Qian <junyangq@databricks.com>
Closes#14705 from junyangq/SPARK-16508-master.
JIRA issue link:
https://issues.apache.org/jira/browse/SPARK-16961
Changed one line of Utils.randomizeInPlace to allow elements to stay in place.
Created a unit test that runs a Pearson's chi squared test to determine whether the output diverges significantly from a uniform distribution.
Author: Nick Lavers <nick.lavers@videoamp.com>
Closes#14551 from nicklavers/SPARK-16961-randomizeInPlace.
## What changes were proposed in this pull request?
Add LDA Wrapper in SparkR with the following interfaces:
- spark.lda(data, ...)
- spark.posterior(object, newData, ...)
- spark.perplexity(object, ...)
- summary(object)
- write.ml(object)
- read.ml(path)
## How was this patch tested?
Test with SparkR unit test.
Author: Xusen Yin <yinxusen@gmail.com>
Closes#14229 from yinxusen/SPARK-16447.
## What changes were proposed in this pull request?
Gaussian Mixture Model wrapper in SparkR, similarly to R's ```mvnormalmixEM```.
## How was this patch tested?
Unit test.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#14392 from yanboliang/spark-16446.
## What changes were proposed in this pull request?
(Please fill in changes proposed in this fix)
Add Isotonic Regression wrapper in SparkR
Wrappers in R and Scala are added.
Unit tests
Documentation
## How was this patch tested?
Manually tested with sudo ./R/run-tests.sh
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
Author: wm624@hotmail.com <wm624@hotmail.com>
Closes#14182 from wangmiao1981/isoR.
## What changes were proposed in this pull request?
Rename RDD functions for now to avoid CRAN check warnings.
Some RDD functions are sharing generics with DataFrame functions (hence the problem) so after the renames we need to add new generics, for now.
## How was this patch tested?
unit tests
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#14626 from felixcheung/rrddfunctions.
## What changes were proposed in this pull request?
Fix the issue that ```spark.glm``` ```weightCol``` should in the signature.
## How was this patch tested?
Existing tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#14641 from yanboliang/weightCol.
## What changes were proposed in this pull request?
Add an install_spark function to the SparkR package. User can run `install_spark()` to install Spark to a local directory within R.
Updates:
Several changes have been made:
- `install.spark()`
- check existence of tar file in the cache folder, and download only if not found
- trial priority of mirror_url look-up: user-provided -> preferred mirror site from apache website -> hardcoded backup option
- use 2.0.0
- `sparkR.session()`
- can install spark when not found in `SPARK_HOME`
## How was this patch tested?
Manual tests, running the check-cran.sh script added in #14173.
Author: Junyang Qian <junyangq@databricks.com>
Closes#14258 from junyangq/SPARK-16579.
## What changes were proposed in this pull request?
Training GLMs on weighted dataset is very important use cases, but it is not supported by SparkR currently. Users can pass argument ```weights``` to specify the weights vector in native R. For ```spark.glm```, we can pass in the ```weightCol``` which is consistent with MLlib.
## How was this patch tested?
Unit test.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#14346 from yanboliang/spark-16710.
## What changes were proposed in this pull request?
This change moves the include jar test from R to SparkSubmitSuite and uses a dynamically compiled jar. This helps us remove the binary jar from the R package and solves both the CRAN warnings and the lack of source being available for this jar.
## How was this patch tested?
SparkR unit tests, SparkSubmitSuite, check-cran.sh
Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
Closes#14243 from shivaram/sparkr-jar-move.
## What changes were proposed in this pull request?
https://issues.apache.org/jira/browse/SPARK-16055
sparkPackages - argument is passed and we detect that we are in the R script mode, we should print some warning like --packages flag should be used with with spark-submit
## How was this patch tested?
In my system locally
Author: krishnakalyan3 <krishnakalyan3@gmail.com>
Closes#14179 from krishnakalyan3/spark-pkg.
## What changes were proposed in this pull request?
Fix R SparkSession init/stop, and warnings of reusing existing Spark Context
## How was this patch tested?
unit tests
shivaram
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#14177 from felixcheung/rsessiontest.
## What changes were proposed in this pull request?
Add a check-cran.sh script that runs `R CMD check` as CRAN. Also fixes a number of issues pointed out by the check. These include
- Updating `DESCRIPTION` to be appropriate
- Adding a .Rbuildignore to ignore lintr, src-native, html that are non-standard files / dirs
- Adding aliases to all S4 methods in DataFrame, Column, GroupedData etc. This is required as stated in https://cran.r-project.org/doc/manuals/r-release/R-exts.html#Documenting-S4-classes-and-methods
- Other minor fixes
## How was this patch tested?
SparkR unit tests, running the above mentioned script
Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
Closes#14173 from shivaram/sparkr-cran-changes.
## What changes were proposed in this pull request?
More tests
I don't think this is critical for Spark 2.0.0 RC, maybe Spark 2.0.1 or 2.1.0.
## How was this patch tested?
unit tests
shivaram dongjoon-hyun
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#14206 from felixcheung/rroutetests.
## What changes were proposed in this pull request?
Fix function routing to work with and without namespace operator `SparkR::createDataFrame`
## How was this patch tested?
manual, unit tests
shivaram
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#14195 from felixcheung/rroutedefault.
## What changes were proposed in this pull request?
Rename window.partitionBy and window.orderBy to windowPartitionBy and windowOrderBy to pass CRAN package check.
## How was this patch tested?
SparkR unit tests.
Author: Sun Rui <sunrui2016@gmail.com>
Closes#14192 from sun-rui/SPARK-16509.
## What changes were proposed in this pull request?
Minor documentation update for code example, code style, and missed reference to "sparkR.init"
## How was this patch tested?
manual
shivaram
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#14178 from felixcheung/rcsvprogrammingguide.
## What changes were proposed in this pull request?
Minor example updates
## How was this patch tested?
manual
shivaram
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#14171 from felixcheung/rexample.
## What changes were proposed in this pull request?
* Update SparkR ML section to make them consistent with SparkR API docs.
* Since #13972 adds labelling support for the ```include_example``` Jekyll plugin, so that we can split the single ```ml.R``` example file into multiple line blocks with different labels, and include them in different algorithms/models in the generated HTML page.
## How was this patch tested?
Only docs update, manually check the generated docs.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#14011 from yanboliang/r-user-guide-update.
## What changes were proposed in this pull request?
This PR prevents ERRORs when `summary(df)` is called for `SparkDataFrame` with not-numeric columns. This failure happens only in `SparkR`.
**Before**
```r
> df <- createDataFrame(faithful)
> df <- withColumn(df, "boolean", df$waiting==79)
> summary(df)
16/07/07 14:15:16 ERROR RBackendHandler: describe on 34 failed
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
org.apache.spark.sql.AnalysisException: cannot resolve 'avg(`boolean`)' due to data type mismatch: function average requires numeric types, not BooleanType;
```
**After**
```r
> df <- createDataFrame(faithful)
> df <- withColumn(df, "boolean", df$waiting==79)
> summary(df)
SparkDataFrame[summary:string, eruptions:string, waiting:string]
```
## How was this patch tested?
Pass the Jenkins with a updated testcase.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#14096 from dongjoon-hyun/SPARK-16425.
## What changes were proposed in this pull request?
Apply default "NA" as null string for R, like R read.csv na.string parameter.
https://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html
na.strings = "NA"
An user passing a csv file with NA value should get the same behavior with SparkR read.df(... source = "csv")
(couldn't open JIRA, will do that later)
## How was this patch tested?
unit tests
shivaram
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#13984 from felixcheung/rcsvnastring.
## What changes were proposed in this pull request?
ORC test should be enabled only when HiveContext is available.
## How was this patch tested?
Manual.
```
$ R/run-tests.sh
...
1. create DataFrame from RDD (test_sparkSQL.R#200) - Hive is not build with SparkSQL, skipped
2. test HiveContext (test_sparkSQL.R#1021) - Hive is not build with SparkSQL, skipped
3. read/write ORC files (test_sparkSQL.R#1728) - Hive is not build with SparkSQL, skipped
4. enableHiveSupport on SparkSession (test_sparkSQL.R#2448) - Hive is not build with SparkSQL, skipped
5. sparkJars tag in SparkContext (test_Windows.R#21) - This test is only for Windows, skipped
DONE ===========================================================================
Tests passed.
```
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#14019 from dongjoon-hyun/SPARK-16233.
## What changes were proposed in this pull request?
Capture errors from R workers in daemon.R to avoid deletion of R session temporary directory. See detailed description at https://issues.apache.org/jira/browse/SPARK-16299
## How was this patch tested?
SparkR unit tests.
Author: Sun Rui <sunrui2016@gmail.com>
Closes#13975 from sun-rui/SPARK-16299.
## What changes were proposed in this pull request?
gapplyCollect() does gapply() on a SparkDataFrame and collect the result back to R. Compared to gapply() + collect(), gapplyCollect() offers performance optimization as well as programming convenience, as no schema is needed to be provided.
This is similar to dapplyCollect().
## How was this patch tested?
Added test cases for gapplyCollect similar to dapplyCollect
Author: Narine Kokhlikyan <narine@slice.com>
Closes#13760 from NarineK/gapplyCollect.
## What changes were proposed in this pull request?
This PR implements `posexplode` table generating function. Currently, master branch raises the following exception for `map` argument. It's different from Hive.
**Before**
```scala
scala> sql("select posexplode(map('a', 1, 'b', 2))").show
org.apache.spark.sql.AnalysisException: No handler for Hive UDF ... posexplode() takes an array as a parameter; line 1 pos 7
```
**After**
```scala
scala> sql("select posexplode(map('a', 1, 'b', 2))").show
+---+---+-----+
|pos|key|value|
+---+---+-----+
| 0| a| 1|
| 1| b| 2|
+---+---+-----+
```
For `array` argument, `after` is the same with `before`.
```
scala> sql("select posexplode(array(1, 2, 3))").show
+---+---+
|pos|col|
+---+---+
| 0| 1|
| 1| 2|
| 2| 3|
+---+---+
```
## How was this patch tested?
Pass the Jenkins tests with newly added testcases.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#13971 from dongjoon-hyun/SPARK-16289.
https://issues.apache.org/jira/browse/SPARK-16140
## What changes were proposed in this pull request?
Group the R doc of spark.kmeans, predict(KM), summary(KM), read/write.ml(KM) under Rd spark.kmeans. The example code was updated.
## How was this patch tested?
Tested on my local machine
And on my laptop `jekyll build` is failing to build API docs, so here I can only show you the html I manually generated from Rd files, with no CSS applied, but the doc content should be there.
![screenshotkmeans](https://cloud.githubusercontent.com/assets/3925641/16403203/c2c9ca1e-3ca7-11e6-9e29-f2164aee75fc.png)
Author: Xin Ren <iamshrek@126.com>
Closes#13921 from keypointt/SPARK-16140.
## What changes were proposed in this pull request?
Add unit tests for csv data for SPARKR
## How was this patch tested?
unit tests
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#13904 from felixcheung/rcsv.
## What changes were proposed in this pull request?
update sparkR DataFrame.R comment
SQLContext ==> SparkSession
## How was this patch tested?
N/A
Author: WeichenXu <WeichenXu123@outlook.com>
Closes#13946 from WeichenXu123/sparkR_comment_update_sparkSession.
## What changes were proposed in this pull request?
Allowing truncate to a specific number of character is convenient at times, especially while operating from the REPL. Sometimes those last few characters make all the difference, and showing everything brings in whole lot of noise.
## How was this patch tested?
Existing tests. + 1 new test in DataFrameSuite.
For SparkR and pyspark, existing tests and manual testing.
Author: Prashant Sharma <prashsh1@in.ibm.com>
Author: Prashant Sharma <prashant@apache.org>
Closes#13839 from ScrapCodes/add_truncateTo_DF.show.
## What changes were proposed in this pull request?
Add `conf` method to get Runtime Config from SparkSession
## How was this patch tested?
unit tests, manual tests
This is how it works in sparkR shell:
```
SparkSession available as 'spark'.
> conf()
$hive.metastore.warehouse.dir
[1] "file:/opt/spark-2.0.0-bin-hadoop2.6/R/spark-warehouse"
$spark.app.id
[1] "local-1466749575523"
$spark.app.name
[1] "SparkR"
$spark.driver.host
[1] "10.0.2.1"
$spark.driver.port
[1] "45629"
$spark.executorEnv.LD_LIBRARY_PATH
[1] "$LD_LIBRARY_PATH:/usr/lib/R/lib:/usr/lib/x86_64-linux-gnu:/usr/lib/jvm/default-java/jre/lib/amd64/server"
$spark.executor.id
[1] "driver"
$spark.home
[1] "/opt/spark-2.0.0-bin-hadoop2.6"
$spark.master
[1] "local[*]"
$spark.sql.catalogImplementation
[1] "hive"
$spark.submit.deployMode
[1] "client"
> conf("spark.master")
$spark.master
[1] "local[*]"
```
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#13885 from felixcheung/rconf.
## What changes were proposed in this pull request?
Updated setJobGroup, cancelJobGroup, clearJobGroup to not require sc/SparkContext as parameter.
Also updated roxygen2 doc and R programming guide on deprecations.
## How was this patch tested?
unit tests
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#13838 from felixcheung/rjobgroup.
## What changes were proposed in this pull request?
Guide for
- UDFs with dapply, dapplyCollect
- spark.lapply for running parallel R functions
## How was this patch tested?
build locally
<img width="654" alt="screen shot 2016-06-14 at 03 12 56" src="https://cloud.githubusercontent.com/assets/3419881/16039344/12a3b6a0-31de-11e6-8d77-fe23308075c0.png">
Author: Kai Jiang <jiangkai@gmail.com>
Closes#13660 from vectorijk/spark-15672-R-guide-update.
## What changes were proposed in this pull request?
add union and deprecate unionAll, separate roxygen2 doc for rbind (since their usage and parameter lists are quite different)
`explode` is also deprecated - but seems like replacement is a combination of calls; not sure if we should deprecate it in SparkR, yet.
## How was this patch tested?
unit tests, manual checks for r doc
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#13805 from felixcheung/runion.
## What changes were proposed in this pull request?
Found these issues while reviewing for SPARK-16090
## How was this patch tested?
roxygen2 doc gen, checked output html
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#13803 from felixcheung/rdocrd.
## What changes were proposed in this pull request?
This PR is a subset of #13023 by yanboliang to make SparkR model param names and default values consistent with MLlib. I tried to avoid other changes from #13023 to keep this PR minimal. I will send a follow-up PR to improve the documentation.
Main changes:
* `spark.glm`: epsilon -> tol, maxit -> maxIter
* `spark.kmeans`: default k -> 2, default maxIter -> 20, default initMode -> "k-means||"
* `spark.naiveBayes`: laplace -> smoothing, default 1.0
## How was this patch tested?
Existing unit tests.
Author: Xiangrui Meng <meng@databricks.com>
Closes#13801 from mengxr/SPARK-15177.1.
## What changes were proposed in this pull request?
I ran a full pass from A to Z and fixed the obvious duplications, improper grouping etc.
There are still more doc issues to be cleaned up.
## How was this patch tested?
manual tests
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#13798 from felixcheung/rdocseealso.
## What changes were proposed in this pull request?
This PR adds `pivot` function to SparkR for API parity. Since this PR is based on https://github.com/apache/spark/pull/13295 , mhnatiuk should be credited for the work he did.
## How was this patch tested?
Pass the Jenkins tests (including new testcase.)
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#13786 from dongjoon-hyun/SPARK-15294.
## What changes were proposed in this pull request?
Removed unnecessary duplicated documentation in dapply and dapplyCollect.
In this pull request I created separate R docs for dapply and dapplyCollect - kept dapply's documentation separate from dapplyCollect's and referred from one to another via a link.
## How was this patch tested?
Existing test cases.
Author: Narine Kokhlikyan <narine@slice.com>
Closes#13790 from NarineK/dapply-docs-fix.
## What changes were proposed in this pull request?
This PR adds `since` tags to Roxygen documentation according to the previous documentation archive.
https://home.apache.org/~dongjoon/spark-2.0.0-docs/api/R/
## How was this patch tested?
Manual.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#13734 from dongjoon-hyun/SPARK-14995.
## What changes were proposed in this pull request?
roxygen2 doc, programming guide, example updates
## How was this patch tested?
manual checks
shivaram
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#13751 from felixcheung/rsparksessiondoc.
## What changes were proposed in this pull request?
This PR adds `spark_partition_id` virtual column function in SparkR for API parity.
The following is just an example to illustrate a SparkR usage on a partitioned parquet table created by `spark.range(10).write.mode("overwrite").parquet("/tmp/t1")`.
```r
> collect(select(read.parquet('/tmp/t1'), c('id', spark_partition_id())))
id SPARK_PARTITION_ID()
1 3 0
2 4 0
3 8 1
4 9 1
5 0 2
6 1 3
7 2 4
8 5 5
9 6 6
10 7 7
```
## How was this patch tested?
Pass the Jenkins tests (including new testcase).
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#13768 from dongjoon-hyun/SPARK-16053.
## What changes were proposed in this pull request?
fix code doc
## How was this patch tested?
manual
shivaram
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#13782 from felixcheung/rcountdoc.
## What changes were proposed in this pull request?
spark.lapply and setLogLevel
## How was this patch tested?
unit test
shivaram thunterdb
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#13752 from felixcheung/rlapply.
## What changes were proposed in this pull request?
This issue adds `read.orc/write.orc` to SparkR for API parity.
## How was this patch tested?
Pass the Jenkins tests (with new testcases).
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#13763 from dongjoon-hyun/SPARK-16051.
## What changes were proposed in this pull request?
Add dropTempView and deprecate dropTempTable
## How was this patch tested?
unit tests
shivaram liancheng
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#13753 from felixcheung/rdroptempview.
## What changes were proposed in this pull request?
This PR adds `monotonically_increasing_id` column function in SparkR for API parity.
After this PR, SparkR supports the followings.
```r
> df <- read.json("examples/src/main/resources/people.json")
> collect(select(df, monotonically_increasing_id(), df$name, df$age))
monotonically_increasing_id() name age
1 0 Michael NA
2 1 Andy 30
3 2 Justin 19
```
## How was this patch tested?
Pass the Jenkins tests (with added testcase).
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#13774 from dongjoon-hyun/SPARK-16059.
## What changes were proposed in this pull request?
This PR introduces the new SparkSession API for SparkR.
`sparkR.session.getOrCreate()` and `sparkR.session.stop()`
"getOrCreate" is a bit unusual in R but it's important to name this clearly.
SparkR implementation should
- SparkSession is the main entrypoint (vs SparkContext; due to limited functionality supported with SparkContext in SparkR)
- SparkSession replaces SQLContext and HiveContext (both a wrapper around SparkSession, and because of API changes, supporting all 3 would be a lot more work)
- Changes to SparkSession is mostly transparent to users due to SPARK-10903
- Full backward compatibility is expected - users should be able to initialize everything just in Spark 1.6.1 (`sparkR.init()`), but with deprecation warning
- Mostly cosmetic changes to parameter list - users should be able to move to `sparkR.session.getOrCreate()` easily
- An advanced syntax with named parameters (aka varargs aka "...") is supported; that should be closer to the Builder syntax that is in Scala/Python (which unfortunately does not work in R because it will look like this: `enableHiveSupport(config(config(master(appName(builder(), "foo"), "local"), "first", "value"), "next, "value"))`
- Updating config on an existing SparkSession is supported, the behavior is the same as Python, in which config is applied to both SparkContext and SparkSession
- Some SparkSession changes are not matched in SparkR, mostly because it would be breaking API change: `catalog` object, `createOrReplaceTempView`
- Other SQLContext workarounds are replicated in SparkR, eg. `tables`, `tableNames`
- `sparkR` shell is updated to use the SparkSession entrypoint (`sqlContext` is removed, just like with Scale/Python)
- All tests are updated to use the SparkSession entrypoint
- A bug in `read.jdbc` is fixed
TODO
- [x] Add more tests
- [ ] Separate PR - update all roxygen2 doc coding example
- [ ] Separate PR - update SparkR programming guide
## How was this patch tested?
unit tests, manual tests
shivaram sun-rui rxin
Author: Felix Cheung <felixcheung_m@hotmail.com>
Author: felixcheung <felixcheung_m@hotmail.com>
Closes#13635 from felixcheung/rsparksession.
## What changes were proposed in this pull request?
This PR adds `randomSplit` to SparkR for API parity.
## How was this patch tested?
Pass the Jenkins tests (with new testcase.)
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#13721 from dongjoon-hyun/SPARK-16005.
## What changes were proposed in this pull request?
Add registerTempTable to DataFrame with Deprecate
## How was this patch tested?
unit tests
shivaram liancheng
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#13722 from felixcheung/rregistertemptable.
## What changes were proposed in this pull request?
This PR adds varargs-type `dropDuplicates` function to SparkR for API parity.
Refer to https://issues.apache.org/jira/browse/SPARK-15807, too.
## How was this patch tested?
Pass the Jenkins tests with new testcases.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#13684 from dongjoon-hyun/SPARK-15908.