## What changes were proposed in this pull request?
Will need to port to this to branch-1.6, -2.0, -2.1, -2.2
## How was this patch tested?
manually
Jenkins, AppVeyor
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#19549 from felixcheung/rcranversioncheck.
This PR sets the java.io.tmpdir for CRAN checks and also disables the hsperfdata for the JVM when running CRAN checks. Together this prevents files from being left behind in `/tmp`
## How was this patch tested?
Tested manually on a clean EC2 machine
Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
Closes#19589 from shivaram/sparkr-tmpdir-clean.
## What changes were proposed in this pull request?
This PR proposes to revive `stringsAsFactors` option in collect API, which was mistakenly removed in 71a138cd0e.
Simply, it casts `charactor` to `factor` if it meets the condition, `stringsAsFactors && is.character(vec)` in primitive type conversion.
## How was this patch tested?
Unit test in `R/pkg/tests/fulltests/test_sparkSQL.R`.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#19551 from HyukjinKwon/SPARK-17902.
## What changes were proposed in this pull request?
Currently percentile_approx never returns the first element when percentile is in (relativeError, 1/N], where relativeError default 1/10000, and N is the total number of elements. But ideally, percentiles in [0, 1/N] should all return the first element as the answer.
For example, given input data 1 to 10, if a user queries 10% (or even less) percentile, it should return 1, because the first value 1 already reaches 10%. Currently it returns 2.
Based on the paper, targetError is not rounded up, and searching index should start from 0 instead of 1. By following the paper, we should be able to fix the cases mentioned above.
## How was this patch tested?
Added a new test case and fix existing test cases.
Author: Zhenhua Wang <wzh_zju@163.com>
Closes#19438 from wzhfy/improve_percentile_approx.
## What changes were proposed in this pull request?
Looks like `FlatMapGroupsInRExec.requiredChildDistribution` didn't consider empty grouping attributes. It should be a problem when running `EnsureRequirements` and `gapply` in R can't work on empty grouping columns.
## How was this patch tested?
Added test.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#19436 from viirya/fix-flatmapinr-distribution.
## What changes were proposed in this pull request?
When zinc is running the pwd might be in the root of the project. A quick solution to this is to not go a level up incase we are in the root rather than root/core/. If we are in the root everything works fine, if we are in core add a script which goes and runs the level up
## How was this patch tested?
set -x in the SparkR install scripts.
Author: Holden Karau <holden@us.ibm.com>
Closes#19402 from holdenk/SPARK-22167-sparkr-packaging-issue-allow-zinc.
## What changes were proposed in this pull request?
Currently, we set lintr to jimhester/lintra769c0b (see [this](7d1175011c) and [SPARK-14074](https://issues.apache.org/jira/browse/SPARK-14074)).
I first tested and checked lintr-1.0.1 but it looks many important fixes are missing (for example, checking 100 length). So, I instead tried the latest commit, 5431140ffe, in my local and fixed the check failures.
It looks it has fixed many bugs and now finds many instances that I have observed and thought should be caught time to time, here I filed [the results](https://gist.github.com/HyukjinKwon/4f59ddcc7b6487a02da81800baca533c).
The downside looks it now takes about 7ish mins, (it was 2ish mins before) in my local.
## How was this patch tested?
Manually, `./dev/lint-r` after manually updating the lintr package.
Author: hyukjinkwon <gurwls223@gmail.com>
Author: zuotingbing <zuo.tingbing9@zte.com.cn>
Closes#19290 from HyukjinKwon/upgrade-r-lint.
## What changes were proposed in this pull request?
The `percentile_approx` function previously accepted numeric type input and output double type results.
But since all numeric types, date and timestamp types are represented as numerics internally, `percentile_approx` can support them easily.
After this PR, it supports date type, timestamp type and numeric types as input types. The result type is also changed to be the same as the input type, which is more reasonable for percentiles.
This change is also required when we generate equi-height histograms for these types.
## How was this patch tested?
Added a new test and modified some existing tests.
Author: Zhenhua Wang <wangzhenhua@huawei.com>
Closes#19321 from wzhfy/approx_percentile_support_types.
## What changes were proposed in this pull request?
This PR make `sample(...)` able to omit `withReplacement` defaulting to `FALSE`.
In short, the following examples are allowed:
```r
> df <- createDataFrame(as.list(seq(10)))
> count(sample(df, fraction=0.5, seed=3))
[1] 4
> count(sample(df, fraction=1.0))
[1] 10
```
In addition, this PR also adds some type checking logics as below:
```r
> sample(df, fraction = "a")
Error in sample(df, fraction = "a") :
fraction must be numeric; however, got character
> sample(df, fraction = 1, seed = NULL)
Error in sample(df, fraction = 1, seed = NULL) :
seed must not be NULL or NA; however, got NULL
> sample(df, list(1), 1.0)
Error in sample(df, list(1), 1) :
withReplacement must be logical; however, got list
> sample(df, fraction = -1.0)
...
Error in sample : illegal argument - requirement failed: Sampling fraction (-1.0) must be on interval [0, 1] without replacement
```
## How was this patch tested?
Manually tested, unit tests added in `R/pkg/tests/fulltests/test_sparkSQL.R`.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#19243 from HyukjinKwon/SPARK-21780.
## What changes were proposed in this pull request?
Clarify behavior of to_utc_timestamp/from_utc_timestamp with an example
## How was this patch tested?
Doc only change / existing tests
Author: Sean Owen <sowen@cloudera.com>
Closes#19276 from srowen/SPARK-22049.
## What changes were proposed in this pull request?
In previous work SPARK-21513, we has allowed `MapType` and `ArrayType` of `MapType`s convert to a json string but only for Scala API. In this follow-up PR, we will make SparkSQL support it for PySpark and SparkR, too. We also fix some little bugs and comments of the previous work in this follow-up PR.
### For PySpark
```
>>> data = [(1, {"name": "Alice"})]
>>> df = spark.createDataFrame(data, ("key", "value"))
>>> df.select(to_json(df.value).alias("json")).collect()
[Row(json=u'{"name":"Alice")']
>>> data = [(1, [{"name": "Alice"}, {"name": "Bob"}])]
>>> df = spark.createDataFrame(data, ("key", "value"))
>>> df.select(to_json(df.value).alias("json")).collect()
[Row(json=u'[{"name":"Alice"},{"name":"Bob"}]')]
```
### For SparkR
```
# Converts a map into a JSON object
df2 <- sql("SELECT map('name', 'Bob')) as people")
df2 <- mutate(df2, people_json = to_json(df2$people))
# Converts an array of maps into a JSON array
df2 <- sql("SELECT array(map('name', 'Bob'), map('name', 'Alice')) as people")
df2 <- mutate(df2, people_json = to_json(df2$people))
```
## How was this patch tested?
Add unit test cases.
cc viirya HyukjinKwon
Author: goldmedal <liugs963@gmail.com>
Closes#19223 from goldmedal/SPARK-21513-fp-PySaprkAndSparkR.
## What changes were proposed in this pull request?
set.seed() before running tests
## How was this patch tested?
jenkins, appveyor
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#19111 from felixcheung/rranseed.
## What changes were proposed in this pull request?
This PR proposes to add a wrapper for `unionByName` API to R and Python as well.
**Python**
```python
df1 = spark.createDataFrame([[1, 2, 3]], ["col0", "col1", "col2"])
df2 = spark.createDataFrame([[4, 5, 6]], ["col1", "col2", "col0"])
df1.unionByName(df2).show()
```
```
+----+----+----+
|col0|col1|col3|
+----+----+----+
| 1| 2| 3|
| 6| 4| 5|
+----+----+----+
```
**R**
```R
df1 <- select(createDataFrame(mtcars), "carb", "am", "gear")
df2 <- select(createDataFrame(mtcars), "am", "gear", "carb")
head(unionByName(limit(df1, 2), limit(df2, 2)))
```
```
carb am gear
1 4 1 4
2 4 1 4
3 4 1 4
4 4 1 4
```
## How was this patch tested?
Doctests for Python and unit test added in `test_sparkSQL.R` for R.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#19105 from HyukjinKwon/unionByName-r-python.
## What changes were proposed in this pull request?
fix the random seed to eliminate variability
## How was this patch tested?
jenkins, appveyor, lots more jenkins
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#19018 from felixcheung/rrftest.
## What changes were proposed in this pull request?
Code in vignettes requires winutils on windows to run, when publishing to CRAN or building from source, winutils might not be available, so it's better to disable code run (so resulting vigenttes will not have output from code, but text is still there and code is still there)
fix * checking re-building of vignette outputs ... WARNING
and
> %LOCALAPPDATA% not found. Please define the environment variable or restart and enter an installation path in localDir.
## How was this patch tested?
jenkins, appveyor, r-hub
before: https://artifacts.r-hub.io/SparkR_2.2.0.tar.gz-49cecef3bb09db1db130db31604e0293/SparkR.Rcheck/00check.log
after: https://artifacts.r-hub.io/SparkR_2.2.0.tar.gz-86a066c7576f46794930ad114e5cff7c/SparkR.Rcheck/00check.log
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#19016 from felixcheung/rvigwind.
## What changes were proposed in this pull request?
SPARK-21100 introduced a new `summary` method to the Scala/Java Dataset API that included expanded statistics (vs `describe`) and control over which statistics to compute. Currently in the R API `summary` acts as an alias for `describe`. This patch updates the R API to call the new `summary` method in the JVM that includes additional statistics and ability to select which to compute.
This does not break the current interface as the present `summary` method does not take additional arguments like `describe` and the output was never meant to be used programmatically.
## How was this patch tested?
Modified and additional unit tests.
Author: Andrew Ray <ray.andrew@gmail.com>
Closes#18786 from aray/summary-r.
## What changes were proposed in this pull request?
Support offset in SparkR GLM #16699
Author: actuaryzhang <actuaryzhang10@gmail.com>
Closes#18831 from actuaryzhang/sparkROffset.
## What changes were proposed in this pull request?
SPARK-20307 Added handleInvalid option to RFormula for tree-based classification algorithms. We should add this parameter for other classification algorithms in SparkR.
This is a followup PR for SPARK-20307.
## How was this patch tested?
New Unit tests are added.
Author: wangmiao1981 <wm624@hotmail.com>
Closes#18605 from wangmiao1981/class.
## What changes were proposed in this pull request?
```RFormula``` should handle invalid for both features and label column.
#18496 only handle invalid values in features column. This PR add handling invalid values for label column and test cases.
## How was this patch tested?
Add test cases.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#18613 from yanboliang/spark-20307.
## What changes were proposed in this pull request?
Update internal references from programming-guide to rdd-programming-guide
See 5ddf243fd8 and https://github.com/apache/spark/pull/18485#issuecomment-314789751
Let's keep the redirector even if it's problematic to build, but not rely on it internally.
## How was this patch tested?
(Doc build)
Author: Sean Owen <sowen@cloudera.com>
Closes#18625 from srowen/SPARK-21267.2.
## What changes were proposed in this pull request?
- Remove Scala 2.10 build profiles and support
- Replace some 2.10 support in scripts with commented placeholders for 2.12 later
- Remove deprecated API calls from 2.10 support
- Remove usages of deprecated context bounds where possible
- Remove Scala 2.10 workarounds like ScalaReflectionLock
- Other minor Scala warning fixes
## How was this patch tested?
Existing tests
Author: Sean Owen <sowen@cloudera.com>
Closes#17150 from srowen/SPARK-19810.
## What changes were proposed in this pull request?
This PR supports schema in a DDL formatted string for `from_json` in R/Python and `dapply` and `gapply` in R, which are commonly used and/or consistent with Scala APIs.
Additionally, this PR exposes `structType` in R to allow working around in other possible corner cases.
**Python**
`from_json`
```python
from pyspark.sql.functions import from_json
data = [(1, '''{"a": 1}''')]
df = spark.createDataFrame(data, ("key", "value"))
df.select(from_json(df.value, "a INT").alias("json")).show()
```
**R**
`from_json`
```R
df <- sql("SELECT named_struct('name', 'Bob') as people")
df <- mutate(df, people_json = to_json(df$people))
head(select(df, from_json(df$people_json, "name STRING")))
```
`structType.character`
```R
structType("a STRING, b INT")
```
`dapply`
```R
dapply(createDataFrame(list(list(1.0)), "a"), function(x) {x}, "a DOUBLE")
```
`gapply`
```R
gapply(createDataFrame(list(list(1.0)), "a"), "a", function(key, x) { x }, "a DOUBLE")
```
## How was this patch tested?
Doc tests for `from_json` in Python and unit tests `test_sparkSQL.R` in R.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#18498 from HyukjinKwon/SPARK-21266.
## What changes were proposed in this pull request?
This is a retry for #18320. This PR was reverted due to unexpected test failures with -10 error code.
I was unable to reproduce in MacOS, CentOS and Ubuntu but only in Jenkins. So, the tests proceeded to verify this and revert the past try here - https://github.com/apache/spark/pull/18456
This new approach was tested in https://github.com/apache/spark/pull/18463.
**Test results**:
- With the part of suspicious change in the past try (466325d3fd)
Tests ran 4 times and 2 times passed and 2 time failed.
- Without the part of suspicious change in the past try (466325d3fd)
Tests ran 5 times and they all passed.
- With this new approach (0a7589c09f)
Tests ran 5 times and they all passed.
It looks the cause is as below (see 466325d3fd):
```diff
+ exitCode <- 1
...
+ data <- parallel:::readChild(child)
+ if (is.raw(data)) {
+ if (unserialize(data) == exitCode) {
...
+ }
+ }
...
- parallel:::mcexit(0L)
+ parallel:::mcexit(0L, send = exitCode)
```
Two possibilities I think
- `parallel:::mcexit(.. , send = exitCode)`
https://stat.ethz.ch/R-manual/R-devel/library/parallel/html/mcfork.html
> It sends send to the master (unless NULL) and then shuts down the child process.
However, it looks possible that the parent attemps to terminate the child right after getting our custom exit code. So, the child gets terminated between "send" and "shuts down", failing to exit properly.
- A bug between `parallel:::mcexit(..., send = ...)` and `parallel:::readChild`.
**Proposal**:
To resolve this, I simply decided to avoid both possibilities with this new approach here (9ff89a7859). To support this idea, I explained with some quotation of the documentation as below:
https://stat.ethz.ch/R-manual/R-devel/library/parallel/html/mcfork.html
> `readChild` and `readChildren` return a raw vector with a "pid" attribute if data were available, an integer vector of length one with the process ID if a child terminated or `NULL` if the child no longer exists (no children at all for `readChildren`).
`readChild` returns "an integer vector of length one with the process ID if a child terminated" so we can check if it is `integer` and the same selected "process ID". I believe this makes sure that the children are exited.
In case that children happen to send any data manually to parent (which is why we introduced the suspicious part of the change (466325d3fd)), this should be raw bytes and will be discarded (and then will try to read the next and check if it is `integer` in the next loop).
## How was this patch tested?
Manual tests and Jenkins tests.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#18465 from HyukjinKwon/SPARK-21093-retry-1.
## What changes were proposed in this pull request?
This adds documentation to many functions in pyspark.sql.functions.py:
`upper`, `lower`, `reverse`, `unix_timestamp`, `from_unixtime`, `rand`, `randn`, `collect_list`, `collect_set`, `lit`
Add units to the trigonometry functions.
Renames columns in datetime examples to be more informative.
Adds links between some functions.
## How was this patch tested?
`./dev/lint-python`
`python python/pyspark/sql/functions.py`
`./python/run-tests.py --module pyspark-sql`
Author: Michael Patterson <map222@gmail.com>
Closes#17865 from map222/spark-20456.
## What changes were proposed in this pull request?
For randomForest classifier, if test data contains unseen labels, it will throw an error. The StringIndexer already has the handleInvalid logic. The patch add a new method to set the underlying StringIndexer handleInvalid logic.
This patch should also apply to other classifiers. This PR focuses on the main logic and randomForest classifier. I will do follow-up PR for other classifiers.
## How was this patch tested?
Add a new unit test based on the error case in the JIRA.
Author: wangmiao1981 <wm624@hotmail.com>
Closes#18496 from wangmiao1981/handle.
## What changes were proposed in this pull request?
Add doc for methods that were left out, and fix various style and consistency issues.
Author: actuaryzhang <actuaryzhang10@gmail.com>
Closes#18493 from actuaryzhang/sparkRDocCleanup.
## What changes were proposed in this pull request?
Grouped documentation for column window methods.
Author: actuaryzhang <actuaryzhang10@gmail.com>
Closes#18481 from actuaryzhang/sparkRDocWindow.
## What changes were proposed in this pull request?
Grouped documentation for column collection methods.
Author: actuaryzhang <actuaryzhang10@gmail.com>
Author: Wayne Zhang <actuaryzhang10@gmail.com>
Closes#18458 from actuaryzhang/sparkRDocCollection.
## What changes were proposed in this pull request?
Grouped documentation for column misc methods.
Author: actuaryzhang <actuaryzhang10@gmail.com>
Author: Wayne Zhang <actuaryzhang10@gmail.com>
Closes#18448 from actuaryzhang/sparkRDocMisc.
## What changes were proposed in this pull request?
Grouped documentation for nonaggregate column methods.
Author: actuaryzhang <actuaryzhang10@gmail.com>
Author: Wayne Zhang <actuaryzhang10@gmail.com>
Closes#18422 from actuaryzhang/sparkRDocNonAgg.
## What changes were proposed in this pull request?
This PR proposes to support a DDL-formetted string as schema as below:
```r
mockLines <- c("{\"name\":\"Michael\"}",
"{\"name\":\"Andy\", \"age\":30}",
"{\"name\":\"Justin\", \"age\":19}")
jsonPath <- tempfile(pattern = "sparkr-test", fileext = ".tmp")
writeLines(mockLines, jsonPath)
df <- read.df(jsonPath, "json", "name STRING, age DOUBLE")
collect(df)
```
## How was this patch tested?
Tests added in `test_streaming.R` and `test_sparkSQL.R` and manual tests.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#18431 from HyukjinKwon/r-ddl-schema.
## What changes were proposed in this pull request?
Grouped documentation for string column methods.
Author: actuaryzhang <actuaryzhang10@gmail.com>
Author: Wayne Zhang <actuaryzhang10@gmail.com>
Closes#18366 from actuaryzhang/sparkRDocString.
## What changes were proposed in this pull request?
Grouped documentation for math column methods.
Author: actuaryzhang <actuaryzhang10@gmail.com>
Author: Wayne Zhang <actuaryzhang10@gmail.com>
Closes#18371 from actuaryzhang/sparkRDocMath.
## What changes were proposed in this pull request?
`mcfork` in R looks opening a pipe ahead but the existing logic does not properly close it when it is executed hot. This leads to the failure of more forking due to the limit for number of files open.
This hot execution looks particularly for `gapply`/`gapplyCollect`. For unknown reason, this happens more easily in CentOS and could be reproduced in Mac too.
All the details are described in https://issues.apache.org/jira/browse/SPARK-21093
This PR proposes simply to terminate R's worker processes in the parent of R's daemon to prevent a leak.
## How was this patch tested?
I ran the codes below on both CentOS and Mac with that configuration disabled/enabled.
```r
df <- createDataFrame(list(list(1L, 1, "1", 0.1)), c("a", "b", "c", "d"))
collect(gapply(df, "a", function(key, x) { x }, schema(df)))
collect(gapply(df, "a", function(key, x) { x }, schema(df)))
... # 30 times
```
Also, now it passes R tests on CentOS as below:
```
SparkSQL functions: Spark package found in SPARK_HOME: .../spark
..............................................................................................................................................................
..............................................................................................................................................................
..............................................................................................................................................................
..............................................................................................................................................................
..............................................................................................................................................................
....................................................................................................................................
```
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#18320 from HyukjinKwon/SPARK-21093.
## What changes were proposed in this pull request?
Extend `setJobDescription` to SparkR API.
## How was this patch tested?
It looks difficult to add a test. Manually tested as below:
```r
df <- createDataFrame(iris)
count(df)
setJobDescription("This is an example job.")
count(df)
```
prints ...
![2017-06-22 12 05 49](https://user-images.githubusercontent.com/6477701/27415670-2a649936-5743-11e7-8e95-312f1cd103af.png)
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#18382 from HyukjinKwon/SPARK-21149.
## What changes were proposed in this pull request?
Grouped documentation for datetime column methods.
Author: actuaryzhang <actuaryzhang10@gmail.com>
Closes#18114 from actuaryzhang/sparkRDocDate.
## What changes were proposed in this pull request?
PR https://github.com/apache/spark/pull/17715 Added Constrained Logistic Regression for ML. We should add it to SparkR.
## How was this patch tested?
Add new unit tests.
Author: wangmiao1981 <wm624@hotmail.com>
Closes#18128 from wangmiao1981/test.
## What changes were proposed in this pull request?
Add `stringIndexerOrderType` to `spark.glm` and `spark.survreg` to support string encoding that is consistent with default R.
## How was this patch tested?
new tests
Author: actuaryzhang <actuaryzhang10@gmail.com>
Closes#18140 from actuaryzhang/sparkRFormula.
## What changes were proposed in this pull request?
LinearSVC should use its own threshold param, rather than the shared one, since it applies to rawPrediction instead of probability. This PR changes the param in the Scala, Python and R APIs.
## How was this patch tested?
New unit test to make sure the threshold can be set to any Double value.
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#18151 from jkbradley/ml-2.2-linearsvc-cleanup.
## What changes were proposed in this pull request?
Grouped documentation for the aggregate functions for Column.
Author: actuaryzhang <actuaryzhang10@gmail.com>
Closes#18025 from actuaryzhang/sparkRDoc4.
## What changes were proposed in this pull request?
Add SQL trunc function
## How was this patch tested?
standard test
Author: actuaryzhang <actuaryzhang10@gmail.com>
Closes#18291 from actuaryzhang/sparkRTrunc2.
## What changes were proposed in this pull request?
This PR proposes to list the files in test _after_ removing both "spark-warehouse" and "metastore_db" so that the next run of R tests pass fine. This is sometimes a bit annoying.
## How was this patch tested?
Manually running multiple times R tests via `./R/run-tests.sh`.
**Before**
Second run:
```
SparkSQL functions: Spark package found in SPARK_HOME: .../spark
...............................................................................................................................................................
...............................................................................................................................................................
...............................................................................................................................................................
...............................................................................................................................................................
...............................................................................................................................................................
....................................................................................................1234.......................
Failed -------------------------------------------------------------------------
1. Failure: No extra files are created in SPARK_HOME by starting session and making calls (test_sparkSQL.R#3384)
length(list1) not equal to length(list2).
1/1 mismatches
[1] 25 - 23 == 2
2. Failure: No extra files are created in SPARK_HOME by starting session and making calls (test_sparkSQL.R#3384)
sort(list1, na.last = TRUE) not equal to sort(list2, na.last = TRUE).
10/25 mismatches
x[16]: "metastore_db"
y[16]: "pkg"
x[17]: "pkg"
y[17]: "R"
x[18]: "R"
y[18]: "README.md"
x[19]: "README.md"
y[19]: "run-tests.sh"
x[20]: "run-tests.sh"
y[20]: "SparkR_2.2.0.tar.gz"
x[21]: "metastore_db"
y[21]: "pkg"
x[22]: "pkg"
y[22]: "R"
x[23]: "R"
y[23]: "README.md"
x[24]: "README.md"
y[24]: "run-tests.sh"
x[25]: "run-tests.sh"
y[25]: "SparkR_2.2.0.tar.gz"
3. Failure: No extra files are created in SPARK_HOME by starting session and making calls (test_sparkSQL.R#3388)
length(list1) not equal to length(list2).
1/1 mismatches
[1] 25 - 23 == 2
4. Failure: No extra files are created in SPARK_HOME by starting session and making calls (test_sparkSQL.R#3388)
sort(list1, na.last = TRUE) not equal to sort(list2, na.last = TRUE).
10/25 mismatches
x[16]: "metastore_db"
y[16]: "pkg"
x[17]: "pkg"
y[17]: "R"
x[18]: "R"
y[18]: "README.md"
x[19]: "README.md"
y[19]: "run-tests.sh"
x[20]: "run-tests.sh"
y[20]: "SparkR_2.2.0.tar.gz"
x[21]: "metastore_db"
y[21]: "pkg"
x[22]: "pkg"
y[22]: "R"
x[23]: "R"
y[23]: "README.md"
x[24]: "README.md"
y[24]: "run-tests.sh"
x[25]: "run-tests.sh"
y[25]: "SparkR_2.2.0.tar.gz"
DONE ===========================================================================
```
**After**
Second run:
```
SparkSQL functions: Spark package found in SPARK_HOME: .../spark
...............................................................................................................................................................
...............................................................................................................................................................
...............................................................................................................................................................
...............................................................................................................................................................
...............................................................................................................................................................
...............................................................................................................................
```
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#18335 from HyukjinKwon/SPARK-21128.
## What changes were proposed in this pull request?
Update Running R Tests dependence packages to:
```bash
R -e "install.packages(c('knitr', 'rmarkdown', 'testthat', 'e1071', 'survival'), repos='http://cran.us.r-project.org')"
```
## How was this patch tested?
manual tests
Author: Yuming Wang <wgyumg@gmail.com>
Closes#18271 from wangyum/building-spark.
### What changes were proposed in this pull request?
The current option name `wholeFile` is misleading for CSV users. Currently, it is not representing a record per file. Actually, one file could have multiple records. Thus, we should rename it. Now, the proposal is `multiLine`.
### How was this patch tested?
N/A
Author: Xiao Li <gatorsmile@gmail.com>
Closes#18202 from gatorsmile/renameCVSOption.
## What changes were proposed in this pull request?
clean up after big test move
## How was this patch tested?
unit tests, jenkins
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#18267 from felixcheung/rtestset2.
## What changes were proposed in this pull request?
Move all existing tests to non-installed directory so that it will never run by installing SparkR package
For a follow-up PR:
- remove all skip_on_cran() calls in tests
- clean up test timer
- improve or change basic tests that do run on CRAN (if anyone has suggestion)
It looks like `R CMD build pkg` will still put pkg\tests (ie. the full tests) into the source package but `R CMD INSTALL` on such source package does not install these tests (and so `R CMD check` does not run them)
## How was this patch tested?
- [x] unit tests, Jenkins
- [x] AppVeyor
- [x] make a source package, install it, `R CMD check` it - verify the full tests are not installed or run
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#18264 from felixcheung/rtestset.
## What changes were proposed in this pull request?
Document Dataset.union is resolution by position, not by name, since this has been a confusing point for a lot of users.
## How was this patch tested?
N/A - doc only change.
Author: Reynold Xin <rxin@databricks.com>
Closes#18256 from rxin/SPARK-21042.
## What changes were proposed in this pull request?
to investigate how long they run
## How was this patch tested?
Jenkins, AppVeyor
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#18104 from felixcheung/rtimetest.
## What changes were proposed in this pull request?
1, add an example for sparkr `decisionTree`
2, document it in user guide
## How was this patch tested?
local submit
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#18067 from zhengruifeng/dt_example.
## What changes were proposed in this pull request?
Joint coefficients with intercept for SparkR linear SVM summary.
## How was this patch tested?
Existing tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#18035 from yanboliang/svm-r.
## What changes were proposed in this pull request?
This change skips tests that use the Hadoop libraries while running
on CRAN check with Windows as the operating system. This is to handle
cases where the Hadoop winutils binaries are missing on the target
system. The skipped tests consist of
1. Tests that save, load a model in MLlib
2. Tests that save, load CSV, JSON and Parquet files in SQL
3. Hive tests
## How was this patch tested?
Tested by running on a local windows VM with HADOOP_HOME unset. Also testing with https://win-builder.r-project.org
Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
Closes#17966 from shivaram/sparkr-windows-cran.
## What changes were proposed in this pull request?
support decision tree in R
## How was this patch tested?
added tests
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#17981 from zhengruifeng/dt_r.
## What changes were proposed in this pull request?
Some examples in the DataFrame methods are syntactically wrong, even though they are pseudo code. Fix these and some style issues.
Author: Wayne Zhang <actuaryzhang@uber.com>
Closes#18003 from actuaryzhang/sparkRDoc3.
## What changes were proposed in this pull request?
Rename `carsDF` to `df` in SparkR `rollup` and `cube` examples.
## How was this patch tested?
Manual tests.
Author: zero323 <zero323@users.noreply.github.com>
Closes#17988 from zero323/cube-docs.
## What changes were proposed in this pull request?
- Adds R wrapper for `o.a.s.sql.functions.broadcast`.
- Renames `broadcast` to `broadcast_`.
## How was this patch tested?
Unit tests, check `check-cran.sh`.
Author: zero323 <zero323@users.noreply.github.com>
Closes#17965 from zero323/SPARK-20726.
## What changes were proposed in this pull request?
This PR proposes three things as below:
- Use casting rules to a timestamp in `to_timestamp` by default (it was `yyyy-MM-dd HH:mm:ss`).
- Support single argument for `to_timestamp` similarly with APIs in other languages.
For example, the one below works
```
import org.apache.spark.sql.functions._
Seq("2016-12-31 00:12:00.00").toDF("a").select(to_timestamp(col("a"))).show()
```
prints
```
+----------------------------------------+
|to_timestamp(`a`, 'yyyy-MM-dd HH:mm:ss')|
+----------------------------------------+
| 2016-12-31 00:12:00|
+----------------------------------------+
```
whereas this does not work in SQL.
**Before**
```
spark-sql> SELECT to_timestamp('2016-12-31 00:12:00');
Error in query: Invalid number of arguments for function to_timestamp; line 1 pos 7
```
**After**
```
spark-sql> SELECT to_timestamp('2016-12-31 00:12:00');
2016-12-31 00:12:00
```
- Related document improvement for SQL function descriptions and other API descriptions accordingly.
**Before**
```
spark-sql> DESCRIBE FUNCTION extended to_date;
...
Usage: to_date(date_str, fmt) - Parses the `left` expression with the `fmt` expression. Returns null with invalid input.
Extended Usage:
Examples:
> SELECT to_date('2016-12-31', 'yyyy-MM-dd');
2016-12-31
```
```
spark-sql> DESCRIBE FUNCTION extended to_timestamp;
...
Usage: to_timestamp(timestamp, fmt) - Parses the `left` expression with the `format` expression to a timestamp. Returns null with invalid input.
Extended Usage:
Examples:
> SELECT to_timestamp('2016-12-31', 'yyyy-MM-dd');
2016-12-31 00:00:00.0
```
**After**
```
spark-sql> DESCRIBE FUNCTION extended to_date;
...
Usage:
to_date(date_str[, fmt]) - Parses the `date_str` expression with the `fmt` expression to
a date. Returns null with invalid input. By default, it follows casting rules to a date if
the `fmt` is omitted.
Extended Usage:
Examples:
> SELECT to_date('2009-07-30 04:17:52');
2009-07-30
> SELECT to_date('2016-12-31', 'yyyy-MM-dd');
2016-12-31
```
```
spark-sql> DESCRIBE FUNCTION extended to_timestamp;
...
Usage:
to_timestamp(timestamp[, fmt]) - Parses the `timestamp` expression with the `fmt` expression to
a timestamp. Returns null with invalid input. By default, it follows casting rules to
a timestamp if the `fmt` is omitted.
Extended Usage:
Examples:
> SELECT to_timestamp('2016-12-31 00:12:00');
2016-12-31 00:12:00
> SELECT to_timestamp('2016-12-31', 'yyyy-MM-dd');
2016-12-31 00:00:00
```
## How was this patch tested?
Added tests in `datetime.sql`.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#17901 from HyukjinKwon/to_timestamp_arg.
## What changes were proposed in this pull request?
- [x] need to test by running R CMD check --as-cran
- [x] sanity check vignettes
## How was this patch tested?
Jenkins
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#17945 from felixcheung/rchangesforpackage.
## What changes were proposed in this pull request?
Change it to check for relative count like in this test https://github.com/apache/spark/blame/master/R/pkg/inst/tests/testthat/test_sparkSQL.R#L3355 for catalog APIs
## How was this patch tested?
unit tests, this needs to combine with another commit with SQL change to check
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#17905 from felixcheung/rtabletests.
## What changes were proposed in this pull request?
Cleaning existing temp tables before running tableNames tests
## How was this patch tested?
SparkR Unit tests
Author: Hossein <hossein@databricks.com>
Closes#17903 from falaki/SPARK-20661.
## What changes were proposed in this pull request?
Fix typo in vignettes
Author: Wayne Zhang <actuaryzhang@uber.com>
Closes#17884 from actuaryzhang/typo.
## What changes were proposed in this pull request?
set timezone on windows
## How was this patch tested?
unit test, AppVeyor
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#17892 from felixcheung/rtimestamptest.
## What changes were proposed in this pull request?
- Add SparkR wrapper for `Dataset.alias`.
- Adjust roxygen annotations for `functions.alias` (including example usage).
## How was this patch tested?
Unit tests, `check_cran.sh`.
Author: zero323 <zero323@users.noreply.github.com>
Closes#17825 from zero323/SPARK-20550.
## What changes were proposed in this pull request?
add environment
## How was this patch tested?
wait for appveyor run
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#17878 from felixcheung/appveyorrcran.
## What changes were proposed in this pull request?
Make tests more reliable by having it till processed.
Increasing timeout value might help but ultimately the flakiness from processing delay when Jenkins is hard to account for. This isn't an actual public API supported
## How was this patch tested?
unit tests
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#17857 from felixcheung/rsstestrelia.
## What changes were proposed in this pull request?
Adds wrapper for `o.a.s.sql.functions.input_file_name`
## How was this patch tested?
Existing unit tests, additional unit tests, `check-cran.sh`.
Author: zero323 <zero323@users.noreply.github.com>
Closes#17818 from zero323/SPARK-20544.
## What changes were proposed in this pull request?
Adds support for generic hints on `SparkDataFrame`
## How was this patch tested?
Unit tests, `check-cran.sh`
Author: zero323 <zero323@users.noreply.github.com>
Closes#17851 from zero323/SPARK-20585.
## What changes were proposed in this pull request?
Add
- R vignettes
- R programming guide
- SS programming guide
- R example
Also disable spark.als in vignettes for now since it's failing (SPARK-20402)
## How was this patch tested?
manually
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#17814 from felixcheung/rdocss.
## What changes were proposed in this pull request?
General rule on skip or not:
skip if
- RDD tests
- tests could run long or complicated (streaming, hivecontext)
- tests on error conditions
- tests won't likely change/break
## How was this patch tested?
unit tests, `R CMD check --as-cran`, `R CMD check`
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#17817 from felixcheung/rskiptest.
## What changes were proposed in this pull request?
doc only
## How was this patch tested?
manual
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#17828 from felixcheung/rnotfamily.
## What changes were proposed in this pull request?
Adds R wrappers for:
- `o.a.s.sql.functions.grouping` as `o.a.s.sql.functions.is_grouping` (to avoid shading `base::grouping`
- `o.a.s.sql.functions.grouping_id`
## How was this patch tested?
Existing unit tests, additional unit tests. `check-cran.sh`.
Author: zero323 <zero323@users.noreply.github.com>
Closes#17807 from zero323/SPARK-20532.
## What changes were proposed in this pull request?
Add without param for timeout - will need this to submit a job that runs until stopped
Need this for 2.2
## How was this patch tested?
manually, unit test
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#17815 from felixcheung/rssawaitinfinite.
## What changes were proposed in this pull request?
- Add null-safe equality operator `%<=>%` (sames as `o.a.s.sql.Column.eqNullSafe`, `o.a.s.sql.Column.<=>`)
- Add boolean negation operator `!` and function `not `.
## How was this patch tested?
Existing unit tests, additional unit tests, `check-cran.sh`.
Author: zero323 <zero323@users.noreply.github.com>
Closes#17783 from zero323/SPARK-20490.
## What changes were proposed in this pull request?
Ad R wrappers for
- `o.a.s.sql.functions.explode_outer`
- `o.a.s.sql.functions.posexplode_outer`
## How was this patch tested?
Additional unit tests, manual testing.
Author: zero323 <zero323@users.noreply.github.com>
Closes#17809 from zero323/SPARK-20535.
## What changes were proposed in this pull request?
It seems we are using `SQLUtils.getSQLDataType` for type string in structField. It looks we can replace this with `CatalystSqlParser.parseDataType`.
They look similar DDL-like type definitions as below:
```scala
scala> Seq(Tuple1(Tuple1("a"))).toDF.show()
```
```
+---+
| _1|
+---+
|[a]|
+---+
```
```scala
scala> Seq(Tuple1(Tuple1("a"))).toDF.select($"_1".cast("struct<_1:string>")).show()
```
```
+---+
| _1|
+---+
|[a]|
+---+
```
Such type strings looks identical when R’s one as below:
```R
> write.df(sql("SELECT named_struct('_1', 'a') as struct"), "/tmp/aa", "parquet")
> collect(read.df("/tmp/aa", "parquet", structType(structField("struct", "struct<_1:string>"))))
struct
1 a
```
R’s one is stricter because we are checking the types via regular expressions in R side ahead.
Actual logics there look a bit different but as we check it ahead in R side, it looks replacing it would not introduce (I think) no behaviour changes. To make this sure, the tests dedicated for it were added in SPARK-20105. (It looks `structField` is the only place that calls this method).
## How was this patch tested?
Existing tests - https://github.com/apache/spark/blob/master/R/pkg/inst/tests/testthat/test_sparkSQL.R#L143-L194 should cover this.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#17785 from HyukjinKwon/SPARK-20493.
## What changes were proposed in this pull request?
Replace
note repeat_string 2.3.0
with
note repeat_string since 2.3.0
## How was this patch tested?
`create-docs.sh`
Author: zero323 <zero323@users.noreply.github.com>
Closes#17779 from zero323/REPEAT-NOTE.
## What changes were proposed in this pull request?
Some PySpark & SparkR tests run with tiny dataset and tiny ```maxIter```, which means they are not converged. I don’t think checking intermediate result during iteration make sense, and these intermediate result may vulnerable and not stable, so we should switch to check the converged result. We hit this issue at #17746 when we upgrade breeze to 0.13.1.
## How was this patch tested?
Existing tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#17757 from yanboliang/flaky-test.
## What changes were proposed in this pull request?
- Add `rollup` and `cube` methods and corresponding generics.
- Add short description to the vignette.
## How was this patch tested?
- Existing unit tests.
- Additional unit tests covering new features.
- `check-cran.sh`.
Author: zero323 <zero323@users.noreply.github.com>
Closes#17728 from zero323/SPARK-20437.
## What changes were proposed in this pull request?
Upgrade breeze version to 0.13.1, which fixed some critical bugs of L-BFGS-B.
## How was this patch tested?
Existing unit tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#17746 from yanboliang/spark-20449.
## What changes were proposed in this pull request?
Add wrappers for `o.a.s.sql.functions`:
- `split` as `split_string`
- `repeat` as `repeat_string`
## How was this patch tested?
Existing tests, additional unit tests, `check-cran.sh`
Author: zero323 <zero323@users.noreply.github.com>
Closes#17729 from zero323/SPARK-20438.
## What changes were proposed in this pull request?
Adds wrappers for `collect_list` and `collect_set`.
## How was this patch tested?
Unit tests, `check-cran.sh`
Author: zero323 <zero323@users.noreply.github.com>
Closes#17672 from zero323/SPARK-20371.
## What changes were proposed in this pull request?
Adds wrappers for `o.a.s.sql.functions.array` and `o.a.s.sql.functions.map`
## How was this patch tested?
Unit tests, `check-cran.sh`
Author: zero323 <zero323@users.noreply.github.com>
Closes#17674 from zero323/SPARK-20375.
## What changes were proposed in this pull request?
Checking a source parameter is asynchronous. When the query is created, it's not guaranteed that source has been created. This PR just increases the timeout of awaitTermination to ensure the parsing error is thrown.
## How was this patch tested?
Jenkins
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#17687 from zsxwing/SPARK-20397.
## What changes were proposed in this pull request?
Document fpGrowth in:
- vignettes
- programming guide
- code example
## How was this patch tested?
Manual tests.
Author: zero323 <zero323@users.noreply.github.com>
Closes#17557 from zero323/SPARK-20208.
## What changes were proposed in this pull request?
This was suggested to be `as.json.array` at the first place in the PR to SPARK-19828 but we could not do this as the lint check emits an error for multiple dots in the variable names.
After SPARK-20278, now we are able to use `multiple.dots.in.names`. `asJsonArray` in `from_json` function is still able to be changed as 2.2 is not released yet.
So, this PR proposes to rename `asJsonArray` to `as.json.array`.
## How was this patch tested?
Jenkins tests, local tests with `./R/run-tests.sh` and manual `./dev/lint-r`. Existing tests should cover this.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#17653 from HyukjinKwon/SPARK-19828-followup.
## What changes were proposed in this pull request?
Currently, multi-dot separated variables in R is not allowed. For example,
```diff
setMethod("from_json", signature(x = "Column", schema = "structType"),
- function(x, schema, asJsonArray = FALSE, ...) {
+ function(x, schema, as.json.array = FALSE, ...) {
if (asJsonArray) {
jschema <- callJStatic("org.apache.spark.sql.types.DataTypes",
"createArrayType",
```
produces an error as below:
```
R/functions.R:2462:31: style: Words within variable and function names should be separated by '_' rather than '.'.
function(x, schema, as.json.array = FALSE, ...) {
^~~~~~~~~~~~~
```
This seems against https://google.github.io/styleguide/Rguide.xml#identifiers which says
> The preferred form for variable names is all lower case letters and words separated with dots
This looks because lintr by default https://github.com/jimhester/lintr follows http://r-pkgs.had.co.nz/style.html as written in the README.md. Few cases seems not following Google's one as "a few tweaks".
Per [SPARK-6813](https://issues.apache.org/jira/browse/SPARK-6813), we follow Google's R Style Guide with few exceptions https://google.github.io/styleguide/Rguide.xml. This is also merged into Spark's website - https://github.com/apache/spark-website/pull/43
Also, it looks we have no limit on function name. This rule also looks affecting to the name of functions as written in the README.md.
> `multiple_dots_linter`: check that function and variable names are separated by _ rather than ..
## How was this patch tested?
Manually tested `./dev/lint-r`with the manual change below in `R/functions.R`:
```diff
setMethod("from_json", signature(x = "Column", schema = "structType"),
- function(x, schema, asJsonArray = FALSE, ...) {
+ function(x, schema, as.json.array = FALSE, ...) {
if (asJsonArray) {
jschema <- callJStatic("org.apache.spark.sql.types.DataTypes",
"createArrayType",
```
**Before**
```R
R/functions.R:2462:31: style: Words within variable and function names should be separated by '_' rather than '.'.
function(x, schema, as.json.array = FALSE, ...) {
^~~~~~~~~~~~~
```
**After**
```
lintr checks passed.
```
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#17590 from HyukjinKwon/disable-dot-in-name.
## What changes were proposed in this pull request?
Fixed spelling of "charactor"
## How was this patch tested?
Spelling change only
Author: Brendan Dwyer <brendan.dwyer@ibm.com>
Closes#17611 from bdwyer2/SPARK-20298.
## What changes were proposed in this pull request?
Test failed because SPARK_HOME is not set before Spark is installed.
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#17516 from felixcheung/rdircheckincran.
## What changes were proposed in this pull request?
Following up on #17483, add createTable (which is new in 2.2.0) and deprecate createExternalTable, plus a number of minor fixes
## How was this patch tested?
manual, unit tests
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#17511 from felixcheung/rceatetable.
## What changes were proposed in this pull request?
Update doc to remove external for createTable, add refreshByPath in python
## How was this patch tested?
manual
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#17512 from felixcheung/catalogdoc.
## What changes were proposed in this pull request?
minor update
zero323
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#17526 from felixcheung/rfpgrowthfollowup.
## What changes were proposed in this pull request?
It seems cran check scripts corrects `R/pkg/DESCRIPTION` and follows the order in `Collate` fields.
This PR proposes to fix `catalog.R`'s order so that running this script does not show up a small diff in this file every time.
## How was this patch tested?
Manually via `./R/check-cran.sh`.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#17528 from HyukjinKwon/minor-reorder-description.
## What changes were proposed in this pull request?
Adds SparkR API for FPGrowth: [SPARK-19825](https://issues.apache.org/jira/browse/SPARK-19825):
- `spark.fpGrowth` -model training.
- `freqItemsets` and `associationRules` methods with new corresponding generics.
- Scala helper: `org.apache.spark.ml.r. FPGrowthWrapper`
- unit tests.
## How was this patch tested?
Feature specific unit tests.
Author: zero323 <zero323@users.noreply.github.com>
Closes#17170 from zero323/SPARK-19825.
## What changes were proposed in this pull request?
Add a set of catalog API in R
```
"currentDatabase",
"listColumns",
"listDatabases",
"listFunctions",
"listTables",
"recoverPartitions",
"refreshByPath",
"refreshTable",
"setCurrentDatabase",
```
https://github.com/apache/spark/pull/17483/files#diff-6929e6c5e59017ff954e110df20ed7ff
## How was this patch tested?
manual tests, unit tests
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#17483 from felixcheung/rcatalog.
JIRA Issue: https://issues.apache.org/jira/browse/SPARK-20123
## What changes were proposed in this pull request?
If $SPARK_HOME or $FWDIR variable contains spaces, then use "./dev/make-distribution.sh --name custom-spark --tgz -Psparkr -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos -Pyarn" build spark will failed.
## How was this patch tested?
manual tests
Author: zuotingbing <zuo.tingbing9@zte.com.cn>
Closes#17452 from zuotingbing/spark-bulid.
## What changes were proposed in this pull request?
It seems `checkType` and the type string in `structField` are not being tested closely. This string format currently seems SparkR-specific (see d1f6c64c4b/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala (L93-L131)) but resembles SQL type definition.
Therefore, it seems nicer if we test positive/negative cases in R side.
## How was this patch tested?
Unit tests in `test_sparkSQL.R`.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#17439 from HyukjinKwon/r-typestring-tests.
## What changes were proposed in this pull request?
This PR proposes to match minor documentations changes in https://github.com/apache/spark/pull/17399 and https://github.com/apache/spark/pull/17380 to R/Python.
## How was this patch tested?
Manual tests in Python , Python tests via `./python/run-tests.py --module=pyspark-sql` and lint-checks for Python/R.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#17429 from HyukjinKwon/minor-match-doc.
## What changes were proposed in this pull request?
SparkR ```spark.getSparkFiles``` fails when it was called on executors, see details at [SPARK-19925](https://issues.apache.org/jira/browse/SPARK-19925).
## How was this patch tested?
Add unit tests, and verify this fix at standalone and yarn cluster.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#17274 from yanboliang/spark-19925.
## What changes were proposed in this pull request?
When SparkR is installed as a R package there might not be any java runtime.
If it is not there SparkR's `sparkR.session()` will block waiting for the connection timeout, hanging the R IDE/shell, without any notification or message.
## How was this patch tested?
manually
- [x] need to test on Windows
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#16596 from felixcheung/rcheckjava.
## What changes were proposed in this pull request?
Update docs for NaN handling in approxQuantile.
## How was this patch tested?
existing tests.
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#17369 from zhengruifeng/doc_quantiles_nan.
## What changes were proposed in this pull request?
Currently JSON and CSV have exactly the same logic about handling bad records, this PR tries to abstract it and put it in a upper level to reduce code duplication.
The overall idea is, we make the JSON and CSV parser to throw a BadRecordException, then the upper level, FailureSafeParser, handles bad records according to the parse mode.
Behavior changes:
1. with PERMISSIVE mode, if the number of tokens doesn't match the schema, previously CSV parser will treat it as a legal record and parse as many tokens as possible. After this PR, we treat it as an illegal record, and put the raw record string in a special column, but we still parse as many tokens as possible.
2. all logging is removed as they are not very useful in practice.
## How was this patch tested?
existing tests
Author: Wenchen Fan <wenchen@databricks.com>
Author: hyukjinkwon <gurwls223@gmail.com>
Author: Wenchen Fan <cloud0fan@gmail.com>
Closes#17315 from cloud-fan/bad-record2.
## What changes were proposed in this pull request?
doc only change
## How was this patch tested?
manual
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#17356 from felixcheung/rdfcheckpoint2.
## What changes were proposed in this pull request?
Add checkpoint, setCheckpointDir API to R
## How was this patch tested?
unit tests, manual tests
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#17351 from felixcheung/rdfcheckpoint.
## What changes were proposed in this pull request?
This PR proposes to support an array of struct type in `to_json` as below:
```scala
import org.apache.spark.sql.functions._
val df = Seq(Tuple1(Tuple1(1) :: Nil)).toDF("a")
df.select(to_json($"a").as("json")).show()
```
```
+----------+
| json|
+----------+
|[{"_1":1}]|
+----------+
```
Currently, it throws an exception as below (a newline manually inserted for readability):
```
org.apache.spark.sql.AnalysisException: cannot resolve 'structtojson(`array`)' due to data type
mismatch: structtojson requires that the expression is a struct expression.;;
```
This allows the roundtrip with `from_json` as below:
```scala
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
val schema = ArrayType(StructType(StructField("a", IntegerType) :: Nil))
val df = Seq("""[{"a":1}, {"a":2}]""").toDF("json").select(from_json($"json", schema).as("array"))
df.show()
// Read back.
df.select(to_json($"array").as("json")).show()
```
```
+----------+
| array|
+----------+
|[[1], [2]]|
+----------+
+-----------------+
| json|
+-----------------+
|[{"a":1},{"a":2}]|
+-----------------+
```
Also, this PR proposes to rename from `StructToJson` to `StructsToJson ` and `JsonToStruct` to `JsonToStructs`.
## How was this patch tested?
Unit tests in `JsonFunctionsSuite` and `JsonExpressionsSuite` for Scala, doctest for Python and test in `test_sparkSQL.R` for R.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#17192 from HyukjinKwon/SPARK-19849.
## What changes were proposed in this pull request?
Passes R `tempdir()` (this is the R session temp dir, shared with other temp files/dirs) to JVM, set System.Property for derby home dir to move derby.log
## How was this patch tested?
Manually, unit tests
With this, these are relocated to under /tmp
```
# ls /tmp/RtmpG2M0cB/
derby.log
```
And they are removed automatically when the R session is ended.
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#16330 from felixcheung/rderby.
## What changes were proposed in this pull request?
It seems cran check scripts corrects `R/pkg/DESCRIPTION` and follows the order in `Collate` fields.
This PR proposes to fix this so that running this script does not show up a diff in this file.
## How was this patch tested?
Manually via `./R/check-cran.sh`.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#17349 from HyukjinKwon/minor-cran.
## What changes were proposed in this pull request?
Add "experimental" API for SS in R
## How was this patch tested?
manual, unit tests
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#16982 from felixcheung/rss.
## What changes were proposed in this pull request?
Since we could not directly define the array type in R, this PR proposes to support array types in R as string types that are used in `structField` as below:
```R
jsonArr <- "[{\"name\":\"Bob\"}, {\"name\":\"Alice\"}]"
df <- as.DataFrame(list(list("people" = jsonArr)))
collect(select(df, alias(from_json(df$people, "array<struct<name:string>>"), "arrcol")))
```
prints
```R
arrcol
1 Bob, Alice
```
## How was this patch tested?
Unit tests in `test_sparkSQL.R`.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#17178 from HyukjinKwon/SPARK-19828.
## What changes were proposed in this pull request?
Port Tweedie GLM #16344 to SparkR
felixcheung yanboliang
## How was this patch tested?
new test in SparkR
Author: actuaryzhang <actuaryzhang10@gmail.com>
Closes#16729 from actuaryzhang/sparkRTweedie.
## What changes were proposed in this pull request?
RandomForest R Wrapper and GBT R Wrapper return param `maxDepth` to R models.
Below 4 R wrappers are changed:
* `RandomForestClassificationWrapper`
* `RandomForestRegressionWrapper`
* `GBTClassificationWrapper`
* `GBTRegressionWrapper`
## How was this patch tested?
Test manually on my local machine.
Author: Xin Ren <iamshrek@126.com>
Closes#17207 from keypointt/SPARK-19282.
### What changes were proposed in this pull request?
Observed by felixcheung in https://github.com/apache/spark/pull/16739, when users use the shuffle-enabled `repartition` API, they expect the partition they got should be the exact number they provided, even if they call shuffle-disabled `coalesce` later.
Currently, `CollapseRepartition` rule does not consider whether shuffle is enabled or not. Thus, we got the following unexpected result.
```Scala
val df = spark.range(0, 10000, 1, 5)
val df2 = df.repartition(10)
assert(df2.coalesce(13).rdd.getNumPartitions == 5)
assert(df2.coalesce(7).rdd.getNumPartitions == 5)
assert(df2.coalesce(3).rdd.getNumPartitions == 3)
```
This PR is to fix the issue. We preserve shuffle-enabled Repartition.
### How was this patch tested?
Added a test case
Author: Xiao Li <gatorsmile@gmail.com>
Closes#16933 from gatorsmile/CollapseRepartition.
## What changes were proposed in this pull request?
Added checks for name consistency of input data frames in union.
## How was this patch tested?
new test.
Author: actuaryzhang <actuaryzhang10@gmail.com>
Closes#17159 from actuaryzhang/sparkRUnion.
## What changes were proposed in this pull request?
Add column functions: to_json, from_json, and tests covering error cases.
## How was this patch tested?
unit tests, manual
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#17134 from felixcheung/rtojson.
## What changes were proposed in this pull request?
Update R document to use JDK8.
## How was this patch tested?
manual tests
Author: Yuming Wang <wgyumg@gmail.com>
Closes#17162 from wangyum/SPARK-19550.
## What changes were proposed in this pull request?
Update doc for R, programming guide. Clarify default behavior for all languages.
## How was this patch tested?
manually
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#17128 from felixcheung/jsonwholefiledoc.
Update R doc:
1. columns, names and colnames returns a vector of strings, not **list** as in current doc.
2. `colnames<-` does allow the subset assignment, so the length of `value` can be less than the number of columns, e.g., `colnames(df)[1] <- "a"`.
felixcheung
Author: actuaryzhang <actuaryzhang10@gmail.com>
Closes#17115 from actuaryzhang/sparkRMinorDoc.
## What changes were proposed in this pull request?
Replace `iris` dataset with `Titanic` or other dataset in example and document.
## How was this patch tested?
Manual and existing test
Author: wm624@hotmail.com <wm624@hotmail.com>
Closes#17032 from wangmiao1981/example.
## What changes were proposed in this pull request?
The `[[` method is supposed to take a single index and return a column. This is different from base R which takes a vector index. We should check for this and issue warning or error when vector index is supplied (which is very likely given the behavior in base R).
Currently I'm issuing a warning message and just take the first element of the vector index. We could change this to an error it that's better.
## How was this patch tested?
new tests
Author: actuaryzhang <actuaryzhang10@gmail.com>
Closes#17017 from actuaryzhang/sparkRSubsetter.
## What changes were proposed in this pull request?
This is a follow-up PR of #16800
When doing SPARK-19456, we found that "" should be consider a NULL column name and should not be set. aggregationDepth should be exposed as an expert parameter.
## How was this patch tested?
Existing tests.
Author: wm624@hotmail.com <wm624@hotmail.com>
Closes#16945 from wangmiao1981/svc.
## What changes were proposed in this pull request?
We recently add the spark.svmLinear API for SparkR. We need to add an example and update the vignettes.
## How was this patch tested?
Manually run example.
Author: wm624@hotmail.com <wm624@hotmail.com>
Closes#16969 from wangmiao1981/example.
## What changes were proposed in this pull request?
SparkR ```approxQuantile``` supports input multiple columns.
## How was this patch tested?
Unit test.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#16951 from yanboliang/spark-19619.
## What changes were proposed in this pull request?
Add coalesce on DataFrame for down partitioning without shuffle and coalesce on Column
## How was this patch tested?
manual, unit tests
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#16739 from felixcheung/rcoalesce.
## What changes were proposed in this pull request?
Linear SVM classifier is newly added into ML and python API has been added. This JIRA is to add R side API.
Marked as WIP, as I am designing unit tests.
## How was this patch tested?
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: wm624@hotmail.com <wm624@hotmail.com>
Closes#16800 from wangmiao1981/svc.
## What changes were proposed in this pull request?
- this is cause by changes in SPARK-18444, SPARK-18643 that we no longer install Spark when `master = ""` (default), but also related to SPARK-18449 since the real `master` value is not known at the time the R code in `sparkR.session` is run. (`master` cannot default to "local" since it could be overridden by spark-submit commandline or spark config)
- as a result, while running SparkR as a package in IDE is working fine, CRAN check is not as it is launching it via non-interactive script
- fix is to add check to the beginning of each test and vignettes; the same would also work by changing `sparkR.session()` to `sparkR.session(master = "local")` in tests, but I think being more explicit is better.
## How was this patch tested?
Tested this by reverting version to 2.1, since it needs to download the release jar with matching version. But since there are changes in 2.2 (specifically around SparkR ML) that are incompatible with 2.1, some tests are failing in this config. Will need to port this to branch-2.1 and retest with 2.1 release jar.
manually as:
```
# modify DESCRIPTION to revert version to 2.1.0
SPARK_HOME=/usr/spark R CMD build pkg
# run cran check without SPARK_HOME
R CMD check --as-cran SparkR_2.1.0.tar.gz
```
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#16720 from felixcheung/rcranchecktest.
## What changes were proposed in this pull request?
Fix a bug in collect method for collecting timestamp column, the bug can be reproduced as shown in the following codes and outputs:
```
library(SparkR)
sparkR.session(master = "local")
df <- data.frame(col1 = c(0, 1, 2),
col2 = c(as.POSIXct("2017-01-01 00:00:01"), NA, as.POSIXct("2017-01-01 12:00:01")))
sdf1 <- createDataFrame(df)
print(dtypes(sdf1))
df1 <- collect(sdf1)
print(lapply(df1, class))
sdf2 <- filter(sdf1, "col1 > 0")
print(dtypes(sdf2))
df2 <- collect(sdf2)
print(lapply(df2, class))
```
As we can see from the printed output, the column type of col2 in df2 is converted to numeric unexpectedly, when NA exists at the top of the column.
This is caused by method `do.call(c, list)`, if we convert a list, i.e. `do.call(c, list(NA, as.POSIXct("2017-01-01 12:00:01"))`, the class of the result is numeric instead of POSIXct.
Therefore, we need to cast the data type of the vector explicitly.
## How was this patch tested?
The patch can be tested manually with the same code above.
Author: titicaca <fangzhou.yang@hotmail.com>
Closes#16689 from titicaca/sparkr-dev.
## What changes were proposed in this pull request?
After SPARK-19464, **SparkPullRequestBuilder** fails because it still tries to use hadoop2.3.
**BEFORE**
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72595/console
```
========================================================================
Building Spark
========================================================================
[error] Could not find hadoop2.3 in the list. Valid options are ['hadoop2.6', 'hadoop2.7']
Attempting to post to Github...
> Post successful.
```
**AFTER**
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72595/console
```
========================================================================
Building Spark
========================================================================
[info] Building Spark (w/Hive 1.2.1) using SBT with these arguments: -Phadoop-2.6 -Pmesos -Pkinesis-asl -Pyarn -Phive-thriftserver -Phive test:package streaming-kafka-0-8-assembly/assembly streaming-flume-assembly/assembly streaming-kinesis-asl-assembly/assembly
Using /usr/java/jdk1.8.0_60 as default JAVA_HOME.
Note, this will be overridden by -java-home if it is set.
```
## How was this patch tested?
Pass the existing test.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#16858 from dongjoon-hyun/hotfix_run-tests.
## What changes were proposed in this pull request?
This pull request adds two new user facing functions:
- `to_date` which accepts an expression and a format and returns a date.
- `to_timestamp` which accepts an expression and a format and returns a timestamp.
For example, Given a date in format: `2016-21-05`. (YYYY-dd-MM)
### Date Function
*Previously*
```
to_date(unix_timestamp(lit("2016-21-05"), "yyyy-dd-MM").cast("timestamp"))
```
*Current*
```
to_date(lit("2016-21-05"), "yyyy-dd-MM")
```
### Timestamp Function
*Previously*
```
unix_timestamp(lit("2016-21-05"), "yyyy-dd-MM").cast("timestamp")
```
*Current*
```
to_timestamp(lit("2016-21-05"), "yyyy-dd-MM")
```
### Tasks
- [X] Add `to_date` to Scala Functions
- [x] Add `to_date` to Python Functions
- [x] Add `to_date` to SQL Functions
- [X] Add `to_timestamp` to Scala Functions
- [x] Add `to_timestamp` to Python Functions
- [x] Add `to_timestamp` to SQL Functions
- [x] Add function to R
## How was this patch tested?
- [x] Add Functions to `DateFunctionsSuite`
- Test new `ParseToTimestamp` Expression (*not necessary*)
- Test new `ParseToDate` Expression (*not necessary*)
- [x] Add test for R
- [x] Add test for Python in test.py
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: anabranch <wac.chambers@gmail.com>
Author: Bill Chambers <bill@databricks.com>
Author: anabranch <bill@databricks.com>
Closes#16138 from anabranch/SPARK-16609.
## What changes were proposed in this pull request?
The names method fails to check for validity of the assignment values. This can be fixed by calling colnames within names.
## How was this patch tested?
new tests.
Author: actuaryzhang <actuaryzhang10@gmail.com>
Closes#16794 from actuaryzhang/sparkRNames.
## What changes were proposed in this pull request?
Current version has error in vignettes:
```
model <- spark.bisectingKmeans(df, Sepal_Length ~ Sepal_Width, k = 4)
summary(kmeansModel)
```
`kmeansModel` does not exist...
felixcheung wangmiao1981
Author: actuaryzhang <actuaryzhang10@gmail.com>
Closes#16799 from actuaryzhang/sparkRVignettes.
## What changes were proposed in this pull request?
Update programming guide, example and vignette with Bisecting k-means.
Author: krishnakalyan3 <krishnakalyan3@gmail.com>
Closes#16767 from krishnakalyan3/bisecting-kmeans.
## What changes were proposed in this pull request
When Kmeans using initMode = "random" and some random seed, it is possible the actual cluster size doesn't equal to the configured `k`.
In this case, summary(model) returns error due to the number of cols of coefficient matrix doesn't equal to k.
Example:
> col1 <- c(1, 2, 3, 4, 0, 1, 2, 3, 4, 0)
> col2 <- c(1, 2, 3, 4, 0, 1, 2, 3, 4, 0)
> col3 <- c(1, 2, 3, 4, 0, 1, 2, 3, 4, 0)
> cols <- as.data.frame(cbind(col1, col2, col3))
> df <- createDataFrame(cols)
>
> model2 <- spark.kmeans(data = df, ~ ., k = 5, maxIter = 10, initMode = "random", seed = 22222, tol = 1E-5)
>
> summary(model2)
Error in `colnames<-`(`*tmp*`, value = c("col1", "col2", "col3")) :
length of 'dimnames' [2] not equal to array extent
In addition: Warning message:
In matrix(coefficients, ncol = k) :
data length [9] is not a sub-multiple or multiple of the number of rows [2]
Fix: Get the actual cluster size in the summary and use it to build the coefficient matrix.
## How was this patch tested?
Add unit tests.
Author: wm624@hotmail.com <wm624@hotmail.com>
Closes#16666 from wangmiao1981/kmeans.
## What changes were proposed in this pull request?
The `coefficients` component in model summary should be 'matrix' but the underlying structure is indeed list. This affects several models except for 'AFTSurvivalRegressionModel' which has the correct implementation. The fix is to first `unlist` the coefficients returned from the `callJMethod` before converting to matrix. An example illustrates the issues:
```
data(iris)
df <- createDataFrame(iris)
model <- spark.glm(df, Sepal_Length ~ Sepal_Width, family = "gaussian")
s <- summary(model)
> str(s$coefficients)
List of 8
$ : num 6.53
$ : num -0.223
$ : num 0.479
$ : num 0.155
$ : num 13.6
$ : num -1.44
$ : num 0
$ : num 0.152
- attr(*, "dim")= int [1:2] 2 4
- attr(*, "dimnames")=List of 2
..$ : chr [1:2] "(Intercept)" "Sepal_Width"
..$ : chr [1:4] "Estimate" "Std. Error" "t value" "Pr(>|t|)"
> s$coefficients[, 2]
$`(Intercept)`
[1] 0.4788963
$Sepal_Width
[1] 0.1550809
```
This shows that the underlying structure of coefficients is still `list`.
felixcheung wangmiao1981
Author: actuaryzhang <actuaryzhang10@gmail.com>
Closes#16730 from actuaryzhang/sparkRCoef.
## What changes were proposed in this pull request?
With extract `[[` or replace `[[<-`, the parameter `i` is a column index, that needs to be corrected in doc. Also a few minor updates: examples, links.
## How was this patch tested?
manual
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#16721 from felixcheung/rsubsetdoc.
## What changes were proposed in this pull request?
This affects mostly running job from the driver in client mode when results are expected to be through stdout (which should be somewhat rare, but possible)
Before:
```
> a <- as.DataFrame(cars)
> b <- group_by(a, "dist")
> c <- count(b)
> sparkR.callJMethod(c$countjc, "explain", TRUE)
NULL
```
After:
```
> a <- as.DataFrame(cars)
> b <- group_by(a, "dist")
> c <- count(b)
> sparkR.callJMethod(c$countjc, "explain", TRUE)
count#11L
NULL
```
Now, `column.explain()` doesn't seem very useful (we can get more extensive output with `DataFrame.explain()`) but there are other more complex examples with calls of `println` in Scala/JVM side, that are getting dropped.
## How was this patch tested?
manual
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#16670 from felixcheung/rjvmstdout.
## What changes were proposed in this pull request?
add header
## How was this patch tested?
Manual run to check vignettes html is created properly
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#16709 from felixcheung/rfilelicense.
## What changes were proposed in this pull request?
With doc to say this would convert DF into RDD
## How was this patch tested?
unit tests, manual tests
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#16668 from felixcheung/rgetnumpartitions.
## What changes were proposed in this pull request?
Add R wrapper for bisecting Kmeans.
As JIRA is down, I will update title to link with corresponding JIRA later.
## How was this patch tested?
Add new unit tests.
Author: wm624@hotmail.com <wm624@hotmail.com>
Closes#16566 from wangmiao1981/bk.
## What changes were proposed in this pull request?
Support for
```
df[[myname]] <- 1
df[[2]] <- df$eruptions
```
## How was this patch tested?
manual tests, unit tests
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#16663 from felixcheung/rcolset.
## What changes were proposed in this pull request?
```spark.gaussianMixture``` supports output total log-likelihood for the model like R ```mvnormalmixEM```.
## How was this patch tested?
R unit test.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#16646 from yanboliang/spark-19291.
## What changes were proposed in this pull request?
When R is starting as a package and it needs to download the Spark release distribution we need to handle error for download and untar, and clean up, otherwise it will get stuck.
## How was this patch tested?
manually
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#16589 from felixcheung/rtarreturncode.
## What changes were proposed in this pull request?
Refactored script to remove duplications and clearer purpose for each script
## How was this patch tested?
manually
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#16249 from felixcheung/rscripts.
## What changes were proposed in this pull request?
Windows seems to be the only place with appauthor in the path, for which we should say "Apache" (and case sensitive)
Current path of `AppData\Local\spark\spark\Cache` is a bit odd.
## How was this patch tested?
manual.
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#16590 from felixcheung/rcachedir.
## What changes were proposed in this pull request?
spark.lda passes the optimizer "em" or "online" as a string to the backend. However, LDAWrapper doesn't set optimizer based on the value from R. Therefore, for optimizer "em", the `isDistributed` field is FALSE, which should be TRUE based on scala code.
In addition, the `summary` method should bring back the results related to `DistributedLDAModel`.
## How was this patch tested?
Manual tests by comparing with scala example.
Modified the current unit test: fix the incorrect unit test and add necessary tests for `summary` method.
Author: wm624@hotmail.com <wm624@hotmail.com>
Closes#16464 from wangmiao1981/new.
## What changes were proposed in this pull request?
To allow specifying number of partitions when the DataFrame is created
## How was this patch tested?
manual, unit tests
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#16512 from felixcheung/rnumpart.
## What changes were proposed in this pull request?
spark.kmeans doesn't have interface to set initSteps, seed and tol. As Spark Kmeans algorithm doesn't take the same set of parameters as R kmeans, we should maintain a different interface in spark.kmeans.
Add missing parameters and corresponding document.
Modified existing unit tests to take additional parameters.
Author: wm624@hotmail.com <wm624@hotmail.com>
Closes#16523 from wangmiao1981/kmeans.
## What changes were proposed in this pull request?
```
df$foo <- 1
```
instead of
```
df$foo <- lit(1)
```
## How was this patch tested?
unit tests
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#16510 from felixcheung/rlitcol.
## What changes were proposed in this pull request?
R family is a longer list than what Spark supports.
## How was this patch tested?
manual
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#16511 from felixcheung/rdocglmfamily.
## What changes were proposed in this pull request?
- [X] Make sure all join types are clearly mentioned
- [X] Make join labeling/style consistent
- [X] Make join label ordering docs the same
- [X] Improve join documentation according to above for Scala
- [X] Improve join documentation according to above for Python
- [X] Improve join documentation according to above for R
## How was this patch tested?
No tests b/c docs.
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: anabranch <wac.chambers@gmail.com>
Closes#16504 from anabranch/SPARK-19126.
## What changes were proposed in this pull request?
- [X] Fix inconsistencies in function reference for dense rank and dense
- [X] Make all languages equivalent in their reference to `dense_rank` and `rank`.
## How was this patch tested?
N/A for docs.
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: anabranch <wac.chambers@gmail.com>
Closes#16505 from anabranch/SPARK-19127.
## What changes were proposed in this pull request?
SparkR ```mllib.R``` is getting bigger as we add more ML wrappers, I'd like to split it into multiple files to make us easy to maintain:
* mllib_classification.R
* mllib_clustering.R
* mllib_recommendation.R
* mllib_regression.R
* mllib_stat.R
* mllib_tree.R
* mllib_utils.R
Note: Only reorg, no actual code change.
## How was this patch tested?
Existing tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#16312 from yanboliang/spark-18862.
## What changes were proposed in this pull request?
#16126 bumps master branch version to 2.2.0-SNAPSHOT, but it seems R version was omitted.
## How was this patch tested?
N/A
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#16488 from yanboliang/r-version.
## What changes were proposed in this pull request?
It would make it easier to integrate with other component expecting row-based JSON format.
This replaces the non-public toJSON RDD API.
## How was this patch tested?
manual, unit tests
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#16368 from felixcheung/rJSON.
## What changes were proposed in this pull request?
API for SparkUI URL from SparkContext
## How was this patch tested?
manual, unit tests
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#16367 from felixcheung/rwebui.
## What changes were proposed in this pull request?
SparkR tests, `R/run-tests.sh`, succeeds only once because `test_sparkSQL.R` does not clean up the test table, `people`.
As a result, the rows in `people` table are accumulated at every run and the test cases fail.
The following is the failure result for the second run.
```r
Failed -------------------------------------------------------------------------
1. Failure: create DataFrame from RDD (test_sparkSQL.R#204) -------------------
collect(sql("SELECT age from people WHERE name = 'Bob'"))$age not equal to c(16).
Lengths differ: 2 vs 1
2. Failure: create DataFrame from RDD (test_sparkSQL.R#206) -------------------
collect(sql("SELECT height from people WHERE name ='Bob'"))$height not equal to c(176.5).
Lengths differ: 2 vs 1
```
## How was this patch tested?
Manual. Run `run-tests.sh` twice and check if it passes without failures.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#16310 from dongjoon-hyun/SPARK-18897.
## What changes were proposed in this pull request?
doc cleanup
## How was this patch tested?
~~vignettes is not building for me. I'm going to kick off a full clean build and try again and attach output here for review.~~
Output html here: https://felixcheung.github.io/sparkr-vignettes.html
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#16286 from felixcheung/rvignettespass.
## What changes were proposed in this pull request?
When do the QA work, I found that the following issues:
1). `spark.mlp` doesn't include an example;
2). `spark.mlp` and `spark.lda` have redundant parameter explanations;
3). `spark.lda` document misses default values for some parameters.
I also changed the `spark.logit` regParam in the examples, as we discussed in #16222.
## How was this patch tested?
Manual test
Author: wm624@hotmail.com <wm624@hotmail.com>
Closes#16284 from wangmiao1981/ks.
## What changes were proposed in this pull request?
Added short section for KSTest.
Also added logreg model to list of ML models in vignette. (This will be reorganized under SPARK-18849)
![screen shot 2016-12-14 at 1 37 31 pm](https://cloud.githubusercontent.com/assets/5084283/21202140/7f24e240-c202-11e6-9362-458208bb9159.png)
## How was this patch tested?
Manually tested example locally.
Built vignettes locally.
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#16283 from jkbradley/ksTest-vignette.
## What changes were proposed in this pull request?
While adding vignettes for kstest, I found some errors in the example:
1. There is a typo of kstest;
2. print.summary.KStest doesn't work with the example;
Fix the example errors;
Add a new unit test for print.summary.KStest;
## How was this patch tested?
Manual test;
Add new unit test;
Author: wm624@hotmail.com <wm624@hotmail.com>
Closes#16259 from wangmiao1981/ks.
## What changes were proposed in this pull request?
Mention `spark.randomForest` and `spark.gbt` in vignettes. Keep the content minimal since users can type `?spark.randomForest` to see the full doc.
cc: jkbradley
Author: Xiangrui Meng <meng@databricks.com>
Closes#16264 from mengxr/SPARK-18793.
## What changes were proposed in this pull request?
Support overriding the download url (include version directory) in an environment variable, `SPARKR_RELEASE_DOWNLOAD_URL`
## How was this patch tested?
unit test, manually testing
- snapshot build url
- download when spark jar not cached
- when spark jar is cached
- RC build url
- download when spark jar not cached
- when spark jar is cached
- multiple cached spark versions
- starting with sparkR shell
To use this,
```
SPARKR_RELEASE_DOWNLOAD_URL=http://this_is_the_url_to_spark_release_tgz R
```
then in R,
```
library(SparkR) # or specify lib.loc
sparkR.session()
```
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#16248 from felixcheung/rinstallurl.
## What changes were proposed in this pull request?
Several SparkR API calling into JVM methods that have void return values are getting printed out, especially when running in a REPL or IDE.
example:
```
> setLogLevel("WARN")
NULL
```
We should fix this to make the result more clear.
Also found a small change to return value of dropTempView in 2.1 - adding doc and test for it.
## How was this patch tested?
manually - I didn't find a expect_*() method in testthat for this
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#16237 from felixcheung/rinvis.
## What changes were proposed in this pull request?
In this PR, the document of `summary` method is improved in the format:
returns summary information of the fitted model, which is a list. The list includes .......
Since `summary` in R is mainly about the model, which is not the same as `summary` object on scala side, if there is one, the scala API doc is not pointed here.
In current document, some `return` have `.` and some don't have. `.` is added to missed ones.
Since spark.logit `summary` has a big refactoring, this PR doesn't include this one. It will be changed when the `spark.logit` PR is merged.
## How was this patch tested?
Manual build.
Author: wm624@hotmail.com <wm624@hotmail.com>
Closes#16150 from wangmiao1981/audit2.
## What changes were proposed in this pull request?
This PR has 2 key changes. One, we are building source package (aka bundle package) for SparkR which could be released on CRAN. Two, we should include in the official Spark binary distributions SparkR installed from this source package instead (which would have help/vignettes rds needed for those to work when the SparkR package is loaded in R, whereas earlier approach with devtools does not)
But, because of various differences in how R performs different tasks, this PR is a fair bit more complicated. More details below.
This PR also includes a few minor fixes.
### more details
These are the additional steps in make-distribution; please see [here](https://github.com/apache/spark/blob/master/R/CRAN_RELEASE.md) on what's going to a CRAN release, which is now run during make-distribution.sh.
1. package needs to be installed because the first code block in vignettes is `library(SparkR)` without lib path
2. `R CMD build` will build vignettes (this process runs Spark/SparkR code and captures outputs into pdf documentation)
3. `R CMD check` on the source package will install package and build vignettes again (this time from source packaged) - this is a key step required to release R package on CRAN
(will skip tests here but tests will need to pass for CRAN release process to success - ideally, during release signoff we should install from the R source package and run tests)
4. `R CMD Install` on the source package (this is the only way to generate doc/vignettes rds files correctly, not in step # 1)
(the output of this step is what we package into Spark dist and sparkr.zip)
Alternatively,
R CMD build should already be installing the package in a temp directory though it might just be finding this location and set it to lib.loc parameter; another approach is perhaps we could try calling `R CMD INSTALL --build pkg` instead.
But in any case, despite installing the package multiple times this is relatively fast.
Building vignettes takes a while though.
## How was this patch tested?
Manually, CI.
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#16014 from felixcheung/rdist.
## What changes were proposed in this pull request?
Reviewing SparkR ML wrappers API for 2.1 release, mainly two issues:
* Remove ```probabilityCol``` from the argument list of ```spark.logit``` and ```spark.randomForest```. Since it was used when making prediction and should be an argument of ```predict```, and we will work on this at [SPARK-18618](https://issues.apache.org/jira/browse/SPARK-18618) in the next release cycle.
* Fix ```spark.als``` params to make it consistent with MLlib.
## How was this patch tested?
Existing tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#16169 from yanboliang/spark-18326.
## What changes were proposed in this pull request?
Fix reservoir sampling bias for small k. An off-by-one error meant that the probability of replacement was slightly too high -- k/(l-1) after l element instead of k/l, which matters for small k.
## How was this patch tested?
Existing test plus new test case.
Author: Sean Owen <sowen@cloudera.com>
Closes#16129 from srowen/SPARK-18678.
## What changes were proposed in this pull request?
Several cleanup and improvements for ```spark.logit```:
* ```summary``` should return coefficients matrix, and should output labels for each class if the model is multinomial logistic regression model.
* ```summary``` should not return ```areaUnderROC, roc, pr, ...```, since most of them are DataFrame which are less important for R users. Meanwhile, these metrics ignore instance weights (setting all to 1.0) which will be changed in later Spark version. In case it will introduce breaking changes, we do not expose them currently.
* SparkR test improvement: comparing the training result with native R glmnet.
* Remove argument ```aggregationDepth``` from ```spark.logit```, since it's an expert Param(related with Spark architecture and job execution) that would be used rarely by R users.
## How was this patch tested?
Unit tests.
The ```summary``` output after this change:
multinomial logistic regression:
```
> df <- suppressWarnings(createDataFrame(iris))
> model <- spark.logit(df, Species ~ ., regParam = 0.5)
> summary(model)
$coefficients
versicolor virginica setosa
(Intercept) 1.514031 -2.609108 1.095077
Sepal_Length 0.02511006 0.2649821 -0.2900921
Sepal_Width -0.5291215 -0.02016446 0.549286
Petal_Length 0.03647411 0.1544119 -0.190886
Petal_Width 0.000236092 0.4195804 -0.4198165
```
binomial logistic regression:
```
> df <- suppressWarnings(createDataFrame(iris))
> training <- df[df$Species %in% c("versicolor", "virginica"), ]
> model <- spark.logit(training, Species ~ ., regParam = 0.5)
> summary(model)
$coefficients
Estimate
(Intercept) -6.053815
Sepal_Length 0.2449379
Sepal_Width 0.1648321
Petal_Length 0.4730718
Petal_Width 1.031947
```
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#16117 from yanboliang/spark-18686.
## What changes were proposed in this pull request?
If SparkR is running as a package and it has previously downloaded Spark Jar it should be able to run as before without having to set SPARK_HOME. Basically with this bug the auto install Spark will only work in the first session.
This seems to be a regression on the earlier behavior.
Fix is to always try to install or check for the cached Spark if running in an interactive session.
As discussed before, we should probably only install Spark iff running in an interactive session (R shell, RStudio etc)
## How was this patch tested?
Manually
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#16077 from felixcheung/rsessioninteractive.
## What changes were proposed in this pull request?
It's better we can fix this issue by providing an option ```type``` for users to change the ```predict``` output schema, then they could output probabilities, log-space predictions, or original labels. In order to not involve breaking API change for 2.1, so revert this change firstly and will add it back after [SPARK-18618](https://issues.apache.org/jira/browse/SPARK-18618) resolved.
## How was this patch tested?
Existing unit tests.
This reverts commit daa975f4bf.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#16118 from yanboliang/spark-18291-revert.
## What changes were proposed in this pull request?
Similar to SPARK-18401, as a classification algorithm, logistic regression should support output original label instead of supporting index label.
In this PR, original label output is supported and test cases are modified and added. Document is also modified.
## How was this patch tested?
Unit tests.
Author: wm624@hotmail.com <wm624@hotmail.com>
Closes#15910 from wangmiao1981/audit.
## What changes were proposed in this pull request?
### The Issue
If I specify my schema when doing
```scala
spark.read
.schema(someSchemaWherePartitionColumnsAreStrings)
```
but if the partition inference can infer it as IntegerType or I assume LongType or DoubleType (basically fixed size types), then once UnsafeRows are generated, your data will be corrupted.
### Proposed solution
The partition handling code path is kind of a mess. In my fix I'm probably adding to the mess, but at least trying to standardize the code path.
The real issue is that a user that uses the `spark.read` code path can never clearly specify what the partition columns are. If you try to specify the fields in `schema`, we practically ignore what the user provides, and fall back to our inferred data types. What happens in the end is data corruption.
My solution tries to fix this by always trying to infer partition columns the first time you specify the table. Once we find what the partition columns are, we try to find them in the user specified schema and use the dataType provided there, or fall back to the smallest common data type.
We will ALWAYS append partition columns to the user's schema, even if they didn't ask for it. We will only use the data type they provided if they specified it. While this is confusing, this has been the behavior since Spark 1.6, and I didn't want to change this behavior in the QA period of Spark 2.1. We may revisit this decision later.
A side effect of this PR is that we won't need https://github.com/apache/spark/pull/15942 if this PR goes in.
## How was this patch tested?
Regression tests
Author: Burak Yavuz <brkyvz@gmail.com>
Closes#15951 from brkyvz/partition-corruption.
## What changes were proposed in this pull request?
Updates links to the wiki to links to the new location of content on spark.apache.org.
## How was this patch tested?
Doc builds
Author: Sean Owen <sowen@cloudera.com>
Closes#15967 from srowen/SPARK-18073.1.
## What changes were proposed in this pull request?
* Fix SparkR ```spark.glm``` errors when fitting on collinear data, since ```standard error of coefficients, t value and p value``` are not available in this condition.
* Scala/Python GLM summary should throw exception if users get ```standard error of coefficients, t value and p value``` but the underlying WLS was solved by local "l-bfgs".
## How was this patch tested?
Add unit tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#15930 from yanboliang/spark-18501.
## What changes were proposed in this pull request?
When running SparkR job in yarn-cluster mode, it will download Spark package from apache website which is not necessary.
```
./bin/spark-submit --master yarn-cluster ./examples/src/main/r/dataframe.R
```
The following is output:
```
Attaching package: ‘SparkR’
The following objects are masked from ‘package:stats’:
cov, filter, lag, na.omit, predict, sd, var, window
The following objects are masked from ‘package:base’:
as.data.frame, colnames, colnames<-, drop, endsWith, intersect,
rank, rbind, sample, startsWith, subset, summary, transform, union
Spark not found in SPARK_HOME:
Spark not found in the cache directory. Installation will start.
MirrorUrl not provided.
Looking for preferred site from apache website...
......
```
There's no ```SPARK_HOME``` in yarn-cluster mode since the R process is in a remote host of the yarn cluster rather than in the client host. The JVM comes up first and the R process then connects to it. So in such cases we should never have to download Spark as Spark is already running.
## How was this patch tested?
Offline test.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#15888 from yanboliang/spark-18444.
## What changes were proposed in this pull request?
I found the documentation for the sample method to be confusing, this adds more clarification across all languages.
- [x] Scala
- [x] Python
- [x] R
- [x] RDD Scala
- [ ] RDD Python with SEED
- [X] RDD Java
- [x] RDD Java with SEED
- [x] RDD Python
## How was this patch tested?
NA
Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before opening a pull request.
Author: anabranch <wac.chambers@gmail.com>
Author: Bill Chambers <bill@databricks.com>
Closes#15815 from anabranch/SPARK-18365.
## What changes were proposed in this pull request?
```spark.mlp``` should support ```RFormula``` like other ML algorithm wrappers.
BTW, I did some cleanup and improvement for ```spark.mlp```.
## How was this patch tested?
Unit tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#15883 from yanboliang/spark-18438.
## What changes were proposed in this pull request?
* Fix the following exceptions which throws when ```spark.randomForest```(classification), ```spark.gbt```(classification), ```spark.naiveBayes``` and ```spark.glm```(binomial family) were fitted on libsvm data.
```
java.lang.IllegalArgumentException: requirement failed: If label column already exists, forceIndexLabel can not be set with true.
```
See [SPARK-18412](https://issues.apache.org/jira/browse/SPARK-18412) for more detail about how to reproduce this bug.
* Refactor out ```getFeaturesAndLabels``` to RWrapperUtils, since lots of ML algorithm wrappers use this function.
* Drop some unwanted columns when making prediction.
## How was this patch tested?
Add unit test.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#15851 from yanboliang/spark-18412.
## What changes were proposed in this pull request?
SparkR ```spark.randomForest``` classification prediction should output original label rather than the indexed label. This issue is very similar with [SPARK-18291](https://issues.apache.org/jira/browse/SPARK-18291).
## How was this patch tested?
Add unit tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#15842 from yanboliang/spark-18401.
## What changes were proposed in this pull request?
Gradient Boosted Tree in R.
With a few minor improvements to RandomForest in R.
Since this is relatively isolated I'd like to target this for branch-2.1
## How was this patch tested?
manual tests, unit tests
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#15746 from felixcheung/rgbt.
## What changes were proposed in this pull request?
minor doc update that should go to master & branch-2.1
## How was this patch tested?
manual
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#15747 from felixcheung/pySPARK-14393.
## What changes were proposed in this pull request?
In test_mllib.R, there are two unnecessary suppressWarnings. This PR just removes them.
## How was this patch tested?
Existing unit tests.
Author: wm624@hotmail.com <wm624@hotmail.com>
Closes#15697 from wangmiao1981/rtest.
## What changes were proposed in this pull request?
Due to a limitation of hive metastore(table location must be directory path, not file path), we always store `path` for data source table in storage properties, instead of the `locationUri` field. However, we should not expose this difference to `CatalogTable` level, but just treat it as a hack in `HiveExternalCatalog`, like we store table schema of data source table in table properties.
This PR unifies `path` and `locationUri` outside of `HiveExternalCatalog`, both data source table and hive serde table should use the `locationUri` field.
This PR also unifies the way we handle default table location for managed table. Previously, the default table location of hive serde managed table is set by external catalog, but the one of data source table is set by command. After this PR, we follow the hive way and the default table location is always set by external catalog.
For managed non-file-based tables, we will assign a default table location and create an empty directory for it, the table location will be removed when the table is dropped. This is reasonable as metastore doesn't care about whether a table is file-based or not, and an empty table directory has no harm.
For external non-file-based tables, ideally we can omit the table location, but due to a hive metastore issue, we will assign a random location to it, and remove it right after the table is created. See SPARK-15269 for more details. This is fine as it's well isolated in `HiveExternalCatalog`.
To keep the existing behaviour of the `path` option, in this PR we always add the `locationUri` to storage properties using key `path`, before passing storage properties to `DataSource` as data source options.
## How was this patch tested?
existing tests.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#15024 from cloud-fan/path.
## What changes were proposed in this pull request?
Simplify struct creation, especially the aspect of `CleanupAliases` which missed some aliases when handling trees created by `CreateStruct`.
This PR includes:
1. A failing test (create struct with nested aliases, some of the aliases survive `CleanupAliases`).
2. A fix that transforms `CreateStruct` into a `CreateNamedStruct` constructor, effectively eliminating `CreateStruct` from all expression trees.
3. A `NamePlaceHolder` used by `CreateStruct` when column names cannot be extracted from unresolved `NamedExpression`.
4. A new Analyzer rule that resolves `NamePlaceHolder` into a string literal once the `NamedExpression` is resolved.
5. `CleanupAliases` code was simplified as it no longer has to deal with `CreateStruct`'s top level columns.
## How was this patch tested?
Running all tests-suits in package org.apache.spark.sql, especially including the analysis suite, making sure added test initially fails, after applying suggested fix rerun the entire analysis package successfully.
Modified few tests that expected `CreateStruct` which is now transformed into `CreateNamedStruct`.
Author: eyal farago <eyal farago>
Author: Herman van Hovell <hvanhovell@databricks.com>
Author: eyal farago <eyal.farago@gmail.com>
Author: Eyal Farago <eyal.farago@actimize.com>
Author: Hyukjin Kwon <gurwls223@gmail.com>
Author: eyalfa <eyal.farago@gmail.com>
Closes#15718 from hvanhovell/SPARK-16839-2.
## What changes were proposed in this pull request?
This PR proposes to
- improve the R-friendly error messages rather than raw JVM exception one.
As `read.json`, `read.text`, `read.orc`, `read.parquet` and `read.jdbc` are executed in the same path with `read.df`, and `write.json`, `write.text`, `write.orc`, `write.parquet` and `write.jdbc` shares the same path with `write.df`, it seems it is safe to call `handledCallJMethod` to handle
JVM messages.
- prevent `zero-length variable name` and prints the ignored options as an warning message.
**Before**
``` r
> read.json("path", a = 1, 2, 3, "a")
Error in env[[name]] <- value :
zero-length variable name
```
``` r
> read.json("arbitrary_path")
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
org.apache.spark.sql.AnalysisException: Path does not exist: file:/...;
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:398)
...
> read.orc("arbitrary_path")
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
org.apache.spark.sql.AnalysisException: Path does not exist: file:/...;
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:398)
...
> read.text("arbitrary_path")
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
org.apache.spark.sql.AnalysisException: Path does not exist: file:/...;
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:398)
...
> read.parquet("arbitrary_path")
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
org.apache.spark.sql.AnalysisException: Path does not exist: file:/...;
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:398)
...
```
``` r
> write.json(df, "existing_path")
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
org.apache.spark.sql.AnalysisException: path file:/... already exists.;
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:68)
> write.orc(df, "existing_path")
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
org.apache.spark.sql.AnalysisException: path file:/... already exists.;
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:68)
> write.text(df, "existing_path")
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
org.apache.spark.sql.AnalysisException: path file:/... already exists.;
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:68)
> write.parquet(df, "existing_path")
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
org.apache.spark.sql.AnalysisException: path file:/... already exists.;
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:68)
```
**After**
``` r
read.json("arbitrary_path", a = 1, 2, 3, "a")
Unnamed arguments ignored: 2, 3, a.
```
``` r
> read.json("arbitrary_path")
Error in json : analysis error - Path does not exist: file:/...
> read.orc("arbitrary_path")
Error in orc : analysis error - Path does not exist: file:/...
> read.text("arbitrary_path")
Error in text : analysis error - Path does not exist: file:/...
> read.parquet("arbitrary_path")
Error in parquet : analysis error - Path does not exist: file:/...
```
``` r
> write.json(df, "existing_path")
Error in json : analysis error - path file:/... already exists.;
> write.orc(df, "existing_path")
Error in orc : analysis error - path file:/... already exists.;
> write.text(df, "existing_path")
Error in text : analysis error - path file:/... already exists.;
> write.parquet(df, "existing_path")
Error in parquet : analysis error - path file:/... already exists.;
```
## How was this patch tested?
Unit tests in `test_utils.R` and `test_sparkSQL.R`.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#15608 from HyukjinKwon/SPARK-17838.
## What changes were proposed in this pull request?
Simplify struct creation, especially the aspect of `CleanupAliases` which missed some aliases when handling trees created by `CreateStruct`.
This PR includes:
1. A failing test (create struct with nested aliases, some of the aliases survive `CleanupAliases`).
2. A fix that transforms `CreateStruct` into a `CreateNamedStruct` constructor, effectively eliminating `CreateStruct` from all expression trees.
3. A `NamePlaceHolder` used by `CreateStruct` when column names cannot be extracted from unresolved `NamedExpression`.
4. A new Analyzer rule that resolves `NamePlaceHolder` into a string literal once the `NamedExpression` is resolved.
5. `CleanupAliases` code was simplified as it no longer has to deal with `CreateStruct`'s top level columns.
## How was this patch tested?
running all tests-suits in package org.apache.spark.sql, especially including the analysis suite, making sure added test initially fails, after applying suggested fix rerun the entire analysis package successfully.
modified few tests that expected `CreateStruct` which is now transformed into `CreateNamedStruct`.
Credit goes to hvanhovell for assisting with this PR.
Author: eyal farago <eyal farago>
Author: eyal farago <eyal.farago@gmail.com>
Author: Herman van Hovell <hvanhovell@databricks.com>
Author: Eyal Farago <eyal.farago@actimize.com>
Author: Hyukjin Kwon <gurwls223@gmail.com>
Author: eyalfa <eyal.farago@gmail.com>
Closes#14444 from eyalfa/SPARK-16839_redundant_aliases_after_cleanupAliases.
## What changes were proposed in this pull request?
Random Forest Regression and Classification for R
Clean-up/reordering generics.R
## How was this patch tested?
manual tests, unit tests
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#15607 from felixcheung/rrandomforest.
## What changes were proposed in this pull request?
This patch makes RBackend connection timeout configurable by user.
## How was this patch tested?
N/A
Author: Hossein <hossein@databricks.com>
Closes#15471 from falaki/SPARK-17919.
## What changes were proposed in this pull request?
API and programming guide doc changes for Scala, Python and R.
## How was this patch tested?
manual test
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#15629 from felixcheung/jsondoc.
## What changes were proposed in this pull request?
a couple of small late finding fixes for doc
## How was this patch tested?
manually
wangmiao1981
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#15650 from felixcheung/logitfix.
## What changes were proposed in this pull request?
As we discussed in #14818, I added a separate R wrapper spark.logit for logistic regression.
This single interface supports both binary and multinomial logistic regression. It also has "predict" and "summary" for binary logistic regression.
## How was this patch tested?
New unit tests are added.
Author: wm624@hotmail.com <wm624@hotmail.com>
Closes#15365 from wangmiao1981/glm.
## What changes were proposed in this pull request?
Add storageLevel to DataFrame for SparkR.
This is similar to this RP: https://github.com/apache/spark/pull/13780
but in R I do not make a class for `StorageLevel`
but add a method `storageToString`
## How was this patch tested?
test added.
Author: WeichenXu <WeichenXu123@outlook.com>
Closes#15516 from WeichenXu123/storageLevel_df_r.
## What changes were proposed in this pull request?
update SparkR MLP, add initalWeights parameter.
## How was this patch tested?
test added.
Author: WeichenXu <WeichenXu123@outlook.com>
Closes#15552 from WeichenXu123/mlp_r_add_initialWeight_param.
## What changes were proposed in this pull request?
Fixes for R doc
## How was this patch tested?
N/A
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#15589 from felixcheung/rdocmergefix.
(cherry picked from commit 0e0d83a597)
Signed-off-by: Felix Cheung <felixcheung@apache.org>
## What changes were proposed in this pull request?
NA date values are serialized as "NA" and NA time values are serialized as NaN from R. In the backend we did not have proper logic to deal with them. As a result we got an IllegalArgumentException for Date and wrong value for time. This PR adds support for deserializing NA as Date and Time.
## How was this patch tested?
* [x] TODO
Author: Hossein <hossein@databricks.com>
Closes#15421 from falaki/SPARK-17811.
## What changes were proposed in this pull request?
Add crossJoin and do not default to cross join if joinExpr is left out
## How was this patch tested?
unit test
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#15559 from felixcheung/rcrossjoin.
## What changes were proposed in this pull request?
testthat library we are using for testing R is redirecting warning (and disabling `options("warn" = 2)`), we need to have a way to detect any new warning and fail
## How was this patch tested?
manual testing, Jenkins
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#15576 from felixcheung/rtestwarning.
## What changes were proposed in this pull request?
Fix for a bunch of test warnings that were added recently.
We need to investigate why warnings are not turning into errors.
```
Warnings -----------------------------------------------------------------------
1. createDataFrame uses files for large objects (test_sparkSQL.R#215) - Use Sepal_Length instead of Sepal.Length as column name
2. createDataFrame uses files for large objects (test_sparkSQL.R#215) - Use Sepal_Width instead of Sepal.Width as column name
3. createDataFrame uses files for large objects (test_sparkSQL.R#215) - Use Petal_Length instead of Petal.Length as column name
4. createDataFrame uses files for large objects (test_sparkSQL.R#215) - Use Petal_Width instead of Petal.Width as column name
Consider adding
importFrom("utils", "object.size")
to your NAMESPACE file.
```
## How was this patch tested?
unit tests
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#15560 from felixcheung/rwarnings.
## What changes were proposed in this pull request?
If the R data structure that is being parallelized is larger than `INT_MAX` we use files to transfer data to JVM. The serialization protocol mimics Python pickling. This allows us to simply call `PythonRDD.readRDDFromFile` to create the RDD.
I tested this on my MacBook. Following code works with this patch:
```R
intMax <- .Machine$integer.max
largeVec <- 1:intMax
rdd <- SparkR:::parallelize(sc, largeVec, 2)
```
## How was this patch tested?
* [x] Unit tests
Author: Hossein <hossein@databricks.com>
Closes#15375 from falaki/SPARK-17790.
## What changes were proposed in this pull request?
SQLConf is session-scoped and mutable. However, we do have the requirement for a static SQL conf, which is global and immutable, e.g. the `schemaStringThreshold` in `HiveExternalCatalog`, the flag to enable/disable hive support, the global temp view database in https://github.com/apache/spark/pull/14897.
Actually we've already implemented static SQL conf implicitly via `SparkConf`, this PR just make it explicit and expose it to users, so that they can see the config value via SQL command or `SparkSession.conf`, and forbid users to set/unset static SQL conf.
## How was this patch tested?
new tests in SQLConfSuite
Author: Wenchen Fan <wenchen@databricks.com>
Closes#15295 from cloud-fan/global-conf.
## What changes were proposed in this pull request?
Fix SparkR ```spark.naiveBayes``` error when response variable of dataset is numeric type.
See details and how to reproduce this bug at [SPARK-15153](https://issues.apache.org/jira/browse/SPARK-15153).
## How was this patch tested?
Add unit test.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#15431 from yanboliang/spark-15153-2.
## What changes were proposed in this pull request?
This PR includes the changes below:
- Support `mode`/`options` in `read.parquet`, `write.parquet`, `read.orc`, `write.orc`, `read.text`, `write.text`, `read.json` and `write.json` APIs
- Support other types (logical, numeric and string) as options for `write.df`, `read.df`, `read.parquet`, `write.parquet`, `read.orc`, `write.orc`, `read.text`, `write.text`, `read.json` and `write.json`
## How was this patch tested?
Unit tests in `test_sparkSQL.R`/ `utils.R`.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#15239 from HyukjinKwon/SPARK-17665.
## What changes were proposed in this pull request?
`write.df`/`read.df` API require path which is not actually always necessary in Spark. Currently, it only affects the datasources implementing `CreatableRelationProvider`. Currently, Spark currently does not have internal data sources implementing this but it'd affect other external datasources.
In addition we'd be able to use this way in Spark's JDBC datasource after https://github.com/apache/spark/pull/12601 is merged.
**Before**
- `read.df`
```r
> read.df(source = "json")
Error in dispatchFunc("read.df(path = NULL, source = NULL, schema = NULL, ...)", :
argument "x" is missing with no default
```
```r
> read.df(path = c(1, 2))
Error in dispatchFunc("read.df(path = NULL, source = NULL, schema = NULL, ...)", :
argument "x" is missing with no default
```
```r
> read.df(c(1, 2))
Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
java.lang.ClassCastException: java.lang.Double cannot be cast to java.lang.String
at org.apache.spark.sql.execution.datasources.DataSource.hasMetadata(DataSource.scala:300)
at
...
In if (is.na(object)) { :
...
```
- `write.df`
```r
> write.df(df, source = "json")
Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘write.df’ for signature ‘"function", "missing"’
```
```r
> write.df(df, source = c(1, 2))
Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘write.df’ for signature ‘"SparkDataFrame", "missing"’
```
```r
> write.df(df, mode = TRUE)
Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘write.df’ for signature ‘"SparkDataFrame", "missing"’
```
**After**
- `read.df`
```r
> read.df(source = "json")
Error in loadDF : analysis error - Unable to infer schema for JSON at . It must be specified manually;
```
```r
> read.df(path = c(1, 2))
Error in f(x, ...) : path should be charactor, null or omitted.
```
```r
> read.df(c(1, 2))
Error in f(x, ...) : path should be charactor, null or omitted.
```
- `write.df`
```r
> write.df(df, source = "json")
Error in save : illegal argument - 'path' is not specified
```
```r
> write.df(df, source = c(1, 2))
Error in .local(df, path, ...) :
source should be charactor, null or omitted. It is 'parquet' by default.
```
```r
> write.df(df, mode = TRUE)
Error in .local(df, path, ...) :
mode should be charactor or omitted. It is 'error' by default.
```
## How was this patch tested?
Unit tests in `test_sparkSQL.R`
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#15231 from HyukjinKwon/write-default-r.
## What changes were proposed in this pull request?
Some tests in `test_mllib.r` are as below:
```r
expect_error(spark.mlp(df, layers = NULL), "layers must be a integer vector with length > 1.")
expect_error(spark.mlp(df, layers = c()), "layers must be a integer vector with length > 1.")
```
The problem is, `is.na` is internally called via `na.omit` in `spark.mlp` which causes warnings as below:
```
Warnings -----------------------------------------------------------------------
1. spark.mlp (test_mllib.R#400) - is.na() applied to non-(list or vector) of type 'NULL'
2. spark.mlp (test_mllib.R#401) - is.na() applied to non-(list or vector) of type 'NULL'
```
## How was this patch tested?
Manually tested. Also, Jenkins tests and AppVeyor.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#15232 from HyukjinKwon/remove-warnnings.
## What changes were proposed in this pull request?
#15140 exposed ```JavaSparkContext.addFile(path: String, recursive: Boolean)``` to Python/R, then we can update SparkR ```spark.addFile``` to support adding directory recursively.
## How was this patch tested?
Added unit test.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#15216 from yanboliang/spark-17577-2.
## What changes were proposed in this pull request?
Spark will add sparkr.zip to archive only when it is yarn mode (SparkSubmit.scala).
```
if (args.isR && clusterManager == YARN) {
val sparkRPackagePath = RUtils.localSparkRPackagePath
if (sparkRPackagePath.isEmpty) {
printErrorAndExit("SPARK_HOME does not exist for R application in YARN mode.")
}
val sparkRPackageFile = new File(sparkRPackagePath.get, SPARKR_PACKAGE_ARCHIVE)
if (!sparkRPackageFile.exists()) {
printErrorAndExit(s"$SPARKR_PACKAGE_ARCHIVE does not exist for R application in YARN mode.")
}
val sparkRPackageURI = Utils.resolveURI(sparkRPackageFile.getAbsolutePath).toString
// Distribute the SparkR package.
// Assigns a symbol link name "sparkr" to the shipped package.
args.archives = mergeFileLists(args.archives, sparkRPackageURI + "#sparkr")
// Distribute the R package archive containing all the built R packages.
if (!RUtils.rPackages.isEmpty) {
val rPackageFile =
RPackageUtils.zipRLibraries(new File(RUtils.rPackages.get), R_PACKAGE_ARCHIVE)
if (!rPackageFile.exists()) {
printErrorAndExit("Failed to zip all the built R packages.")
}
val rPackageURI = Utils.resolveURI(rPackageFile.getAbsolutePath).toString
// Assigns a symbol link name "rpkg" to the shipped package.
args.archives = mergeFileLists(args.archives, rPackageURI + "#rpkg")
}
}
```
So it is necessary to pass spark.master from R process to JVM. Otherwise sparkr.zip won't be distributed to executor. Besides that I also pass spark.yarn.keytab/spark.yarn.principal to spark side, because JVM process need them to access secured cluster.
## How was this patch tested?
Verify it manually in R Studio using the following code.
```
Sys.setenv(SPARK_HOME="/Users/jzhang/github/spark")
.libPaths(c(file.path(Sys.getenv(), "R", "lib"), .libPaths()))
library(SparkR)
sparkR.session(master="yarn-client", sparkConfig = list(spark.executor.instances="1"))
df <- as.DataFrame(mtcars)
head(df)
```
…
Author: Jeff Zhang <zjffdu@apache.org>
Closes#14784 from zjffdu/SPARK-17210.
## What changes were proposed in this pull request?
update `MultilayerPerceptronClassifierWrapper.fit` paramter type:
`layers: Array[Int]`
`seed: String`
update several default params in sparkR `spark.mlp`:
`tol` --> 1e-6
`stepSize` --> 0.03
`seed` --> NULL ( when seed == NULL, the scala-side wrapper regard it as a `null` value and the seed will use the default one )
r-side `seed` only support 32bit integer.
remove `layers` default value, and move it in front of those parameters with default value.
add `layers` parameter validation check.
## How was this patch tested?
tests added.
Author: WeichenXu <WeichenXu123@outlook.com>
Closes#15051 from WeichenXu123/update_py_mlp_default.
## What changes were proposed in this pull request?
When we build the docs separately we don't have the JAR files from the Spark build in
the same tree. As the SparkR vignettes need to launch a SparkContext to be built, we skip building them if JAR files don't exist
## How was this patch tested?
To test this we can run the following:
```
build/mvn -DskipTests -Psparkr clean
./R/create-docs.sh
```
You should see a line `Skipping R vignettes as Spark JARs not found` at the end
Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
Closes#15200 from shivaram/sparkr-vignette-skip.
## What changes were proposed in this pull request?
#14881 added Kolmogorov-Smirnov Test wrapper to SparkR. I found that ```print.summary.KSTest``` was implemented inappropriately and result in no effect.
Running the following code for KSTest:
```Scala
data <- data.frame(test = c(0.1, 0.15, 0.2, 0.3, 0.25, -1, -0.5))
df <- createDataFrame(data)
testResult <- spark.kstest(df, "test", "norm")
summary(testResult)
```
Before this PR:
![image](https://cloud.githubusercontent.com/assets/1962026/18615016/b9a2823a-7d4f-11e6-934b-128beade355e.png)
After this PR:
![image](https://cloud.githubusercontent.com/assets/1962026/18615014/aafe2798-7d4f-11e6-8b99-c705bb9fe8f2.png)
The new implementation is similar with [```print.summary.GeneralizedLinearRegressionModel```](https://github.com/apache/spark/blob/master/R/pkg/R/mllib.R#L284) of SparkR and [```print.summary.glm```](https://svn.r-project.org/R/trunk/src/library/stats/R/glm.R) of native R.
BTW, I removed the comparison of ```print.summary.KSTest``` in unit test, since it's only wrappers of the summary output which has been checked. Another reason is that these comparison will output summary information to the test console, it will make the test output in a mess.
## How was this patch tested?
Existing test.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#15139 from yanboliang/spark-17315.
## What changes were proposed in this pull request?
Scala/Python users can add files to Spark job by submit options ```--files``` or ```SparkContext.addFile()```. Meanwhile, users can get the added file by ```SparkFiles.get(filename)```.
We should also support this function for SparkR users, since they also have the requirements for some shared dependency files. For example, SparkR users can download third party R packages to driver firstly, add these files to the Spark job as dependency by this API and then each executor can install these packages by ```install.packages```.
## How was this patch tested?
Add unit test.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#15131 from yanboliang/spark-17577.
## What changes were proposed in this pull request?
Clarify that slide and window duration are absolute, and not relative to a calendar.
## How was this patch tested?
Doc build (no functional change)
Author: Sean Owen <sowen@cloudera.com>
Closes#15142 from srowen/SPARK-17297.
## What changes were proposed in this pull request?
Point references to spark-packages.org to https://cwiki.apache.org/confluence/display/SPARK/Third+Party+Projects
This will be accompanied by a parallel change to the spark-website repo, and additional changes to this wiki.
## How was this patch tested?
Jenkins tests.
Author: Sean Owen <sowen@cloudera.com>
Closes#15075 from srowen/SPARK-17445.
## What changes were proposed in this pull request?
This PR tries to add a SparkR vignette, which works as a friendly guidance going through the functionality provided by SparkR.
## How was this patch tested?
Manual test.
Author: junyangq <qianjunyang@gmail.com>
Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
Author: Junyang Qian <junyangq@databricks.com>
Closes#14980 from junyangq/SPARKR-vignette.
## What changes were proposed in this pull request?
Fix summary() method's `return` description for spark.mlp
## How was this patch tested?
Ran tests locally on my laptop.
Author: Xin Ren <iamshrek@126.com>
Closes#15015 from keypointt/SPARK-16445-2.
## What changes were proposed in this pull request?
SparkR ```spark.als``` arguments ```reg``` should be 0.1 by default, which need to be consistent with ML.
## How was this patch tested?
Existing tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#15021 from yanboliang/spark-17464.
## What changes were proposed in this pull request?
additional options were not passed down in write.df.
## How was this patch tested?
unit tests
falaki shivaram
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#15010 from felixcheung/testreadoptions.
## What changes were proposed in this pull request?
Fixed bug in `dapplyCollect` by changing the `compute` function of `worker.R` to explicitly handle raw (binary) vectors.
cc shivaram
## How was this patch tested?
Unit tests
Author: Clark Fitzgerald <clarkfitzg@gmail.com>
Closes#14783 from clarkfitzg/SPARK-16785.
## What changes were proposed in this pull request?
This PR tries to add Kolmogorov-Smirnov Test wrapper to SparkR. This wrapper implementation only supports one sample test against normal distribution.
## How was this patch tested?
R unit test.
Author: Junyang Qian <junyangq@databricks.com>
Closes#14881 from junyangq/SPARK-17315.
## What changes were proposed in this pull request?
This PR tries to add some more explanation to `sparkR.session`. It also modifies doc for `count` so when grouped in one doc, the description doesn't confuse users.
## How was this patch tested?
Manual test.
![screen shot 2016-09-02 at 1 21 36 pm](https://cloud.githubusercontent.com/assets/15318264/18217198/409613ac-7110-11e6-8dae-cb0c8df557bf.png)
Author: Junyang Qian <junyangq@databricks.com>
Closes#14942 from junyangq/fixSparkRSessionDoc.
## What changes were proposed in this pull request?
Require the use of CROSS join syntax in SQL (and a new crossJoin
DataFrame API) to specify explicit cartesian products between relations.
By cartesian product we mean a join between relations R and S where
there is no join condition involving columns from both R and S.
If a cartesian product is detected in the absence of an explicit CROSS
join, an error must be thrown. Turning on the
"spark.sql.crossJoin.enabled" configuration flag will disable this check
and allow cartesian products without an explicit CROSS join.
The new crossJoin DataFrame API must be used to specify explicit cross
joins. The existing join(DataFrame) method will produce a INNER join
that will require a subsequent join condition.
That is df1.join(df2) is equivalent to select * from df1, df2.
## How was this patch tested?
Added cross-join.sql to the SQLQueryTestSuite to test the check for cartesian products. Added a couple of tests to the DataFrameJoinSuite to test the crossJoin API. Modified various other test suites to explicitly specify a cross join where an INNER join or a comma-separated list was previously used.
Author: Srinath Shankar <srinath@databricks.com>
Closes#14866 from srinathshankar/crossjoin.
## What changes were proposed in this pull request?
change since version in doc
## How was this patch tested?
manual
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#14939 from felixcheung/rsparkversion2.
## What changes were proposed in this pull request?
Doc change - see https://issues.apache.org/jira/browse/SPARK-16324
## How was this patch tested?
manual check
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#14934 from felixcheung/regexpextractdoc.
## What changes were proposed in this pull request?
Add sparkR.version() API.
```
> sparkR.version()
[1] "2.1.0-SNAPSHOT"
```
## How was this patch tested?
manual, unit tests
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#14935 from felixcheung/rsparksessionversion.
## What changes were proposed in this pull request?
(Please fill in changes proposed in this fix)
registerTempTable(createDataFrame(iris), "iris")
str(collect(sql("select cast('1' as double) as x, cast('2' as decimal) as y from iris limit 5")))
'data.frame': 5 obs. of 2 variables:
$ x: num 1 1 1 1 1
$ y:List of 5
..$ : num 2
..$ : num 2
..$ : num 2
..$ : num 2
..$ : num 2
The problem is that spark returns `decimal(10, 0)` col type, instead of `decimal`. Thus, `decimal(10, 0)` is not handled correctly. It should be handled as "double".
As discussed in JIRA thread, we can have two potential fixes:
1). Scala side fix to add a new case when writing the object back; However, I can't use spark.sql.types._ in Spark core due to dependency issues. I don't find a way of doing type case match;
2). SparkR side fix: Add a helper function to check special type like `"decimal(10, 0)"` and replace it with `double`, which is PRIMITIVE type. This special helper is generic for adding new types handling in the future.
I open this PR to discuss pros and cons of both approaches. If we want to do Scala side fix, we need to find a way to match the case of DecimalType and StructType in Spark Core.
## How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
Manual test:
> str(collect(sql("select cast('1' as double) as x, cast('2' as decimal) as y from iris limit 5")))
'data.frame': 5 obs. of 2 variables:
$ x: num 1 1 1 1 1
$ y: num 2 2 2 2 2
R Unit tests
Author: wm624@hotmail.com <wm624@hotmail.com>
Closes#14613 from wangmiao1981/type.
https://issues.apache.org/jira/browse/SPARK-17241
## What changes were proposed in this pull request?
Spark has configurable L2 regularization parameter for generalized linear regression. It is very important to have them in SparkR so that users can run ridge regression.
## How was this patch tested?
Test manually on local laptop.
Author: Xin Ren <iamshrek@126.com>
Closes#14856 from keypointt/SPARK-17241.
## What changes were proposed in this pull request?
The usage in the original example is incorrect. This PR fixes it.
## How was this patch tested?
Manual test.
Author: Junyang Qian <junyangq@databricks.com>
Closes#14903 from junyangq/SPARKR-FixWindowPartitionByDoc.
## What changes were proposed in this pull request?
Remove cleanup.jobj test. Use JVM wrapper API for other test cases.
## How was this patch tested?
Run R unit tests with testthat 1.0
Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
Closes#14904 from shivaram/sparkr-jvm-tests-fix.
## What changes were proposed in this pull request?
Currently, `HiveContext` in SparkR is not being tested and always skipped.
This is because the initiation of `TestHiveContext` is being failed due to trying to load non-existing data paths (test tables).
This is introduced from https://github.com/apache/spark/pull/14005
This enables the tests with SparkR.
## How was this patch tested?
Manually,
**Before** (on Mac OS)
```
...
Skipped ------------------------------------------------------------------------
1. create DataFrame from RDD (test_sparkSQL.R#200) - Hive is not build with SparkSQL, skipped
2. test HiveContext (test_sparkSQL.R#1041) - Hive is not build with SparkSQL, skipped
3. read/write ORC files (test_sparkSQL.R#1748) - Hive is not build with SparkSQL, skipped
4. enableHiveSupport on SparkSession (test_sparkSQL.R#2480) - Hive is not build with SparkSQL, skipped
5. sparkJars tag in SparkContext (test_Windows.R#21) - This test is only for Windows, skipped
...
```
**After** (on Mac OS)
```
...
Skipped ------------------------------------------------------------------------
1. sparkJars tag in SparkContext (test_Windows.R#21) - This test is only for Windows, skipped
...
```
Please refer the tests below (on Windows)
- Before: https://ci.appveyor.com/project/HyukjinKwon/spark/build/45-test123
- After: https://ci.appveyor.com/project/HyukjinKwon/spark/build/46-test123
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#14889 from HyukjinKwon/SPARK-17326.
## What changes were proposed in this pull request?
This change exposes a public API in SparkR to create objects, call methods on the Spark driver JVM
## How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
Unit tests, CRAN checks
Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
Closes#14775 from shivaram/sparkr-java-api.
## What changes were proposed in this pull request?
This PR tries to fix the name of the `SparkDataFrame` used in the example. Also, it gives a reference url of an example data file so that users can play with.
## How was this patch tested?
Manual test.
Author: Junyang Qian <junyangq@databricks.com>
Closes#14853 from junyangq/SPARKR-FixLDADoc.
## What changes were proposed in this pull request?
The original example doesn't work because the features are not categorical. This PR fixes this by changing to another dataset.
## How was this patch tested?
Manual test.
Author: Junyang Qian <junyangq@databricks.com>
Closes#14820 from junyangq/SPARK-FixNaiveBayes.
## What changes were proposed in this pull request?
This PR gives informative message to users when they try to connect to a remote master but don't have Spark package in their local machine.
As a clarification, for now, automatic installation will only happen if they start SparkR in R console (rather than from sparkr-shell) and connect to local master. In the remote master mode, local Spark package is still needed, but we will not trigger the install.spark function because the versions have to match those on the cluster, which involves more user input. Instead, we here try to provide detailed message that may help the users.
Some of the other messages have also been slightly changed.
## How was this patch tested?
Manual test.
Author: Junyang Qian <junyangq@databricks.com>
Closes#14761 from junyangq/SPARK-16579-V1.
## What changes were proposed in this pull request?
This PR adds more examples to window function docs to make them more accessible to the users.
It also fixes default value issues for `lag` and `lead`.
## How was this patch tested?
Manual test, R unit test.
Author: Junyang Qian <junyangq@databricks.com>
Closes#14779 from junyangq/SPARKR-FixWindowFunctionDocs.
## What changes were proposed in this pull request?
Fixed several misplaced param tag - they should be on the spark.* method generics
## How was this patch tested?
run knitr
junyangq
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#14792 from felixcheung/rdocmllib.
https://issues.apache.org/jira/browse/SPARK-16445
## What changes were proposed in this pull request?
Create Multilayer Perceptron Classifier wrapper in SparkR
## How was this patch tested?
Tested manually on local machine
Author: Xin Ren <iamshrek@126.com>
Closes#14447 from keypointt/SPARK-16445.
## What changes were proposed in this pull request?
The original doc of `show` put methods for multiple classes together but the text only talks about `SparkDataFrame`. This PR tries to fix this problem.
## How was this patch tested?
Manual test.
Author: Junyang Qian <junyangq@databricks.com>
Closes#14776 from junyangq/SPARK-FixShowDoc.
## What changes were proposed in this pull request?
The PR removes reference link in the doc for environment variables for common Windows folders. The cran check gave code 503: service unavailable on the original link.
## How was this patch tested?
Manual check.
Author: Junyang Qian <junyangq@databricks.com>
Closes#14767 from junyangq/SPARKR-RemoveLink.
## What changes were proposed in this pull request?
Update DESCRIPTION
## How was this patch tested?
Run install and CRAN tests
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#14764 from felixcheung/rpackagedescription.
## What changes were proposed in this pull request?
(Please fill in changes proposed in this fix)
## How was this patch tested?
This change adds CRAN documentation checks to be run as a part of `R/run-tests.sh` . As this script is also used by Jenkins this means that we will get documentation checks on every PR going forward.
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
Closes#14759 from shivaram/sparkr-cran-jenkins.
## What changes were proposed in this pull request?
replace ``` ` ``` in code doc with `\code{thing}`
remove added `...` for drop(DataFrame)
fix remaining CRAN check warnings
## How was this patch tested?
create doc with knitr
junyangq
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#14734 from felixcheung/rdoccleanup.
## What changes were proposed in this pull request?
This change adds Xiangrui Meng and Felix Cheung to the maintainers field in the package description.
## How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
Closes#14758 from shivaram/sparkr-maintainers.
## What changes were proposed in this pull request?
refactor, cleanup, reformat, fix deprecation in test
## How was this patch tested?
unit tests, manual tests
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#14735 from felixcheung/rmllibutil.
## What changes were proposed in this pull request?
This PR tries to fix the scheme of local cache folder in Windows. The name of the environment variable should be `LOCALAPPDATA` rather than `%LOCALAPPDATA%`.
## How was this patch tested?
Manual test in Windows 7.
Author: Junyang Qian <junyangq@databricks.com>
Closes#14743 from junyangq/SPARKR-FixWindowsInstall.
## What changes were proposed in this pull request?
Ignore temp files generated by `check-cran.sh`.
Author: Xiangrui Meng <meng@databricks.com>
Closes#14740 from mengxr/R-gitignore.
## What changes were proposed in this pull request?
#14551 fixed off-by-one bug in ```randomizeInPlace``` and some test failure caused by this fix.
But for SparkR ```spark.gaussianMixture``` test case, the fix is inappropriate. It only changed the output result of native R which should be compared by SparkR, however, it did not change the R code in annotation which is used for reproducing the result in native R. It will confuse users who can not reproduce the same result in native R. This PR sends a more robust test case which can produce same result between SparkR and native R.
## How was this patch tested?
Unit test update.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#14730 from yanboliang/spark-16961-followup.
## What changes were proposed in this pull request?
This PR tries to fix all the remaining "undocumented/duplicated arguments" warnings given by CRAN-check.
One left is doc for R `stats::glm` exported in SparkR. To mute that warning, we have to also provide document for all arguments of that non-SparkR function.
Some previous conversation is in #14558.
## How was this patch tested?
R unit test and `check-cran.sh` script (with no-test).
Author: Junyang Qian <junyangq@databricks.com>
Closes#14705 from junyangq/SPARK-16508-master.
JIRA issue link:
https://issues.apache.org/jira/browse/SPARK-16961
Changed one line of Utils.randomizeInPlace to allow elements to stay in place.
Created a unit test that runs a Pearson's chi squared test to determine whether the output diverges significantly from a uniform distribution.
Author: Nick Lavers <nick.lavers@videoamp.com>
Closes#14551 from nicklavers/SPARK-16961-randomizeInPlace.
## What changes were proposed in this pull request?
Add LDA Wrapper in SparkR with the following interfaces:
- spark.lda(data, ...)
- spark.posterior(object, newData, ...)
- spark.perplexity(object, ...)
- summary(object)
- write.ml(object)
- read.ml(path)
## How was this patch tested?
Test with SparkR unit test.
Author: Xusen Yin <yinxusen@gmail.com>
Closes#14229 from yinxusen/SPARK-16447.
## What changes were proposed in this pull request?
Gaussian Mixture Model wrapper in SparkR, similarly to R's ```mvnormalmixEM```.
## How was this patch tested?
Unit test.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#14392 from yanboliang/spark-16446.
## What changes were proposed in this pull request?
(Please fill in changes proposed in this fix)
Add Isotonic Regression wrapper in SparkR
Wrappers in R and Scala are added.
Unit tests
Documentation
## How was this patch tested?
Manually tested with sudo ./R/run-tests.sh
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
Author: wm624@hotmail.com <wm624@hotmail.com>
Closes#14182 from wangmiao1981/isoR.
## What changes were proposed in this pull request?
Rename RDD functions for now to avoid CRAN check warnings.
Some RDD functions are sharing generics with DataFrame functions (hence the problem) so after the renames we need to add new generics, for now.
## How was this patch tested?
unit tests
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#14626 from felixcheung/rrddfunctions.
## What changes were proposed in this pull request?
Fix the issue that ```spark.glm``` ```weightCol``` should in the signature.
## How was this patch tested?
Existing tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#14641 from yanboliang/weightCol.
## What changes were proposed in this pull request?
Add an install_spark function to the SparkR package. User can run `install_spark()` to install Spark to a local directory within R.
Updates:
Several changes have been made:
- `install.spark()`
- check existence of tar file in the cache folder, and download only if not found
- trial priority of mirror_url look-up: user-provided -> preferred mirror site from apache website -> hardcoded backup option
- use 2.0.0
- `sparkR.session()`
- can install spark when not found in `SPARK_HOME`
## How was this patch tested?
Manual tests, running the check-cran.sh script added in #14173.
Author: Junyang Qian <junyangq@databricks.com>
Closes#14258 from junyangq/SPARK-16579.
## What changes were proposed in this pull request?
Training GLMs on weighted dataset is very important use cases, but it is not supported by SparkR currently. Users can pass argument ```weights``` to specify the weights vector in native R. For ```spark.glm```, we can pass in the ```weightCol``` which is consistent with MLlib.
## How was this patch tested?
Unit test.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#14346 from yanboliang/spark-16710.
## What changes were proposed in this pull request?
This change moves the include jar test from R to SparkSubmitSuite and uses a dynamically compiled jar. This helps us remove the binary jar from the R package and solves both the CRAN warnings and the lack of source being available for this jar.
## How was this patch tested?
SparkR unit tests, SparkSubmitSuite, check-cran.sh
Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
Closes#14243 from shivaram/sparkr-jar-move.
## What changes were proposed in this pull request?
https://issues.apache.org/jira/browse/SPARK-16055
sparkPackages - argument is passed and we detect that we are in the R script mode, we should print some warning like --packages flag should be used with with spark-submit
## How was this patch tested?
In my system locally
Author: krishnakalyan3 <krishnakalyan3@gmail.com>
Closes#14179 from krishnakalyan3/spark-pkg.
## What changes were proposed in this pull request?
Fix R SparkSession init/stop, and warnings of reusing existing Spark Context
## How was this patch tested?
unit tests
shivaram
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#14177 from felixcheung/rsessiontest.
## What changes were proposed in this pull request?
Add a check-cran.sh script that runs `R CMD check` as CRAN. Also fixes a number of issues pointed out by the check. These include
- Updating `DESCRIPTION` to be appropriate
- Adding a .Rbuildignore to ignore lintr, src-native, html that are non-standard files / dirs
- Adding aliases to all S4 methods in DataFrame, Column, GroupedData etc. This is required as stated in https://cran.r-project.org/doc/manuals/r-release/R-exts.html#Documenting-S4-classes-and-methods
- Other minor fixes
## How was this patch tested?
SparkR unit tests, running the above mentioned script
Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
Closes#14173 from shivaram/sparkr-cran-changes.
## What changes were proposed in this pull request?
More tests
I don't think this is critical for Spark 2.0.0 RC, maybe Spark 2.0.1 or 2.1.0.
## How was this patch tested?
unit tests
shivaram dongjoon-hyun
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#14206 from felixcheung/rroutetests.
## What changes were proposed in this pull request?
Fix function routing to work with and without namespace operator `SparkR::createDataFrame`
## How was this patch tested?
manual, unit tests
shivaram
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#14195 from felixcheung/rroutedefault.
## What changes were proposed in this pull request?
Rename window.partitionBy and window.orderBy to windowPartitionBy and windowOrderBy to pass CRAN package check.
## How was this patch tested?
SparkR unit tests.
Author: Sun Rui <sunrui2016@gmail.com>
Closes#14192 from sun-rui/SPARK-16509.
## What changes were proposed in this pull request?
Minor documentation update for code example, code style, and missed reference to "sparkR.init"
## How was this patch tested?
manual
shivaram
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#14178 from felixcheung/rcsvprogrammingguide.
## What changes were proposed in this pull request?
Minor example updates
## How was this patch tested?
manual
shivaram
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#14171 from felixcheung/rexample.
## What changes were proposed in this pull request?
* Update SparkR ML section to make them consistent with SparkR API docs.
* Since #13972 adds labelling support for the ```include_example``` Jekyll plugin, so that we can split the single ```ml.R``` example file into multiple line blocks with different labels, and include them in different algorithms/models in the generated HTML page.
## How was this patch tested?
Only docs update, manually check the generated docs.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#14011 from yanboliang/r-user-guide-update.
## What changes were proposed in this pull request?
This PR prevents ERRORs when `summary(df)` is called for `SparkDataFrame` with not-numeric columns. This failure happens only in `SparkR`.
**Before**
```r
> df <- createDataFrame(faithful)
> df <- withColumn(df, "boolean", df$waiting==79)
> summary(df)
16/07/07 14:15:16 ERROR RBackendHandler: describe on 34 failed
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
org.apache.spark.sql.AnalysisException: cannot resolve 'avg(`boolean`)' due to data type mismatch: function average requires numeric types, not BooleanType;
```
**After**
```r
> df <- createDataFrame(faithful)
> df <- withColumn(df, "boolean", df$waiting==79)
> summary(df)
SparkDataFrame[summary:string, eruptions:string, waiting:string]
```
## How was this patch tested?
Pass the Jenkins with a updated testcase.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#14096 from dongjoon-hyun/SPARK-16425.
## What changes were proposed in this pull request?
Apply default "NA" as null string for R, like R read.csv na.string parameter.
https://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html
na.strings = "NA"
An user passing a csv file with NA value should get the same behavior with SparkR read.df(... source = "csv")
(couldn't open JIRA, will do that later)
## How was this patch tested?
unit tests
shivaram
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#13984 from felixcheung/rcsvnastring.
## What changes were proposed in this pull request?
ORC test should be enabled only when HiveContext is available.
## How was this patch tested?
Manual.
```
$ R/run-tests.sh
...
1. create DataFrame from RDD (test_sparkSQL.R#200) - Hive is not build with SparkSQL, skipped
2. test HiveContext (test_sparkSQL.R#1021) - Hive is not build with SparkSQL, skipped
3. read/write ORC files (test_sparkSQL.R#1728) - Hive is not build with SparkSQL, skipped
4. enableHiveSupport on SparkSession (test_sparkSQL.R#2448) - Hive is not build with SparkSQL, skipped
5. sparkJars tag in SparkContext (test_Windows.R#21) - This test is only for Windows, skipped
DONE ===========================================================================
Tests passed.
```
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#14019 from dongjoon-hyun/SPARK-16233.
## What changes were proposed in this pull request?
Capture errors from R workers in daemon.R to avoid deletion of R session temporary directory. See detailed description at https://issues.apache.org/jira/browse/SPARK-16299
## How was this patch tested?
SparkR unit tests.
Author: Sun Rui <sunrui2016@gmail.com>
Closes#13975 from sun-rui/SPARK-16299.
## What changes were proposed in this pull request?
gapplyCollect() does gapply() on a SparkDataFrame and collect the result back to R. Compared to gapply() + collect(), gapplyCollect() offers performance optimization as well as programming convenience, as no schema is needed to be provided.
This is similar to dapplyCollect().
## How was this patch tested?
Added test cases for gapplyCollect similar to dapplyCollect
Author: Narine Kokhlikyan <narine@slice.com>
Closes#13760 from NarineK/gapplyCollect.
## What changes were proposed in this pull request?
This PR implements `posexplode` table generating function. Currently, master branch raises the following exception for `map` argument. It's different from Hive.
**Before**
```scala
scala> sql("select posexplode(map('a', 1, 'b', 2))").show
org.apache.spark.sql.AnalysisException: No handler for Hive UDF ... posexplode() takes an array as a parameter; line 1 pos 7
```
**After**
```scala
scala> sql("select posexplode(map('a', 1, 'b', 2))").show
+---+---+-----+
|pos|key|value|
+---+---+-----+
| 0| a| 1|
| 1| b| 2|
+---+---+-----+
```
For `array` argument, `after` is the same with `before`.
```
scala> sql("select posexplode(array(1, 2, 3))").show
+---+---+
|pos|col|
+---+---+
| 0| 1|
| 1| 2|
| 2| 3|
+---+---+
```
## How was this patch tested?
Pass the Jenkins tests with newly added testcases.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#13971 from dongjoon-hyun/SPARK-16289.
https://issues.apache.org/jira/browse/SPARK-16140
## What changes were proposed in this pull request?
Group the R doc of spark.kmeans, predict(KM), summary(KM), read/write.ml(KM) under Rd spark.kmeans. The example code was updated.
## How was this patch tested?
Tested on my local machine
And on my laptop `jekyll build` is failing to build API docs, so here I can only show you the html I manually generated from Rd files, with no CSS applied, but the doc content should be there.
![screenshotkmeans](https://cloud.githubusercontent.com/assets/3925641/16403203/c2c9ca1e-3ca7-11e6-9e29-f2164aee75fc.png)
Author: Xin Ren <iamshrek@126.com>
Closes#13921 from keypointt/SPARK-16140.
## What changes were proposed in this pull request?
Add unit tests for csv data for SPARKR
## How was this patch tested?
unit tests
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#13904 from felixcheung/rcsv.
## What changes were proposed in this pull request?
update sparkR DataFrame.R comment
SQLContext ==> SparkSession
## How was this patch tested?
N/A
Author: WeichenXu <WeichenXu123@outlook.com>
Closes#13946 from WeichenXu123/sparkR_comment_update_sparkSession.
## What changes were proposed in this pull request?
Allowing truncate to a specific number of character is convenient at times, especially while operating from the REPL. Sometimes those last few characters make all the difference, and showing everything brings in whole lot of noise.
## How was this patch tested?
Existing tests. + 1 new test in DataFrameSuite.
For SparkR and pyspark, existing tests and manual testing.
Author: Prashant Sharma <prashsh1@in.ibm.com>
Author: Prashant Sharma <prashant@apache.org>
Closes#13839 from ScrapCodes/add_truncateTo_DF.show.
## What changes were proposed in this pull request?
Add `conf` method to get Runtime Config from SparkSession
## How was this patch tested?
unit tests, manual tests
This is how it works in sparkR shell:
```
SparkSession available as 'spark'.
> conf()
$hive.metastore.warehouse.dir
[1] "file:/opt/spark-2.0.0-bin-hadoop2.6/R/spark-warehouse"
$spark.app.id
[1] "local-1466749575523"
$spark.app.name
[1] "SparkR"
$spark.driver.host
[1] "10.0.2.1"
$spark.driver.port
[1] "45629"
$spark.executorEnv.LD_LIBRARY_PATH
[1] "$LD_LIBRARY_PATH:/usr/lib/R/lib:/usr/lib/x86_64-linux-gnu:/usr/lib/jvm/default-java/jre/lib/amd64/server"
$spark.executor.id
[1] "driver"
$spark.home
[1] "/opt/spark-2.0.0-bin-hadoop2.6"
$spark.master
[1] "local[*]"
$spark.sql.catalogImplementation
[1] "hive"
$spark.submit.deployMode
[1] "client"
> conf("spark.master")
$spark.master
[1] "local[*]"
```
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#13885 from felixcheung/rconf.
## What changes were proposed in this pull request?
Updated setJobGroup, cancelJobGroup, clearJobGroup to not require sc/SparkContext as parameter.
Also updated roxygen2 doc and R programming guide on deprecations.
## How was this patch tested?
unit tests
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#13838 from felixcheung/rjobgroup.
## What changes were proposed in this pull request?
Guide for
- UDFs with dapply, dapplyCollect
- spark.lapply for running parallel R functions
## How was this patch tested?
build locally
<img width="654" alt="screen shot 2016-06-14 at 03 12 56" src="https://cloud.githubusercontent.com/assets/3419881/16039344/12a3b6a0-31de-11e6-8d77-fe23308075c0.png">
Author: Kai Jiang <jiangkai@gmail.com>
Closes#13660 from vectorijk/spark-15672-R-guide-update.