## What changes were proposed in this pull request?
fix the random seed to eliminate variability
## How was this patch tested?
jenkins, appveyor, lots more jenkins
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#19018 from felixcheung/rrftest.
## What changes were proposed in this pull request?
Code in vignettes requires winutils on windows to run, when publishing to CRAN or building from source, winutils might not be available, so it's better to disable code run (so resulting vigenttes will not have output from code, but text is still there and code is still there)
fix * checking re-building of vignette outputs ... WARNING
and
> %LOCALAPPDATA% not found. Please define the environment variable or restart and enter an installation path in localDir.
## How was this patch tested?
jenkins, appveyor, r-hub
before: https://artifacts.r-hub.io/SparkR_2.2.0.tar.gz-49cecef3bb09db1db130db31604e0293/SparkR.Rcheck/00check.log
after: https://artifacts.r-hub.io/SparkR_2.2.0.tar.gz-86a066c7576f46794930ad114e5cff7c/SparkR.Rcheck/00check.log
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#19016 from felixcheung/rvigwind.
## What changes were proposed in this pull request?
SPARK-21100 introduced a new `summary` method to the Scala/Java Dataset API that included expanded statistics (vs `describe`) and control over which statistics to compute. Currently in the R API `summary` acts as an alias for `describe`. This patch updates the R API to call the new `summary` method in the JVM that includes additional statistics and ability to select which to compute.
This does not break the current interface as the present `summary` method does not take additional arguments like `describe` and the output was never meant to be used programmatically.
## How was this patch tested?
Modified and additional unit tests.
Author: Andrew Ray <ray.andrew@gmail.com>
Closes#18786 from aray/summary-r.
## What changes were proposed in this pull request?
Support offset in SparkR GLM #16699
Author: actuaryzhang <actuaryzhang10@gmail.com>
Closes#18831 from actuaryzhang/sparkROffset.
## What changes were proposed in this pull request?
SPARK-20307 Added handleInvalid option to RFormula for tree-based classification algorithms. We should add this parameter for other classification algorithms in SparkR.
This is a followup PR for SPARK-20307.
## How was this patch tested?
New Unit tests are added.
Author: wangmiao1981 <wm624@hotmail.com>
Closes#18605 from wangmiao1981/class.
## What changes were proposed in this pull request?
```RFormula``` should handle invalid for both features and label column.
#18496 only handle invalid values in features column. This PR add handling invalid values for label column and test cases.
## How was this patch tested?
Add test cases.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#18613 from yanboliang/spark-20307.
## What changes were proposed in this pull request?
Update internal references from programming-guide to rdd-programming-guide
See 5ddf243fd8 and https://github.com/apache/spark/pull/18485#issuecomment-314789751
Let's keep the redirector even if it's problematic to build, but not rely on it internally.
## How was this patch tested?
(Doc build)
Author: Sean Owen <sowen@cloudera.com>
Closes#18625 from srowen/SPARK-21267.2.
## What changes were proposed in this pull request?
- Remove Scala 2.10 build profiles and support
- Replace some 2.10 support in scripts with commented placeholders for 2.12 later
- Remove deprecated API calls from 2.10 support
- Remove usages of deprecated context bounds where possible
- Remove Scala 2.10 workarounds like ScalaReflectionLock
- Other minor Scala warning fixes
## How was this patch tested?
Existing tests
Author: Sean Owen <sowen@cloudera.com>
Closes#17150 from srowen/SPARK-19810.
## What changes were proposed in this pull request?
This PR supports schema in a DDL formatted string for `from_json` in R/Python and `dapply` and `gapply` in R, which are commonly used and/or consistent with Scala APIs.
Additionally, this PR exposes `structType` in R to allow working around in other possible corner cases.
**Python**
`from_json`
```python
from pyspark.sql.functions import from_json
data = [(1, '''{"a": 1}''')]
df = spark.createDataFrame(data, ("key", "value"))
df.select(from_json(df.value, "a INT").alias("json")).show()
```
**R**
`from_json`
```R
df <- sql("SELECT named_struct('name', 'Bob') as people")
df <- mutate(df, people_json = to_json(df$people))
head(select(df, from_json(df$people_json, "name STRING")))
```
`structType.character`
```R
structType("a STRING, b INT")
```
`dapply`
```R
dapply(createDataFrame(list(list(1.0)), "a"), function(x) {x}, "a DOUBLE")
```
`gapply`
```R
gapply(createDataFrame(list(list(1.0)), "a"), "a", function(key, x) { x }, "a DOUBLE")
```
## How was this patch tested?
Doc tests for `from_json` in Python and unit tests `test_sparkSQL.R` in R.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#18498 from HyukjinKwon/SPARK-21266.
## What changes were proposed in this pull request?
This is a retry for #18320. This PR was reverted due to unexpected test failures with -10 error code.
I was unable to reproduce in MacOS, CentOS and Ubuntu but only in Jenkins. So, the tests proceeded to verify this and revert the past try here - https://github.com/apache/spark/pull/18456
This new approach was tested in https://github.com/apache/spark/pull/18463.
**Test results**:
- With the part of suspicious change in the past try (466325d3fd)
Tests ran 4 times and 2 times passed and 2 time failed.
- Without the part of suspicious change in the past try (466325d3fd)
Tests ran 5 times and they all passed.
- With this new approach (0a7589c09f)
Tests ran 5 times and they all passed.
It looks the cause is as below (see 466325d3fd):
```diff
+ exitCode <- 1
...
+ data <- parallel:::readChild(child)
+ if (is.raw(data)) {
+ if (unserialize(data) == exitCode) {
...
+ }
+ }
...
- parallel:::mcexit(0L)
+ parallel:::mcexit(0L, send = exitCode)
```
Two possibilities I think
- `parallel:::mcexit(.. , send = exitCode)`
https://stat.ethz.ch/R-manual/R-devel/library/parallel/html/mcfork.html
> It sends send to the master (unless NULL) and then shuts down the child process.
However, it looks possible that the parent attemps to terminate the child right after getting our custom exit code. So, the child gets terminated between "send" and "shuts down", failing to exit properly.
- A bug between `parallel:::mcexit(..., send = ...)` and `parallel:::readChild`.
**Proposal**:
To resolve this, I simply decided to avoid both possibilities with this new approach here (9ff89a7859). To support this idea, I explained with some quotation of the documentation as below:
https://stat.ethz.ch/R-manual/R-devel/library/parallel/html/mcfork.html
> `readChild` and `readChildren` return a raw vector with a "pid" attribute if data were available, an integer vector of length one with the process ID if a child terminated or `NULL` if the child no longer exists (no children at all for `readChildren`).
`readChild` returns "an integer vector of length one with the process ID if a child terminated" so we can check if it is `integer` and the same selected "process ID". I believe this makes sure that the children are exited.
In case that children happen to send any data manually to parent (which is why we introduced the suspicious part of the change (466325d3fd)), this should be raw bytes and will be discarded (and then will try to read the next and check if it is `integer` in the next loop).
## How was this patch tested?
Manual tests and Jenkins tests.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#18465 from HyukjinKwon/SPARK-21093-retry-1.
## What changes were proposed in this pull request?
This adds documentation to many functions in pyspark.sql.functions.py:
`upper`, `lower`, `reverse`, `unix_timestamp`, `from_unixtime`, `rand`, `randn`, `collect_list`, `collect_set`, `lit`
Add units to the trigonometry functions.
Renames columns in datetime examples to be more informative.
Adds links between some functions.
## How was this patch tested?
`./dev/lint-python`
`python python/pyspark/sql/functions.py`
`./python/run-tests.py --module pyspark-sql`
Author: Michael Patterson <map222@gmail.com>
Closes#17865 from map222/spark-20456.
## What changes were proposed in this pull request?
For randomForest classifier, if test data contains unseen labels, it will throw an error. The StringIndexer already has the handleInvalid logic. The patch add a new method to set the underlying StringIndexer handleInvalid logic.
This patch should also apply to other classifiers. This PR focuses on the main logic and randomForest classifier. I will do follow-up PR for other classifiers.
## How was this patch tested?
Add a new unit test based on the error case in the JIRA.
Author: wangmiao1981 <wm624@hotmail.com>
Closes#18496 from wangmiao1981/handle.
## What changes were proposed in this pull request?
Add doc for methods that were left out, and fix various style and consistency issues.
Author: actuaryzhang <actuaryzhang10@gmail.com>
Closes#18493 from actuaryzhang/sparkRDocCleanup.
## What changes were proposed in this pull request?
Grouped documentation for column window methods.
Author: actuaryzhang <actuaryzhang10@gmail.com>
Closes#18481 from actuaryzhang/sparkRDocWindow.
## What changes were proposed in this pull request?
Grouped documentation for column collection methods.
Author: actuaryzhang <actuaryzhang10@gmail.com>
Author: Wayne Zhang <actuaryzhang10@gmail.com>
Closes#18458 from actuaryzhang/sparkRDocCollection.
## What changes were proposed in this pull request?
Grouped documentation for column misc methods.
Author: actuaryzhang <actuaryzhang10@gmail.com>
Author: Wayne Zhang <actuaryzhang10@gmail.com>
Closes#18448 from actuaryzhang/sparkRDocMisc.
## What changes were proposed in this pull request?
Grouped documentation for nonaggregate column methods.
Author: actuaryzhang <actuaryzhang10@gmail.com>
Author: Wayne Zhang <actuaryzhang10@gmail.com>
Closes#18422 from actuaryzhang/sparkRDocNonAgg.
## What changes were proposed in this pull request?
This PR proposes to support a DDL-formetted string as schema as below:
```r
mockLines <- c("{\"name\":\"Michael\"}",
"{\"name\":\"Andy\", \"age\":30}",
"{\"name\":\"Justin\", \"age\":19}")
jsonPath <- tempfile(pattern = "sparkr-test", fileext = ".tmp")
writeLines(mockLines, jsonPath)
df <- read.df(jsonPath, "json", "name STRING, age DOUBLE")
collect(df)
```
## How was this patch tested?
Tests added in `test_streaming.R` and `test_sparkSQL.R` and manual tests.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#18431 from HyukjinKwon/r-ddl-schema.
## What changes were proposed in this pull request?
Grouped documentation for string column methods.
Author: actuaryzhang <actuaryzhang10@gmail.com>
Author: Wayne Zhang <actuaryzhang10@gmail.com>
Closes#18366 from actuaryzhang/sparkRDocString.
## What changes were proposed in this pull request?
Grouped documentation for math column methods.
Author: actuaryzhang <actuaryzhang10@gmail.com>
Author: Wayne Zhang <actuaryzhang10@gmail.com>
Closes#18371 from actuaryzhang/sparkRDocMath.
## What changes were proposed in this pull request?
`mcfork` in R looks opening a pipe ahead but the existing logic does not properly close it when it is executed hot. This leads to the failure of more forking due to the limit for number of files open.
This hot execution looks particularly for `gapply`/`gapplyCollect`. For unknown reason, this happens more easily in CentOS and could be reproduced in Mac too.
All the details are described in https://issues.apache.org/jira/browse/SPARK-21093
This PR proposes simply to terminate R's worker processes in the parent of R's daemon to prevent a leak.
## How was this patch tested?
I ran the codes below on both CentOS and Mac with that configuration disabled/enabled.
```r
df <- createDataFrame(list(list(1L, 1, "1", 0.1)), c("a", "b", "c", "d"))
collect(gapply(df, "a", function(key, x) { x }, schema(df)))
collect(gapply(df, "a", function(key, x) { x }, schema(df)))
... # 30 times
```
Also, now it passes R tests on CentOS as below:
```
SparkSQL functions: Spark package found in SPARK_HOME: .../spark
..............................................................................................................................................................
..............................................................................................................................................................
..............................................................................................................................................................
..............................................................................................................................................................
..............................................................................................................................................................
....................................................................................................................................
```
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#18320 from HyukjinKwon/SPARK-21093.
## What changes were proposed in this pull request?
Extend `setJobDescription` to SparkR API.
## How was this patch tested?
It looks difficult to add a test. Manually tested as below:
```r
df <- createDataFrame(iris)
count(df)
setJobDescription("This is an example job.")
count(df)
```
prints ...
![2017-06-22 12 05 49](https://user-images.githubusercontent.com/6477701/27415670-2a649936-5743-11e7-8e95-312f1cd103af.png)
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#18382 from HyukjinKwon/SPARK-21149.
## What changes were proposed in this pull request?
Grouped documentation for datetime column methods.
Author: actuaryzhang <actuaryzhang10@gmail.com>
Closes#18114 from actuaryzhang/sparkRDocDate.
## What changes were proposed in this pull request?
PR https://github.com/apache/spark/pull/17715 Added Constrained Logistic Regression for ML. We should add it to SparkR.
## How was this patch tested?
Add new unit tests.
Author: wangmiao1981 <wm624@hotmail.com>
Closes#18128 from wangmiao1981/test.
## What changes were proposed in this pull request?
Add `stringIndexerOrderType` to `spark.glm` and `spark.survreg` to support string encoding that is consistent with default R.
## How was this patch tested?
new tests
Author: actuaryzhang <actuaryzhang10@gmail.com>
Closes#18140 from actuaryzhang/sparkRFormula.
## What changes were proposed in this pull request?
LinearSVC should use its own threshold param, rather than the shared one, since it applies to rawPrediction instead of probability. This PR changes the param in the Scala, Python and R APIs.
## How was this patch tested?
New unit test to make sure the threshold can be set to any Double value.
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#18151 from jkbradley/ml-2.2-linearsvc-cleanup.
## What changes were proposed in this pull request?
Grouped documentation for the aggregate functions for Column.
Author: actuaryzhang <actuaryzhang10@gmail.com>
Closes#18025 from actuaryzhang/sparkRDoc4.
## What changes were proposed in this pull request?
Add SQL trunc function
## How was this patch tested?
standard test
Author: actuaryzhang <actuaryzhang10@gmail.com>
Closes#18291 from actuaryzhang/sparkRTrunc2.
## What changes were proposed in this pull request?
This PR proposes to list the files in test _after_ removing both "spark-warehouse" and "metastore_db" so that the next run of R tests pass fine. This is sometimes a bit annoying.
## How was this patch tested?
Manually running multiple times R tests via `./R/run-tests.sh`.
**Before**
Second run:
```
SparkSQL functions: Spark package found in SPARK_HOME: .../spark
...............................................................................................................................................................
...............................................................................................................................................................
...............................................................................................................................................................
...............................................................................................................................................................
...............................................................................................................................................................
....................................................................................................1234.......................
Failed -------------------------------------------------------------------------
1. Failure: No extra files are created in SPARK_HOME by starting session and making calls (test_sparkSQL.R#3384)
length(list1) not equal to length(list2).
1/1 mismatches
[1] 25 - 23 == 2
2. Failure: No extra files are created in SPARK_HOME by starting session and making calls (test_sparkSQL.R#3384)
sort(list1, na.last = TRUE) not equal to sort(list2, na.last = TRUE).
10/25 mismatches
x[16]: "metastore_db"
y[16]: "pkg"
x[17]: "pkg"
y[17]: "R"
x[18]: "R"
y[18]: "README.md"
x[19]: "README.md"
y[19]: "run-tests.sh"
x[20]: "run-tests.sh"
y[20]: "SparkR_2.2.0.tar.gz"
x[21]: "metastore_db"
y[21]: "pkg"
x[22]: "pkg"
y[22]: "R"
x[23]: "R"
y[23]: "README.md"
x[24]: "README.md"
y[24]: "run-tests.sh"
x[25]: "run-tests.sh"
y[25]: "SparkR_2.2.0.tar.gz"
3. Failure: No extra files are created in SPARK_HOME by starting session and making calls (test_sparkSQL.R#3388)
length(list1) not equal to length(list2).
1/1 mismatches
[1] 25 - 23 == 2
4. Failure: No extra files are created in SPARK_HOME by starting session and making calls (test_sparkSQL.R#3388)
sort(list1, na.last = TRUE) not equal to sort(list2, na.last = TRUE).
10/25 mismatches
x[16]: "metastore_db"
y[16]: "pkg"
x[17]: "pkg"
y[17]: "R"
x[18]: "R"
y[18]: "README.md"
x[19]: "README.md"
y[19]: "run-tests.sh"
x[20]: "run-tests.sh"
y[20]: "SparkR_2.2.0.tar.gz"
x[21]: "metastore_db"
y[21]: "pkg"
x[22]: "pkg"
y[22]: "R"
x[23]: "R"
y[23]: "README.md"
x[24]: "README.md"
y[24]: "run-tests.sh"
x[25]: "run-tests.sh"
y[25]: "SparkR_2.2.0.tar.gz"
DONE ===========================================================================
```
**After**
Second run:
```
SparkSQL functions: Spark package found in SPARK_HOME: .../spark
...............................................................................................................................................................
...............................................................................................................................................................
...............................................................................................................................................................
...............................................................................................................................................................
...............................................................................................................................................................
...............................................................................................................................
```
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#18335 from HyukjinKwon/SPARK-21128.
## What changes were proposed in this pull request?
Update Running R Tests dependence packages to:
```bash
R -e "install.packages(c('knitr', 'rmarkdown', 'testthat', 'e1071', 'survival'), repos='http://cran.us.r-project.org')"
```
## How was this patch tested?
manual tests
Author: Yuming Wang <wgyumg@gmail.com>
Closes#18271 from wangyum/building-spark.
### What changes were proposed in this pull request?
The current option name `wholeFile` is misleading for CSV users. Currently, it is not representing a record per file. Actually, one file could have multiple records. Thus, we should rename it. Now, the proposal is `multiLine`.
### How was this patch tested?
N/A
Author: Xiao Li <gatorsmile@gmail.com>
Closes#18202 from gatorsmile/renameCVSOption.
## What changes were proposed in this pull request?
clean up after big test move
## How was this patch tested?
unit tests, jenkins
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#18267 from felixcheung/rtestset2.
## What changes were proposed in this pull request?
Move all existing tests to non-installed directory so that it will never run by installing SparkR package
For a follow-up PR:
- remove all skip_on_cran() calls in tests
- clean up test timer
- improve or change basic tests that do run on CRAN (if anyone has suggestion)
It looks like `R CMD build pkg` will still put pkg\tests (ie. the full tests) into the source package but `R CMD INSTALL` on such source package does not install these tests (and so `R CMD check` does not run them)
## How was this patch tested?
- [x] unit tests, Jenkins
- [x] AppVeyor
- [x] make a source package, install it, `R CMD check` it - verify the full tests are not installed or run
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#18264 from felixcheung/rtestset.
## What changes were proposed in this pull request?
Document Dataset.union is resolution by position, not by name, since this has been a confusing point for a lot of users.
## How was this patch tested?
N/A - doc only change.
Author: Reynold Xin <rxin@databricks.com>
Closes#18256 from rxin/SPARK-21042.
## What changes were proposed in this pull request?
to investigate how long they run
## How was this patch tested?
Jenkins, AppVeyor
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#18104 from felixcheung/rtimetest.
## What changes were proposed in this pull request?
1, add an example for sparkr `decisionTree`
2, document it in user guide
## How was this patch tested?
local submit
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#18067 from zhengruifeng/dt_example.
## What changes were proposed in this pull request?
Joint coefficients with intercept for SparkR linear SVM summary.
## How was this patch tested?
Existing tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#18035 from yanboliang/svm-r.
## What changes were proposed in this pull request?
This change skips tests that use the Hadoop libraries while running
on CRAN check with Windows as the operating system. This is to handle
cases where the Hadoop winutils binaries are missing on the target
system. The skipped tests consist of
1. Tests that save, load a model in MLlib
2. Tests that save, load CSV, JSON and Parquet files in SQL
3. Hive tests
## How was this patch tested?
Tested by running on a local windows VM with HADOOP_HOME unset. Also testing with https://win-builder.r-project.org
Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
Closes#17966 from shivaram/sparkr-windows-cran.
## What changes were proposed in this pull request?
support decision tree in R
## How was this patch tested?
added tests
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#17981 from zhengruifeng/dt_r.
## What changes were proposed in this pull request?
Some examples in the DataFrame methods are syntactically wrong, even though they are pseudo code. Fix these and some style issues.
Author: Wayne Zhang <actuaryzhang@uber.com>
Closes#18003 from actuaryzhang/sparkRDoc3.
## What changes were proposed in this pull request?
Rename `carsDF` to `df` in SparkR `rollup` and `cube` examples.
## How was this patch tested?
Manual tests.
Author: zero323 <zero323@users.noreply.github.com>
Closes#17988 from zero323/cube-docs.
## What changes were proposed in this pull request?
- Adds R wrapper for `o.a.s.sql.functions.broadcast`.
- Renames `broadcast` to `broadcast_`.
## How was this patch tested?
Unit tests, check `check-cran.sh`.
Author: zero323 <zero323@users.noreply.github.com>
Closes#17965 from zero323/SPARK-20726.
## What changes were proposed in this pull request?
This PR proposes three things as below:
- Use casting rules to a timestamp in `to_timestamp` by default (it was `yyyy-MM-dd HH:mm:ss`).
- Support single argument for `to_timestamp` similarly with APIs in other languages.
For example, the one below works
```
import org.apache.spark.sql.functions._
Seq("2016-12-31 00:12:00.00").toDF("a").select(to_timestamp(col("a"))).show()
```
prints
```
+----------------------------------------+
|to_timestamp(`a`, 'yyyy-MM-dd HH:mm:ss')|
+----------------------------------------+
| 2016-12-31 00:12:00|
+----------------------------------------+
```
whereas this does not work in SQL.
**Before**
```
spark-sql> SELECT to_timestamp('2016-12-31 00:12:00');
Error in query: Invalid number of arguments for function to_timestamp; line 1 pos 7
```
**After**
```
spark-sql> SELECT to_timestamp('2016-12-31 00:12:00');
2016-12-31 00:12:00
```
- Related document improvement for SQL function descriptions and other API descriptions accordingly.
**Before**
```
spark-sql> DESCRIBE FUNCTION extended to_date;
...
Usage: to_date(date_str, fmt) - Parses the `left` expression with the `fmt` expression. Returns null with invalid input.
Extended Usage:
Examples:
> SELECT to_date('2016-12-31', 'yyyy-MM-dd');
2016-12-31
```
```
spark-sql> DESCRIBE FUNCTION extended to_timestamp;
...
Usage: to_timestamp(timestamp, fmt) - Parses the `left` expression with the `format` expression to a timestamp. Returns null with invalid input.
Extended Usage:
Examples:
> SELECT to_timestamp('2016-12-31', 'yyyy-MM-dd');
2016-12-31 00:00:00.0
```
**After**
```
spark-sql> DESCRIBE FUNCTION extended to_date;
...
Usage:
to_date(date_str[, fmt]) - Parses the `date_str` expression with the `fmt` expression to
a date. Returns null with invalid input. By default, it follows casting rules to a date if
the `fmt` is omitted.
Extended Usage:
Examples:
> SELECT to_date('2009-07-30 04:17:52');
2009-07-30
> SELECT to_date('2016-12-31', 'yyyy-MM-dd');
2016-12-31
```
```
spark-sql> DESCRIBE FUNCTION extended to_timestamp;
...
Usage:
to_timestamp(timestamp[, fmt]) - Parses the `timestamp` expression with the `fmt` expression to
a timestamp. Returns null with invalid input. By default, it follows casting rules to
a timestamp if the `fmt` is omitted.
Extended Usage:
Examples:
> SELECT to_timestamp('2016-12-31 00:12:00');
2016-12-31 00:12:00
> SELECT to_timestamp('2016-12-31', 'yyyy-MM-dd');
2016-12-31 00:00:00
```
## How was this patch tested?
Added tests in `datetime.sql`.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#17901 from HyukjinKwon/to_timestamp_arg.
## What changes were proposed in this pull request?
- [x] need to test by running R CMD check --as-cran
- [x] sanity check vignettes
## How was this patch tested?
Jenkins
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#17945 from felixcheung/rchangesforpackage.
## What changes were proposed in this pull request?
Change it to check for relative count like in this test https://github.com/apache/spark/blame/master/R/pkg/inst/tests/testthat/test_sparkSQL.R#L3355 for catalog APIs
## How was this patch tested?
unit tests, this needs to combine with another commit with SQL change to check
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#17905 from felixcheung/rtabletests.
## What changes were proposed in this pull request?
Cleaning existing temp tables before running tableNames tests
## How was this patch tested?
SparkR Unit tests
Author: Hossein <hossein@databricks.com>
Closes#17903 from falaki/SPARK-20661.
## What changes were proposed in this pull request?
Fix typo in vignettes
Author: Wayne Zhang <actuaryzhang@uber.com>
Closes#17884 from actuaryzhang/typo.
## What changes were proposed in this pull request?
set timezone on windows
## How was this patch tested?
unit test, AppVeyor
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#17892 from felixcheung/rtimestamptest.
## What changes were proposed in this pull request?
- Add SparkR wrapper for `Dataset.alias`.
- Adjust roxygen annotations for `functions.alias` (including example usage).
## How was this patch tested?
Unit tests, `check_cran.sh`.
Author: zero323 <zero323@users.noreply.github.com>
Closes#17825 from zero323/SPARK-20550.
## What changes were proposed in this pull request?
add environment
## How was this patch tested?
wait for appveyor run
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#17878 from felixcheung/appveyorrcran.
## What changes were proposed in this pull request?
Make tests more reliable by having it till processed.
Increasing timeout value might help but ultimately the flakiness from processing delay when Jenkins is hard to account for. This isn't an actual public API supported
## How was this patch tested?
unit tests
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#17857 from felixcheung/rsstestrelia.
## What changes were proposed in this pull request?
Adds wrapper for `o.a.s.sql.functions.input_file_name`
## How was this patch tested?
Existing unit tests, additional unit tests, `check-cran.sh`.
Author: zero323 <zero323@users.noreply.github.com>
Closes#17818 from zero323/SPARK-20544.
## What changes were proposed in this pull request?
Adds support for generic hints on `SparkDataFrame`
## How was this patch tested?
Unit tests, `check-cran.sh`
Author: zero323 <zero323@users.noreply.github.com>
Closes#17851 from zero323/SPARK-20585.
## What changes were proposed in this pull request?
Add
- R vignettes
- R programming guide
- SS programming guide
- R example
Also disable spark.als in vignettes for now since it's failing (SPARK-20402)
## How was this patch tested?
manually
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#17814 from felixcheung/rdocss.
## What changes were proposed in this pull request?
General rule on skip or not:
skip if
- RDD tests
- tests could run long or complicated (streaming, hivecontext)
- tests on error conditions
- tests won't likely change/break
## How was this patch tested?
unit tests, `R CMD check --as-cran`, `R CMD check`
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#17817 from felixcheung/rskiptest.
## What changes were proposed in this pull request?
doc only
## How was this patch tested?
manual
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#17828 from felixcheung/rnotfamily.
## What changes were proposed in this pull request?
Adds R wrappers for:
- `o.a.s.sql.functions.grouping` as `o.a.s.sql.functions.is_grouping` (to avoid shading `base::grouping`
- `o.a.s.sql.functions.grouping_id`
## How was this patch tested?
Existing unit tests, additional unit tests. `check-cran.sh`.
Author: zero323 <zero323@users.noreply.github.com>
Closes#17807 from zero323/SPARK-20532.
## What changes were proposed in this pull request?
Add without param for timeout - will need this to submit a job that runs until stopped
Need this for 2.2
## How was this patch tested?
manually, unit test
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#17815 from felixcheung/rssawaitinfinite.
## What changes were proposed in this pull request?
- Add null-safe equality operator `%<=>%` (sames as `o.a.s.sql.Column.eqNullSafe`, `o.a.s.sql.Column.<=>`)
- Add boolean negation operator `!` and function `not `.
## How was this patch tested?
Existing unit tests, additional unit tests, `check-cran.sh`.
Author: zero323 <zero323@users.noreply.github.com>
Closes#17783 from zero323/SPARK-20490.
## What changes were proposed in this pull request?
Ad R wrappers for
- `o.a.s.sql.functions.explode_outer`
- `o.a.s.sql.functions.posexplode_outer`
## How was this patch tested?
Additional unit tests, manual testing.
Author: zero323 <zero323@users.noreply.github.com>
Closes#17809 from zero323/SPARK-20535.
## What changes were proposed in this pull request?
It seems we are using `SQLUtils.getSQLDataType` for type string in structField. It looks we can replace this with `CatalystSqlParser.parseDataType`.
They look similar DDL-like type definitions as below:
```scala
scala> Seq(Tuple1(Tuple1("a"))).toDF.show()
```
```
+---+
| _1|
+---+
|[a]|
+---+
```
```scala
scala> Seq(Tuple1(Tuple1("a"))).toDF.select($"_1".cast("struct<_1:string>")).show()
```
```
+---+
| _1|
+---+
|[a]|
+---+
```
Such type strings looks identical when R’s one as below:
```R
> write.df(sql("SELECT named_struct('_1', 'a') as struct"), "/tmp/aa", "parquet")
> collect(read.df("/tmp/aa", "parquet", structType(structField("struct", "struct<_1:string>"))))
struct
1 a
```
R’s one is stricter because we are checking the types via regular expressions in R side ahead.
Actual logics there look a bit different but as we check it ahead in R side, it looks replacing it would not introduce (I think) no behaviour changes. To make this sure, the tests dedicated for it were added in SPARK-20105. (It looks `structField` is the only place that calls this method).
## How was this patch tested?
Existing tests - https://github.com/apache/spark/blob/master/R/pkg/inst/tests/testthat/test_sparkSQL.R#L143-L194 should cover this.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#17785 from HyukjinKwon/SPARK-20493.
## What changes were proposed in this pull request?
Replace
note repeat_string 2.3.0
with
note repeat_string since 2.3.0
## How was this patch tested?
`create-docs.sh`
Author: zero323 <zero323@users.noreply.github.com>
Closes#17779 from zero323/REPEAT-NOTE.
## What changes were proposed in this pull request?
Some PySpark & SparkR tests run with tiny dataset and tiny ```maxIter```, which means they are not converged. I don’t think checking intermediate result during iteration make sense, and these intermediate result may vulnerable and not stable, so we should switch to check the converged result. We hit this issue at #17746 when we upgrade breeze to 0.13.1.
## How was this patch tested?
Existing tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#17757 from yanboliang/flaky-test.
## What changes were proposed in this pull request?
- Add `rollup` and `cube` methods and corresponding generics.
- Add short description to the vignette.
## How was this patch tested?
- Existing unit tests.
- Additional unit tests covering new features.
- `check-cran.sh`.
Author: zero323 <zero323@users.noreply.github.com>
Closes#17728 from zero323/SPARK-20437.
## What changes were proposed in this pull request?
Upgrade breeze version to 0.13.1, which fixed some critical bugs of L-BFGS-B.
## How was this patch tested?
Existing unit tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#17746 from yanboliang/spark-20449.
## What changes were proposed in this pull request?
Add wrappers for `o.a.s.sql.functions`:
- `split` as `split_string`
- `repeat` as `repeat_string`
## How was this patch tested?
Existing tests, additional unit tests, `check-cran.sh`
Author: zero323 <zero323@users.noreply.github.com>
Closes#17729 from zero323/SPARK-20438.
## What changes were proposed in this pull request?
Adds wrappers for `collect_list` and `collect_set`.
## How was this patch tested?
Unit tests, `check-cran.sh`
Author: zero323 <zero323@users.noreply.github.com>
Closes#17672 from zero323/SPARK-20371.
## What changes were proposed in this pull request?
Adds wrappers for `o.a.s.sql.functions.array` and `o.a.s.sql.functions.map`
## How was this patch tested?
Unit tests, `check-cran.sh`
Author: zero323 <zero323@users.noreply.github.com>
Closes#17674 from zero323/SPARK-20375.
## What changes were proposed in this pull request?
Checking a source parameter is asynchronous. When the query is created, it's not guaranteed that source has been created. This PR just increases the timeout of awaitTermination to ensure the parsing error is thrown.
## How was this patch tested?
Jenkins
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#17687 from zsxwing/SPARK-20397.
## What changes were proposed in this pull request?
Document fpGrowth in:
- vignettes
- programming guide
- code example
## How was this patch tested?
Manual tests.
Author: zero323 <zero323@users.noreply.github.com>
Closes#17557 from zero323/SPARK-20208.
## What changes were proposed in this pull request?
This was suggested to be `as.json.array` at the first place in the PR to SPARK-19828 but we could not do this as the lint check emits an error for multiple dots in the variable names.
After SPARK-20278, now we are able to use `multiple.dots.in.names`. `asJsonArray` in `from_json` function is still able to be changed as 2.2 is not released yet.
So, this PR proposes to rename `asJsonArray` to `as.json.array`.
## How was this patch tested?
Jenkins tests, local tests with `./R/run-tests.sh` and manual `./dev/lint-r`. Existing tests should cover this.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#17653 from HyukjinKwon/SPARK-19828-followup.
## What changes were proposed in this pull request?
Currently, multi-dot separated variables in R is not allowed. For example,
```diff
setMethod("from_json", signature(x = "Column", schema = "structType"),
- function(x, schema, asJsonArray = FALSE, ...) {
+ function(x, schema, as.json.array = FALSE, ...) {
if (asJsonArray) {
jschema <- callJStatic("org.apache.spark.sql.types.DataTypes",
"createArrayType",
```
produces an error as below:
```
R/functions.R:2462:31: style: Words within variable and function names should be separated by '_' rather than '.'.
function(x, schema, as.json.array = FALSE, ...) {
^~~~~~~~~~~~~
```
This seems against https://google.github.io/styleguide/Rguide.xml#identifiers which says
> The preferred form for variable names is all lower case letters and words separated with dots
This looks because lintr by default https://github.com/jimhester/lintr follows http://r-pkgs.had.co.nz/style.html as written in the README.md. Few cases seems not following Google's one as "a few tweaks".
Per [SPARK-6813](https://issues.apache.org/jira/browse/SPARK-6813), we follow Google's R Style Guide with few exceptions https://google.github.io/styleguide/Rguide.xml. This is also merged into Spark's website - https://github.com/apache/spark-website/pull/43
Also, it looks we have no limit on function name. This rule also looks affecting to the name of functions as written in the README.md.
> `multiple_dots_linter`: check that function and variable names are separated by _ rather than ..
## How was this patch tested?
Manually tested `./dev/lint-r`with the manual change below in `R/functions.R`:
```diff
setMethod("from_json", signature(x = "Column", schema = "structType"),
- function(x, schema, asJsonArray = FALSE, ...) {
+ function(x, schema, as.json.array = FALSE, ...) {
if (asJsonArray) {
jschema <- callJStatic("org.apache.spark.sql.types.DataTypes",
"createArrayType",
```
**Before**
```R
R/functions.R:2462:31: style: Words within variable and function names should be separated by '_' rather than '.'.
function(x, schema, as.json.array = FALSE, ...) {
^~~~~~~~~~~~~
```
**After**
```
lintr checks passed.
```
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#17590 from HyukjinKwon/disable-dot-in-name.
## What changes were proposed in this pull request?
Fixed spelling of "charactor"
## How was this patch tested?
Spelling change only
Author: Brendan Dwyer <brendan.dwyer@ibm.com>
Closes#17611 from bdwyer2/SPARK-20298.
## What changes were proposed in this pull request?
Test failed because SPARK_HOME is not set before Spark is installed.
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#17516 from felixcheung/rdircheckincran.
## What changes were proposed in this pull request?
Following up on #17483, add createTable (which is new in 2.2.0) and deprecate createExternalTable, plus a number of minor fixes
## How was this patch tested?
manual, unit tests
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#17511 from felixcheung/rceatetable.
## What changes were proposed in this pull request?
Update doc to remove external for createTable, add refreshByPath in python
## How was this patch tested?
manual
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#17512 from felixcheung/catalogdoc.
## What changes were proposed in this pull request?
minor update
zero323
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#17526 from felixcheung/rfpgrowthfollowup.
## What changes were proposed in this pull request?
It seems cran check scripts corrects `R/pkg/DESCRIPTION` and follows the order in `Collate` fields.
This PR proposes to fix `catalog.R`'s order so that running this script does not show up a small diff in this file every time.
## How was this patch tested?
Manually via `./R/check-cran.sh`.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#17528 from HyukjinKwon/minor-reorder-description.
## What changes were proposed in this pull request?
Adds SparkR API for FPGrowth: [SPARK-19825](https://issues.apache.org/jira/browse/SPARK-19825):
- `spark.fpGrowth` -model training.
- `freqItemsets` and `associationRules` methods with new corresponding generics.
- Scala helper: `org.apache.spark.ml.r. FPGrowthWrapper`
- unit tests.
## How was this patch tested?
Feature specific unit tests.
Author: zero323 <zero323@users.noreply.github.com>
Closes#17170 from zero323/SPARK-19825.
## What changes were proposed in this pull request?
Add a set of catalog API in R
```
"currentDatabase",
"listColumns",
"listDatabases",
"listFunctions",
"listTables",
"recoverPartitions",
"refreshByPath",
"refreshTable",
"setCurrentDatabase",
```
https://github.com/apache/spark/pull/17483/files#diff-6929e6c5e59017ff954e110df20ed7ff
## How was this patch tested?
manual tests, unit tests
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#17483 from felixcheung/rcatalog.
JIRA Issue: https://issues.apache.org/jira/browse/SPARK-20123
## What changes were proposed in this pull request?
If $SPARK_HOME or $FWDIR variable contains spaces, then use "./dev/make-distribution.sh --name custom-spark --tgz -Psparkr -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos -Pyarn" build spark will failed.
## How was this patch tested?
manual tests
Author: zuotingbing <zuo.tingbing9@zte.com.cn>
Closes#17452 from zuotingbing/spark-bulid.
## What changes were proposed in this pull request?
It seems `checkType` and the type string in `structField` are not being tested closely. This string format currently seems SparkR-specific (see d1f6c64c4b/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala (L93-L131)) but resembles SQL type definition.
Therefore, it seems nicer if we test positive/negative cases in R side.
## How was this patch tested?
Unit tests in `test_sparkSQL.R`.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#17439 from HyukjinKwon/r-typestring-tests.
## What changes were proposed in this pull request?
This PR proposes to match minor documentations changes in https://github.com/apache/spark/pull/17399 and https://github.com/apache/spark/pull/17380 to R/Python.
## How was this patch tested?
Manual tests in Python , Python tests via `./python/run-tests.py --module=pyspark-sql` and lint-checks for Python/R.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#17429 from HyukjinKwon/minor-match-doc.
## What changes were proposed in this pull request?
SparkR ```spark.getSparkFiles``` fails when it was called on executors, see details at [SPARK-19925](https://issues.apache.org/jira/browse/SPARK-19925).
## How was this patch tested?
Add unit tests, and verify this fix at standalone and yarn cluster.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#17274 from yanboliang/spark-19925.
## What changes were proposed in this pull request?
When SparkR is installed as a R package there might not be any java runtime.
If it is not there SparkR's `sparkR.session()` will block waiting for the connection timeout, hanging the R IDE/shell, without any notification or message.
## How was this patch tested?
manually
- [x] need to test on Windows
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#16596 from felixcheung/rcheckjava.
## What changes were proposed in this pull request?
Update docs for NaN handling in approxQuantile.
## How was this patch tested?
existing tests.
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#17369 from zhengruifeng/doc_quantiles_nan.
## What changes were proposed in this pull request?
Currently JSON and CSV have exactly the same logic about handling bad records, this PR tries to abstract it and put it in a upper level to reduce code duplication.
The overall idea is, we make the JSON and CSV parser to throw a BadRecordException, then the upper level, FailureSafeParser, handles bad records according to the parse mode.
Behavior changes:
1. with PERMISSIVE mode, if the number of tokens doesn't match the schema, previously CSV parser will treat it as a legal record and parse as many tokens as possible. After this PR, we treat it as an illegal record, and put the raw record string in a special column, but we still parse as many tokens as possible.
2. all logging is removed as they are not very useful in practice.
## How was this patch tested?
existing tests
Author: Wenchen Fan <wenchen@databricks.com>
Author: hyukjinkwon <gurwls223@gmail.com>
Author: Wenchen Fan <cloud0fan@gmail.com>
Closes#17315 from cloud-fan/bad-record2.
## What changes were proposed in this pull request?
doc only change
## How was this patch tested?
manual
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#17356 from felixcheung/rdfcheckpoint2.
## What changes were proposed in this pull request?
Add checkpoint, setCheckpointDir API to R
## How was this patch tested?
unit tests, manual tests
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#17351 from felixcheung/rdfcheckpoint.
## What changes were proposed in this pull request?
This PR proposes to support an array of struct type in `to_json` as below:
```scala
import org.apache.spark.sql.functions._
val df = Seq(Tuple1(Tuple1(1) :: Nil)).toDF("a")
df.select(to_json($"a").as("json")).show()
```
```
+----------+
| json|
+----------+
|[{"_1":1}]|
+----------+
```
Currently, it throws an exception as below (a newline manually inserted for readability):
```
org.apache.spark.sql.AnalysisException: cannot resolve 'structtojson(`array`)' due to data type
mismatch: structtojson requires that the expression is a struct expression.;;
```
This allows the roundtrip with `from_json` as below:
```scala
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
val schema = ArrayType(StructType(StructField("a", IntegerType) :: Nil))
val df = Seq("""[{"a":1}, {"a":2}]""").toDF("json").select(from_json($"json", schema).as("array"))
df.show()
// Read back.
df.select(to_json($"array").as("json")).show()
```
```
+----------+
| array|
+----------+
|[[1], [2]]|
+----------+
+-----------------+
| json|
+-----------------+
|[{"a":1},{"a":2}]|
+-----------------+
```
Also, this PR proposes to rename from `StructToJson` to `StructsToJson ` and `JsonToStruct` to `JsonToStructs`.
## How was this patch tested?
Unit tests in `JsonFunctionsSuite` and `JsonExpressionsSuite` for Scala, doctest for Python and test in `test_sparkSQL.R` for R.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#17192 from HyukjinKwon/SPARK-19849.
## What changes were proposed in this pull request?
Passes R `tempdir()` (this is the R session temp dir, shared with other temp files/dirs) to JVM, set System.Property for derby home dir to move derby.log
## How was this patch tested?
Manually, unit tests
With this, these are relocated to under /tmp
```
# ls /tmp/RtmpG2M0cB/
derby.log
```
And they are removed automatically when the R session is ended.
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#16330 from felixcheung/rderby.
## What changes were proposed in this pull request?
It seems cran check scripts corrects `R/pkg/DESCRIPTION` and follows the order in `Collate` fields.
This PR proposes to fix this so that running this script does not show up a diff in this file.
## How was this patch tested?
Manually via `./R/check-cran.sh`.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#17349 from HyukjinKwon/minor-cran.
## What changes were proposed in this pull request?
Add "experimental" API for SS in R
## How was this patch tested?
manual, unit tests
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#16982 from felixcheung/rss.
## What changes were proposed in this pull request?
Since we could not directly define the array type in R, this PR proposes to support array types in R as string types that are used in `structField` as below:
```R
jsonArr <- "[{\"name\":\"Bob\"}, {\"name\":\"Alice\"}]"
df <- as.DataFrame(list(list("people" = jsonArr)))
collect(select(df, alias(from_json(df$people, "array<struct<name:string>>"), "arrcol")))
```
prints
```R
arrcol
1 Bob, Alice
```
## How was this patch tested?
Unit tests in `test_sparkSQL.R`.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#17178 from HyukjinKwon/SPARK-19828.
## What changes were proposed in this pull request?
Port Tweedie GLM #16344 to SparkR
felixcheung yanboliang
## How was this patch tested?
new test in SparkR
Author: actuaryzhang <actuaryzhang10@gmail.com>
Closes#16729 from actuaryzhang/sparkRTweedie.