## What changes were proposed in this pull request?
remove spark if spark downloaded & installed
## How was this patch tested?
manually by building package
Jenkins, AppVeyor
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#19657 from felixcheung/rinstalldir.
## What changes were proposed in this pull request?
This PR proposes to add `errorifexists` to SparkR API and fix the rest of them describing the mode, mainly, in API documentations as well.
This PR also replaces `convertToJSaveMode` to `setWriteMode` so that string as is is passed to JVM and executes:
b034f2565f/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala (L72-L82)
and remove the duplication here:
3f958a9992/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala (L187-L194)
## How was this patch tested?
Manually checked the built documentation. These were mainly found by `` grep -r `error` `` and `grep -r 'error'`.
Also, unit tests added in `test_sparkSQL.R`.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#19673 from HyukjinKwon/SPARK-21640-followup.
## What changes were proposed in this pull request?
This is to fix the code for the latest R changes in R-devel, when running CRAN check
```
checking for code/documentation mismatches ... WARNING
Codoc mismatches from documentation object 'attach':
attach
Code: function(what, pos = 2L, name = deparse(substitute(what),
backtick = FALSE), warn.conflicts = TRUE)
Docs: function(what, pos = 2L, name = deparse(substitute(what)),
warn.conflicts = TRUE)
Mismatches in argument default values:
Name: 'name' Code: deparse(substitute(what), backtick = FALSE) Docs: deparse(substitute(what))
Codoc mismatches from documentation object 'glm':
glm
Code: function(formula, family = gaussian, data, weights, subset,
na.action, start = NULL, etastart, mustart, offset,
control = list(...), model = TRUE, method = "glm.fit",
x = FALSE, y = TRUE, singular.ok = TRUE, contrasts =
NULL, ...)
Docs: function(formula, family = gaussian, data, weights, subset,
na.action, start = NULL, etastart, mustart, offset,
control = list(...), model = TRUE, method = "glm.fit",
x = FALSE, y = TRUE, contrasts = NULL, ...)
Argument names in code not in docs:
singular.ok
Mismatches in argument names:
Position: 16 Code: singular.ok Docs: contrasts
Position: 17 Code: contrasts Docs: ...
```
With attach, it's pulling in the function definition from base::attach. We need to disable that but we would still need a function signature for roxygen2 to build with.
With glm it's pulling in the function definition (ie. "usage") from the stats::glm function. Since this is "compiled in" when we build the source package into the .Rd file, when it changes at runtime or in CRAN check it won't match the latest signature. The solution is not to pull in from stats::glm since there isn't much value in doing that (none of the param we actually use, the ones we do use we have explicitly documented them)
Also with attach we are changing to call dynamically.
## How was this patch tested?
Manually.
- [x] check documentation output - yes
- [x] check help `?attach` `?glm` - yes
- [x] check on other platforms, r-hub, on r-devel etc..
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#19557 from felixcheung/rattachglmdocerror.
## What changes were proposed in this pull request?
This PR adds a check between the R package version used and the version reported by SparkContext running in the JVM. The goal here is to warn users when they have a R package downloaded from CRAN and are using that to connect to an existing Spark cluster.
This is raised as a warning rather than an error as users might want to use patch versions interchangeably (e.g., 2.1.3 with 2.1.2 etc.)
## How was this patch tested?
Manually by changing the `DESCRIPTION` file
Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
Closes#19624 from shivaram/sparkr-version-check.
This PR sets the java.io.tmpdir for CRAN checks and also disables the hsperfdata for the JVM when running CRAN checks. Together this prevents files from being left behind in `/tmp`
## How was this patch tested?
Tested manually on a clean EC2 machine
Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
Closes#19589 from shivaram/sparkr-tmpdir-clean.
## What changes were proposed in this pull request?
This PR proposes to revive `stringsAsFactors` option in collect API, which was mistakenly removed in 71a138cd0e.
Simply, it casts `charactor` to `factor` if it meets the condition, `stringsAsFactors && is.character(vec)` in primitive type conversion.
## How was this patch tested?
Unit test in `R/pkg/tests/fulltests/test_sparkSQL.R`.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#19551 from HyukjinKwon/SPARK-17902.
## What changes were proposed in this pull request?
Currently percentile_approx never returns the first element when percentile is in (relativeError, 1/N], where relativeError default 1/10000, and N is the total number of elements. But ideally, percentiles in [0, 1/N] should all return the first element as the answer.
For example, given input data 1 to 10, if a user queries 10% (or even less) percentile, it should return 1, because the first value 1 already reaches 10%. Currently it returns 2.
Based on the paper, targetError is not rounded up, and searching index should start from 0 instead of 1. By following the paper, we should be able to fix the cases mentioned above.
## How was this patch tested?
Added a new test case and fix existing test cases.
Author: Zhenhua Wang <wzh_zju@163.com>
Closes#19438 from wzhfy/improve_percentile_approx.
## What changes were proposed in this pull request?
Looks like `FlatMapGroupsInRExec.requiredChildDistribution` didn't consider empty grouping attributes. It should be a problem when running `EnsureRequirements` and `gapply` in R can't work on empty grouping columns.
## How was this patch tested?
Added test.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#19436 from viirya/fix-flatmapinr-distribution.
## What changes were proposed in this pull request?
Currently, we set lintr to jimhester/lintra769c0b (see [this](7d1175011c) and [SPARK-14074](https://issues.apache.org/jira/browse/SPARK-14074)).
I first tested and checked lintr-1.0.1 but it looks many important fixes are missing (for example, checking 100 length). So, I instead tried the latest commit, 5431140ffe, in my local and fixed the check failures.
It looks it has fixed many bugs and now finds many instances that I have observed and thought should be caught time to time, here I filed [the results](https://gist.github.com/HyukjinKwon/4f59ddcc7b6487a02da81800baca533c).
The downside looks it now takes about 7ish mins, (it was 2ish mins before) in my local.
## How was this patch tested?
Manually, `./dev/lint-r` after manually updating the lintr package.
Author: hyukjinkwon <gurwls223@gmail.com>
Author: zuotingbing <zuo.tingbing9@zte.com.cn>
Closes#19290 from HyukjinKwon/upgrade-r-lint.
## What changes were proposed in this pull request?
The `percentile_approx` function previously accepted numeric type input and output double type results.
But since all numeric types, date and timestamp types are represented as numerics internally, `percentile_approx` can support them easily.
After this PR, it supports date type, timestamp type and numeric types as input types. The result type is also changed to be the same as the input type, which is more reasonable for percentiles.
This change is also required when we generate equi-height histograms for these types.
## How was this patch tested?
Added a new test and modified some existing tests.
Author: Zhenhua Wang <wangzhenhua@huawei.com>
Closes#19321 from wzhfy/approx_percentile_support_types.
## What changes were proposed in this pull request?
This PR make `sample(...)` able to omit `withReplacement` defaulting to `FALSE`.
In short, the following examples are allowed:
```r
> df <- createDataFrame(as.list(seq(10)))
> count(sample(df, fraction=0.5, seed=3))
[1] 4
> count(sample(df, fraction=1.0))
[1] 10
```
In addition, this PR also adds some type checking logics as below:
```r
> sample(df, fraction = "a")
Error in sample(df, fraction = "a") :
fraction must be numeric; however, got character
> sample(df, fraction = 1, seed = NULL)
Error in sample(df, fraction = 1, seed = NULL) :
seed must not be NULL or NA; however, got NULL
> sample(df, list(1), 1.0)
Error in sample(df, list(1), 1) :
withReplacement must be logical; however, got list
> sample(df, fraction = -1.0)
...
Error in sample : illegal argument - requirement failed: Sampling fraction (-1.0) must be on interval [0, 1] without replacement
```
## How was this patch tested?
Manually tested, unit tests added in `R/pkg/tests/fulltests/test_sparkSQL.R`.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#19243 from HyukjinKwon/SPARK-21780.
## What changes were proposed in this pull request?
Clarify behavior of to_utc_timestamp/from_utc_timestamp with an example
## How was this patch tested?
Doc only change / existing tests
Author: Sean Owen <sowen@cloudera.com>
Closes#19276 from srowen/SPARK-22049.
## What changes were proposed in this pull request?
In previous work SPARK-21513, we has allowed `MapType` and `ArrayType` of `MapType`s convert to a json string but only for Scala API. In this follow-up PR, we will make SparkSQL support it for PySpark and SparkR, too. We also fix some little bugs and comments of the previous work in this follow-up PR.
### For PySpark
```
>>> data = [(1, {"name": "Alice"})]
>>> df = spark.createDataFrame(data, ("key", "value"))
>>> df.select(to_json(df.value).alias("json")).collect()
[Row(json=u'{"name":"Alice")']
>>> data = [(1, [{"name": "Alice"}, {"name": "Bob"}])]
>>> df = spark.createDataFrame(data, ("key", "value"))
>>> df.select(to_json(df.value).alias("json")).collect()
[Row(json=u'[{"name":"Alice"},{"name":"Bob"}]')]
```
### For SparkR
```
# Converts a map into a JSON object
df2 <- sql("SELECT map('name', 'Bob')) as people")
df2 <- mutate(df2, people_json = to_json(df2$people))
# Converts an array of maps into a JSON array
df2 <- sql("SELECT array(map('name', 'Bob'), map('name', 'Alice')) as people")
df2 <- mutate(df2, people_json = to_json(df2$people))
```
## How was this patch tested?
Add unit test cases.
cc viirya HyukjinKwon
Author: goldmedal <liugs963@gmail.com>
Closes#19223 from goldmedal/SPARK-21513-fp-PySaprkAndSparkR.
## What changes were proposed in this pull request?
set.seed() before running tests
## How was this patch tested?
jenkins, appveyor
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#19111 from felixcheung/rranseed.
## What changes were proposed in this pull request?
This PR proposes to add a wrapper for `unionByName` API to R and Python as well.
**Python**
```python
df1 = spark.createDataFrame([[1, 2, 3]], ["col0", "col1", "col2"])
df2 = spark.createDataFrame([[4, 5, 6]], ["col1", "col2", "col0"])
df1.unionByName(df2).show()
```
```
+----+----+----+
|col0|col1|col3|
+----+----+----+
| 1| 2| 3|
| 6| 4| 5|
+----+----+----+
```
**R**
```R
df1 <- select(createDataFrame(mtcars), "carb", "am", "gear")
df2 <- select(createDataFrame(mtcars), "am", "gear", "carb")
head(unionByName(limit(df1, 2), limit(df2, 2)))
```
```
carb am gear
1 4 1 4
2 4 1 4
3 4 1 4
4 4 1 4
```
## How was this patch tested?
Doctests for Python and unit test added in `test_sparkSQL.R` for R.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#19105 from HyukjinKwon/unionByName-r-python.
## What changes were proposed in this pull request?
fix the random seed to eliminate variability
## How was this patch tested?
jenkins, appveyor, lots more jenkins
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#19018 from felixcheung/rrftest.
## What changes were proposed in this pull request?
Code in vignettes requires winutils on windows to run, when publishing to CRAN or building from source, winutils might not be available, so it's better to disable code run (so resulting vigenttes will not have output from code, but text is still there and code is still there)
fix * checking re-building of vignette outputs ... WARNING
and
> %LOCALAPPDATA% not found. Please define the environment variable or restart and enter an installation path in localDir.
## How was this patch tested?
jenkins, appveyor, r-hub
before: https://artifacts.r-hub.io/SparkR_2.2.0.tar.gz-49cecef3bb09db1db130db31604e0293/SparkR.Rcheck/00check.log
after: https://artifacts.r-hub.io/SparkR_2.2.0.tar.gz-86a066c7576f46794930ad114e5cff7c/SparkR.Rcheck/00check.log
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#19016 from felixcheung/rvigwind.
## What changes were proposed in this pull request?
SPARK-21100 introduced a new `summary` method to the Scala/Java Dataset API that included expanded statistics (vs `describe`) and control over which statistics to compute. Currently in the R API `summary` acts as an alias for `describe`. This patch updates the R API to call the new `summary` method in the JVM that includes additional statistics and ability to select which to compute.
This does not break the current interface as the present `summary` method does not take additional arguments like `describe` and the output was never meant to be used programmatically.
## How was this patch tested?
Modified and additional unit tests.
Author: Andrew Ray <ray.andrew@gmail.com>
Closes#18786 from aray/summary-r.
## What changes were proposed in this pull request?
Support offset in SparkR GLM #16699
Author: actuaryzhang <actuaryzhang10@gmail.com>
Closes#18831 from actuaryzhang/sparkROffset.
## What changes were proposed in this pull request?
SPARK-20307 Added handleInvalid option to RFormula for tree-based classification algorithms. We should add this parameter for other classification algorithms in SparkR.
This is a followup PR for SPARK-20307.
## How was this patch tested?
New Unit tests are added.
Author: wangmiao1981 <wm624@hotmail.com>
Closes#18605 from wangmiao1981/class.
## What changes were proposed in this pull request?
```RFormula``` should handle invalid for both features and label column.
#18496 only handle invalid values in features column. This PR add handling invalid values for label column and test cases.
## How was this patch tested?
Add test cases.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#18613 from yanboliang/spark-20307.
## What changes were proposed in this pull request?
Update internal references from programming-guide to rdd-programming-guide
See 5ddf243fd8 and https://github.com/apache/spark/pull/18485#issuecomment-314789751
Let's keep the redirector even if it's problematic to build, but not rely on it internally.
## How was this patch tested?
(Doc build)
Author: Sean Owen <sowen@cloudera.com>
Closes#18625 from srowen/SPARK-21267.2.
## What changes were proposed in this pull request?
- Remove Scala 2.10 build profiles and support
- Replace some 2.10 support in scripts with commented placeholders for 2.12 later
- Remove deprecated API calls from 2.10 support
- Remove usages of deprecated context bounds where possible
- Remove Scala 2.10 workarounds like ScalaReflectionLock
- Other minor Scala warning fixes
## How was this patch tested?
Existing tests
Author: Sean Owen <sowen@cloudera.com>
Closes#17150 from srowen/SPARK-19810.
## What changes were proposed in this pull request?
This PR supports schema in a DDL formatted string for `from_json` in R/Python and `dapply` and `gapply` in R, which are commonly used and/or consistent with Scala APIs.
Additionally, this PR exposes `structType` in R to allow working around in other possible corner cases.
**Python**
`from_json`
```python
from pyspark.sql.functions import from_json
data = [(1, '''{"a": 1}''')]
df = spark.createDataFrame(data, ("key", "value"))
df.select(from_json(df.value, "a INT").alias("json")).show()
```
**R**
`from_json`
```R
df <- sql("SELECT named_struct('name', 'Bob') as people")
df <- mutate(df, people_json = to_json(df$people))
head(select(df, from_json(df$people_json, "name STRING")))
```
`structType.character`
```R
structType("a STRING, b INT")
```
`dapply`
```R
dapply(createDataFrame(list(list(1.0)), "a"), function(x) {x}, "a DOUBLE")
```
`gapply`
```R
gapply(createDataFrame(list(list(1.0)), "a"), "a", function(key, x) { x }, "a DOUBLE")
```
## How was this patch tested?
Doc tests for `from_json` in Python and unit tests `test_sparkSQL.R` in R.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#18498 from HyukjinKwon/SPARK-21266.
## What changes were proposed in this pull request?
This is a retry for #18320. This PR was reverted due to unexpected test failures with -10 error code.
I was unable to reproduce in MacOS, CentOS and Ubuntu but only in Jenkins. So, the tests proceeded to verify this and revert the past try here - https://github.com/apache/spark/pull/18456
This new approach was tested in https://github.com/apache/spark/pull/18463.
**Test results**:
- With the part of suspicious change in the past try (466325d3fd)
Tests ran 4 times and 2 times passed and 2 time failed.
- Without the part of suspicious change in the past try (466325d3fd)
Tests ran 5 times and they all passed.
- With this new approach (0a7589c09f)
Tests ran 5 times and they all passed.
It looks the cause is as below (see 466325d3fd):
```diff
+ exitCode <- 1
...
+ data <- parallel:::readChild(child)
+ if (is.raw(data)) {
+ if (unserialize(data) == exitCode) {
...
+ }
+ }
...
- parallel:::mcexit(0L)
+ parallel:::mcexit(0L, send = exitCode)
```
Two possibilities I think
- `parallel:::mcexit(.. , send = exitCode)`
https://stat.ethz.ch/R-manual/R-devel/library/parallel/html/mcfork.html
> It sends send to the master (unless NULL) and then shuts down the child process.
However, it looks possible that the parent attemps to terminate the child right after getting our custom exit code. So, the child gets terminated between "send" and "shuts down", failing to exit properly.
- A bug between `parallel:::mcexit(..., send = ...)` and `parallel:::readChild`.
**Proposal**:
To resolve this, I simply decided to avoid both possibilities with this new approach here (9ff89a7859). To support this idea, I explained with some quotation of the documentation as below:
https://stat.ethz.ch/R-manual/R-devel/library/parallel/html/mcfork.html
> `readChild` and `readChildren` return a raw vector with a "pid" attribute if data were available, an integer vector of length one with the process ID if a child terminated or `NULL` if the child no longer exists (no children at all for `readChildren`).
`readChild` returns "an integer vector of length one with the process ID if a child terminated" so we can check if it is `integer` and the same selected "process ID". I believe this makes sure that the children are exited.
In case that children happen to send any data manually to parent (which is why we introduced the suspicious part of the change (466325d3fd)), this should be raw bytes and will be discarded (and then will try to read the next and check if it is `integer` in the next loop).
## How was this patch tested?
Manual tests and Jenkins tests.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#18465 from HyukjinKwon/SPARK-21093-retry-1.
## What changes were proposed in this pull request?
This adds documentation to many functions in pyspark.sql.functions.py:
`upper`, `lower`, `reverse`, `unix_timestamp`, `from_unixtime`, `rand`, `randn`, `collect_list`, `collect_set`, `lit`
Add units to the trigonometry functions.
Renames columns in datetime examples to be more informative.
Adds links between some functions.
## How was this patch tested?
`./dev/lint-python`
`python python/pyspark/sql/functions.py`
`./python/run-tests.py --module pyspark-sql`
Author: Michael Patterson <map222@gmail.com>
Closes#17865 from map222/spark-20456.
## What changes were proposed in this pull request?
For randomForest classifier, if test data contains unseen labels, it will throw an error. The StringIndexer already has the handleInvalid logic. The patch add a new method to set the underlying StringIndexer handleInvalid logic.
This patch should also apply to other classifiers. This PR focuses on the main logic and randomForest classifier. I will do follow-up PR for other classifiers.
## How was this patch tested?
Add a new unit test based on the error case in the JIRA.
Author: wangmiao1981 <wm624@hotmail.com>
Closes#18496 from wangmiao1981/handle.
## What changes were proposed in this pull request?
Add doc for methods that were left out, and fix various style and consistency issues.
Author: actuaryzhang <actuaryzhang10@gmail.com>
Closes#18493 from actuaryzhang/sparkRDocCleanup.
## What changes were proposed in this pull request?
Grouped documentation for column window methods.
Author: actuaryzhang <actuaryzhang10@gmail.com>
Closes#18481 from actuaryzhang/sparkRDocWindow.
## What changes were proposed in this pull request?
Grouped documentation for column collection methods.
Author: actuaryzhang <actuaryzhang10@gmail.com>
Author: Wayne Zhang <actuaryzhang10@gmail.com>
Closes#18458 from actuaryzhang/sparkRDocCollection.
## What changes were proposed in this pull request?
Grouped documentation for column misc methods.
Author: actuaryzhang <actuaryzhang10@gmail.com>
Author: Wayne Zhang <actuaryzhang10@gmail.com>
Closes#18448 from actuaryzhang/sparkRDocMisc.
## What changes were proposed in this pull request?
Grouped documentation for nonaggregate column methods.
Author: actuaryzhang <actuaryzhang10@gmail.com>
Author: Wayne Zhang <actuaryzhang10@gmail.com>
Closes#18422 from actuaryzhang/sparkRDocNonAgg.
## What changes were proposed in this pull request?
This PR proposes to support a DDL-formetted string as schema as below:
```r
mockLines <- c("{\"name\":\"Michael\"}",
"{\"name\":\"Andy\", \"age\":30}",
"{\"name\":\"Justin\", \"age\":19}")
jsonPath <- tempfile(pattern = "sparkr-test", fileext = ".tmp")
writeLines(mockLines, jsonPath)
df <- read.df(jsonPath, "json", "name STRING, age DOUBLE")
collect(df)
```
## How was this patch tested?
Tests added in `test_streaming.R` and `test_sparkSQL.R` and manual tests.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#18431 from HyukjinKwon/r-ddl-schema.
## What changes were proposed in this pull request?
Grouped documentation for string column methods.
Author: actuaryzhang <actuaryzhang10@gmail.com>
Author: Wayne Zhang <actuaryzhang10@gmail.com>
Closes#18366 from actuaryzhang/sparkRDocString.
## What changes were proposed in this pull request?
Grouped documentation for math column methods.
Author: actuaryzhang <actuaryzhang10@gmail.com>
Author: Wayne Zhang <actuaryzhang10@gmail.com>
Closes#18371 from actuaryzhang/sparkRDocMath.
## What changes were proposed in this pull request?
`mcfork` in R looks opening a pipe ahead but the existing logic does not properly close it when it is executed hot. This leads to the failure of more forking due to the limit for number of files open.
This hot execution looks particularly for `gapply`/`gapplyCollect`. For unknown reason, this happens more easily in CentOS and could be reproduced in Mac too.
All the details are described in https://issues.apache.org/jira/browse/SPARK-21093
This PR proposes simply to terminate R's worker processes in the parent of R's daemon to prevent a leak.
## How was this patch tested?
I ran the codes below on both CentOS and Mac with that configuration disabled/enabled.
```r
df <- createDataFrame(list(list(1L, 1, "1", 0.1)), c("a", "b", "c", "d"))
collect(gapply(df, "a", function(key, x) { x }, schema(df)))
collect(gapply(df, "a", function(key, x) { x }, schema(df)))
... # 30 times
```
Also, now it passes R tests on CentOS as below:
```
SparkSQL functions: Spark package found in SPARK_HOME: .../spark
..............................................................................................................................................................
..............................................................................................................................................................
..............................................................................................................................................................
..............................................................................................................................................................
..............................................................................................................................................................
....................................................................................................................................
```
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#18320 from HyukjinKwon/SPARK-21093.
## What changes were proposed in this pull request?
Extend `setJobDescription` to SparkR API.
## How was this patch tested?
It looks difficult to add a test. Manually tested as below:
```r
df <- createDataFrame(iris)
count(df)
setJobDescription("This is an example job.")
count(df)
```
prints ...
![2017-06-22 12 05 49](https://user-images.githubusercontent.com/6477701/27415670-2a649936-5743-11e7-8e95-312f1cd103af.png)
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#18382 from HyukjinKwon/SPARK-21149.
## What changes were proposed in this pull request?
Grouped documentation for datetime column methods.
Author: actuaryzhang <actuaryzhang10@gmail.com>
Closes#18114 from actuaryzhang/sparkRDocDate.
## What changes were proposed in this pull request?
PR https://github.com/apache/spark/pull/17715 Added Constrained Logistic Regression for ML. We should add it to SparkR.
## How was this patch tested?
Add new unit tests.
Author: wangmiao1981 <wm624@hotmail.com>
Closes#18128 from wangmiao1981/test.
## What changes were proposed in this pull request?
Add `stringIndexerOrderType` to `spark.glm` and `spark.survreg` to support string encoding that is consistent with default R.
## How was this patch tested?
new tests
Author: actuaryzhang <actuaryzhang10@gmail.com>
Closes#18140 from actuaryzhang/sparkRFormula.
## What changes were proposed in this pull request?
LinearSVC should use its own threshold param, rather than the shared one, since it applies to rawPrediction instead of probability. This PR changes the param in the Scala, Python and R APIs.
## How was this patch tested?
New unit test to make sure the threshold can be set to any Double value.
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#18151 from jkbradley/ml-2.2-linearsvc-cleanup.
## What changes were proposed in this pull request?
Grouped documentation for the aggregate functions for Column.
Author: actuaryzhang <actuaryzhang10@gmail.com>
Closes#18025 from actuaryzhang/sparkRDoc4.
## What changes were proposed in this pull request?
Add SQL trunc function
## How was this patch tested?
standard test
Author: actuaryzhang <actuaryzhang10@gmail.com>
Closes#18291 from actuaryzhang/sparkRTrunc2.
## What changes were proposed in this pull request?
This PR proposes to list the files in test _after_ removing both "spark-warehouse" and "metastore_db" so that the next run of R tests pass fine. This is sometimes a bit annoying.
## How was this patch tested?
Manually running multiple times R tests via `./R/run-tests.sh`.
**Before**
Second run:
```
SparkSQL functions: Spark package found in SPARK_HOME: .../spark
...............................................................................................................................................................
...............................................................................................................................................................
...............................................................................................................................................................
...............................................................................................................................................................
...............................................................................................................................................................
....................................................................................................1234.......................
Failed -------------------------------------------------------------------------
1. Failure: No extra files are created in SPARK_HOME by starting session and making calls (test_sparkSQL.R#3384)
length(list1) not equal to length(list2).
1/1 mismatches
[1] 25 - 23 == 2
2. Failure: No extra files are created in SPARK_HOME by starting session and making calls (test_sparkSQL.R#3384)
sort(list1, na.last = TRUE) not equal to sort(list2, na.last = TRUE).
10/25 mismatches
x[16]: "metastore_db"
y[16]: "pkg"
x[17]: "pkg"
y[17]: "R"
x[18]: "R"
y[18]: "README.md"
x[19]: "README.md"
y[19]: "run-tests.sh"
x[20]: "run-tests.sh"
y[20]: "SparkR_2.2.0.tar.gz"
x[21]: "metastore_db"
y[21]: "pkg"
x[22]: "pkg"
y[22]: "R"
x[23]: "R"
y[23]: "README.md"
x[24]: "README.md"
y[24]: "run-tests.sh"
x[25]: "run-tests.sh"
y[25]: "SparkR_2.2.0.tar.gz"
3. Failure: No extra files are created in SPARK_HOME by starting session and making calls (test_sparkSQL.R#3388)
length(list1) not equal to length(list2).
1/1 mismatches
[1] 25 - 23 == 2
4. Failure: No extra files are created in SPARK_HOME by starting session and making calls (test_sparkSQL.R#3388)
sort(list1, na.last = TRUE) not equal to sort(list2, na.last = TRUE).
10/25 mismatches
x[16]: "metastore_db"
y[16]: "pkg"
x[17]: "pkg"
y[17]: "R"
x[18]: "R"
y[18]: "README.md"
x[19]: "README.md"
y[19]: "run-tests.sh"
x[20]: "run-tests.sh"
y[20]: "SparkR_2.2.0.tar.gz"
x[21]: "metastore_db"
y[21]: "pkg"
x[22]: "pkg"
y[22]: "R"
x[23]: "R"
y[23]: "README.md"
x[24]: "README.md"
y[24]: "run-tests.sh"
x[25]: "run-tests.sh"
y[25]: "SparkR_2.2.0.tar.gz"
DONE ===========================================================================
```
**After**
Second run:
```
SparkSQL functions: Spark package found in SPARK_HOME: .../spark
...............................................................................................................................................................
...............................................................................................................................................................
...............................................................................................................................................................
...............................................................................................................................................................
...............................................................................................................................................................
...............................................................................................................................
```
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#18335 from HyukjinKwon/SPARK-21128.
### What changes were proposed in this pull request?
The current option name `wholeFile` is misleading for CSV users. Currently, it is not representing a record per file. Actually, one file could have multiple records. Thus, we should rename it. Now, the proposal is `multiLine`.
### How was this patch tested?
N/A
Author: Xiao Li <gatorsmile@gmail.com>
Closes#18202 from gatorsmile/renameCVSOption.
## What changes were proposed in this pull request?
clean up after big test move
## How was this patch tested?
unit tests, jenkins
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#18267 from felixcheung/rtestset2.
## What changes were proposed in this pull request?
Move all existing tests to non-installed directory so that it will never run by installing SparkR package
For a follow-up PR:
- remove all skip_on_cran() calls in tests
- clean up test timer
- improve or change basic tests that do run on CRAN (if anyone has suggestion)
It looks like `R CMD build pkg` will still put pkg\tests (ie. the full tests) into the source package but `R CMD INSTALL` on such source package does not install these tests (and so `R CMD check` does not run them)
## How was this patch tested?
- [x] unit tests, Jenkins
- [x] AppVeyor
- [x] make a source package, install it, `R CMD check` it - verify the full tests are not installed or run
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#18264 from felixcheung/rtestset.