## What changes were proposed in this pull request?
Add array_remove / array_zip / map_from_arrays / array_distinct functions in SparkR.
## How was this patch tested?
Add tests in test_sparkSQL.R
Author: Huaxin Gao <huaxing@us.ibm.com>
Closes#21645 from huaxingao/spark-24537.
## What changes were proposed in this pull request?
Add model predictions for Linear Support Vector Machine (SVM) Classifier, Logistic Regression, GBT, RF and DecisionTree in vignettes.
## How was this patch tested?
Manually ran the test and checked the result.
Author: Huaxin Gao <huaxing@us.ibm.com>
Closes#21678 from huaxingao/spark-23461.
## What changes were proposed in this pull request?
change to skip tests if
- couldn't determine java version
fix problem on windows
## How was this patch tested?
unit test, manual, win-builder
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#21666 from felixcheung/rjavaskip.
## What changes were proposed in this pull request?
This PR adds array_join function to SparkR
## How was this patch tested?
Add unit test in test_sparkSQL.R
Author: Huaxin Gao <huaxing@us.ibm.com>
Closes#21313 from huaxingao/spark-24187.
## What changes were proposed in this pull request?
change generic to get it to work with googleVis
also fix lintr
## How was this patch tested?
manual test, unit tests
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#21315 from felixcheung/googvis.
## What changes were proposed in this pull request?
Change text to grep for.
## How was this patch tested?
manual test
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#21314 from felixcheung/openjdkver.
## What changes were proposed in this pull request?
reverse and concat are already in functions.R as column string functions. Since now these two functions are categorized as collection functions in scala and python, we will do the same in R.
## How was this patch tested?
Add test in test_sparkSQL.R
Author: Huaxin Gao <huaxing@us.ibm.com>
Closes#21307 from huaxingao/spark_24186.
## What changes were proposed in this pull request?
The PR adds the `slice` function to SparkR. The function returns a subset of consecutive elements from the given array.
```
> df <- createDataFrame(cbind(model = rownames(mtcars), mtcars))
> tmp <- mutate(df, v1 = create_array(df$mpg, df$cyl, df$hp))
> head(select(tmp, slice(tmp$v1, 2L, 2L)))
```
```
slice(v1, 2, 2)
1 6, 110
2 6, 110
3 4, 93
4 6, 110
5 8, 175
6 6, 105
```
## How was this patch tested?
A test added into R/pkg/tests/fulltests/test_sparkSQL.R
Author: Marek Novotny <mn.mikke@gmail.com>
Closes#21298 from mn-mikke/SPARK-24198.
This change updates the SystemRequirements and also includes a runtime check if the JVM is being launched by R. The runtime check is done by querying `java -version`
## How was this patch tested?
Tested on a Mac and Windows machine
Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
Closes#21278 from shivaram/sparkr-skip-solaris.
## What changes were proposed in this pull request?
It's useful to know what relationship between date1 and date2 results in a positive number.
Author: aditkumar <aditkumar@gmail.com>
Author: Adit Kumar <aditkumar@gmail.com>
Closes#20787 from aditkumar/master.
## What changes were proposed in this pull request?
The PR adds array_sort function to SparkR.
## How was this patch tested?
Tests added into R/pkg/tests/fulltests/test_sparkSQL.R
## Example
```
> df <- createDataFrame(list(list(list(2L, 1L, 3L, NA)), list(list(NA, 6L, 5L, NA, 4L))))
> head(collect(select(df, array_sort(df[[1]]))))
```
Result:
```
array_sort(_1)
1 1, 2, 3, NA
2 4, 5, 6, NA, NA
```
Author: Marek Novotny <mn.mikke@gmail.com>
Closes#21294 from mn-mikke/SPARK-24197.
## What changes were proposed in this pull request?
I propose to add a clear statement for functions like `collect_list()` about non-deterministic behavior of such functions. The behavior must be taken into account by user while creating and running queries.
Author: Maxim Gekk <maxim.gekk@databricks.com>
Closes#21228 from MaxGekk/deterministic-comments.
## What changes were proposed in this pull request?
Mention `spark.sql.crossJoin.enabled` in error message when an implicit `CROSS JOIN` is detected.
## How was this patch tested?
`CartesianProductSuite` and `JoinSuite`.
Author: Henry Robinson <henry@apache.org>
Closes#21201 from henryr/spark-24128.
## What changes were proposed in this pull request?
add array flatten function to SparkR
## How was this patch tested?
Unit tests were added in R/pkg/tests/fulltests/test_sparkSQL.R
Author: Huaxin Gao <huaxing@us.ibm.com>
Closes#21244 from huaxingao/spark-24185.
## What changes were proposed in this pull request?
The lint failure bugged me:
```R
R/SQLContext.R:715:97: style: Trailing whitespace is superfluous.
#' file-based streaming data source. \code{timeZone} to indicate a timezone to be used to
^
tests/fulltests/test_streaming.R:239:45: style: Commas should always have a space after.
expect_equal(times[order(times$eventTime),][1, 2], 2)
^
lintr checks failed.
```
and I actually saw https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6-ubuntu-test/500/console too. If I understood correctly, there is a try about moving to Unbuntu one.
## How was this patch tested?
Manually tested by `./dev/lint-r`:
```
...
lintr checks passed.
```
Author: hyukjinkwon <gurwls223@apache.org>
Closes#20879 from HyukjinKwon/minor-r-lint.
## What changes were proposed in this pull request?
Seems R's substr API treats Scala substr API as zero based and so subtracts the given starting position by 1.
Because Scala's substr API also accepts zero-based starting position (treated as the first element), so the current R's substr test results are correct as they all use 1 as starting positions.
## How was this patch tested?
Modified tests.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#20464 from viirya/SPARK-23291.
## What changes were proposed in this pull request?
Removed export tag to get rid of unknown tag warnings
## How was this patch tested?
Existing tests
Author: Rekha Joshi <rekhajoshm@gmail.com>
Author: rjoshi2 <rekhajoshm@gmail.com>
Closes#20501 from rekhajoshm/SPARK-22430.
## What changes were proposed in this pull request?
Provide more details in trigonometric function documentations. Referenced `java.lang.Math` for further details in the descriptions.
## How was this patch tested?
Ran full build, checked generated documentation manually
Author: Mihaly Toth <misutoth@gmail.com>
Closes#20618 from misutoth/trigonometric-doc.
## What changes were proposed in this pull request?
https://github.com/apache/spark/pull/18944 added one patch, which allowed a spark session to be created when the hive metastore server is down. However, it did not allow running any commands with the spark session. This brings troubles to the user who only wants to read / write data frames without metastore setup.
## How was this patch tested?
Added some unit tests to read and write data frames based on the original HiveMetastoreLazyInitializationSuite.
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: Feng Liu <fengliu@databricks.com>
Closes#20681 from liufengdb/completely-lazy.
## What changes were proposed in this pull request?
Fix doc link that was changed in 2.3
shivaram
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#20711 from felixcheung/rvigmean.
## What changes were proposed in this pull request?
Update the description and tests of three external API or functions `createFunction `, `length` and `repartitionByRange `
## How was this patch tested?
N/A
Author: gatorsmile <gatorsmile@gmail.com>
Closes#20495 from gatorsmile/updateFunc.
## What changes were proposed in this pull request?
It's not obvious from the comments that any added column must be a
function of the dataset that we are adding it to. Add a comment to
that effect to Scala, Python and R Data* methods.
Author: Henry Robinson <henry@cloudera.com>
Closes#20429 from henryr/SPARK-23157.
## What changes were proposed in this pull request?
doc only changes
## How was this patch tested?
manual
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#20380 from felixcheung/rclrdoc.
## What changes were proposed in this pull request?
A fix to https://issues.apache.org/jira/browse/SPARK-21727, "Operating on an ArrayType in a SparkR DataFrame throws error"
## How was this patch tested?
- Ran tests at R\pkg\tests\run-all.R (see below attached results)
- Tested the following lines in SparkR, which now seem to execute without error:
```
indices <- 1:4
myDf <- data.frame(indices)
myDf$data <- list(rep(0, 20))
mySparkDf <- as.DataFrame(myDf)
collect(mySparkDf)
```
[2018-01-22 SPARK-21727 Test Results.txt](https://github.com/apache/spark/files/1653535/2018-01-22.SPARK-21727.Test.Results.txt)
felixcheung yanboliang sun-rui shivaram
_The contribution is my original work and I license the work to the project under the project’s open source license_
Author: neilalex <neil@neilalex.com>
Closes#20352 from neilalex/neilalex-sparkr-arraytype.
## What changes were proposed in this pull request?
Make the default behavior of EXCEPT (i.e. EXCEPT DISTINCT) more
explicit in the documentation, and call out the change in behavior
from 1.x.
Author: Henry Robinson <henry@cloudera.com>
Closes#20254 from henryr/spark-23062.
## What changes were proposed in this pull request?
RFormula should use VectorSizeHint & OneHotEncoderEstimator in its pipeline to avoid using the deprecated OneHotEncoder & to ensure the model produced can be used in streaming.
## How was this patch tested?
Unit tests.
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: Bago Amirbekian <bago@databricks.com>
Closes#20229 from MrBago/rFormula.
## What changes were proposed in this pull request?
fix doc truncated
## How was this patch tested?
manually
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#20263 from felixcheung/r23docfix.
## What changes were proposed in this pull request?
This patch bumps the master branch version to `2.4.0-SNAPSHOT`.
## How was this patch tested?
N/A
Author: gatorsmile <gatorsmile@gmail.com>
Closes#20222 from gatorsmile/bump24.
## What changes were proposed in this pull request?
Including VectorSizeHint in RFormula piplelines will allow them to be applied to streaming dataframes.
## How was this patch tested?
Unit tests.
Author: Bago Amirbekian <bago@databricks.com>
Closes#20238 from MrBago/rFormulaVectorSize.
## What changes were proposed in this pull request?
Add a note to the `HasCheckpointInterval` parameter doc that clarifies that this setting is ignored when no checkpoint directory has been set on the spark context.
## How was this patch tested?
No tests necessary, just a doc update.
Author: sethah <shendrickson@cloudera.com>
Closes#20188 from sethah/als_checkpoint_doc.
## What changes were proposed in this pull request?
R Structured Streaming API for withWatermark, trigger, partitionBy
## How was this patch tested?
manual, unit tests
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#20129 from felixcheung/rwater.
## What changes were proposed in this pull request?
update R migration guide and vignettes
## How was this patch tested?
manually
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#20106 from felixcheung/rreleasenote23.
## What changes were proposed in this pull request?
Add to `arrange` the option to sort only within partition
## How was this patch tested?
manual, unit tests
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#20118 from felixcheung/rsortwithinpartition.
## What changes were proposed in this pull request?
This pr modified `concat` to concat binary inputs into a single binary output.
`concat` in the current master always output data as a string. But, in some databases (e.g., PostgreSQL), if all inputs are binary, `concat` also outputs binary.
## How was this patch tested?
Added tests in `SQLQueryTestSuite` and `TypeCoercionSuite`.
Author: Takeshi Yamamuro <yamamuro@apache.org>
Closes#19977 from maropu/SPARK-22771.
## What changes were proposed in this pull request?
Add sql functions
## How was this patch tested?
manual, unit tests
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#20105 from felixcheung/rsqlfuncs.
## What changes were proposed in this pull request?
This PR proposes to add `localCheckpoint(..)` in R API.
```r
df <- localCheckpoint(createDataFrame(iris))
```
## How was this patch tested?
Unit tests added in `R/pkg/tests/fulltests/test_sparkSQL.R`
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#20073 from HyukjinKwon/SPARK-22843.
## What changes were proposed in this pull request?
Since all CRAN checks go through the same machine, if there is an older partial download or partial install of Spark left behind the tests fail. This PR overwrites the install files when running tests. This shouldn't affect Jenkins as `SPARK_HOME` is set when running Jenkins tests.
## How was this patch tested?
Test manually by running `R CMD check --as-cran`
Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
Closes#20060 from shivaram/sparkr-overwrite-cran.
## What changes were proposed in this pull request?
This PR adds `date_trunc` in R API as below:
```r
> df <- createDataFrame(list(list(a = as.POSIXlt("2012-12-13 12:34:00"))))
> head(select(df, date_trunc("hour", df$a)))
date_trunc(hour, a)
1 2012-12-13 12:00:00
```
## How was this patch tested?
Unit tests added in `R/pkg/tests/fulltests/test_sparkSQL.R`.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#20031 from HyukjinKwon/r-datetrunc.
## What changes were proposed in this pull request?
This is a followup to reduce AppVeyor test time. This PR proposes to reduce the number of shuffle partitions to reduce the tasks running R workers in few particular tests.
The symptom is similar as described in `https://github.com/apache/spark/pull/19722`. There are many R processes newly launched on Windows without forking and it makes the differences of elapsed time between Linux and Windows.
Here is the simple comparison for before/after of this change. I manually tested this by disabling `spark.sparkr.use.daemon`. Disabling it resembles the tests on Windows:
**Before**
<img width="672" alt="2017-11-25 12 22 13" src="https://user-images.githubusercontent.com/6477701/33217949-b5528dfa-d17d-11e7-8050-75675c39eb20.png">
**After**
<img width="682" alt="2017-11-25 12 32 00" src="https://user-images.githubusercontent.com/6477701/33217958-c6518052-d17d-11e7-9f8e-1be21a784559.png">
So, this probably will reduce roughly more than 10 minutes.
## How was this patch tested?
AppVeyor tests
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#19816 from HyukjinKwon/SPARK-21693-followup.
## What changes were proposed in this pull request?
This PR proposes to reduce max iteration in Linear SVM test in SparkR. This particular test elapses roughly 5 mins on my Mac and over 20 mins on Windows.
The root cause appears, it triggers 2500ish jobs by the default 100 max iterations. In Linux, `daemon.R` is forked but on Windows another process is launched, which is extremely slow.
So, given my observation, there are many processes (not forked) ran on Windows, which makes the differences of elapsed time.
After reducing the max iteration to 10, the total jobs in this single test is reduced to 550ish.
After reducing the max iteration to 5, the total jobs in this single test is reduced to 360ish.
## How was this patch tested?
Manually tested the elapsed times.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#19722 from HyukjinKwon/SPARK-21693-test.
## What changes were proposed in this pull request?
The current internal `table()` API of `SparkSession` bypasses the Analyzer and directly calls `sessionState.catalog.lookupRelation` API. This skips the view resolution logics in our Analyzer rule `ResolveRelations`. This internal API is widely used by various DDL commands, public and internal APIs.
Users might get the strange error caused by view resolution when the default database is different.
```
Table or view not found: t1; line 1 pos 14
org.apache.spark.sql.AnalysisException: Table or view not found: t1; line 1 pos 14
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
```
This PR is to fix it by enforcing it to use `ResolveRelations` to resolve the table.
## How was this patch tested?
Added a test case and modified the existing test cases
Author: gatorsmile <gatorsmile@gmail.com>
Closes#19713 from gatorsmile/viewResolution.
## What changes were proposed in this pull request?
This PR adds `dayofweek` to R API:
```r
data <- list(list(d = as.Date("2012-12-13")),
list(d = as.Date("2013-12-14")),
list(d = as.Date("2014-12-15")))
df <- createDataFrame(data)
collect(select(df, dayofweek(df$d)))
```
```
dayofweek(d)
1 5
2 7
3 2
```
## How was this patch tested?
Manual tests and unit tests in `R/pkg/tests/fulltests/test_sparkSQL.R`
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#19706 from HyukjinKwon/add-dayofweek.
## What changes were proposed in this pull request?
remove spark if spark downloaded & installed
## How was this patch tested?
manually by building package
Jenkins, AppVeyor
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#19657 from felixcheung/rinstalldir.
## What changes were proposed in this pull request?
This PR proposes to add `errorifexists` to SparkR API and fix the rest of them describing the mode, mainly, in API documentations as well.
This PR also replaces `convertToJSaveMode` to `setWriteMode` so that string as is is passed to JVM and executes:
b034f2565f/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala (L72-L82)
and remove the duplication here:
3f958a9992/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala (L187-L194)
## How was this patch tested?
Manually checked the built documentation. These were mainly found by `` grep -r `error` `` and `grep -r 'error'`.
Also, unit tests added in `test_sparkSQL.R`.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#19673 from HyukjinKwon/SPARK-21640-followup.
## What changes were proposed in this pull request?
This is to fix the code for the latest R changes in R-devel, when running CRAN check
```
checking for code/documentation mismatches ... WARNING
Codoc mismatches from documentation object 'attach':
attach
Code: function(what, pos = 2L, name = deparse(substitute(what),
backtick = FALSE), warn.conflicts = TRUE)
Docs: function(what, pos = 2L, name = deparse(substitute(what)),
warn.conflicts = TRUE)
Mismatches in argument default values:
Name: 'name' Code: deparse(substitute(what), backtick = FALSE) Docs: deparse(substitute(what))
Codoc mismatches from documentation object 'glm':
glm
Code: function(formula, family = gaussian, data, weights, subset,
na.action, start = NULL, etastart, mustart, offset,
control = list(...), model = TRUE, method = "glm.fit",
x = FALSE, y = TRUE, singular.ok = TRUE, contrasts =
NULL, ...)
Docs: function(formula, family = gaussian, data, weights, subset,
na.action, start = NULL, etastart, mustart, offset,
control = list(...), model = TRUE, method = "glm.fit",
x = FALSE, y = TRUE, contrasts = NULL, ...)
Argument names in code not in docs:
singular.ok
Mismatches in argument names:
Position: 16 Code: singular.ok Docs: contrasts
Position: 17 Code: contrasts Docs: ...
```
With attach, it's pulling in the function definition from base::attach. We need to disable that but we would still need a function signature for roxygen2 to build with.
With glm it's pulling in the function definition (ie. "usage") from the stats::glm function. Since this is "compiled in" when we build the source package into the .Rd file, when it changes at runtime or in CRAN check it won't match the latest signature. The solution is not to pull in from stats::glm since there isn't much value in doing that (none of the param we actually use, the ones we do use we have explicitly documented them)
Also with attach we are changing to call dynamically.
## How was this patch tested?
Manually.
- [x] check documentation output - yes
- [x] check help `?attach` `?glm` - yes
- [x] check on other platforms, r-hub, on r-devel etc..
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#19557 from felixcheung/rattachglmdocerror.
## What changes were proposed in this pull request?
This PR adds a check between the R package version used and the version reported by SparkContext running in the JVM. The goal here is to warn users when they have a R package downloaded from CRAN and are using that to connect to an existing Spark cluster.
This is raised as a warning rather than an error as users might want to use patch versions interchangeably (e.g., 2.1.3 with 2.1.2 etc.)
## How was this patch tested?
Manually by changing the `DESCRIPTION` file
Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
Closes#19624 from shivaram/sparkr-version-check.
## What changes were proposed in this pull request?
Will need to port to this to branch-1.6, -2.0, -2.1, -2.2
## How was this patch tested?
manually
Jenkins, AppVeyor
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#19549 from felixcheung/rcranversioncheck.
This PR sets the java.io.tmpdir for CRAN checks and also disables the hsperfdata for the JVM when running CRAN checks. Together this prevents files from being left behind in `/tmp`
## How was this patch tested?
Tested manually on a clean EC2 machine
Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
Closes#19589 from shivaram/sparkr-tmpdir-clean.
## What changes were proposed in this pull request?
This PR proposes to revive `stringsAsFactors` option in collect API, which was mistakenly removed in 71a138cd0e.
Simply, it casts `charactor` to `factor` if it meets the condition, `stringsAsFactors && is.character(vec)` in primitive type conversion.
## How was this patch tested?
Unit test in `R/pkg/tests/fulltests/test_sparkSQL.R`.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#19551 from HyukjinKwon/SPARK-17902.
## What changes were proposed in this pull request?
Currently percentile_approx never returns the first element when percentile is in (relativeError, 1/N], where relativeError default 1/10000, and N is the total number of elements. But ideally, percentiles in [0, 1/N] should all return the first element as the answer.
For example, given input data 1 to 10, if a user queries 10% (or even less) percentile, it should return 1, because the first value 1 already reaches 10%. Currently it returns 2.
Based on the paper, targetError is not rounded up, and searching index should start from 0 instead of 1. By following the paper, we should be able to fix the cases mentioned above.
## How was this patch tested?
Added a new test case and fix existing test cases.
Author: Zhenhua Wang <wzh_zju@163.com>
Closes#19438 from wzhfy/improve_percentile_approx.
## What changes were proposed in this pull request?
Looks like `FlatMapGroupsInRExec.requiredChildDistribution` didn't consider empty grouping attributes. It should be a problem when running `EnsureRequirements` and `gapply` in R can't work on empty grouping columns.
## How was this patch tested?
Added test.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#19436 from viirya/fix-flatmapinr-distribution.
## What changes were proposed in this pull request?
When zinc is running the pwd might be in the root of the project. A quick solution to this is to not go a level up incase we are in the root rather than root/core/. If we are in the root everything works fine, if we are in core add a script which goes and runs the level up
## How was this patch tested?
set -x in the SparkR install scripts.
Author: Holden Karau <holden@us.ibm.com>
Closes#19402 from holdenk/SPARK-22167-sparkr-packaging-issue-allow-zinc.
## What changes were proposed in this pull request?
Currently, we set lintr to jimhester/lintra769c0b (see [this](7d1175011c) and [SPARK-14074](https://issues.apache.org/jira/browse/SPARK-14074)).
I first tested and checked lintr-1.0.1 but it looks many important fixes are missing (for example, checking 100 length). So, I instead tried the latest commit, 5431140ffe, in my local and fixed the check failures.
It looks it has fixed many bugs and now finds many instances that I have observed and thought should be caught time to time, here I filed [the results](https://gist.github.com/HyukjinKwon/4f59ddcc7b6487a02da81800baca533c).
The downside looks it now takes about 7ish mins, (it was 2ish mins before) in my local.
## How was this patch tested?
Manually, `./dev/lint-r` after manually updating the lintr package.
Author: hyukjinkwon <gurwls223@gmail.com>
Author: zuotingbing <zuo.tingbing9@zte.com.cn>
Closes#19290 from HyukjinKwon/upgrade-r-lint.
## What changes were proposed in this pull request?
The `percentile_approx` function previously accepted numeric type input and output double type results.
But since all numeric types, date and timestamp types are represented as numerics internally, `percentile_approx` can support them easily.
After this PR, it supports date type, timestamp type and numeric types as input types. The result type is also changed to be the same as the input type, which is more reasonable for percentiles.
This change is also required when we generate equi-height histograms for these types.
## How was this patch tested?
Added a new test and modified some existing tests.
Author: Zhenhua Wang <wangzhenhua@huawei.com>
Closes#19321 from wzhfy/approx_percentile_support_types.
## What changes were proposed in this pull request?
This PR make `sample(...)` able to omit `withReplacement` defaulting to `FALSE`.
In short, the following examples are allowed:
```r
> df <- createDataFrame(as.list(seq(10)))
> count(sample(df, fraction=0.5, seed=3))
[1] 4
> count(sample(df, fraction=1.0))
[1] 10
```
In addition, this PR also adds some type checking logics as below:
```r
> sample(df, fraction = "a")
Error in sample(df, fraction = "a") :
fraction must be numeric; however, got character
> sample(df, fraction = 1, seed = NULL)
Error in sample(df, fraction = 1, seed = NULL) :
seed must not be NULL or NA; however, got NULL
> sample(df, list(1), 1.0)
Error in sample(df, list(1), 1) :
withReplacement must be logical; however, got list
> sample(df, fraction = -1.0)
...
Error in sample : illegal argument - requirement failed: Sampling fraction (-1.0) must be on interval [0, 1] without replacement
```
## How was this patch tested?
Manually tested, unit tests added in `R/pkg/tests/fulltests/test_sparkSQL.R`.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#19243 from HyukjinKwon/SPARK-21780.
## What changes were proposed in this pull request?
Clarify behavior of to_utc_timestamp/from_utc_timestamp with an example
## How was this patch tested?
Doc only change / existing tests
Author: Sean Owen <sowen@cloudera.com>
Closes#19276 from srowen/SPARK-22049.
## What changes were proposed in this pull request?
In previous work SPARK-21513, we has allowed `MapType` and `ArrayType` of `MapType`s convert to a json string but only for Scala API. In this follow-up PR, we will make SparkSQL support it for PySpark and SparkR, too. We also fix some little bugs and comments of the previous work in this follow-up PR.
### For PySpark
```
>>> data = [(1, {"name": "Alice"})]
>>> df = spark.createDataFrame(data, ("key", "value"))
>>> df.select(to_json(df.value).alias("json")).collect()
[Row(json=u'{"name":"Alice")']
>>> data = [(1, [{"name": "Alice"}, {"name": "Bob"}])]
>>> df = spark.createDataFrame(data, ("key", "value"))
>>> df.select(to_json(df.value).alias("json")).collect()
[Row(json=u'[{"name":"Alice"},{"name":"Bob"}]')]
```
### For SparkR
```
# Converts a map into a JSON object
df2 <- sql("SELECT map('name', 'Bob')) as people")
df2 <- mutate(df2, people_json = to_json(df2$people))
# Converts an array of maps into a JSON array
df2 <- sql("SELECT array(map('name', 'Bob'), map('name', 'Alice')) as people")
df2 <- mutate(df2, people_json = to_json(df2$people))
```
## How was this patch tested?
Add unit test cases.
cc viirya HyukjinKwon
Author: goldmedal <liugs963@gmail.com>
Closes#19223 from goldmedal/SPARK-21513-fp-PySaprkAndSparkR.
## What changes were proposed in this pull request?
set.seed() before running tests
## How was this patch tested?
jenkins, appveyor
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#19111 from felixcheung/rranseed.
## What changes were proposed in this pull request?
This PR proposes to add a wrapper for `unionByName` API to R and Python as well.
**Python**
```python
df1 = spark.createDataFrame([[1, 2, 3]], ["col0", "col1", "col2"])
df2 = spark.createDataFrame([[4, 5, 6]], ["col1", "col2", "col0"])
df1.unionByName(df2).show()
```
```
+----+----+----+
|col0|col1|col3|
+----+----+----+
| 1| 2| 3|
| 6| 4| 5|
+----+----+----+
```
**R**
```R
df1 <- select(createDataFrame(mtcars), "carb", "am", "gear")
df2 <- select(createDataFrame(mtcars), "am", "gear", "carb")
head(unionByName(limit(df1, 2), limit(df2, 2)))
```
```
carb am gear
1 4 1 4
2 4 1 4
3 4 1 4
4 4 1 4
```
## How was this patch tested?
Doctests for Python and unit test added in `test_sparkSQL.R` for R.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#19105 from HyukjinKwon/unionByName-r-python.
## What changes were proposed in this pull request?
fix the random seed to eliminate variability
## How was this patch tested?
jenkins, appveyor, lots more jenkins
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#19018 from felixcheung/rrftest.
## What changes were proposed in this pull request?
Code in vignettes requires winutils on windows to run, when publishing to CRAN or building from source, winutils might not be available, so it's better to disable code run (so resulting vigenttes will not have output from code, but text is still there and code is still there)
fix * checking re-building of vignette outputs ... WARNING
and
> %LOCALAPPDATA% not found. Please define the environment variable or restart and enter an installation path in localDir.
## How was this patch tested?
jenkins, appveyor, r-hub
before: https://artifacts.r-hub.io/SparkR_2.2.0.tar.gz-49cecef3bb09db1db130db31604e0293/SparkR.Rcheck/00check.log
after: https://artifacts.r-hub.io/SparkR_2.2.0.tar.gz-86a066c7576f46794930ad114e5cff7c/SparkR.Rcheck/00check.log
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#19016 from felixcheung/rvigwind.
## What changes were proposed in this pull request?
SPARK-21100 introduced a new `summary` method to the Scala/Java Dataset API that included expanded statistics (vs `describe`) and control over which statistics to compute. Currently in the R API `summary` acts as an alias for `describe`. This patch updates the R API to call the new `summary` method in the JVM that includes additional statistics and ability to select which to compute.
This does not break the current interface as the present `summary` method does not take additional arguments like `describe` and the output was never meant to be used programmatically.
## How was this patch tested?
Modified and additional unit tests.
Author: Andrew Ray <ray.andrew@gmail.com>
Closes#18786 from aray/summary-r.
## What changes were proposed in this pull request?
Support offset in SparkR GLM #16699
Author: actuaryzhang <actuaryzhang10@gmail.com>
Closes#18831 from actuaryzhang/sparkROffset.
## What changes were proposed in this pull request?
SPARK-20307 Added handleInvalid option to RFormula for tree-based classification algorithms. We should add this parameter for other classification algorithms in SparkR.
This is a followup PR for SPARK-20307.
## How was this patch tested?
New Unit tests are added.
Author: wangmiao1981 <wm624@hotmail.com>
Closes#18605 from wangmiao1981/class.
## What changes were proposed in this pull request?
```RFormula``` should handle invalid for both features and label column.
#18496 only handle invalid values in features column. This PR add handling invalid values for label column and test cases.
## How was this patch tested?
Add test cases.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#18613 from yanboliang/spark-20307.
## What changes were proposed in this pull request?
Update internal references from programming-guide to rdd-programming-guide
See 5ddf243fd8 and https://github.com/apache/spark/pull/18485#issuecomment-314789751
Let's keep the redirector even if it's problematic to build, but not rely on it internally.
## How was this patch tested?
(Doc build)
Author: Sean Owen <sowen@cloudera.com>
Closes#18625 from srowen/SPARK-21267.2.
## What changes were proposed in this pull request?
- Remove Scala 2.10 build profiles and support
- Replace some 2.10 support in scripts with commented placeholders for 2.12 later
- Remove deprecated API calls from 2.10 support
- Remove usages of deprecated context bounds where possible
- Remove Scala 2.10 workarounds like ScalaReflectionLock
- Other minor Scala warning fixes
## How was this patch tested?
Existing tests
Author: Sean Owen <sowen@cloudera.com>
Closes#17150 from srowen/SPARK-19810.
## What changes were proposed in this pull request?
This PR supports schema in a DDL formatted string for `from_json` in R/Python and `dapply` and `gapply` in R, which are commonly used and/or consistent with Scala APIs.
Additionally, this PR exposes `structType` in R to allow working around in other possible corner cases.
**Python**
`from_json`
```python
from pyspark.sql.functions import from_json
data = [(1, '''{"a": 1}''')]
df = spark.createDataFrame(data, ("key", "value"))
df.select(from_json(df.value, "a INT").alias("json")).show()
```
**R**
`from_json`
```R
df <- sql("SELECT named_struct('name', 'Bob') as people")
df <- mutate(df, people_json = to_json(df$people))
head(select(df, from_json(df$people_json, "name STRING")))
```
`structType.character`
```R
structType("a STRING, b INT")
```
`dapply`
```R
dapply(createDataFrame(list(list(1.0)), "a"), function(x) {x}, "a DOUBLE")
```
`gapply`
```R
gapply(createDataFrame(list(list(1.0)), "a"), "a", function(key, x) { x }, "a DOUBLE")
```
## How was this patch tested?
Doc tests for `from_json` in Python and unit tests `test_sparkSQL.R` in R.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#18498 from HyukjinKwon/SPARK-21266.
## What changes were proposed in this pull request?
This is a retry for #18320. This PR was reverted due to unexpected test failures with -10 error code.
I was unable to reproduce in MacOS, CentOS and Ubuntu but only in Jenkins. So, the tests proceeded to verify this and revert the past try here - https://github.com/apache/spark/pull/18456
This new approach was tested in https://github.com/apache/spark/pull/18463.
**Test results**:
- With the part of suspicious change in the past try (466325d3fd)
Tests ran 4 times and 2 times passed and 2 time failed.
- Without the part of suspicious change in the past try (466325d3fd)
Tests ran 5 times and they all passed.
- With this new approach (0a7589c09f)
Tests ran 5 times and they all passed.
It looks the cause is as below (see 466325d3fd):
```diff
+ exitCode <- 1
...
+ data <- parallel:::readChild(child)
+ if (is.raw(data)) {
+ if (unserialize(data) == exitCode) {
...
+ }
+ }
...
- parallel:::mcexit(0L)
+ parallel:::mcexit(0L, send = exitCode)
```
Two possibilities I think
- `parallel:::mcexit(.. , send = exitCode)`
https://stat.ethz.ch/R-manual/R-devel/library/parallel/html/mcfork.html
> It sends send to the master (unless NULL) and then shuts down the child process.
However, it looks possible that the parent attemps to terminate the child right after getting our custom exit code. So, the child gets terminated between "send" and "shuts down", failing to exit properly.
- A bug between `parallel:::mcexit(..., send = ...)` and `parallel:::readChild`.
**Proposal**:
To resolve this, I simply decided to avoid both possibilities with this new approach here (9ff89a7859). To support this idea, I explained with some quotation of the documentation as below:
https://stat.ethz.ch/R-manual/R-devel/library/parallel/html/mcfork.html
> `readChild` and `readChildren` return a raw vector with a "pid" attribute if data were available, an integer vector of length one with the process ID if a child terminated or `NULL` if the child no longer exists (no children at all for `readChildren`).
`readChild` returns "an integer vector of length one with the process ID if a child terminated" so we can check if it is `integer` and the same selected "process ID". I believe this makes sure that the children are exited.
In case that children happen to send any data manually to parent (which is why we introduced the suspicious part of the change (466325d3fd)), this should be raw bytes and will be discarded (and then will try to read the next and check if it is `integer` in the next loop).
## How was this patch tested?
Manual tests and Jenkins tests.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#18465 from HyukjinKwon/SPARK-21093-retry-1.
## What changes were proposed in this pull request?
This adds documentation to many functions in pyspark.sql.functions.py:
`upper`, `lower`, `reverse`, `unix_timestamp`, `from_unixtime`, `rand`, `randn`, `collect_list`, `collect_set`, `lit`
Add units to the trigonometry functions.
Renames columns in datetime examples to be more informative.
Adds links between some functions.
## How was this patch tested?
`./dev/lint-python`
`python python/pyspark/sql/functions.py`
`./python/run-tests.py --module pyspark-sql`
Author: Michael Patterson <map222@gmail.com>
Closes#17865 from map222/spark-20456.
## What changes were proposed in this pull request?
For randomForest classifier, if test data contains unseen labels, it will throw an error. The StringIndexer already has the handleInvalid logic. The patch add a new method to set the underlying StringIndexer handleInvalid logic.
This patch should also apply to other classifiers. This PR focuses on the main logic and randomForest classifier. I will do follow-up PR for other classifiers.
## How was this patch tested?
Add a new unit test based on the error case in the JIRA.
Author: wangmiao1981 <wm624@hotmail.com>
Closes#18496 from wangmiao1981/handle.
## What changes were proposed in this pull request?
Add doc for methods that were left out, and fix various style and consistency issues.
Author: actuaryzhang <actuaryzhang10@gmail.com>
Closes#18493 from actuaryzhang/sparkRDocCleanup.
## What changes were proposed in this pull request?
Grouped documentation for column window methods.
Author: actuaryzhang <actuaryzhang10@gmail.com>
Closes#18481 from actuaryzhang/sparkRDocWindow.
## What changes were proposed in this pull request?
Grouped documentation for column collection methods.
Author: actuaryzhang <actuaryzhang10@gmail.com>
Author: Wayne Zhang <actuaryzhang10@gmail.com>
Closes#18458 from actuaryzhang/sparkRDocCollection.
## What changes were proposed in this pull request?
Grouped documentation for column misc methods.
Author: actuaryzhang <actuaryzhang10@gmail.com>
Author: Wayne Zhang <actuaryzhang10@gmail.com>
Closes#18448 from actuaryzhang/sparkRDocMisc.
## What changes were proposed in this pull request?
Grouped documentation for nonaggregate column methods.
Author: actuaryzhang <actuaryzhang10@gmail.com>
Author: Wayne Zhang <actuaryzhang10@gmail.com>
Closes#18422 from actuaryzhang/sparkRDocNonAgg.
## What changes were proposed in this pull request?
This PR proposes to support a DDL-formetted string as schema as below:
```r
mockLines <- c("{\"name\":\"Michael\"}",
"{\"name\":\"Andy\", \"age\":30}",
"{\"name\":\"Justin\", \"age\":19}")
jsonPath <- tempfile(pattern = "sparkr-test", fileext = ".tmp")
writeLines(mockLines, jsonPath)
df <- read.df(jsonPath, "json", "name STRING, age DOUBLE")
collect(df)
```
## How was this patch tested?
Tests added in `test_streaming.R` and `test_sparkSQL.R` and manual tests.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#18431 from HyukjinKwon/r-ddl-schema.
## What changes were proposed in this pull request?
Grouped documentation for string column methods.
Author: actuaryzhang <actuaryzhang10@gmail.com>
Author: Wayne Zhang <actuaryzhang10@gmail.com>
Closes#18366 from actuaryzhang/sparkRDocString.
## What changes were proposed in this pull request?
Grouped documentation for math column methods.
Author: actuaryzhang <actuaryzhang10@gmail.com>
Author: Wayne Zhang <actuaryzhang10@gmail.com>
Closes#18371 from actuaryzhang/sparkRDocMath.
## What changes were proposed in this pull request?
`mcfork` in R looks opening a pipe ahead but the existing logic does not properly close it when it is executed hot. This leads to the failure of more forking due to the limit for number of files open.
This hot execution looks particularly for `gapply`/`gapplyCollect`. For unknown reason, this happens more easily in CentOS and could be reproduced in Mac too.
All the details are described in https://issues.apache.org/jira/browse/SPARK-21093
This PR proposes simply to terminate R's worker processes in the parent of R's daemon to prevent a leak.
## How was this patch tested?
I ran the codes below on both CentOS and Mac with that configuration disabled/enabled.
```r
df <- createDataFrame(list(list(1L, 1, "1", 0.1)), c("a", "b", "c", "d"))
collect(gapply(df, "a", function(key, x) { x }, schema(df)))
collect(gapply(df, "a", function(key, x) { x }, schema(df)))
... # 30 times
```
Also, now it passes R tests on CentOS as below:
```
SparkSQL functions: Spark package found in SPARK_HOME: .../spark
..............................................................................................................................................................
..............................................................................................................................................................
..............................................................................................................................................................
..............................................................................................................................................................
..............................................................................................................................................................
....................................................................................................................................
```
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#18320 from HyukjinKwon/SPARK-21093.
## What changes were proposed in this pull request?
Extend `setJobDescription` to SparkR API.
## How was this patch tested?
It looks difficult to add a test. Manually tested as below:
```r
df <- createDataFrame(iris)
count(df)
setJobDescription("This is an example job.")
count(df)
```
prints ...
![2017-06-22 12 05 49](https://user-images.githubusercontent.com/6477701/27415670-2a649936-5743-11e7-8e95-312f1cd103af.png)
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#18382 from HyukjinKwon/SPARK-21149.
## What changes were proposed in this pull request?
Grouped documentation for datetime column methods.
Author: actuaryzhang <actuaryzhang10@gmail.com>
Closes#18114 from actuaryzhang/sparkRDocDate.
## What changes were proposed in this pull request?
PR https://github.com/apache/spark/pull/17715 Added Constrained Logistic Regression for ML. We should add it to SparkR.
## How was this patch tested?
Add new unit tests.
Author: wangmiao1981 <wm624@hotmail.com>
Closes#18128 from wangmiao1981/test.
## What changes were proposed in this pull request?
Add `stringIndexerOrderType` to `spark.glm` and `spark.survreg` to support string encoding that is consistent with default R.
## How was this patch tested?
new tests
Author: actuaryzhang <actuaryzhang10@gmail.com>
Closes#18140 from actuaryzhang/sparkRFormula.
## What changes were proposed in this pull request?
LinearSVC should use its own threshold param, rather than the shared one, since it applies to rawPrediction instead of probability. This PR changes the param in the Scala, Python and R APIs.
## How was this patch tested?
New unit test to make sure the threshold can be set to any Double value.
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#18151 from jkbradley/ml-2.2-linearsvc-cleanup.
## What changes were proposed in this pull request?
Grouped documentation for the aggregate functions for Column.
Author: actuaryzhang <actuaryzhang10@gmail.com>
Closes#18025 from actuaryzhang/sparkRDoc4.
## What changes were proposed in this pull request?
Add SQL trunc function
## How was this patch tested?
standard test
Author: actuaryzhang <actuaryzhang10@gmail.com>
Closes#18291 from actuaryzhang/sparkRTrunc2.
## What changes were proposed in this pull request?
This PR proposes to list the files in test _after_ removing both "spark-warehouse" and "metastore_db" so that the next run of R tests pass fine. This is sometimes a bit annoying.
## How was this patch tested?
Manually running multiple times R tests via `./R/run-tests.sh`.
**Before**
Second run:
```
SparkSQL functions: Spark package found in SPARK_HOME: .../spark
...............................................................................................................................................................
...............................................................................................................................................................
...............................................................................................................................................................
...............................................................................................................................................................
...............................................................................................................................................................
....................................................................................................1234.......................
Failed -------------------------------------------------------------------------
1. Failure: No extra files are created in SPARK_HOME by starting session and making calls (test_sparkSQL.R#3384)
length(list1) not equal to length(list2).
1/1 mismatches
[1] 25 - 23 == 2
2. Failure: No extra files are created in SPARK_HOME by starting session and making calls (test_sparkSQL.R#3384)
sort(list1, na.last = TRUE) not equal to sort(list2, na.last = TRUE).
10/25 mismatches
x[16]: "metastore_db"
y[16]: "pkg"
x[17]: "pkg"
y[17]: "R"
x[18]: "R"
y[18]: "README.md"
x[19]: "README.md"
y[19]: "run-tests.sh"
x[20]: "run-tests.sh"
y[20]: "SparkR_2.2.0.tar.gz"
x[21]: "metastore_db"
y[21]: "pkg"
x[22]: "pkg"
y[22]: "R"
x[23]: "R"
y[23]: "README.md"
x[24]: "README.md"
y[24]: "run-tests.sh"
x[25]: "run-tests.sh"
y[25]: "SparkR_2.2.0.tar.gz"
3. Failure: No extra files are created in SPARK_HOME by starting session and making calls (test_sparkSQL.R#3388)
length(list1) not equal to length(list2).
1/1 mismatches
[1] 25 - 23 == 2
4. Failure: No extra files are created in SPARK_HOME by starting session and making calls (test_sparkSQL.R#3388)
sort(list1, na.last = TRUE) not equal to sort(list2, na.last = TRUE).
10/25 mismatches
x[16]: "metastore_db"
y[16]: "pkg"
x[17]: "pkg"
y[17]: "R"
x[18]: "R"
y[18]: "README.md"
x[19]: "README.md"
y[19]: "run-tests.sh"
x[20]: "run-tests.sh"
y[20]: "SparkR_2.2.0.tar.gz"
x[21]: "metastore_db"
y[21]: "pkg"
x[22]: "pkg"
y[22]: "R"
x[23]: "R"
y[23]: "README.md"
x[24]: "README.md"
y[24]: "run-tests.sh"
x[25]: "run-tests.sh"
y[25]: "SparkR_2.2.0.tar.gz"
DONE ===========================================================================
```
**After**
Second run:
```
SparkSQL functions: Spark package found in SPARK_HOME: .../spark
...............................................................................................................................................................
...............................................................................................................................................................
...............................................................................................................................................................
...............................................................................................................................................................
...............................................................................................................................................................
...............................................................................................................................
```
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#18335 from HyukjinKwon/SPARK-21128.
## What changes were proposed in this pull request?
Update Running R Tests dependence packages to:
```bash
R -e "install.packages(c('knitr', 'rmarkdown', 'testthat', 'e1071', 'survival'), repos='http://cran.us.r-project.org')"
```
## How was this patch tested?
manual tests
Author: Yuming Wang <wgyumg@gmail.com>
Closes#18271 from wangyum/building-spark.
### What changes were proposed in this pull request?
The current option name `wholeFile` is misleading for CSV users. Currently, it is not representing a record per file. Actually, one file could have multiple records. Thus, we should rename it. Now, the proposal is `multiLine`.
### How was this patch tested?
N/A
Author: Xiao Li <gatorsmile@gmail.com>
Closes#18202 from gatorsmile/renameCVSOption.
## What changes were proposed in this pull request?
clean up after big test move
## How was this patch tested?
unit tests, jenkins
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#18267 from felixcheung/rtestset2.
## What changes were proposed in this pull request?
Move all existing tests to non-installed directory so that it will never run by installing SparkR package
For a follow-up PR:
- remove all skip_on_cran() calls in tests
- clean up test timer
- improve or change basic tests that do run on CRAN (if anyone has suggestion)
It looks like `R CMD build pkg` will still put pkg\tests (ie. the full tests) into the source package but `R CMD INSTALL` on such source package does not install these tests (and so `R CMD check` does not run them)
## How was this patch tested?
- [x] unit tests, Jenkins
- [x] AppVeyor
- [x] make a source package, install it, `R CMD check` it - verify the full tests are not installed or run
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#18264 from felixcheung/rtestset.
## What changes were proposed in this pull request?
Document Dataset.union is resolution by position, not by name, since this has been a confusing point for a lot of users.
## How was this patch tested?
N/A - doc only change.
Author: Reynold Xin <rxin@databricks.com>
Closes#18256 from rxin/SPARK-21042.
## What changes were proposed in this pull request?
to investigate how long they run
## How was this patch tested?
Jenkins, AppVeyor
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#18104 from felixcheung/rtimetest.
## What changes were proposed in this pull request?
1, add an example for sparkr `decisionTree`
2, document it in user guide
## How was this patch tested?
local submit
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#18067 from zhengruifeng/dt_example.
## What changes were proposed in this pull request?
Joint coefficients with intercept for SparkR linear SVM summary.
## How was this patch tested?
Existing tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#18035 from yanboliang/svm-r.
## What changes were proposed in this pull request?
This change skips tests that use the Hadoop libraries while running
on CRAN check with Windows as the operating system. This is to handle
cases where the Hadoop winutils binaries are missing on the target
system. The skipped tests consist of
1. Tests that save, load a model in MLlib
2. Tests that save, load CSV, JSON and Parquet files in SQL
3. Hive tests
## How was this patch tested?
Tested by running on a local windows VM with HADOOP_HOME unset. Also testing with https://win-builder.r-project.org
Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
Closes#17966 from shivaram/sparkr-windows-cran.
## What changes were proposed in this pull request?
support decision tree in R
## How was this patch tested?
added tests
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#17981 from zhengruifeng/dt_r.
## What changes were proposed in this pull request?
Some examples in the DataFrame methods are syntactically wrong, even though they are pseudo code. Fix these and some style issues.
Author: Wayne Zhang <actuaryzhang@uber.com>
Closes#18003 from actuaryzhang/sparkRDoc3.
## What changes were proposed in this pull request?
Rename `carsDF` to `df` in SparkR `rollup` and `cube` examples.
## How was this patch tested?
Manual tests.
Author: zero323 <zero323@users.noreply.github.com>
Closes#17988 from zero323/cube-docs.
## What changes were proposed in this pull request?
- Adds R wrapper for `o.a.s.sql.functions.broadcast`.
- Renames `broadcast` to `broadcast_`.
## How was this patch tested?
Unit tests, check `check-cran.sh`.
Author: zero323 <zero323@users.noreply.github.com>
Closes#17965 from zero323/SPARK-20726.
## What changes were proposed in this pull request?
This PR proposes three things as below:
- Use casting rules to a timestamp in `to_timestamp` by default (it was `yyyy-MM-dd HH:mm:ss`).
- Support single argument for `to_timestamp` similarly with APIs in other languages.
For example, the one below works
```
import org.apache.spark.sql.functions._
Seq("2016-12-31 00:12:00.00").toDF("a").select(to_timestamp(col("a"))).show()
```
prints
```
+----------------------------------------+
|to_timestamp(`a`, 'yyyy-MM-dd HH:mm:ss')|
+----------------------------------------+
| 2016-12-31 00:12:00|
+----------------------------------------+
```
whereas this does not work in SQL.
**Before**
```
spark-sql> SELECT to_timestamp('2016-12-31 00:12:00');
Error in query: Invalid number of arguments for function to_timestamp; line 1 pos 7
```
**After**
```
spark-sql> SELECT to_timestamp('2016-12-31 00:12:00');
2016-12-31 00:12:00
```
- Related document improvement for SQL function descriptions and other API descriptions accordingly.
**Before**
```
spark-sql> DESCRIBE FUNCTION extended to_date;
...
Usage: to_date(date_str, fmt) - Parses the `left` expression with the `fmt` expression. Returns null with invalid input.
Extended Usage:
Examples:
> SELECT to_date('2016-12-31', 'yyyy-MM-dd');
2016-12-31
```
```
spark-sql> DESCRIBE FUNCTION extended to_timestamp;
...
Usage: to_timestamp(timestamp, fmt) - Parses the `left` expression with the `format` expression to a timestamp. Returns null with invalid input.
Extended Usage:
Examples:
> SELECT to_timestamp('2016-12-31', 'yyyy-MM-dd');
2016-12-31 00:00:00.0
```
**After**
```
spark-sql> DESCRIBE FUNCTION extended to_date;
...
Usage:
to_date(date_str[, fmt]) - Parses the `date_str` expression with the `fmt` expression to
a date. Returns null with invalid input. By default, it follows casting rules to a date if
the `fmt` is omitted.
Extended Usage:
Examples:
> SELECT to_date('2009-07-30 04:17:52');
2009-07-30
> SELECT to_date('2016-12-31', 'yyyy-MM-dd');
2016-12-31
```
```
spark-sql> DESCRIBE FUNCTION extended to_timestamp;
...
Usage:
to_timestamp(timestamp[, fmt]) - Parses the `timestamp` expression with the `fmt` expression to
a timestamp. Returns null with invalid input. By default, it follows casting rules to
a timestamp if the `fmt` is omitted.
Extended Usage:
Examples:
> SELECT to_timestamp('2016-12-31 00:12:00');
2016-12-31 00:12:00
> SELECT to_timestamp('2016-12-31', 'yyyy-MM-dd');
2016-12-31 00:00:00
```
## How was this patch tested?
Added tests in `datetime.sql`.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#17901 from HyukjinKwon/to_timestamp_arg.
## What changes were proposed in this pull request?
- [x] need to test by running R CMD check --as-cran
- [x] sanity check vignettes
## How was this patch tested?
Jenkins
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#17945 from felixcheung/rchangesforpackage.
## What changes were proposed in this pull request?
Change it to check for relative count like in this test https://github.com/apache/spark/blame/master/R/pkg/inst/tests/testthat/test_sparkSQL.R#L3355 for catalog APIs
## How was this patch tested?
unit tests, this needs to combine with another commit with SQL change to check
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#17905 from felixcheung/rtabletests.
## What changes were proposed in this pull request?
Cleaning existing temp tables before running tableNames tests
## How was this patch tested?
SparkR Unit tests
Author: Hossein <hossein@databricks.com>
Closes#17903 from falaki/SPARK-20661.
## What changes were proposed in this pull request?
Fix typo in vignettes
Author: Wayne Zhang <actuaryzhang@uber.com>
Closes#17884 from actuaryzhang/typo.
## What changes were proposed in this pull request?
set timezone on windows
## How was this patch tested?
unit test, AppVeyor
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#17892 from felixcheung/rtimestamptest.
## What changes were proposed in this pull request?
- Add SparkR wrapper for `Dataset.alias`.
- Adjust roxygen annotations for `functions.alias` (including example usage).
## How was this patch tested?
Unit tests, `check_cran.sh`.
Author: zero323 <zero323@users.noreply.github.com>
Closes#17825 from zero323/SPARK-20550.
## What changes were proposed in this pull request?
add environment
## How was this patch tested?
wait for appveyor run
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#17878 from felixcheung/appveyorrcran.
## What changes were proposed in this pull request?
Make tests more reliable by having it till processed.
Increasing timeout value might help but ultimately the flakiness from processing delay when Jenkins is hard to account for. This isn't an actual public API supported
## How was this patch tested?
unit tests
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#17857 from felixcheung/rsstestrelia.
## What changes were proposed in this pull request?
Adds wrapper for `o.a.s.sql.functions.input_file_name`
## How was this patch tested?
Existing unit tests, additional unit tests, `check-cran.sh`.
Author: zero323 <zero323@users.noreply.github.com>
Closes#17818 from zero323/SPARK-20544.
## What changes were proposed in this pull request?
Adds support for generic hints on `SparkDataFrame`
## How was this patch tested?
Unit tests, `check-cran.sh`
Author: zero323 <zero323@users.noreply.github.com>
Closes#17851 from zero323/SPARK-20585.
## What changes were proposed in this pull request?
Add
- R vignettes
- R programming guide
- SS programming guide
- R example
Also disable spark.als in vignettes for now since it's failing (SPARK-20402)
## How was this patch tested?
manually
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#17814 from felixcheung/rdocss.
## What changes were proposed in this pull request?
General rule on skip or not:
skip if
- RDD tests
- tests could run long or complicated (streaming, hivecontext)
- tests on error conditions
- tests won't likely change/break
## How was this patch tested?
unit tests, `R CMD check --as-cran`, `R CMD check`
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#17817 from felixcheung/rskiptest.
## What changes were proposed in this pull request?
doc only
## How was this patch tested?
manual
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#17828 from felixcheung/rnotfamily.
## What changes were proposed in this pull request?
Adds R wrappers for:
- `o.a.s.sql.functions.grouping` as `o.a.s.sql.functions.is_grouping` (to avoid shading `base::grouping`
- `o.a.s.sql.functions.grouping_id`
## How was this patch tested?
Existing unit tests, additional unit tests. `check-cran.sh`.
Author: zero323 <zero323@users.noreply.github.com>
Closes#17807 from zero323/SPARK-20532.
## What changes were proposed in this pull request?
Add without param for timeout - will need this to submit a job that runs until stopped
Need this for 2.2
## How was this patch tested?
manually, unit test
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#17815 from felixcheung/rssawaitinfinite.
## What changes were proposed in this pull request?
- Add null-safe equality operator `%<=>%` (sames as `o.a.s.sql.Column.eqNullSafe`, `o.a.s.sql.Column.<=>`)
- Add boolean negation operator `!` and function `not `.
## How was this patch tested?
Existing unit tests, additional unit tests, `check-cran.sh`.
Author: zero323 <zero323@users.noreply.github.com>
Closes#17783 from zero323/SPARK-20490.
## What changes were proposed in this pull request?
Ad R wrappers for
- `o.a.s.sql.functions.explode_outer`
- `o.a.s.sql.functions.posexplode_outer`
## How was this patch tested?
Additional unit tests, manual testing.
Author: zero323 <zero323@users.noreply.github.com>
Closes#17809 from zero323/SPARK-20535.
## What changes were proposed in this pull request?
It seems we are using `SQLUtils.getSQLDataType` for type string in structField. It looks we can replace this with `CatalystSqlParser.parseDataType`.
They look similar DDL-like type definitions as below:
```scala
scala> Seq(Tuple1(Tuple1("a"))).toDF.show()
```
```
+---+
| _1|
+---+
|[a]|
+---+
```
```scala
scala> Seq(Tuple1(Tuple1("a"))).toDF.select($"_1".cast("struct<_1:string>")).show()
```
```
+---+
| _1|
+---+
|[a]|
+---+
```
Such type strings looks identical when R’s one as below:
```R
> write.df(sql("SELECT named_struct('_1', 'a') as struct"), "/tmp/aa", "parquet")
> collect(read.df("/tmp/aa", "parquet", structType(structField("struct", "struct<_1:string>"))))
struct
1 a
```
R’s one is stricter because we are checking the types via regular expressions in R side ahead.
Actual logics there look a bit different but as we check it ahead in R side, it looks replacing it would not introduce (I think) no behaviour changes. To make this sure, the tests dedicated for it were added in SPARK-20105. (It looks `structField` is the only place that calls this method).
## How was this patch tested?
Existing tests - https://github.com/apache/spark/blob/master/R/pkg/inst/tests/testthat/test_sparkSQL.R#L143-L194 should cover this.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#17785 from HyukjinKwon/SPARK-20493.
## What changes were proposed in this pull request?
Replace
note repeat_string 2.3.0
with
note repeat_string since 2.3.0
## How was this patch tested?
`create-docs.sh`
Author: zero323 <zero323@users.noreply.github.com>
Closes#17779 from zero323/REPEAT-NOTE.
## What changes were proposed in this pull request?
Some PySpark & SparkR tests run with tiny dataset and tiny ```maxIter```, which means they are not converged. I don’t think checking intermediate result during iteration make sense, and these intermediate result may vulnerable and not stable, so we should switch to check the converged result. We hit this issue at #17746 when we upgrade breeze to 0.13.1.
## How was this patch tested?
Existing tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#17757 from yanboliang/flaky-test.
## What changes were proposed in this pull request?
- Add `rollup` and `cube` methods and corresponding generics.
- Add short description to the vignette.
## How was this patch tested?
- Existing unit tests.
- Additional unit tests covering new features.
- `check-cran.sh`.
Author: zero323 <zero323@users.noreply.github.com>
Closes#17728 from zero323/SPARK-20437.
## What changes were proposed in this pull request?
Upgrade breeze version to 0.13.1, which fixed some critical bugs of L-BFGS-B.
## How was this patch tested?
Existing unit tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#17746 from yanboliang/spark-20449.
## What changes were proposed in this pull request?
Add wrappers for `o.a.s.sql.functions`:
- `split` as `split_string`
- `repeat` as `repeat_string`
## How was this patch tested?
Existing tests, additional unit tests, `check-cran.sh`
Author: zero323 <zero323@users.noreply.github.com>
Closes#17729 from zero323/SPARK-20438.
## What changes were proposed in this pull request?
Adds wrappers for `collect_list` and `collect_set`.
## How was this patch tested?
Unit tests, `check-cran.sh`
Author: zero323 <zero323@users.noreply.github.com>
Closes#17672 from zero323/SPARK-20371.
## What changes were proposed in this pull request?
Adds wrappers for `o.a.s.sql.functions.array` and `o.a.s.sql.functions.map`
## How was this patch tested?
Unit tests, `check-cran.sh`
Author: zero323 <zero323@users.noreply.github.com>
Closes#17674 from zero323/SPARK-20375.
## What changes were proposed in this pull request?
Checking a source parameter is asynchronous. When the query is created, it's not guaranteed that source has been created. This PR just increases the timeout of awaitTermination to ensure the parsing error is thrown.
## How was this patch tested?
Jenkins
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#17687 from zsxwing/SPARK-20397.
## What changes were proposed in this pull request?
Document fpGrowth in:
- vignettes
- programming guide
- code example
## How was this patch tested?
Manual tests.
Author: zero323 <zero323@users.noreply.github.com>
Closes#17557 from zero323/SPARK-20208.
## What changes were proposed in this pull request?
This was suggested to be `as.json.array` at the first place in the PR to SPARK-19828 but we could not do this as the lint check emits an error for multiple dots in the variable names.
After SPARK-20278, now we are able to use `multiple.dots.in.names`. `asJsonArray` in `from_json` function is still able to be changed as 2.2 is not released yet.
So, this PR proposes to rename `asJsonArray` to `as.json.array`.
## How was this patch tested?
Jenkins tests, local tests with `./R/run-tests.sh` and manual `./dev/lint-r`. Existing tests should cover this.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#17653 from HyukjinKwon/SPARK-19828-followup.
## What changes were proposed in this pull request?
Currently, multi-dot separated variables in R is not allowed. For example,
```diff
setMethod("from_json", signature(x = "Column", schema = "structType"),
- function(x, schema, asJsonArray = FALSE, ...) {
+ function(x, schema, as.json.array = FALSE, ...) {
if (asJsonArray) {
jschema <- callJStatic("org.apache.spark.sql.types.DataTypes",
"createArrayType",
```
produces an error as below:
```
R/functions.R:2462:31: style: Words within variable and function names should be separated by '_' rather than '.'.
function(x, schema, as.json.array = FALSE, ...) {
^~~~~~~~~~~~~
```
This seems against https://google.github.io/styleguide/Rguide.xml#identifiers which says
> The preferred form for variable names is all lower case letters and words separated with dots
This looks because lintr by default https://github.com/jimhester/lintr follows http://r-pkgs.had.co.nz/style.html as written in the README.md. Few cases seems not following Google's one as "a few tweaks".
Per [SPARK-6813](https://issues.apache.org/jira/browse/SPARK-6813), we follow Google's R Style Guide with few exceptions https://google.github.io/styleguide/Rguide.xml. This is also merged into Spark's website - https://github.com/apache/spark-website/pull/43
Also, it looks we have no limit on function name. This rule also looks affecting to the name of functions as written in the README.md.
> `multiple_dots_linter`: check that function and variable names are separated by _ rather than ..
## How was this patch tested?
Manually tested `./dev/lint-r`with the manual change below in `R/functions.R`:
```diff
setMethod("from_json", signature(x = "Column", schema = "structType"),
- function(x, schema, asJsonArray = FALSE, ...) {
+ function(x, schema, as.json.array = FALSE, ...) {
if (asJsonArray) {
jschema <- callJStatic("org.apache.spark.sql.types.DataTypes",
"createArrayType",
```
**Before**
```R
R/functions.R:2462:31: style: Words within variable and function names should be separated by '_' rather than '.'.
function(x, schema, as.json.array = FALSE, ...) {
^~~~~~~~~~~~~
```
**After**
```
lintr checks passed.
```
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#17590 from HyukjinKwon/disable-dot-in-name.
## What changes were proposed in this pull request?
Fixed spelling of "charactor"
## How was this patch tested?
Spelling change only
Author: Brendan Dwyer <brendan.dwyer@ibm.com>
Closes#17611 from bdwyer2/SPARK-20298.
## What changes were proposed in this pull request?
Test failed because SPARK_HOME is not set before Spark is installed.
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#17516 from felixcheung/rdircheckincran.
## What changes were proposed in this pull request?
Following up on #17483, add createTable (which is new in 2.2.0) and deprecate createExternalTable, plus a number of minor fixes
## How was this patch tested?
manual, unit tests
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#17511 from felixcheung/rceatetable.
## What changes were proposed in this pull request?
Update doc to remove external for createTable, add refreshByPath in python
## How was this patch tested?
manual
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#17512 from felixcheung/catalogdoc.
## What changes were proposed in this pull request?
minor update
zero323
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#17526 from felixcheung/rfpgrowthfollowup.
## What changes were proposed in this pull request?
It seems cran check scripts corrects `R/pkg/DESCRIPTION` and follows the order in `Collate` fields.
This PR proposes to fix `catalog.R`'s order so that running this script does not show up a small diff in this file every time.
## How was this patch tested?
Manually via `./R/check-cran.sh`.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#17528 from HyukjinKwon/minor-reorder-description.
## What changes were proposed in this pull request?
Adds SparkR API for FPGrowth: [SPARK-19825](https://issues.apache.org/jira/browse/SPARK-19825):
- `spark.fpGrowth` -model training.
- `freqItemsets` and `associationRules` methods with new corresponding generics.
- Scala helper: `org.apache.spark.ml.r. FPGrowthWrapper`
- unit tests.
## How was this patch tested?
Feature specific unit tests.
Author: zero323 <zero323@users.noreply.github.com>
Closes#17170 from zero323/SPARK-19825.
## What changes were proposed in this pull request?
Add a set of catalog API in R
```
"currentDatabase",
"listColumns",
"listDatabases",
"listFunctions",
"listTables",
"recoverPartitions",
"refreshByPath",
"refreshTable",
"setCurrentDatabase",
```
https://github.com/apache/spark/pull/17483/files#diff-6929e6c5e59017ff954e110df20ed7ff
## How was this patch tested?
manual tests, unit tests
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#17483 from felixcheung/rcatalog.
JIRA Issue: https://issues.apache.org/jira/browse/SPARK-20123
## What changes were proposed in this pull request?
If $SPARK_HOME or $FWDIR variable contains spaces, then use "./dev/make-distribution.sh --name custom-spark --tgz -Psparkr -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos -Pyarn" build spark will failed.
## How was this patch tested?
manual tests
Author: zuotingbing <zuo.tingbing9@zte.com.cn>
Closes#17452 from zuotingbing/spark-bulid.
## What changes were proposed in this pull request?
It seems `checkType` and the type string in `structField` are not being tested closely. This string format currently seems SparkR-specific (see d1f6c64c4b/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala (L93-L131)) but resembles SQL type definition.
Therefore, it seems nicer if we test positive/negative cases in R side.
## How was this patch tested?
Unit tests in `test_sparkSQL.R`.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#17439 from HyukjinKwon/r-typestring-tests.
## What changes were proposed in this pull request?
This PR proposes to match minor documentations changes in https://github.com/apache/spark/pull/17399 and https://github.com/apache/spark/pull/17380 to R/Python.
## How was this patch tested?
Manual tests in Python , Python tests via `./python/run-tests.py --module=pyspark-sql` and lint-checks for Python/R.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#17429 from HyukjinKwon/minor-match-doc.
## What changes were proposed in this pull request?
SparkR ```spark.getSparkFiles``` fails when it was called on executors, see details at [SPARK-19925](https://issues.apache.org/jira/browse/SPARK-19925).
## How was this patch tested?
Add unit tests, and verify this fix at standalone and yarn cluster.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#17274 from yanboliang/spark-19925.
## What changes were proposed in this pull request?
When SparkR is installed as a R package there might not be any java runtime.
If it is not there SparkR's `sparkR.session()` will block waiting for the connection timeout, hanging the R IDE/shell, without any notification or message.
## How was this patch tested?
manually
- [x] need to test on Windows
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#16596 from felixcheung/rcheckjava.
## What changes were proposed in this pull request?
Update docs for NaN handling in approxQuantile.
## How was this patch tested?
existing tests.
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#17369 from zhengruifeng/doc_quantiles_nan.
## What changes were proposed in this pull request?
Currently JSON and CSV have exactly the same logic about handling bad records, this PR tries to abstract it and put it in a upper level to reduce code duplication.
The overall idea is, we make the JSON and CSV parser to throw a BadRecordException, then the upper level, FailureSafeParser, handles bad records according to the parse mode.
Behavior changes:
1. with PERMISSIVE mode, if the number of tokens doesn't match the schema, previously CSV parser will treat it as a legal record and parse as many tokens as possible. After this PR, we treat it as an illegal record, and put the raw record string in a special column, but we still parse as many tokens as possible.
2. all logging is removed as they are not very useful in practice.
## How was this patch tested?
existing tests
Author: Wenchen Fan <wenchen@databricks.com>
Author: hyukjinkwon <gurwls223@gmail.com>
Author: Wenchen Fan <cloud0fan@gmail.com>
Closes#17315 from cloud-fan/bad-record2.
## What changes were proposed in this pull request?
doc only change
## How was this patch tested?
manual
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#17356 from felixcheung/rdfcheckpoint2.
## What changes were proposed in this pull request?
Add checkpoint, setCheckpointDir API to R
## How was this patch tested?
unit tests, manual tests
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#17351 from felixcheung/rdfcheckpoint.
## What changes were proposed in this pull request?
This PR proposes to support an array of struct type in `to_json` as below:
```scala
import org.apache.spark.sql.functions._
val df = Seq(Tuple1(Tuple1(1) :: Nil)).toDF("a")
df.select(to_json($"a").as("json")).show()
```
```
+----------+
| json|
+----------+
|[{"_1":1}]|
+----------+
```
Currently, it throws an exception as below (a newline manually inserted for readability):
```
org.apache.spark.sql.AnalysisException: cannot resolve 'structtojson(`array`)' due to data type
mismatch: structtojson requires that the expression is a struct expression.;;
```
This allows the roundtrip with `from_json` as below:
```scala
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
val schema = ArrayType(StructType(StructField("a", IntegerType) :: Nil))
val df = Seq("""[{"a":1}, {"a":2}]""").toDF("json").select(from_json($"json", schema).as("array"))
df.show()
// Read back.
df.select(to_json($"array").as("json")).show()
```
```
+----------+
| array|
+----------+
|[[1], [2]]|
+----------+
+-----------------+
| json|
+-----------------+
|[{"a":1},{"a":2}]|
+-----------------+
```
Also, this PR proposes to rename from `StructToJson` to `StructsToJson ` and `JsonToStruct` to `JsonToStructs`.
## How was this patch tested?
Unit tests in `JsonFunctionsSuite` and `JsonExpressionsSuite` for Scala, doctest for Python and test in `test_sparkSQL.R` for R.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#17192 from HyukjinKwon/SPARK-19849.
## What changes were proposed in this pull request?
Passes R `tempdir()` (this is the R session temp dir, shared with other temp files/dirs) to JVM, set System.Property for derby home dir to move derby.log
## How was this patch tested?
Manually, unit tests
With this, these are relocated to under /tmp
```
# ls /tmp/RtmpG2M0cB/
derby.log
```
And they are removed automatically when the R session is ended.
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#16330 from felixcheung/rderby.
## What changes were proposed in this pull request?
It seems cran check scripts corrects `R/pkg/DESCRIPTION` and follows the order in `Collate` fields.
This PR proposes to fix this so that running this script does not show up a diff in this file.
## How was this patch tested?
Manually via `./R/check-cran.sh`.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#17349 from HyukjinKwon/minor-cran.
## What changes were proposed in this pull request?
Add "experimental" API for SS in R
## How was this patch tested?
manual, unit tests
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#16982 from felixcheung/rss.
## What changes were proposed in this pull request?
Since we could not directly define the array type in R, this PR proposes to support array types in R as string types that are used in `structField` as below:
```R
jsonArr <- "[{\"name\":\"Bob\"}, {\"name\":\"Alice\"}]"
df <- as.DataFrame(list(list("people" = jsonArr)))
collect(select(df, alias(from_json(df$people, "array<struct<name:string>>"), "arrcol")))
```
prints
```R
arrcol
1 Bob, Alice
```
## How was this patch tested?
Unit tests in `test_sparkSQL.R`.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#17178 from HyukjinKwon/SPARK-19828.
## What changes were proposed in this pull request?
Port Tweedie GLM #16344 to SparkR
felixcheung yanboliang
## How was this patch tested?
new test in SparkR
Author: actuaryzhang <actuaryzhang10@gmail.com>
Closes#16729 from actuaryzhang/sparkRTweedie.
## What changes were proposed in this pull request?
RandomForest R Wrapper and GBT R Wrapper return param `maxDepth` to R models.
Below 4 R wrappers are changed:
* `RandomForestClassificationWrapper`
* `RandomForestRegressionWrapper`
* `GBTClassificationWrapper`
* `GBTRegressionWrapper`
## How was this patch tested?
Test manually on my local machine.
Author: Xin Ren <iamshrek@126.com>
Closes#17207 from keypointt/SPARK-19282.
### What changes were proposed in this pull request?
Observed by felixcheung in https://github.com/apache/spark/pull/16739, when users use the shuffle-enabled `repartition` API, they expect the partition they got should be the exact number they provided, even if they call shuffle-disabled `coalesce` later.
Currently, `CollapseRepartition` rule does not consider whether shuffle is enabled or not. Thus, we got the following unexpected result.
```Scala
val df = spark.range(0, 10000, 1, 5)
val df2 = df.repartition(10)
assert(df2.coalesce(13).rdd.getNumPartitions == 5)
assert(df2.coalesce(7).rdd.getNumPartitions == 5)
assert(df2.coalesce(3).rdd.getNumPartitions == 3)
```
This PR is to fix the issue. We preserve shuffle-enabled Repartition.
### How was this patch tested?
Added a test case
Author: Xiao Li <gatorsmile@gmail.com>
Closes#16933 from gatorsmile/CollapseRepartition.
## What changes were proposed in this pull request?
Added checks for name consistency of input data frames in union.
## How was this patch tested?
new test.
Author: actuaryzhang <actuaryzhang10@gmail.com>
Closes#17159 from actuaryzhang/sparkRUnion.
## What changes were proposed in this pull request?
Add column functions: to_json, from_json, and tests covering error cases.
## How was this patch tested?
unit tests, manual
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#17134 from felixcheung/rtojson.
## What changes were proposed in this pull request?
Update R document to use JDK8.
## How was this patch tested?
manual tests
Author: Yuming Wang <wgyumg@gmail.com>
Closes#17162 from wangyum/SPARK-19550.
## What changes were proposed in this pull request?
Update doc for R, programming guide. Clarify default behavior for all languages.
## How was this patch tested?
manually
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#17128 from felixcheung/jsonwholefiledoc.
Update R doc:
1. columns, names and colnames returns a vector of strings, not **list** as in current doc.
2. `colnames<-` does allow the subset assignment, so the length of `value` can be less than the number of columns, e.g., `colnames(df)[1] <- "a"`.
felixcheung
Author: actuaryzhang <actuaryzhang10@gmail.com>
Closes#17115 from actuaryzhang/sparkRMinorDoc.
## What changes were proposed in this pull request?
Replace `iris` dataset with `Titanic` or other dataset in example and document.
## How was this patch tested?
Manual and existing test
Author: wm624@hotmail.com <wm624@hotmail.com>
Closes#17032 from wangmiao1981/example.
## What changes were proposed in this pull request?
The `[[` method is supposed to take a single index and return a column. This is different from base R which takes a vector index. We should check for this and issue warning or error when vector index is supplied (which is very likely given the behavior in base R).
Currently I'm issuing a warning message and just take the first element of the vector index. We could change this to an error it that's better.
## How was this patch tested?
new tests
Author: actuaryzhang <actuaryzhang10@gmail.com>
Closes#17017 from actuaryzhang/sparkRSubsetter.
## What changes were proposed in this pull request?
This is a follow-up PR of #16800
When doing SPARK-19456, we found that "" should be consider a NULL column name and should not be set. aggregationDepth should be exposed as an expert parameter.
## How was this patch tested?
Existing tests.
Author: wm624@hotmail.com <wm624@hotmail.com>
Closes#16945 from wangmiao1981/svc.
## What changes were proposed in this pull request?
We recently add the spark.svmLinear API for SparkR. We need to add an example and update the vignettes.
## How was this patch tested?
Manually run example.
Author: wm624@hotmail.com <wm624@hotmail.com>
Closes#16969 from wangmiao1981/example.
## What changes were proposed in this pull request?
SparkR ```approxQuantile``` supports input multiple columns.
## How was this patch tested?
Unit test.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#16951 from yanboliang/spark-19619.
## What changes were proposed in this pull request?
Add coalesce on DataFrame for down partitioning without shuffle and coalesce on Column
## How was this patch tested?
manual, unit tests
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#16739 from felixcheung/rcoalesce.
## What changes were proposed in this pull request?
Linear SVM classifier is newly added into ML and python API has been added. This JIRA is to add R side API.
Marked as WIP, as I am designing unit tests.
## How was this patch tested?
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: wm624@hotmail.com <wm624@hotmail.com>
Closes#16800 from wangmiao1981/svc.
## What changes were proposed in this pull request?
- this is cause by changes in SPARK-18444, SPARK-18643 that we no longer install Spark when `master = ""` (default), but also related to SPARK-18449 since the real `master` value is not known at the time the R code in `sparkR.session` is run. (`master` cannot default to "local" since it could be overridden by spark-submit commandline or spark config)
- as a result, while running SparkR as a package in IDE is working fine, CRAN check is not as it is launching it via non-interactive script
- fix is to add check to the beginning of each test and vignettes; the same would also work by changing `sparkR.session()` to `sparkR.session(master = "local")` in tests, but I think being more explicit is better.
## How was this patch tested?
Tested this by reverting version to 2.1, since it needs to download the release jar with matching version. But since there are changes in 2.2 (specifically around SparkR ML) that are incompatible with 2.1, some tests are failing in this config. Will need to port this to branch-2.1 and retest with 2.1 release jar.
manually as:
```
# modify DESCRIPTION to revert version to 2.1.0
SPARK_HOME=/usr/spark R CMD build pkg
# run cran check without SPARK_HOME
R CMD check --as-cran SparkR_2.1.0.tar.gz
```
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#16720 from felixcheung/rcranchecktest.
## What changes were proposed in this pull request?
Fix a bug in collect method for collecting timestamp column, the bug can be reproduced as shown in the following codes and outputs:
```
library(SparkR)
sparkR.session(master = "local")
df <- data.frame(col1 = c(0, 1, 2),
col2 = c(as.POSIXct("2017-01-01 00:00:01"), NA, as.POSIXct("2017-01-01 12:00:01")))
sdf1 <- createDataFrame(df)
print(dtypes(sdf1))
df1 <- collect(sdf1)
print(lapply(df1, class))
sdf2 <- filter(sdf1, "col1 > 0")
print(dtypes(sdf2))
df2 <- collect(sdf2)
print(lapply(df2, class))
```
As we can see from the printed output, the column type of col2 in df2 is converted to numeric unexpectedly, when NA exists at the top of the column.
This is caused by method `do.call(c, list)`, if we convert a list, i.e. `do.call(c, list(NA, as.POSIXct("2017-01-01 12:00:01"))`, the class of the result is numeric instead of POSIXct.
Therefore, we need to cast the data type of the vector explicitly.
## How was this patch tested?
The patch can be tested manually with the same code above.
Author: titicaca <fangzhou.yang@hotmail.com>
Closes#16689 from titicaca/sparkr-dev.
## What changes were proposed in this pull request?
After SPARK-19464, **SparkPullRequestBuilder** fails because it still tries to use hadoop2.3.
**BEFORE**
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72595/console
```
========================================================================
Building Spark
========================================================================
[error] Could not find hadoop2.3 in the list. Valid options are ['hadoop2.6', 'hadoop2.7']
Attempting to post to Github...
> Post successful.
```
**AFTER**
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72595/console
```
========================================================================
Building Spark
========================================================================
[info] Building Spark (w/Hive 1.2.1) using SBT with these arguments: -Phadoop-2.6 -Pmesos -Pkinesis-asl -Pyarn -Phive-thriftserver -Phive test:package streaming-kafka-0-8-assembly/assembly streaming-flume-assembly/assembly streaming-kinesis-asl-assembly/assembly
Using /usr/java/jdk1.8.0_60 as default JAVA_HOME.
Note, this will be overridden by -java-home if it is set.
```
## How was this patch tested?
Pass the existing test.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#16858 from dongjoon-hyun/hotfix_run-tests.
## What changes were proposed in this pull request?
This pull request adds two new user facing functions:
- `to_date` which accepts an expression and a format and returns a date.
- `to_timestamp` which accepts an expression and a format and returns a timestamp.
For example, Given a date in format: `2016-21-05`. (YYYY-dd-MM)
### Date Function
*Previously*
```
to_date(unix_timestamp(lit("2016-21-05"), "yyyy-dd-MM").cast("timestamp"))
```
*Current*
```
to_date(lit("2016-21-05"), "yyyy-dd-MM")
```
### Timestamp Function
*Previously*
```
unix_timestamp(lit("2016-21-05"), "yyyy-dd-MM").cast("timestamp")
```
*Current*
```
to_timestamp(lit("2016-21-05"), "yyyy-dd-MM")
```
### Tasks
- [X] Add `to_date` to Scala Functions
- [x] Add `to_date` to Python Functions
- [x] Add `to_date` to SQL Functions
- [X] Add `to_timestamp` to Scala Functions
- [x] Add `to_timestamp` to Python Functions
- [x] Add `to_timestamp` to SQL Functions
- [x] Add function to R
## How was this patch tested?
- [x] Add Functions to `DateFunctionsSuite`
- Test new `ParseToTimestamp` Expression (*not necessary*)
- Test new `ParseToDate` Expression (*not necessary*)
- [x] Add test for R
- [x] Add test for Python in test.py
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: anabranch <wac.chambers@gmail.com>
Author: Bill Chambers <bill@databricks.com>
Author: anabranch <bill@databricks.com>
Closes#16138 from anabranch/SPARK-16609.
## What changes were proposed in this pull request?
The names method fails to check for validity of the assignment values. This can be fixed by calling colnames within names.
## How was this patch tested?
new tests.
Author: actuaryzhang <actuaryzhang10@gmail.com>
Closes#16794 from actuaryzhang/sparkRNames.
## What changes were proposed in this pull request?
Current version has error in vignettes:
```
model <- spark.bisectingKmeans(df, Sepal_Length ~ Sepal_Width, k = 4)
summary(kmeansModel)
```
`kmeansModel` does not exist...
felixcheung wangmiao1981
Author: actuaryzhang <actuaryzhang10@gmail.com>
Closes#16799 from actuaryzhang/sparkRVignettes.
## What changes were proposed in this pull request?
Update programming guide, example and vignette with Bisecting k-means.
Author: krishnakalyan3 <krishnakalyan3@gmail.com>
Closes#16767 from krishnakalyan3/bisecting-kmeans.
## What changes were proposed in this pull request
When Kmeans using initMode = "random" and some random seed, it is possible the actual cluster size doesn't equal to the configured `k`.
In this case, summary(model) returns error due to the number of cols of coefficient matrix doesn't equal to k.
Example:
> col1 <- c(1, 2, 3, 4, 0, 1, 2, 3, 4, 0)
> col2 <- c(1, 2, 3, 4, 0, 1, 2, 3, 4, 0)
> col3 <- c(1, 2, 3, 4, 0, 1, 2, 3, 4, 0)
> cols <- as.data.frame(cbind(col1, col2, col3))
> df <- createDataFrame(cols)
>
> model2 <- spark.kmeans(data = df, ~ ., k = 5, maxIter = 10, initMode = "random", seed = 22222, tol = 1E-5)
>
> summary(model2)
Error in `colnames<-`(`*tmp*`, value = c("col1", "col2", "col3")) :
length of 'dimnames' [2] not equal to array extent
In addition: Warning message:
In matrix(coefficients, ncol = k) :
data length [9] is not a sub-multiple or multiple of the number of rows [2]
Fix: Get the actual cluster size in the summary and use it to build the coefficient matrix.
## How was this patch tested?
Add unit tests.
Author: wm624@hotmail.com <wm624@hotmail.com>
Closes#16666 from wangmiao1981/kmeans.
## What changes were proposed in this pull request?
The `coefficients` component in model summary should be 'matrix' but the underlying structure is indeed list. This affects several models except for 'AFTSurvivalRegressionModel' which has the correct implementation. The fix is to first `unlist` the coefficients returned from the `callJMethod` before converting to matrix. An example illustrates the issues:
```
data(iris)
df <- createDataFrame(iris)
model <- spark.glm(df, Sepal_Length ~ Sepal_Width, family = "gaussian")
s <- summary(model)
> str(s$coefficients)
List of 8
$ : num 6.53
$ : num -0.223
$ : num 0.479
$ : num 0.155
$ : num 13.6
$ : num -1.44
$ : num 0
$ : num 0.152
- attr(*, "dim")= int [1:2] 2 4
- attr(*, "dimnames")=List of 2
..$ : chr [1:2] "(Intercept)" "Sepal_Width"
..$ : chr [1:4] "Estimate" "Std. Error" "t value" "Pr(>|t|)"
> s$coefficients[, 2]
$`(Intercept)`
[1] 0.4788963
$Sepal_Width
[1] 0.1550809
```
This shows that the underlying structure of coefficients is still `list`.
felixcheung wangmiao1981
Author: actuaryzhang <actuaryzhang10@gmail.com>
Closes#16730 from actuaryzhang/sparkRCoef.
## What changes were proposed in this pull request?
With extract `[[` or replace `[[<-`, the parameter `i` is a column index, that needs to be corrected in doc. Also a few minor updates: examples, links.
## How was this patch tested?
manual
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#16721 from felixcheung/rsubsetdoc.
## What changes were proposed in this pull request?
This affects mostly running job from the driver in client mode when results are expected to be through stdout (which should be somewhat rare, but possible)
Before:
```
> a <- as.DataFrame(cars)
> b <- group_by(a, "dist")
> c <- count(b)
> sparkR.callJMethod(c$countjc, "explain", TRUE)
NULL
```
After:
```
> a <- as.DataFrame(cars)
> b <- group_by(a, "dist")
> c <- count(b)
> sparkR.callJMethod(c$countjc, "explain", TRUE)
count#11L
NULL
```
Now, `column.explain()` doesn't seem very useful (we can get more extensive output with `DataFrame.explain()`) but there are other more complex examples with calls of `println` in Scala/JVM side, that are getting dropped.
## How was this patch tested?
manual
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#16670 from felixcheung/rjvmstdout.
## What changes were proposed in this pull request?
add header
## How was this patch tested?
Manual run to check vignettes html is created properly
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#16709 from felixcheung/rfilelicense.
## What changes were proposed in this pull request?
With doc to say this would convert DF into RDD
## How was this patch tested?
unit tests, manual tests
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#16668 from felixcheung/rgetnumpartitions.
## What changes were proposed in this pull request?
Add R wrapper for bisecting Kmeans.
As JIRA is down, I will update title to link with corresponding JIRA later.
## How was this patch tested?
Add new unit tests.
Author: wm624@hotmail.com <wm624@hotmail.com>
Closes#16566 from wangmiao1981/bk.
## What changes were proposed in this pull request?
Support for
```
df[[myname]] <- 1
df[[2]] <- df$eruptions
```
## How was this patch tested?
manual tests, unit tests
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#16663 from felixcheung/rcolset.
## What changes were proposed in this pull request?
```spark.gaussianMixture``` supports output total log-likelihood for the model like R ```mvnormalmixEM```.
## How was this patch tested?
R unit test.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#16646 from yanboliang/spark-19291.
## What changes were proposed in this pull request?
When R is starting as a package and it needs to download the Spark release distribution we need to handle error for download and untar, and clean up, otherwise it will get stuck.
## How was this patch tested?
manually
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#16589 from felixcheung/rtarreturncode.
## What changes were proposed in this pull request?
Refactored script to remove duplications and clearer purpose for each script
## How was this patch tested?
manually
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#16249 from felixcheung/rscripts.
## What changes were proposed in this pull request?
Windows seems to be the only place with appauthor in the path, for which we should say "Apache" (and case sensitive)
Current path of `AppData\Local\spark\spark\Cache` is a bit odd.
## How was this patch tested?
manual.
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#16590 from felixcheung/rcachedir.
## What changes were proposed in this pull request?
spark.lda passes the optimizer "em" or "online" as a string to the backend. However, LDAWrapper doesn't set optimizer based on the value from R. Therefore, for optimizer "em", the `isDistributed` field is FALSE, which should be TRUE based on scala code.
In addition, the `summary` method should bring back the results related to `DistributedLDAModel`.
## How was this patch tested?
Manual tests by comparing with scala example.
Modified the current unit test: fix the incorrect unit test and add necessary tests for `summary` method.
Author: wm624@hotmail.com <wm624@hotmail.com>
Closes#16464 from wangmiao1981/new.
## What changes were proposed in this pull request?
To allow specifying number of partitions when the DataFrame is created
## How was this patch tested?
manual, unit tests
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#16512 from felixcheung/rnumpart.
## What changes were proposed in this pull request?
spark.kmeans doesn't have interface to set initSteps, seed and tol. As Spark Kmeans algorithm doesn't take the same set of parameters as R kmeans, we should maintain a different interface in spark.kmeans.
Add missing parameters and corresponding document.
Modified existing unit tests to take additional parameters.
Author: wm624@hotmail.com <wm624@hotmail.com>
Closes#16523 from wangmiao1981/kmeans.
## What changes were proposed in this pull request?
```
df$foo <- 1
```
instead of
```
df$foo <- lit(1)
```
## How was this patch tested?
unit tests
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#16510 from felixcheung/rlitcol.
## What changes were proposed in this pull request?
R family is a longer list than what Spark supports.
## How was this patch tested?
manual
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#16511 from felixcheung/rdocglmfamily.
## What changes were proposed in this pull request?
- [X] Make sure all join types are clearly mentioned
- [X] Make join labeling/style consistent
- [X] Make join label ordering docs the same
- [X] Improve join documentation according to above for Scala
- [X] Improve join documentation according to above for Python
- [X] Improve join documentation according to above for R
## How was this patch tested?
No tests b/c docs.
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: anabranch <wac.chambers@gmail.com>
Closes#16504 from anabranch/SPARK-19126.
## What changes were proposed in this pull request?
- [X] Fix inconsistencies in function reference for dense rank and dense
- [X] Make all languages equivalent in their reference to `dense_rank` and `rank`.
## How was this patch tested?
N/A for docs.
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: anabranch <wac.chambers@gmail.com>
Closes#16505 from anabranch/SPARK-19127.
## What changes were proposed in this pull request?
SparkR ```mllib.R``` is getting bigger as we add more ML wrappers, I'd like to split it into multiple files to make us easy to maintain:
* mllib_classification.R
* mllib_clustering.R
* mllib_recommendation.R
* mllib_regression.R
* mllib_stat.R
* mllib_tree.R
* mllib_utils.R
Note: Only reorg, no actual code change.
## How was this patch tested?
Existing tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#16312 from yanboliang/spark-18862.
## What changes were proposed in this pull request?
#16126 bumps master branch version to 2.2.0-SNAPSHOT, but it seems R version was omitted.
## How was this patch tested?
N/A
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#16488 from yanboliang/r-version.
## What changes were proposed in this pull request?
It would make it easier to integrate with other component expecting row-based JSON format.
This replaces the non-public toJSON RDD API.
## How was this patch tested?
manual, unit tests
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#16368 from felixcheung/rJSON.
## What changes were proposed in this pull request?
API for SparkUI URL from SparkContext
## How was this patch tested?
manual, unit tests
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#16367 from felixcheung/rwebui.
## What changes were proposed in this pull request?
SparkR tests, `R/run-tests.sh`, succeeds only once because `test_sparkSQL.R` does not clean up the test table, `people`.
As a result, the rows in `people` table are accumulated at every run and the test cases fail.
The following is the failure result for the second run.
```r
Failed -------------------------------------------------------------------------
1. Failure: create DataFrame from RDD (test_sparkSQL.R#204) -------------------
collect(sql("SELECT age from people WHERE name = 'Bob'"))$age not equal to c(16).
Lengths differ: 2 vs 1
2. Failure: create DataFrame from RDD (test_sparkSQL.R#206) -------------------
collect(sql("SELECT height from people WHERE name ='Bob'"))$height not equal to c(176.5).
Lengths differ: 2 vs 1
```
## How was this patch tested?
Manual. Run `run-tests.sh` twice and check if it passes without failures.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#16310 from dongjoon-hyun/SPARK-18897.
## What changes were proposed in this pull request?
doc cleanup
## How was this patch tested?
~~vignettes is not building for me. I'm going to kick off a full clean build and try again and attach output here for review.~~
Output html here: https://felixcheung.github.io/sparkr-vignettes.html
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#16286 from felixcheung/rvignettespass.
## What changes were proposed in this pull request?
When do the QA work, I found that the following issues:
1). `spark.mlp` doesn't include an example;
2). `spark.mlp` and `spark.lda` have redundant parameter explanations;
3). `spark.lda` document misses default values for some parameters.
I also changed the `spark.logit` regParam in the examples, as we discussed in #16222.
## How was this patch tested?
Manual test
Author: wm624@hotmail.com <wm624@hotmail.com>
Closes#16284 from wangmiao1981/ks.
## What changes were proposed in this pull request?
Added short section for KSTest.
Also added logreg model to list of ML models in vignette. (This will be reorganized under SPARK-18849)
![screen shot 2016-12-14 at 1 37 31 pm](https://cloud.githubusercontent.com/assets/5084283/21202140/7f24e240-c202-11e6-9362-458208bb9159.png)
## How was this patch tested?
Manually tested example locally.
Built vignettes locally.
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#16283 from jkbradley/ksTest-vignette.
## What changes were proposed in this pull request?
While adding vignettes for kstest, I found some errors in the example:
1. There is a typo of kstest;
2. print.summary.KStest doesn't work with the example;
Fix the example errors;
Add a new unit test for print.summary.KStest;
## How was this patch tested?
Manual test;
Add new unit test;
Author: wm624@hotmail.com <wm624@hotmail.com>
Closes#16259 from wangmiao1981/ks.
## What changes were proposed in this pull request?
Mention `spark.randomForest` and `spark.gbt` in vignettes. Keep the content minimal since users can type `?spark.randomForest` to see the full doc.
cc: jkbradley
Author: Xiangrui Meng <meng@databricks.com>
Closes#16264 from mengxr/SPARK-18793.
## What changes were proposed in this pull request?
Support overriding the download url (include version directory) in an environment variable, `SPARKR_RELEASE_DOWNLOAD_URL`
## How was this patch tested?
unit test, manually testing
- snapshot build url
- download when spark jar not cached
- when spark jar is cached
- RC build url
- download when spark jar not cached
- when spark jar is cached
- multiple cached spark versions
- starting with sparkR shell
To use this,
```
SPARKR_RELEASE_DOWNLOAD_URL=http://this_is_the_url_to_spark_release_tgz R
```
then in R,
```
library(SparkR) # or specify lib.loc
sparkR.session()
```
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#16248 from felixcheung/rinstallurl.
## What changes were proposed in this pull request?
Several SparkR API calling into JVM methods that have void return values are getting printed out, especially when running in a REPL or IDE.
example:
```
> setLogLevel("WARN")
NULL
```
We should fix this to make the result more clear.
Also found a small change to return value of dropTempView in 2.1 - adding doc and test for it.
## How was this patch tested?
manually - I didn't find a expect_*() method in testthat for this
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#16237 from felixcheung/rinvis.
## What changes were proposed in this pull request?
In this PR, the document of `summary` method is improved in the format:
returns summary information of the fitted model, which is a list. The list includes .......
Since `summary` in R is mainly about the model, which is not the same as `summary` object on scala side, if there is one, the scala API doc is not pointed here.
In current document, some `return` have `.` and some don't have. `.` is added to missed ones.
Since spark.logit `summary` has a big refactoring, this PR doesn't include this one. It will be changed when the `spark.logit` PR is merged.
## How was this patch tested?
Manual build.
Author: wm624@hotmail.com <wm624@hotmail.com>
Closes#16150 from wangmiao1981/audit2.
## What changes were proposed in this pull request?
This PR has 2 key changes. One, we are building source package (aka bundle package) for SparkR which could be released on CRAN. Two, we should include in the official Spark binary distributions SparkR installed from this source package instead (which would have help/vignettes rds needed for those to work when the SparkR package is loaded in R, whereas earlier approach with devtools does not)
But, because of various differences in how R performs different tasks, this PR is a fair bit more complicated. More details below.
This PR also includes a few minor fixes.
### more details
These are the additional steps in make-distribution; please see [here](https://github.com/apache/spark/blob/master/R/CRAN_RELEASE.md) on what's going to a CRAN release, which is now run during make-distribution.sh.
1. package needs to be installed because the first code block in vignettes is `library(SparkR)` without lib path
2. `R CMD build` will build vignettes (this process runs Spark/SparkR code and captures outputs into pdf documentation)
3. `R CMD check` on the source package will install package and build vignettes again (this time from source packaged) - this is a key step required to release R package on CRAN
(will skip tests here but tests will need to pass for CRAN release process to success - ideally, during release signoff we should install from the R source package and run tests)
4. `R CMD Install` on the source package (this is the only way to generate doc/vignettes rds files correctly, not in step # 1)
(the output of this step is what we package into Spark dist and sparkr.zip)
Alternatively,
R CMD build should already be installing the package in a temp directory though it might just be finding this location and set it to lib.loc parameter; another approach is perhaps we could try calling `R CMD INSTALL --build pkg` instead.
But in any case, despite installing the package multiple times this is relatively fast.
Building vignettes takes a while though.
## How was this patch tested?
Manually, CI.
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#16014 from felixcheung/rdist.
## What changes were proposed in this pull request?
Reviewing SparkR ML wrappers API for 2.1 release, mainly two issues:
* Remove ```probabilityCol``` from the argument list of ```spark.logit``` and ```spark.randomForest```. Since it was used when making prediction and should be an argument of ```predict```, and we will work on this at [SPARK-18618](https://issues.apache.org/jira/browse/SPARK-18618) in the next release cycle.
* Fix ```spark.als``` params to make it consistent with MLlib.
## How was this patch tested?
Existing tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#16169 from yanboliang/spark-18326.
## What changes were proposed in this pull request?
Fix reservoir sampling bias for small k. An off-by-one error meant that the probability of replacement was slightly too high -- k/(l-1) after l element instead of k/l, which matters for small k.
## How was this patch tested?
Existing test plus new test case.
Author: Sean Owen <sowen@cloudera.com>
Closes#16129 from srowen/SPARK-18678.
## What changes were proposed in this pull request?
Several cleanup and improvements for ```spark.logit```:
* ```summary``` should return coefficients matrix, and should output labels for each class if the model is multinomial logistic regression model.
* ```summary``` should not return ```areaUnderROC, roc, pr, ...```, since most of them are DataFrame which are less important for R users. Meanwhile, these metrics ignore instance weights (setting all to 1.0) which will be changed in later Spark version. In case it will introduce breaking changes, we do not expose them currently.
* SparkR test improvement: comparing the training result with native R glmnet.
* Remove argument ```aggregationDepth``` from ```spark.logit```, since it's an expert Param(related with Spark architecture and job execution) that would be used rarely by R users.
## How was this patch tested?
Unit tests.
The ```summary``` output after this change:
multinomial logistic regression:
```
> df <- suppressWarnings(createDataFrame(iris))
> model <- spark.logit(df, Species ~ ., regParam = 0.5)
> summary(model)
$coefficients
versicolor virginica setosa
(Intercept) 1.514031 -2.609108 1.095077
Sepal_Length 0.02511006 0.2649821 -0.2900921
Sepal_Width -0.5291215 -0.02016446 0.549286
Petal_Length 0.03647411 0.1544119 -0.190886
Petal_Width 0.000236092 0.4195804 -0.4198165
```
binomial logistic regression:
```
> df <- suppressWarnings(createDataFrame(iris))
> training <- df[df$Species %in% c("versicolor", "virginica"), ]
> model <- spark.logit(training, Species ~ ., regParam = 0.5)
> summary(model)
$coefficients
Estimate
(Intercept) -6.053815
Sepal_Length 0.2449379
Sepal_Width 0.1648321
Petal_Length 0.4730718
Petal_Width 1.031947
```
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#16117 from yanboliang/spark-18686.
## What changes were proposed in this pull request?
If SparkR is running as a package and it has previously downloaded Spark Jar it should be able to run as before without having to set SPARK_HOME. Basically with this bug the auto install Spark will only work in the first session.
This seems to be a regression on the earlier behavior.
Fix is to always try to install or check for the cached Spark if running in an interactive session.
As discussed before, we should probably only install Spark iff running in an interactive session (R shell, RStudio etc)
## How was this patch tested?
Manually
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#16077 from felixcheung/rsessioninteractive.
## What changes were proposed in this pull request?
It's better we can fix this issue by providing an option ```type``` for users to change the ```predict``` output schema, then they could output probabilities, log-space predictions, or original labels. In order to not involve breaking API change for 2.1, so revert this change firstly and will add it back after [SPARK-18618](https://issues.apache.org/jira/browse/SPARK-18618) resolved.
## How was this patch tested?
Existing unit tests.
This reverts commit daa975f4bf.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#16118 from yanboliang/spark-18291-revert.
## What changes were proposed in this pull request?
Similar to SPARK-18401, as a classification algorithm, logistic regression should support output original label instead of supporting index label.
In this PR, original label output is supported and test cases are modified and added. Document is also modified.
## How was this patch tested?
Unit tests.
Author: wm624@hotmail.com <wm624@hotmail.com>
Closes#15910 from wangmiao1981/audit.
## What changes were proposed in this pull request?
### The Issue
If I specify my schema when doing
```scala
spark.read
.schema(someSchemaWherePartitionColumnsAreStrings)
```
but if the partition inference can infer it as IntegerType or I assume LongType or DoubleType (basically fixed size types), then once UnsafeRows are generated, your data will be corrupted.
### Proposed solution
The partition handling code path is kind of a mess. In my fix I'm probably adding to the mess, but at least trying to standardize the code path.
The real issue is that a user that uses the `spark.read` code path can never clearly specify what the partition columns are. If you try to specify the fields in `schema`, we practically ignore what the user provides, and fall back to our inferred data types. What happens in the end is data corruption.
My solution tries to fix this by always trying to infer partition columns the first time you specify the table. Once we find what the partition columns are, we try to find them in the user specified schema and use the dataType provided there, or fall back to the smallest common data type.
We will ALWAYS append partition columns to the user's schema, even if they didn't ask for it. We will only use the data type they provided if they specified it. While this is confusing, this has been the behavior since Spark 1.6, and I didn't want to change this behavior in the QA period of Spark 2.1. We may revisit this decision later.
A side effect of this PR is that we won't need https://github.com/apache/spark/pull/15942 if this PR goes in.
## How was this patch tested?
Regression tests
Author: Burak Yavuz <brkyvz@gmail.com>
Closes#15951 from brkyvz/partition-corruption.
## What changes were proposed in this pull request?
Updates links to the wiki to links to the new location of content on spark.apache.org.
## How was this patch tested?
Doc builds
Author: Sean Owen <sowen@cloudera.com>
Closes#15967 from srowen/SPARK-18073.1.
## What changes were proposed in this pull request?
* Fix SparkR ```spark.glm``` errors when fitting on collinear data, since ```standard error of coefficients, t value and p value``` are not available in this condition.
* Scala/Python GLM summary should throw exception if users get ```standard error of coefficients, t value and p value``` but the underlying WLS was solved by local "l-bfgs".
## How was this patch tested?
Add unit tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#15930 from yanboliang/spark-18501.
## What changes were proposed in this pull request?
When running SparkR job in yarn-cluster mode, it will download Spark package from apache website which is not necessary.
```
./bin/spark-submit --master yarn-cluster ./examples/src/main/r/dataframe.R
```
The following is output:
```
Attaching package: ‘SparkR’
The following objects are masked from ‘package:stats’:
cov, filter, lag, na.omit, predict, sd, var, window
The following objects are masked from ‘package:base’:
as.data.frame, colnames, colnames<-, drop, endsWith, intersect,
rank, rbind, sample, startsWith, subset, summary, transform, union
Spark not found in SPARK_HOME:
Spark not found in the cache directory. Installation will start.
MirrorUrl not provided.
Looking for preferred site from apache website...
......
```
There's no ```SPARK_HOME``` in yarn-cluster mode since the R process is in a remote host of the yarn cluster rather than in the client host. The JVM comes up first and the R process then connects to it. So in such cases we should never have to download Spark as Spark is already running.
## How was this patch tested?
Offline test.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#15888 from yanboliang/spark-18444.
## What changes were proposed in this pull request?
I found the documentation for the sample method to be confusing, this adds more clarification across all languages.
- [x] Scala
- [x] Python
- [x] R
- [x] RDD Scala
- [ ] RDD Python with SEED
- [X] RDD Java
- [x] RDD Java with SEED
- [x] RDD Python
## How was this patch tested?
NA
Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before opening a pull request.
Author: anabranch <wac.chambers@gmail.com>
Author: Bill Chambers <bill@databricks.com>
Closes#15815 from anabranch/SPARK-18365.
## What changes were proposed in this pull request?
```spark.mlp``` should support ```RFormula``` like other ML algorithm wrappers.
BTW, I did some cleanup and improvement for ```spark.mlp```.
## How was this patch tested?
Unit tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#15883 from yanboliang/spark-18438.
## What changes were proposed in this pull request?
* Fix the following exceptions which throws when ```spark.randomForest```(classification), ```spark.gbt```(classification), ```spark.naiveBayes``` and ```spark.glm```(binomial family) were fitted on libsvm data.
```
java.lang.IllegalArgumentException: requirement failed: If label column already exists, forceIndexLabel can not be set with true.
```
See [SPARK-18412](https://issues.apache.org/jira/browse/SPARK-18412) for more detail about how to reproduce this bug.
* Refactor out ```getFeaturesAndLabels``` to RWrapperUtils, since lots of ML algorithm wrappers use this function.
* Drop some unwanted columns when making prediction.
## How was this patch tested?
Add unit test.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#15851 from yanboliang/spark-18412.
## What changes were proposed in this pull request?
SparkR ```spark.randomForest``` classification prediction should output original label rather than the indexed label. This issue is very similar with [SPARK-18291](https://issues.apache.org/jira/browse/SPARK-18291).
## How was this patch tested?
Add unit tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#15842 from yanboliang/spark-18401.
## What changes were proposed in this pull request?
Gradient Boosted Tree in R.
With a few minor improvements to RandomForest in R.
Since this is relatively isolated I'd like to target this for branch-2.1
## How was this patch tested?
manual tests, unit tests
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#15746 from felixcheung/rgbt.
## What changes were proposed in this pull request?
minor doc update that should go to master & branch-2.1
## How was this patch tested?
manual
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#15747 from felixcheung/pySPARK-14393.
## What changes were proposed in this pull request?
In test_mllib.R, there are two unnecessary suppressWarnings. This PR just removes them.
## How was this patch tested?
Existing unit tests.
Author: wm624@hotmail.com <wm624@hotmail.com>
Closes#15697 from wangmiao1981/rtest.
## What changes were proposed in this pull request?
Due to a limitation of hive metastore(table location must be directory path, not file path), we always store `path` for data source table in storage properties, instead of the `locationUri` field. However, we should not expose this difference to `CatalogTable` level, but just treat it as a hack in `HiveExternalCatalog`, like we store table schema of data source table in table properties.
This PR unifies `path` and `locationUri` outside of `HiveExternalCatalog`, both data source table and hive serde table should use the `locationUri` field.
This PR also unifies the way we handle default table location for managed table. Previously, the default table location of hive serde managed table is set by external catalog, but the one of data source table is set by command. After this PR, we follow the hive way and the default table location is always set by external catalog.
For managed non-file-based tables, we will assign a default table location and create an empty directory for it, the table location will be removed when the table is dropped. This is reasonable as metastore doesn't care about whether a table is file-based or not, and an empty table directory has no harm.
For external non-file-based tables, ideally we can omit the table location, but due to a hive metastore issue, we will assign a random location to it, and remove it right after the table is created. See SPARK-15269 for more details. This is fine as it's well isolated in `HiveExternalCatalog`.
To keep the existing behaviour of the `path` option, in this PR we always add the `locationUri` to storage properties using key `path`, before passing storage properties to `DataSource` as data source options.
## How was this patch tested?
existing tests.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#15024 from cloud-fan/path.
## What changes were proposed in this pull request?
Simplify struct creation, especially the aspect of `CleanupAliases` which missed some aliases when handling trees created by `CreateStruct`.
This PR includes:
1. A failing test (create struct with nested aliases, some of the aliases survive `CleanupAliases`).
2. A fix that transforms `CreateStruct` into a `CreateNamedStruct` constructor, effectively eliminating `CreateStruct` from all expression trees.
3. A `NamePlaceHolder` used by `CreateStruct` when column names cannot be extracted from unresolved `NamedExpression`.
4. A new Analyzer rule that resolves `NamePlaceHolder` into a string literal once the `NamedExpression` is resolved.
5. `CleanupAliases` code was simplified as it no longer has to deal with `CreateStruct`'s top level columns.
## How was this patch tested?
Running all tests-suits in package org.apache.spark.sql, especially including the analysis suite, making sure added test initially fails, after applying suggested fix rerun the entire analysis package successfully.
Modified few tests that expected `CreateStruct` which is now transformed into `CreateNamedStruct`.
Author: eyal farago <eyal farago>
Author: Herman van Hovell <hvanhovell@databricks.com>
Author: eyal farago <eyal.farago@gmail.com>
Author: Eyal Farago <eyal.farago@actimize.com>
Author: Hyukjin Kwon <gurwls223@gmail.com>
Author: eyalfa <eyal.farago@gmail.com>
Closes#15718 from hvanhovell/SPARK-16839-2.
## What changes were proposed in this pull request?
This PR proposes to
- improve the R-friendly error messages rather than raw JVM exception one.
As `read.json`, `read.text`, `read.orc`, `read.parquet` and `read.jdbc` are executed in the same path with `read.df`, and `write.json`, `write.text`, `write.orc`, `write.parquet` and `write.jdbc` shares the same path with `write.df`, it seems it is safe to call `handledCallJMethod` to handle
JVM messages.
- prevent `zero-length variable name` and prints the ignored options as an warning message.
**Before**
``` r
> read.json("path", a = 1, 2, 3, "a")
Error in env[[name]] <- value :
zero-length variable name
```
``` r
> read.json("arbitrary_path")
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
org.apache.spark.sql.AnalysisException: Path does not exist: file:/...;
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:398)
...
> read.orc("arbitrary_path")
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
org.apache.spark.sql.AnalysisException: Path does not exist: file:/...;
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:398)
...
> read.text("arbitrary_path")
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
org.apache.spark.sql.AnalysisException: Path does not exist: file:/...;
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:398)
...
> read.parquet("arbitrary_path")
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
org.apache.spark.sql.AnalysisException: Path does not exist: file:/...;
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:398)
...
```
``` r
> write.json(df, "existing_path")
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
org.apache.spark.sql.AnalysisException: path file:/... already exists.;
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:68)
> write.orc(df, "existing_path")
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
org.apache.spark.sql.AnalysisException: path file:/... already exists.;
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:68)
> write.text(df, "existing_path")
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
org.apache.spark.sql.AnalysisException: path file:/... already exists.;
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:68)
> write.parquet(df, "existing_path")
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
org.apache.spark.sql.AnalysisException: path file:/... already exists.;
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:68)
```
**After**
``` r
read.json("arbitrary_path", a = 1, 2, 3, "a")
Unnamed arguments ignored: 2, 3, a.
```
``` r
> read.json("arbitrary_path")
Error in json : analysis error - Path does not exist: file:/...
> read.orc("arbitrary_path")
Error in orc : analysis error - Path does not exist: file:/...
> read.text("arbitrary_path")
Error in text : analysis error - Path does not exist: file:/...
> read.parquet("arbitrary_path")
Error in parquet : analysis error - Path does not exist: file:/...
```
``` r
> write.json(df, "existing_path")
Error in json : analysis error - path file:/... already exists.;
> write.orc(df, "existing_path")
Error in orc : analysis error - path file:/... already exists.;
> write.text(df, "existing_path")
Error in text : analysis error - path file:/... already exists.;
> write.parquet(df, "existing_path")
Error in parquet : analysis error - path file:/... already exists.;
```
## How was this patch tested?
Unit tests in `test_utils.R` and `test_sparkSQL.R`.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#15608 from HyukjinKwon/SPARK-17838.
## What changes were proposed in this pull request?
Simplify struct creation, especially the aspect of `CleanupAliases` which missed some aliases when handling trees created by `CreateStruct`.
This PR includes:
1. A failing test (create struct with nested aliases, some of the aliases survive `CleanupAliases`).
2. A fix that transforms `CreateStruct` into a `CreateNamedStruct` constructor, effectively eliminating `CreateStruct` from all expression trees.
3. A `NamePlaceHolder` used by `CreateStruct` when column names cannot be extracted from unresolved `NamedExpression`.
4. A new Analyzer rule that resolves `NamePlaceHolder` into a string literal once the `NamedExpression` is resolved.
5. `CleanupAliases` code was simplified as it no longer has to deal with `CreateStruct`'s top level columns.
## How was this patch tested?
running all tests-suits in package org.apache.spark.sql, especially including the analysis suite, making sure added test initially fails, after applying suggested fix rerun the entire analysis package successfully.
modified few tests that expected `CreateStruct` which is now transformed into `CreateNamedStruct`.
Credit goes to hvanhovell for assisting with this PR.
Author: eyal farago <eyal farago>
Author: eyal farago <eyal.farago@gmail.com>
Author: Herman van Hovell <hvanhovell@databricks.com>
Author: Eyal Farago <eyal.farago@actimize.com>
Author: Hyukjin Kwon <gurwls223@gmail.com>
Author: eyalfa <eyal.farago@gmail.com>
Closes#14444 from eyalfa/SPARK-16839_redundant_aliases_after_cleanupAliases.