Commit graph

824 commits

Author SHA1 Message Date
Dominik Gehl 3a09024636 [SPARK-36154][DOCS] Documenting week and quarter as valid formats in pyspark sql/functions trunc
### What changes were proposed in this pull request?
Added missing documentation of week and quarter as valid formats to pyspark sql/functions trunc

### Why are the changes needed?
Pyspark documentation and scala documentation didn't mentioned the same supported formats

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Only documentation change

Closes #33359 from dominikgehl/feature/SPARK-36154.

Authored-by: Dominik Gehl <dog@open.ch>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(cherry picked from commit 802f632a28)
Signed-off-by: Max Gekk <max.gekk@gmail.com>
2021-07-15 16:51:25 +03:00
Wenchen Fan c1d8178817 [SPARK-35968][SQL] Make sure partitions are not too small in AQE partition coalescing
### What changes were proposed in this pull request?

By default, AQE will set `COALESCE_PARTITIONS_MIN_PARTITION_NUM` to the spark default parallelism, which is usually quite big. This is to keep the parallelism on par with non-AQE, to avoid perf regressions.

However, this usually leads to many small/empty partitions, and hurts performance (although not worse than non-AQE). Users usually blindly set `COALESCE_PARTITIONS_MIN_PARTITION_NUM` to 1, which makes this config quite useless.

This PR adds a new config to set the min partition size, to avoid too small partitions after coalescing. By default, Spark will not respect the target size, and only respect this min partition size, to maximize the parallelism and avoid perf regression in AQE. This PR also adds a bool config to respect the target size when coalescing partitions, and it's recommended to set it to get better overall performance. This PR also deprecates the `COALESCE_PARTITIONS_MIN_PARTITION_NUM` config.

### Why are the changes needed?

AQE is default on now, we should make the perf better in the default case.

### Does this PR introduce _any_ user-facing change?

yes, a new config.

### How was this patch tested?

new tests

Closes #33172 from cloud-fan/aqe2.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 0c9c8ff569)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-07-02 16:07:46 +08:00
itholic 745756ca4c [SPARK-35603][R][DOCS] Add data source options link for R API documentation
### What changes were proposed in this pull request?

There are options for data source are documented at Data Source Options page for every data sources.

For Python, Scala, JAVA, the link for Data Source Option page was added in each API documentation.

- Python
<img width="732" alt="Screen Shot 2021-06-07 at 12 25 45 PM" src="https://user-images.githubusercontent.com/44108233/120955187-cbe38800-c78b-11eb-9475-ccf89bbc3c95.png">

- Scala
<img width="677" alt="Screen Shot 2021-06-07 at 12 26 41 PM" src="https://user-images.githubusercontent.com/44108233/120955186-cab25b00-c78b-11eb-9fed-3f0d2024029b.png">

- JAVA
<img width="726" alt="Screen Shot 2021-06-07 at 12 27 49 PM" src="https://user-images.githubusercontent.com/44108233/120955182-c8e89780-c78b-11eb-9cf1-13e41ba35b3e.png">

However, we have no link for R documentation, so we should add the link to the R documentation as well.

### Why are the changes needed?

To provide users available options for each data source when they read/write it.

### Does this PR introduce _any_ user-facing change?

Yes, the link for Data Source Option is added to R documentation as below.

<img width="855" alt="Screen Shot 2021-06-07 at 12 29 26 PM" src="https://user-images.githubusercontent.com/44108233/120955302-064d2500-c78c-11eb-8dc3-cb22dfd5fd14.png">

### How was this patch tested?

Manually built doc and checked one by one

Closes #32797 from itholic/SPARK-35603.

Lead-authored-by: itholic <haejoon.lee@databricks.com>
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-08 11:58:38 +09:00
Hyukjin Kwon 1ba1b70cfe [SPARK-35573][R][TESTS] Make SparkR tests pass with R 4.1+
### What changes were proposed in this pull request?

This PR proposes to support R 4.1.0+ in SparkR. Currently the tests are being failed as below:

```
══ Failed ══════════════════════════════════════════════════════════════════════
── 1. Failure (test_sparkSQL_arrow.R:71:3): createDataFrame/collect Arrow optimi
collect(createDataFrame(rdf)) not equal to `expected`.
Component “g”: 'tzone' attributes are inconsistent ('UTC' and '')

── 2. Failure (test_sparkSQL_arrow.R:143:3): dapply() Arrow optimization - type
collect(ret) not equal to `rdf`.
Component “b”: 'tzone' attributes are inconsistent ('UTC' and '')

── 3. Failure (test_sparkSQL_arrow.R:229:3): gapply() Arrow optimization - type
collect(ret) not equal to `rdf`.
Component “b”: 'tzone' attributes are inconsistent ('UTC' and '')

── 4. Error (test_sparkSQL.R:1454:3): column functions ─────────────────────────
Error: (converted from warning) cannot xtfrm data frames
Backtrace:
  1. base::sort(collect(distinct(select(df, input_file_name())))) test_sparkSQL.R:1454:2
  2. base::sort.default(collect(distinct(select(df, input_file_name()))))
  5. base::order(x, na.last = na.last, decreasing = decreasing)
  6. base::lapply(z, function(x) if (is.object(x)) as.vector(xtfrm(x)) else x)
  7. base:::FUN(X[[i]], ...)
 10. base::xtfrm.data.frame(x)

── 5. Failure (test_utils.R:67:3): cleanClosure on R functions ─────────────────
`actual` not equal to `g`.
names for current but not for target
Length mismatch: comparison on first 0 components

── 6. Failure (test_utils.R:80:3): cleanClosure on R functions ─────────────────
`actual` not equal to `g`.
names for current but not for target
Length mismatch: comparison on first 0 components
```

It fixes three as below:

- Avoid a sort on DataFrame which isn't legitimate: https://github.com/apache/spark/pull/32709#discussion_r642458108
- Treat the empty timezone and local timezone as equivalent in SparkR: https://github.com/apache/spark/pull/32709#discussion_r642464454
- Disable `check.environment` in the cleaned closure comparison (enabled by default from R 4.1+, https://cran.r-project.org/doc/manuals/r-release/NEWS.html), and keep the test as is https://github.com/apache/spark/pull/32709#discussion_r642510089

### Why are the changes needed?

Higher R versions have bug fixes and improvements. More importantly R users tend to use highest R versions.

### Does this PR introduce _any_ user-facing change?

Yes, SparkR will work together with R 4.1.0+

### How was this patch tested?

```bash
./R/run-tests.sh
```

```
sparkSQL_arrow:
SparkSQL Arrow optimization: .................

...

sparkSQL:
SparkSQL functions: ........................................................................................................................................................................................................
........................................................................................................................................................................................................
........................................................................................................................................................................................................
........................................................................................................................................................................................................
........................................................................................................................................................................................................
........................................................................................................................................................................................................

...

utils:
functions in utils.R: ..............................................
```

Closes #32709 from HyukjinKwon/SPARK-35573.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-01 10:35:52 +09:00
Felix Cheung 1530876615 [SPARK-35495][R] Change SparkR maintainer for CRAN
### What changes were proposed in this pull request?

As discussed, update SparkR maintainer for future release.

### Why are the changes needed?

Shivaram will not be able to work with this in the future, so we would like to migrate off the maintainer contact email.

shivaram

Closes #32642 from felixcheung/sparkr-maintainer.

Authored-by: Felix Cheung <felixcheung@users.noreply.github.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-05-23 19:08:54 -07:00
Hyukjin Kwon ecb48ccb7d [SPARK-35381][R] Fix lambda variable name issues in nested higher order functions at R APIs
### What changes were proposed in this pull request?

This PR fixes the same issue as https://github.com/apache/spark/pull/32424

```r
df <- sql("SELECT array(1, 2, 3) as numbers, array('a', 'b', 'c') as letters")
collect(select(
  df,
  array_transform("numbers", function(number) {
    array_transform("letters", function(latter) {
      struct(alias(number, "n"), alias(latter, "l"))
    })
  })
))
```

**Before:**

```
... a, a, b, b, c, c, a, a, b, b, c, c, a, a, b, b, c, c
```

**After:**

```
... 1, a, 1, b, 1, c, 2, a, 2, b, 2, c, 3, a, 3, b, 3, c
```

### Why are the changes needed?

To produce the correct results.

### Does this PR introduce _any_ user-facing change?

Yes, it fixes the results to be correct as mentioned above.

### How was this patch tested?

Manually tested as above, and unit test was added.

Closes #32517 from HyukjinKwon/SPARK-35381.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-05-12 16:52:39 +09:00
Ruifeng Zheng 1f150b9392 [SPARK-35024][ML] Refactor LinearSVC - support virtual centering
### What changes were proposed in this pull request?
1, remove existing agg, and use a new agg supporting virtual centering
2, add related testsuites

### Why are the changes needed?
centering vectors should accelerate convergence, and generate solution more close to R

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
updated testsuites and added testsuites

Closes #32124 from zhengruifeng/svc_agg_refactor.

Authored-by: Ruifeng Zheng <ruifengz@foxmail.com>
Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
2021-04-25 13:16:46 +08:00
Kousuke Saruta c0972dec1d [SPARK-35180][BUILD] Allow to build SparkR with SBT
### What changes were proposed in this pull request?

This PR proposes a change that allows us to build SparkR with SBT.

### Why are the changes needed?

In the current master, SparkR can be built only with Maven.
It's helpful if we can built it with SBT.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

I confirmed that I can build SparkR on Ubuntu 20.04 with the following command.
```
build/sbt -Psparkr package
```

Closes #32285 from sarutak/sbt-sparkr.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: hyukjinkwon <gurwls223@apache.org>
2021-04-22 20:56:33 +09:00
Yuanjian Li 8e9e70045b [SPARK-35171][R] Declare the markdown package as a dependency of the SparkR package
### What changes were proposed in this pull request?
Declare the markdown package as a dependency of the SparkR package

### Why are the changes needed?
If we didn't install pandoc locally, running make-distribution.sh will fail with the following message:
```
— re-building ‘sparkr-vignettes.Rmd’ using rmarkdown
Warning in engine$weave(file, quiet = quiet, encoding = enc) :
Pandoc (>= 1.12.3) not available. Falling back to R Markdown v1.
Error: processing vignette 'sparkr-vignettes.Rmd' failed with diagnostics:
The 'markdown' package should be declared as a dependency of the 'SparkR' package (e.g., in the 'Suggests' field of DESCRIPTION), because the latter contains vignette(s) built with the 'markdown' package. Please see https://github.com/yihui/knitr/issues/1864 for more information.
— failed re-building ‘sparkr-vignettes.Rmd’
```

### Does this PR introduce _any_ user-facing change?
Yes. Workaround for R packaging.

### How was this patch tested?
Manually test. After the fix, the command `sh dev/make-distribution.sh -Psparkr -Phadoop-2.7 -Phive -Phive-thriftserver -Pyarn` in the environment without pandoc will pass.

Closes #32270 from xuanyuanking/SPARK-35171.

Authored-by: Yuanjian Li <yuanjian.li@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-04-21 20:43:47 +09:00
HyukjinKwon f72b9068ad [SPARK-34643][R][DOCS] Use CRAN URL in canonical form
### What changes were proposed in this pull request?

This PR fixes the URL links to use CRAN URL in canonical form.
CRAN package submission was failed as below:

```
   Found the following (possibly) invalid URLs:
     URL: https://cran.r-project.org/web/packages/e1071/index.html
       From: man/spark.naiveBayes.Rd
       Status: 200
       Message: OK
       CRAN URL not in canonical form
     URL: https://cran.r-project.org/web/packages/mixtools/index.html
       From: man/spark.gaussianMixture.Rd
       Status: 200
       Message: OK
       CRAN URL not in canonical form
     URL: https://cran.r-project.org/web/packages/survival/index.html
       From: man/spark.survreg.Rd
       Status: 200
       Message: OK
       CRAN URL not in canonical form
     URL: https://cran.r-project.org/web/packages/topicmodels/index.html
       From: man/spark.lda.Rd
       Status: 200
       Message: OK
       CRAN URL not in canonical form
     The canonical URL of the CRAN page for a package is
       https://CRAN.R-project.org/package=pkgname
```

### Why are the changes needed?

To fix CRAN package submission

### Does this PR introduce _any_ user-facing change?

It exposes a canoncal form of URLs to end users.

### How was this patch tested?

I manually clicked each links.

Closes #31759 from HyukjinKwon/minor-doc-fixes.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-03-05 10:08:11 -08:00
Richard Penney 7d0743b493 [SPARK-33678][SQL] Product aggregation function
### Why is this change being proposed?
This patch adds support for a new "product" aggregation function in `sql.functions` which multiplies-together all values in an aggregation group.

This is likely to be useful in statistical applications which involve combining probabilities, or financial applications that involve combining cumulative interest rates, but is also a versatile mathematical operation of similar status to `sum` or `stddev`. Other users [have noted](https://stackoverflow.com/questions/52991640/cumulative-product-in-spark) the absence of such a function in current releases of Spark.

This function is both much more concise than an expression of the form `exp(sum(log(...)))`, and avoids awkward edge-cases associated with some values being zero or negative, as well as being less computationally costly.

### Does this PR introduce _any_ user-facing change?
No - only adds new function.

### How was this patch tested?
Built-in tests have been added for the new `catalyst.expressions.aggregate.Product` class and its invocation via the (scala) `sql.functions.product` function. The latter, and the PySpark wrapper have also been manually tested in spark-shell and pyspark sessions. The SparkR wrapper is currently untested, and may need separate validation (I'm not an "R" user myself).

An illustration of the new functionality, within PySpark is as follows:
```
import pyspark.sql.functions as pf, pyspark.sql.window as pw

df = sqlContext.range(1, 17).toDF("x")
win = pw.Window.partitionBy(pf.lit(1)).orderBy(pf.col("x"))

df.withColumn("factorial", pf.product("x").over(win)).show(20, False)
+---+---------------+
|x  |factorial      |
+---+---------------+
|1  |1.0            |
|2  |2.0            |
|3  |6.0            |
|4  |24.0           |
|5  |120.0          |
|6  |720.0          |
|7  |5040.0         |
|8  |40320.0        |
|9  |362880.0       |
|10 |3628800.0      |
|11 |3.99168E7      |
|12 |4.790016E8     |
|13 |6.2270208E9    |
|14 |8.71782912E10  |
|15 |1.307674368E12 |
|16 |2.0922789888E13|
+---+---------------+
```

Closes #30745 from rwpenney/feature/agg-product.

Lead-authored-by: Richard Penney <rwp@rwpenney.uk>
Co-authored-by: Richard Penney <rwpenney@users.noreply.github.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-03-02 16:51:07 +09:00
HyukjinKwon 30468a9015 [SPARK-34306][SQL][PYTHON][R] Use Snake naming rule across the function APIs
### What changes were proposed in this pull request?

This PR completes snake_case rule at functions APIs across the languages, see also SPARK-10621.

In more details, this PR:
- Adds `count_distinct` in Scala Python, and R, and document that `count_distinct` is encouraged. This was not deprecated because `countDistinct` is pretty commonly used. We could deprecate in the future releases.
- (Scala-specific) adds `typedlit` but doesn't deprecate `typedLit` which is arguably commonly used. Likewise, we could deprecate in the future releases.
- Deprecates and renames:
  - `sumDistinct` -> `sum_distinct`
  - `bitwiseNOT` -> `bitwise_not`
  - `shiftLeft` -> `shiftleft` (matched with SQL name in `FunctionRegistry`)
  - `shiftRight` -> `shiftright` (matched with SQL name in `FunctionRegistry`)
  - `shiftRightUnsigned` -> `shiftrightunsigned` (matched with SQL name in `FunctionRegistry`)
  - (Scala-specific) `callUDF` -> `call_udf`

### Why are the changes needed?

To keep the consistent naming in APIs.

### Does this PR introduce _any_ user-facing change?

Yes, it deprecates some APIs and add new renamed APIs as described above.

### How was this patch tested?

Unittests were added.

Closes #31408 from HyukjinKwon/SPARK-34306.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-02-02 09:29:40 +09:00
Max Gekk 2b76e6d15c [SPARK-34301][SQL] Use logical plan of alter table in CatalogImpl.recoverPartitions()
### What changes were proposed in this pull request?
Replace v1 exec node `AlterTableRecoverPartitionsCommand` by the logical node `AlterTableRecoverPartitions` in `CatalogImpl.recoverPartitions()`.

### Why are the changes needed?
1. Print user friendly error message for views:
```
my_temp_table is a temp view. 'recoverPartitions()' expects a table
```
Before the changes:
```
Table or view 'my_temp_table' not found in database 'default'
```

2. To not bind to v1 `ALTER TABLE .. RECOVER PARTITIONS`, and to support v2 tables potentially as well.

### Does this PR introduce _any_ user-facing change?
Yes, it can.

### How was this patch tested?
By running new test in `CatalogSuite`:
```
$ build/sbt -Phive -Phive-thriftserver "test:testOnly org.apache.spark.sql.internal.CatalogSuite"
```

Closes #31403 from MaxGekk/catalogimpl-recoverPartitions.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-02-01 14:09:40 +00:00
HyukjinKwon b5bdbf2ebc [SPARK-30682][R][SQL][FOLLOW-UP] Keep the name similar with Scala side in higher order functions
### What changes were proposed in this pull request?

This PR is a followup of #27433. It fixes the naming to match with Scala side, and this is similar with https://github.com/apache/spark/pull/31062.

Note that:

- there are a bit of inconsistency already e.g.) `x`, `y` in SparkR and they are documented together for doc deduplication. This part I did not change but the name `zero` vs `initialValue` looks unnecessary.
- such naming matching seems already pretty common in SparkR.

### Why are the changes needed?

To make the usage similar with Scala side, and for consistency.

### Does this PR introduce _any_ user-facing change?

No, this is not released yet.

### How was this patch tested?

GitHub Actions and Jenkins build will test it out.

Also, I manually tested:

```r
> df <- select(createDataFrame(data.frame(id = 1)),expr("CAST(array(1.0, 2.0, -3.0, -4.0) AS array<double>) xs"))
> collect(select(df, array_aggregate("xs", initialValue = lit(0.0), merge = function(x, y) otherwise(when(x > y, x), y))))
  aggregate(xs, 0.0, lambdafunction(CASE WHEN (x > y) THEN x ELSE y END, x, y), lambdafunction(id, id))
1                                                                                                     2
```

Closes #31226 from HyukjinKwon/SPARK-30682.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-01-18 14:19:14 +09:00
zero323 66cc12944a [SPARK-34132][DOCS][R] Update Roxygen version references to 7.1.1
### What changes were proposed in this pull request?

This PR updates `roxygen2` version reference in docs and `DESCRIPTION` file.

### Why are the changes needed?

According to information provided by shaneknapp (see [this comment](https://issues.apache.org/jira/browse/SPARK-30747?focusedCommentId=17265142&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17265142) to SPARK-30747) all workers use roxygen 7.1.1.

In GitHub workflow we install the latest version

c75c29dcaa/.github/workflows/build_and_test.yml (L346)

which [is also 7.1.1 at the moment](https://web.archive.org/web/20210115172522/https://cran.r-project.org/web/packages/roxygen2/).

### Does this PR introduce _any_ user-facing change?

Docs and description mention currently used package verison.

### How was this patch tested?

- `dev/lint-r`.
- Manual check of command used in docs.

Closes #31200 from zero323/ROXYGEN-VERSION-UPDATE-DOCS.

Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-01-15 17:08:17 -08:00
HyukjinKwon 0ba3ab4c23 [SPARK-34021][R] Fix hyper links in SparkR documentation for CRAN submission
### What changes were proposed in this pull request?

3.0.1 CRAN submission was failed as the reason below:

```
   Found the following (possibly) invalid URLs:
     URL: http://jsonlines.org/ (moved to https://jsonlines.org/)
       From: man/read.json.Rd
             man/write.json.Rd
       Status: 200
       Message: OK
     URL: https://dl.acm.org/citation.cfm?id=1608614 (moved to
https://dl.acm.org/doi/10.1109/MC.2009.263)
       From: inst/doc/sparkr-vignettes.html
       Status: 200
       Message: OK
 ```

The links were being redirected now. This PR checked all hyperlinks in the docs such as `href{...}` and `url{...}`, and fixed all in SparkR:

- Fix two problems above.
- Fix http to https
- Fix `https://www.apache.org/ https://spark.apache.org/` -> `https://www.apache.org https://spark.apache.org`.

### Why are the changes needed?

For CRAN submission.

### Does this PR introduce _any_ user-facing change?

Virtually no because it's just cleanup that CRAN requires.

### How was this patch tested?

Manually tested by clicking the links

Closes #31058 from HyukjinKwon/SPARK-34021.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-01-07 13:58:13 +09:00
Tom.Howland 3d8ee492d6 [SPARK-34015][R] Fixing input timing in gapply
### What changes were proposed in this pull request?

When sparkR is run at log level INFO, a summary of how the worker spent its time processing the partition is printed. There is a logic error where it is over-reporting the time inputting rows.

In detail: the variable inputElap in a wider context is used to mark the end of reading rows, but in the part changed here it was used as a local variable for measuring the beginning of compute time in a loop over the groups in the partition. Thus, the error is not observable if there is only one group per partition, which is what you get in unit tests.

For our application, here's what a log entry looks like before these changes were applied:

`20/10/09 04:08:58 INFO RRunner: Times: boot = 0.013 s, init = 0.005 s, broadcast = 0.000 s, read-input = 529.471 s, compute = 492.037 s, write-output = 0.020 s, total = 1021.546 s`

this indicates that we're spending more time reading rows than operating on the rows.

After these changes, it looks like this:

`20/12/15 06:43:29 INFO RRunner: Times: boot = 0.013 s, init = 0.010 s, broadcast = 0.000 s, read-input = 120.275 s, compute = 1680.161 s, write-output = 0.045 s, total = 1812.553 s
`
### Why are the changes needed?

Metrics shouldn't mislead?

### Does this PR introduce _any_ user-facing change?

Aside from no longer misleading, no

### How was this patch tested?

unit tests passed. Field test results seem plausible

Closes #31021 from WamBamBoozle/input_timing.

Authored-by: Tom.Howland <Tom.Howland@target.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-01-06 11:40:02 +09:00
Michael Chirico 12b69cc27c [SPARK-26199][SPARK-31517][R] Fix strategy for handling ... names in mutate
### What changes were proposed in this pull request?

Change the strategy for how the varargs are handled in the default `mutate` method

### Why are the changes needed?

Bugfix -- `deparse` + `sapply` not working as intended due to `width.cutoff`

### Does this PR introduce any user-facing change?

Yes, bugfix. Shouldn't change any working code.

### How was this patch tested?

None! yet.

Closes #28386 from MichaelChirico/r-mutate-deparse.

Lead-authored-by: Michael Chirico <michael.chirico@grabtaxi.com>
Co-authored-by: Michael Chirico <michaelchirico4@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-12-17 17:20:45 +09:00
Dongjoon Hyun de9818f043
[SPARK-33662][BUILD] Setting version to 3.2.0-SNAPSHOT
### What changes were proposed in this pull request?

This PR aims to update `master` branch version to 3.2.0-SNAPSHOT.

### Why are the changes needed?

Start to prepare Apache Spark 3.2.0.

### Does this PR introduce _any_ user-facing change?

N/A.

### How was this patch tested?

Pass the CIs.

Closes #30606 from dongjoon-hyun/SPARK-3.2.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-12-04 14:10:42 -08:00
zero323 5a1c5ac807 [SPARK-33622][R][ML] Add array_to_vector to SparkR
### What changes were proposed in this pull request?

This PR adds `array_to_vector` to R API.

### Why are the changes needed?

Feature parity.

### Does this PR introduce _any_ user-facing change?

New function exposed in the public API.

### How was this patch tested?

New unit test.
Manual verification of the documentation examples.

Closes #30561 from zero323/SPARK-33622.

Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-12-01 10:44:14 -08:00
Josh Soref 13fd272cd3 Spelling r common dev mlib external project streaming resource managers python
### What changes were proposed in this pull request?

This PR intends to fix typos in the sub-modules:
* `R`
* `common`
* `dev`
* `mlib`
* `external`
* `project`
* `streaming`
* `resource-managers`
* `python`

Split per srowen https://github.com/apache/spark/pull/30323#issuecomment-728981618

NOTE: The misspellings have been reported at 706a726f87 (commitcomment-44064356)

### Why are the changes needed?

Misspelled words make it harder to read / understand content.

### Does this PR introduce _any_ user-facing change?

There are various fixes to documentation, etc...

### How was this patch tested?

No testing was performed

Closes #30402 from jsoref/spelling-R_common_dev_mlib_external_project_streaming_resource-managers_python.

Authored-by: Josh Soref <jsoref@users.noreply.github.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-11-27 10:22:45 -06:00
zero323 d082ad0abf [SPARK-33563][PYTHON][R][SQL] Expose inverse hyperbolic trig functions in PySpark and SparkR
### What changes were proposed in this pull request?

This PR adds the following functions (introduced in Scala API with SPARK-33061):

- `acosh`
- `asinh`
- `atanh`

to Python and R.

### Why are the changes needed?

Feature parity.

### Does this PR introduce _any_ user-facing change?

New functions.

### How was this patch tested?

New unit tests.

Closes #30501 from zero323/SPARK-33563.

Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-11-27 11:00:09 +09:00
zero323 56a8510e19 [SPARK-33304][R][SQL] Add from_avro and to_avro functions to SparkR
### What changes were proposed in this pull request?

Adds `from_avro` and `to_avro` functions to SparkR.

### Why are the changes needed?

Feature parity.

### Does this PR introduce _any_ user-facing change?

New functions exposed in SparkR API.

### How was this patch tested?

New unit tests.

Closes #30216 from zero323/SPARK-33304.

Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-11-19 09:52:29 +09:00
neko 4360c6f12a [SPARK-33363] Add prompt information related to the current task when pyspark/sparkR starts
### What changes were proposed in this pull request?
add prompt information about current applicationId, current URL and master info when pyspark / sparkR starts.

### Why are the changes needed?
The information printed when pyspark/sparkR starts does not prompt the basic information of current application, and it is not convenient when used pyspark/sparkR in dos.

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
manual test result shows below:
![pyspark new print](https://user-images.githubusercontent.com/52202080/98274268-2a663f00-1fce-11eb-88ce-964ce90b439e.png)
![sparkR](https://user-images.githubusercontent.com/52202080/98541235-1a01dd00-22ca-11eb-9304-09bcde87b05e.png)

Closes #30266 from akiyamaneko/pyspark-hint-info.

Authored-by: neko <echohlne@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-11-10 11:12:19 +09:00
zero323 d71b2febaf [SPARK-30663][SPARK-33313][TESTS][R] Drop testthat 1.x support and add testthat 3.x support
### What changes were proposed in this pull request?

This PR modifies `R/pkg/tests/run-all.R` by:

- Removing `testthat` 1.x support, as Jenkins has been upgraded to 2.x with SPARK-30637 and this code is no longer relevant.
- Add `testthat` 3.x support to avoid AppVeyor failures.

### Why are the changes needed?

Currently used internal API has been removed in the latest `testthat` release.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Tests executed against `testthat == 2.3.2` and `testthat == 3.0.0`

Closes #30219 from zero323/SPARK-33313.

Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-11-02 08:54:08 +09:00
Max Gekk b409025641 [SPARK-33281][SQL] Return SQL schema instead of Catalog string from the SchemaOfCsv expression
### What changes were proposed in this pull request?
Return schema in SQL format instead of Catalog string from the SchemaOfCsv expression.

### Why are the changes needed?
To unify output of the `schema_of_json()` and `schema_of_csv()`.

### Does this PR introduce _any_ user-facing change?
Yes, they can but `schema_of_csv()` is usually used in combination with `from_csv()`, so, the format of schema shouldn't be much matter.

Before:
```
> SELECT schema_of_csv('1,abc');
  struct<_c0:int,_c1:string>
```

After:
```
> SELECT schema_of_csv('1,abc');
  STRUCT<`_c0`: INT, `_c1`: STRING>
```

### How was this patch tested?
By existing test suites `CsvFunctionsSuite` and `CsvExpressionsSuite`.

Closes #30180 from MaxGekk/schema_of_csv-sql-schema.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-10-29 21:02:10 +09:00
Max Gekk 9d5e48ea95 [SPARK-33270][SQL] Return SQL schema instead of Catalog string from the SchemaOfJson expression
### What changes were proposed in this pull request?
Return schema in SQL format instead of Catalog string from the `SchemaOfJson` expression.

### Why are the changes needed?
In some cases, `from_json()` cannot parse schemas returned by `schema_of_json`, for instance, when JSON fields have spaces (gaps). Such fields will be quoted after the changes, and can be parsed by `from_json()`.

Here is the example:
```scala
val in = Seq("""{"a b": 1}""").toDS()
in.select(from_json('value, schema_of_json("""{"a b": 100}""")) as "parsed")
```
raises the exception:
```
== SQL ==
struct<a b:bigint>
------^^^

	at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:263)
	at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:130)
	at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parseTableSchema(ParseDriver.scala:76)
	at org.apache.spark.sql.types.DataType$.fromDDL(DataType.scala:131)
	at org.apache.spark.sql.catalyst.expressions.ExprUtils$.evalTypeExpr(ExprUtils.scala:33)
	at org.apache.spark.sql.catalyst.expressions.JsonToStructs.<init>(jsonExpressions.scala:537)
	at org.apache.spark.sql.functions$.from_json(functions.scala:4141)
```

### Does this PR introduce _any_ user-facing change?
Yes. For example, `schema_of_json` for the input `{"col":0}`.

Before: `struct<col:bigint>`
After: `STRUCT<`col`: BIGINT>`

### How was this patch tested?
By existing test suites `JsonFunctionsSuite` and `JsonExpressionsSuite`.

Closes #30172 from MaxGekk/schema_of_json-sql-schema.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-10-29 10:30:41 +09:00
zero323 ea709d6748 [SPARK-33258][R][SQL] Add asc_nulls_* and desc_nulls_* methods to SparkR
### What changes were proposed in this pull request?

This PR adds the following `Column` methods to R API:

- asc_nulls_first
- asc_nulls_last
- desc_nulls_first
- desc_nulls_last

### Why are the changes needed?

Feature parity.

### Does this PR introduce _any_ user-facing change?

No, new methods.

### How was this patch tested?

New unit tests.

Closes #30159 from zero323/SPARK-33258.

Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-10-28 09:46:13 +09:00
xuewei.linxuewei dc697a8b59 [SPARK-13860][SQL] Change statistical aggregate function to return null instead of Double.NaN when divideByZero
### What changes were proposed in this pull request?

As [SPARK-13860](https://issues.apache.org/jira/browse/SPARK-13860) stated, TPCDS Query 39 returns wrong results using SparkSQL. The root cause is that when stddev_samp is applied to a single element set, with TPCDS answer, it return null; as in SparkSQL, it return Double.NaN which caused the wrong result.

Add an extra legacy config to fallback into the NaN logical, and return null by default to align with TPCDS standard.

### Why are the changes needed?

SQL correctness issue.

### Does this PR introduce any user-facing change?
Yes. See sql-migration-guide

In Spark 3.1, statistical aggregation function includes `std`, `stddev`, `stddev_samp`, `variance`, `var_samp`, `skewness`, `kurtosis`, `covar_samp`, `corr` will return `NULL` instead of `Double.NaN` when `DivideByZero` occurs during expression evaluation, for example, when `stddev_samp` applied on a single element set. In Spark version 3.0 and earlier, it will return `Double.NaN` in such case. To restore the behavior before Spark 3.1, you can set `spark.sql.legacy.statisticalAggregate` to `true`.

### How was this patch tested?
Updated DataFrameAggregateSuite/DataFrameWindowFunctionsSuite to test both default and legacy behavior.
Adjust DataFrameWindowFunctionsSuite/SQLQueryTestSuite and some R case to update to the default return null behavior.

Closes #29983 from leanken/leanken-SPARK-13860.

Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-10-13 13:21:45 +00:00
zero323 3beab8d8a8 [SPARK-32793][FOLLOW-UP] Minor corrections for PySpark annotations and SparkR
### What changes were proposed in this pull request?

- Annotated return types of `assert_true` and `raise_error` as discussed [here](https://github.com/apache/spark/pull/29947#pullrequestreview-504495801).
- Add `assert_true` and `raise_error`  to SparkR NAMESPACE.
- Validating message vector size in SparkR as discussed [here](https://github.com/apache/spark/pull/29947#pullrequestreview-504539004).

### Why are the changes needed?

As discussed in review for #29947.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

- Existing tests.
- Validation of annotations using MyPy

Closes #29978 from zero323/SPARK-32793-FOLLOW-UP.

Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-10-09 09:50:45 +09:00
Karen Feng 39510b0e9b [SPARK-32793][SQL] Add raise_error function, adds error message parameter to assert_true
## What changes were proposed in this pull request?

Adds a SQL function `raise_error` which underlies the refactored `assert_true` function. `assert_true` now also (optionally) accepts a custom error message field.
`raise_error` is exposed in SQL, Python, Scala, and R.
`assert_true` was previously only exposed in SQL; it is now also exposed in Python, Scala, and R.

### Why are the changes needed?

Improves usability of `assert_true` by clarifying error messaging, and adds the useful helper function `raise_error`.

### Does this PR introduce _any_ user-facing change?

Yes:
- Adds `raise_error` function to the SQL, Python, Scala, and R APIs.
- Adds `assert_true` function to the SQL, Python and R APIs.

### How was this patch tested?

Adds unit tests in SQL, Python, Scala, and R for `assert_true` and `raise_error`.

Closes #29947 from karenfeng/spark-32793.

Lead-authored-by: Karen Feng <karen.feng@databricks.com>
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-10-08 12:05:39 +09:00
zero323 473b3ba6aa [SPARK-32511][FOLLOW-UP][SQL][R][PYTHON] Add dropFields to SparkR and PySpark
### What changes were proposed in this pull request?

This PR adds `dropFields` method to:

- PySpark `Column`
- SparkR `Column`

### Why are the changes needed?

Feature parity.

### Does this PR introduce _any_ user-facing change?

No, new API.

### How was this patch tested?

- New unit tests.
- Manual verification of examples / doctests.
- Manual run of MyPy tests

Closes #29967 from zero323/SPARK-32511-FOLLOW-UP-PYSPARK-SPARKR.

Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-10-08 10:37:42 +09:00
zero323 24f890e8e8 [SPARK-33040][FOLLOW-UP][R] Reorder argument choices and add examples
### What changes were proposed in this pull request?

- Reorder choices of `dtype` to match Scala defaults.
- Add example to ml_functions.

### Why are the changes needed?

As requested:

- https://github.com/apache/spark/pull/29917#pullrequestreview-501715344
- https://github.com/apache/spark/pull/29917#pullrequestreview-501716521

### Does this PR introduce _any_ user-facing change?

No (changes to newly added component).

### How was this patch tested?

Existing tests.

Closes #29944 from zero323/SPARK-33040-FOLLOW-UP.

Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-10-05 16:31:17 +09:00
zero323 e83d03ca48 [SPARK-33040][R][ML] Add SparkR wrapper for vector_to_array
### What changes were proposed in this pull request?

Add SparkR wrapper for `o.a.s.ml.functions.vector_to_array`

### Why are the changes needed?

- Currently ML vectors, including predictions, are almost inaccessible to R users. That's is a serious loss of functionality.
- Feature parity.

### Does this PR introduce _any_ user-facing change?

Yes, new R function is added.

### How was this patch tested?

- New unit tests.
- Manual verification.

Closes #29917 from zero323/SPARK-33040.

Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-10-05 13:18:12 +09:00
zero323 9b21fdd731 [SPARK-32949][FOLLOW-UP][R][SQL] Reindent lines in SparkR timestamp_seconds
### What changes were proposed in this pull request?

Re-indent lines of SparkR `timestamp_seconds`.

### Why are the changes needed?

Current indentation is not aligned with the opening line.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #29940 from zero323/SPARK-32949-FOLLOW-UP.

Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-10-03 13:50:38 -07:00
zero323 9b88aca295 [SPARK-33030][R] Add nth_value to SparkR
### What changes were proposed in this pull request?

Adds `nth_value` function to SparkR.

### Why are the changes needed?

Feature parity. The function has been already added to [Scala](https://issues.apache.org/jira/browse/SPARK-27951) and [Python](https://issues.apache.org/jira/browse/SPARK-33020).

### Does this PR introduce _any_ user-facing change?

Yes. New function is exposed to R users.

### How was this patch tested?

New unit tests.

Closes #29905 from zero323/SPARK-33030.

Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-10-02 00:53:17 -07:00
Max Gekk 1b60ff5afe [MINOR][DOCS] Document when current_date and current_timestamp are evaluated
### What changes were proposed in this pull request?
Explicitly document that `current_date` and `current_timestamp` are executed at the start of query evaluation. And all calls of `current_date`/`current_timestamp` within the same query return the same value

### Why are the changes needed?
Users could expect that `current_date` and `current_timestamp` return the current date/timestamp at the moment of query execution but in fact the functions are folded by the optimizer at the start of query evaluation:
0df8dd6073/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/finishAnalysis.scala (L71-L91)

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
by running `./dev/scalastyle`.

Closes #29892 from MaxGekk/doc-current_date.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-09-29 05:20:12 +00:00
Max Gekk 7c14f177eb [SPARK-32306][SQL][DOCS] Clarify the result of percentile_approx()
### What changes were proposed in this pull request?
More precise description of the result of the `percentile_approx()` function and its synonym `approx_percentile()`. The proposed sentence clarifies that  the function returns **one of elements** (or array of elements) from the input column.

### Why are the changes needed?
To improve Spark docs and avoid misunderstanding of the function behavior.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
`./dev/scalastyle`

Closes #29835 from MaxGekk/doc-percentile_approx.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2020-09-22 12:45:19 -07:00
zero323 3118c220f9 [SPARK-32949][R][SQL] Add timestamp_seconds to SparkR
### What changes were proposed in this pull request?

This PR adds R wrapper for `timestamp_seconds` function.

### Why are the changes needed?

Feature parity.

### Does this PR introduce _any_ user-facing change?

Yes, it adds a new R function.

### How was this patch tested?

New unit tests.

Closes #29822 from zero323/SPARK-32949.

Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-09-21 22:32:25 -07:00
zero323 1ad1f71535 [SPARK-32946][R][SQL] Add withColumn to SparkR
### What changes were proposed in this pull request?

This PR adds `withColumn` function SparkR.

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

Yes, new function, equivalent to Scala and PySpark equivalents, is exposed to the end user.

### How was this patch tested?

New unit tests added.

Closes #29814 from zero323/SPARK-32946.

Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-09-21 16:35:00 +09:00
zero323 7fb9f6884f [SPARK-32799][R][SQL] Add allowMissingColumns to SparkR unionByName
### What changes were proposed in this pull request?

Add optional `allowMissingColumns` argument to SparkR `unionByName`.

### Why are the changes needed?

Feature parity.

### Does this PR introduce _any_ user-facing change?

`unionByName` supports `allowMissingColumns`.

### How was this patch tested?

Existing unit tests. New unit tests targeting this feature.

Closes #29813 from zero323/SPARK-32799.

Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-09-21 09:39:34 +09:00
Lu WANG 701e593414 [MINOR][R] Fix a R style in try and finally at DataFrame.R
Fix the R style issue which is not catched by the R style checker. Got error:
```
R/DataFrame.R:1244:17: style: Closing curly-braces should always be on their own line, unless it's followed by an else.
}, finally = {
 ^
lintr checks failed.
```

Closes #29574 from lu-wang-dl/fix-r-style.

Lead-authored-by: Lu WANG <lu.wang@databricks.com>
Co-authored-by: Lu Wang <38018689+lu-wang-dl@users.noreply.github.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-09-01 10:07:34 +09:00
HyukjinKwon 2491cf1ae1 [SPARK-32747][R][TESTS] Deduplicate configuration set/unset in test_sparkSQL_arrow.R
### What changes were proposed in this pull request?

This PR proposes to deduplicate configuration set/unset in `test_sparkSQL_arrow.R`.
Setting `spark.sql.execution.arrow.sparkr.enabled` can be globally done instead of doing it in each test case.

### Why are the changes needed?

To duduplicate the codes.

### Does this PR introduce _any_ user-facing change?

No, dev-only

### How was this patch tested?

Manually ran the tests.

Closes #29592 from HyukjinKwon/SPARK-32747.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-08-31 17:39:12 +09:00
HyukjinKwon babb654c81 [SPARK-32647][INFRA] Report SparkR test results with JUnit reporter
### What changes were proposed in this pull request?

This PR proposes to generate JUnit XML test report in SparkR tests that can be leveraged in both Jenkins and GitHub Actions.

**GitHub Actions**

![Screen Shot 2020-08-18 at 12 42 46 PM](https://user-images.githubusercontent.com/6477701/90467934-55b85b00-e150-11ea-863c-c8415e764ddb.png)

**Jenkins**

![Screen Shot 2020-08-18 at 2 03 42 PM](https://user-images.githubusercontent.com/6477701/90472509-a5505400-e15b-11ea-9165-777ec9b96eaa.png)

NOTE that while I am here, I am switching back the console reporter from "progress" to "summary". Currently non-ascii codes are broken in Jenkins console and switching it to "summary" can work around it.
"summary" is the default format used in testthat 1.x.

### Why are the changes needed?

To check the test failures more easily.

### Does this PR introduce _any_ user-facing change?

No, dev-only

### How was this patch tested?

It is tested in GitHub Actions at https://github.com/HyukjinKwon/spark/pull/23/checks?check_run_id=996586446
In case of Jenkins, https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127525/testReport/

Closes #29456 from HyukjinKwon/sparkr-junit.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-08-18 19:35:15 +09:00
alexander-daskalov 10edeafc69 [MINOR][SQL] Fixed approx_count_distinct rsd param description
### What changes were proposed in this pull request?

In the docs concerning the approx_count_distinct I have changed the description of the rsd parameter from **_maximum estimation error allowed_** to _**maximum relative standard deviation allowed**_

### Why are the changes needed?

Maximum estimation error allowed can be misleading. You can set the target relative standard deviation, which affects the estimation error, but on given runs the estimation error can still be above the rsd parameter.

### Does this PR introduce _any_ user-facing change?

This PR should make it easier for users reading the docs to understand that the rsd parameter in approx_count_distinct doesn't cap the estimation error, but just sets the target deviation instead,

### How was this patch tested?

No tests, as no code changes were made.

Closes #29424 from Comonut/fix-approx_count_distinct-rsd-param-description.

Authored-by: alexander-daskalov <alexander.daskalov@adevinta.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-08-14 22:10:41 +09:00
Dongjoon Hyun b421bf0196 [SPARK-32517][CORE] Add StorageLevel.DISK_ONLY_3
### What changes were proposed in this pull request?

This PR aims to add `StorageLevel.DISK_ONLY_3` as a built-in `StorageLevel`.

### Why are the changes needed?

In a YARN cluster, HDFS uaually provides storages with replication factor 3. So, we can save the result to HDFS to get `StorageLevel.DISK_ONLY_3` technically. However, disaggregate clusters or clusters without storage services are rising. Previously, in that situation, the users were able to use similar `MEMORY_AND_DISK_2` or a user-created `StorageLevel`. This PR aims to support those use cases officially for better UX.

### Does this PR introduce _any_ user-facing change?

Yes. This provides a new built-in option.

### How was this patch tested?

Pass the GitHub Action or Jenkins with the revised test cases.

Closes #29331 from dongjoon-hyun/SPARK-32517.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-08-10 07:33:06 -07:00
HyukjinKwon 42219af906 [SPARK-32543][R] Remove arrow::as_tibble usage in SparkR
### What changes were proposed in this pull request?

SparkR increased the minimal version of Arrow R version to 1.0.0 at SPARK-32452, and Arrow R 0.14 dropped `as_tibble`. We can remove the usage in SparkR.

### Why are the changes needed?

To remove codes unused anymore.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

GitHub Actions will test them out.

Closes #29361 from HyukjinKwon/SPARK-32543.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-08-05 10:35:03 -07:00
HyukjinKwon e1d7321034 [SPARK-32478][R][SQL] Error message to show the schema mismatch in gapply with Arrow vectorization
### What changes were proposed in this pull request?

This PR proposes to:

1. Fix the error message when the output schema is misbatched with R DataFrame from the given function. For example,

    ```R
    df <- createDataFrame(list(list(a=1L, b="2")))
    count(gapply(df, "a", function(key, group) { group }, structType("a int, b int")))
    ```

    **Before:**

    ```
    Error in handleErrors(returnStatus, conn) :
      ...
      java.lang.UnsupportedOperationException
	    ...
    ```

    **After:**

    ```
    Error in handleErrors(returnStatus, conn) :
     ...
     java.lang.AssertionError: assertion failed: Invalid schema from gapply: expected IntegerType, IntegerType, got IntegerType, StringType
        ...
    ```

2. Update documentation about the schema matching for `gapply` and `dapply`.

### Why are the changes needed?

To show which schema is not matched, and let users know what's going on.

### Does this PR introduce _any_ user-facing change?

Yes, error message is updated as above, and documentation is updated.

### How was this patch tested?

Manually tested and unitttests were added.

Closes #29283 from HyukjinKwon/r-vectorized-error.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-07-30 15:16:02 +09:00
HyukjinKwon bfa5d57bbd [SPARK-32452][R][SQL] Bump up the minimum Arrow version as 1.0.0 in SparkR
### What changes were proposed in this pull request?

This PR proposes to set the minimum Arrow version as 1.0.0 to minimise the maintenance overhead and keep the minimal version up to date.

Other required changes to support 1.0.0 were already made in SPARK-32451.

### Why are the changes needed?

R side, people rather aggressively encourage people to use the latest version, and SparkR vectorization is very experimental that was added from Spark 3.0.

Also, we're technically not testing old Arrow versions in SparkR for now.

### Does this PR introduce _any_ user-facing change?

Yes, users wouldn't be able to use SparkR with old Arrow.

### How was this patch tested?

GitHub Actions and AppVeyor are already testing them.

Closes #29253 from HyukjinKwon/SPARK-32452.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-07-27 14:21:15 +09:00
Dongjoon Hyun 8153f56286 [SPARK-32451][R] Support Apache Arrow 1.0.0
### What changes were proposed in this pull request?

Currently, `GitHub Action` is broken due to `SparkR UT failure` from new Apache Arrow 1.0.0.

![Screen Shot 2020-07-26 at 5 12 08 PM](https://user-images.githubusercontent.com/9700541/88492923-3409f080-cf63-11ea-8fea-6051298c2dd0.png)

This PR aims to update R code according to Apache Arrow 1.0.0 recommendation to pass R unit tests.

An alternative is pinning Apache Arrow version at 0.17.1 and I also created a PR to compare with this.
- https://github.com/apache/spark/pull/29251

### Why are the changes needed?

- Apache Spark 3.1 supports Apache Arrow 0.15.1+.
- Apache Arrow released 1.0.0 a few days ago and this causes GitHub Action SparkR test failures due to warnings.
    - https://github.com/apache/spark/commits/master

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

- [x] Pass the Jenkins (https://github.com/apache/spark/pull/29252#issuecomment-664067492)
- [x] Pass the GitHub (https://github.com/apache/spark/runs/912656867)

Closes #29252 from dongjoon-hyun/SPARK-ARROW.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-07-26 18:51:25 -07:00