Commit graph

824 commits

Author SHA1 Message Date
Erik Krogen cf22d947fb [SPARK-32036] Replace references to blacklist/whitelist language with more appropriate terminology, excluding the blacklisting feature
### What changes were proposed in this pull request?

This PR will remove references to these "blacklist" and "whitelist" terms besides the blacklisting feature as a whole, which can be handled in a separate JIRA/PR.

This touches quite a few files, but the changes are straightforward (variable/method/etc. name changes) and most quite self-contained.

### Why are the changes needed?

As per discussion on the Spark dev list, it will be beneficial to remove references to problematic language that can alienate potential community members. One such reference is "blacklist" and "whitelist". While it seems to me that there is some valid debate as to whether these terms have racist origins, the cultural connotations are inescapable in today's world.

### Does this PR introduce _any_ user-facing change?

In the test file `HiveQueryFileTest`, a developer has the ability to specify the system property `spark.hive.whitelist` to specify a list of Hive query files that should be tested. This system property has been renamed to `spark.hive.includelist`. The old property has been kept for compatibility, but will log a warning if used. I am open to feedback from others on whether keeping a deprecated property here is unnecessary given that this is just for developers running tests.

### How was this patch tested?

Existing tests should be suitable since no behavior changes are expected as a result of this PR.

Closes #28874 from xkrogen/xkrogen-SPARK-32036-rename-blacklists.

Authored-by: Erik Krogen <ekrogen@linkedin.com>
Signed-off-by: Thomas Graves <tgraves@apache.org>
2020-07-15 11:40:55 -05:00
Huaxin Gao d1255297b8 [SPARK-19939][ML] Add support for association rules in ML
### What changes were proposed in this pull request?
Adding support to Association Rules in Spark ml.fpm.

### Why are the changes needed?
Support is an indication of how frequently the itemset of an association rule appears in the database and suggests if the rules are generally applicable to the dateset. Refer to [wiki](https://en.wikipedia.org/wiki/Association_rule_learning#Support) for more details.

### Does this PR introduce _any_ user-facing change?
Yes. Associate Rules now have support measure

### How was this patch tested?
existing and new unit test

Closes #28903 from huaxingao/fpm.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-06-26 12:55:38 -05:00
HyukjinKwon e29ec42879 [SPARK-32074][BUILD][R] Update AppVeyor R version to 4.0.2
### What changes were proposed in this pull request?
R version 4.0.2 was released, see https://cran.r-project.org/doc/manuals/r-release/NEWS.html. This PR targets to upgrade R version in AppVeyor CI environment.

### Why are the changes needed?

To test the latest R versions before the release, and see if there are any regressions.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

AppVeyor will test.

Closes #28909 from HyukjinKwon/SPARK-32074.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-06-24 15:37:41 +09:00
HyukjinKwon b62e2536db [SPARK-32073][R] Drop R < 3.5 support
### What changes were proposed in this pull request?

Spark 3.0 accidentally dropped R < 3.5. It is built by R 3.6.3 which not support R < 3.5:

```
Error in readRDS(pfile) : cannot read workspace version 3 written by R 3.6.3; need R 3.5.0 or newer version.
```

In fact, with SPARK-31918, we will have to drop R < 3.5 entirely to support R 4.0.0. This is inevitable to release on CRAN because they require to make the tests pass with the latest R.

### Why are the changes needed?

To show the supported versions correctly, and support R 4.0.0 to unblock the releases.

### Does this PR introduce _any_ user-facing change?

In fact, no because Spark 3.0.0 already does not work with R < 3.5.
Compared to Spark 2.4, yes. R < 3.5 would not work.

### How was this patch tested?

Jenkins should test it out.

Closes #28908 from HyukjinKwon/SPARK-32073.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-06-24 11:05:27 +09:00
HyukjinKwon 11d2b07b74 [SPARK-31918][R] Ignore S4 generic methods under SparkR namespace in closure cleaning to support R 4.0.0+
### What changes were proposed in this pull request?

This PR proposes to ignore S4 generic methods under SparkR namespace in closure cleaning to support R 4.0.0+.

Currently, when you run the codes that runs R native codes, it fails as below with R 4.0.0:

```r
df <- createDataFrame(lapply(seq(100), function (e) list(value=e)))
count(dapply(df, function(x) as.data.frame(x[x$value < 50,]), schema(df)))
```

```
org.apache.spark.SparkException: R unexpectedly exited.
R worker produced errors: Error in lapply(part, FUN) : attempt to bind a variable to R_UnboundValue
```

The root cause seems to be related to when an S4 generic method is manually included into the closure's environment via `SparkR:::cleanClosure`. For example, when an RRDD is created via `createDataFrame` with calling `lapply` to convert, `lapply` itself:

f53d8c63e8/R/pkg/R/RDD.R (L484)

is added into the environment of the cleaned closure - because this is not an exposed namespace; however, this is broken in R 4.0.0+ for an unknown reason with an error message such as "attempt to bind a variable to R_UnboundValue".

Actually, we don't need to add the `lapply` into the environment of the closure because it is not supposed to be called in worker side. In fact, there is no private generic methods supposed to be called in worker side in SparkR at all from my understanding.

Therefore, this PR takes a simpler path to work around just by explicitly excluding the S4 generic methods under SparkR namespace to support R 4.0.0. in SparkR.

### Why are the changes needed?

To support R 4.0.0+ with SparkR, and unblock the releases on CRAN. CRAN requires the tests pass with the latest R.

### Does this PR introduce _any_ user-facing change?

Yes, it will support R 4.0.0 to end-users.

### How was this patch tested?

Manually tested. Both CRAN and tests with R 4.0.1:

```
══ testthat results  ═══════════════════════════════════════════════════════════
[ OK: 13 | SKIPPED: 0 | WARNINGS: 0 | FAILED: 0 ]
✔ |  OK F W S | Context
✔ |  11       | binary functions [2.5 s]
✔ |   4       | functions on binary files [2.1 s]
✔ |   2       | broadcast variables [0.5 s]
✔ |   5       | functions in client.R
✔ |  46       | test functions in sparkR.R [6.3 s]
✔ |   2       | include R packages [0.3 s]
✔ |   2       | JVM API [0.2 s]
✔ |  75       | MLlib classification algorithms, except for tree-based algorithms [86.3 s]
✔ |  70       | MLlib clustering algorithms [44.5 s]
✔ |   6       | MLlib frequent pattern mining [3.0 s]
✔ |   8       | MLlib recommendation algorithms [9.6 s]
✔ | 136       | MLlib regression algorithms, except for tree-based algorithms [76.0 s]
✔ |   8       | MLlib statistics algorithms [0.6 s]
✔ |  94       | MLlib tree-based algorithms [85.2 s]
✔ |  29       | parallelize() and collect() [0.5 s]
✔ | 428       | basic RDD functions [25.3 s]
✔ |  39       | SerDe functionality [2.2 s]
✔ |  20       | partitionBy, groupByKey, reduceByKey etc. [3.9 s]
✔ |   4       | functions in sparkR.R
✔ |  16       | SparkSQL Arrow optimization [19.2 s]
✔ |   6       | test show SparkDataFrame when eager execution is enabled. [1.1 s]
✔ | 1175       | SparkSQL functions [134.8 s]
✔ |  42       | Structured Streaming [478.2 s]
✔ |  16       | tests RDD function take() [1.1 s]
✔ |  14       | the textFile() function [2.9 s]
✔ |  46       | functions in utils.R [0.7 s]
✔ |   0     1 | Windows-specific tests
────────────────────────────────────────────────────────────────────────────────
test_Windows.R:22: skip: sparkJars tag in SparkContext
Reason: This test is only for Windows, skipped
────────────────────────────────────────────────────────────────────────────────

══ Results ═════════════════════════════════════════════════════════════════════
Duration: 987.3 s

OK:       2304
Failed:   0
Warnings: 0
Skipped:  1
...
Status: OK
+ popd
Tests passed.
```

Note that I tested to build SparkR in R 4.0.0, and run the tests with R 3.6.3. It all passed. See also [the comment in the JIRA](https://issues.apache.org/jira/browse/SPARK-31918?focusedCommentId=17142837&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17142837).

Closes #28907 from HyukjinKwon/SPARK-31918.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-06-24 11:03:05 +09:00
HyukjinKwon e1315cd656
[SPARK-31701][R][SQL] Bump up the minimum Arrow version as 0.15.1 in SparkR
### What changes were proposed in this pull request?

This PR proposes to set the minimum Arrow version as 0.15.1 to be consistent with PySpark side at.

### Why are the changes needed?

It will reduce the maintenance overhead to match the Arrow versions, and minimize the supported range. SparkR Arrow optimization is experimental yet.

### Does this PR introduce _any_ user-facing change?

No, it's the change in unreleased branches only.

### How was this patch tested?

0.15.x was already tested at SPARK-29378, and we're testing the latest version of SparkR currently in AppVeyor. I already manually tested too.

Closes #28520 from HyukjinKwon/SPARK-31701.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-05-13 10:03:12 -07:00
Michael Chirico f53d8c63e8 [SPARK-31571][R] Overhaul stop/message/warning calls to be more canonical
### What changes were proposed in this pull request?

Internal usages like `{stop,warning,message}({paste,paste0,sprintf}` and `{stop,warning,message}(some_literal_string_as_variable` have been removed and replaced as appropriate.

### Why are the changes needed?

CRAN policy recommends against using such constructions to build error messages, in particular because it makes the process of creating portable error messages for the package more onerous.

### Does this PR introduce any user-facing change?

There may be some small grammatical changes visible in error messaging.

### How was this patch tested?

Not done

Closes #28365 from MichaelChirico/r-stop-paste.

Authored-by: Michael Chirico <michael.chirico@grabtaxi.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-05-03 12:40:20 +09:00
Michael Chirico c00fe5ef3e
[MINOR][R] small tidying of sh scripts for R
### What changes were proposed in this pull request?

Some tidying of `sh` scripts in `R/`

### Why are the changes needed?

Not strictly needed, but the `'devtools' %in% installed.packages()` line in particular is "improper" / proabbly slow

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Not

Closes #28419 from MichaelChirico/r-scripts-cleanup.

Authored-by: Michael Chirico <michael.chirico@grabtaxi.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-04-30 16:58:05 -07:00
Michael Chirico 226301a6bc [SPARK-29339][R][FOLLOW-UP] Remove requireNamespace1 workaround for arrow
### What changes were proposed in this pull request?

`requireNamespace1` was used to get `SparkR` on CRAN while Suggesting `arrow` while `arrow` was not yet available on CRAN.

### Why are the changes needed?

Now `arrow` is on CRAN, we can properly use `requireNamespace` without triggering CRAN failures.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

AppVeyor will test, and CRAN check in Jenkins build.

Closes #28387 from MichaelChirico/r-require-arrow.

Authored-by: Michael Chirico <michael.chirico@grabtaxi.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-04-29 18:42:32 +09:00
Michael Chirico c011502ee3 [SPARK-31573][R] Apply fixed=TRUE as appropriate to regex usage in R
### What changes were proposed in this pull request?

For regex functions in base R (`gsub`, `grep`, `grepl`, `strsplit`, `gregexpr`), supplying the `fixed=TRUE` option will be more performant.

### Why are the changes needed?

This is a minor fix for performance

### Does this PR introduce any user-facing change?

No (although some internal code was applying fixed-as-regex in some cases that could technically have been over-broad and caught unintended patterns)

### How was this patch tested?

Not

Closes #28367 from MichaelChirico/r-regex-fixed.

Authored-by: Michael Chirico <michael.chirico@grabtaxi.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-04-28 17:24:21 +09:00
Michael Chirico 410fa91321 [SPARK-31578][R] Vectorize schema validation for arrow in types.R
### What changes were proposed in this pull request?

Repeated `sapply` avoided in internal `checkSchemaInArrow`

### Why are the changes needed?

Current implementation is doubly inefficient:

 1. Repeatedly doing the same (95%) `sapply` loop
 2. Doing scalar `==` on a vector (`==` should be done over the whole vector for efficiency)

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

By my trusty friend the CI bots

Closes #28372 from MichaelChirico/vectorize-types.

Authored-by: Michael Chirico <michael.chirico@grabtaxi.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-04-28 11:03:51 +09:00
Michael Chirico a68d98cf4f [SPARK-31568][R] Replaces paste(sep="") to paste0
### What changes were proposed in this pull request?

All instances of `paste(..., sep = "")` in the code are replaced with `paste0` which is more performant

### Why are the changes needed?

Performance & consistency (`paste0` is already used extensively in the R package)

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

None

Closes #28374 from MichaelChirico/r-paste0.

Authored-by: Michael Chirico <michael.chirico@grabtaxi.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-04-28 10:58:48 +09:00
Michael Chirico b6509aa502 [SPARK-31568][R][DOCS] Add detail about func/key in gapply to documentation
### What changes were proposed in this pull request?

Improve documentation for `gapply` in `SparkR`

### Why are the changes needed?

Spent a long time this weekend trying to figure out just what exactly `key` is in `gapply`'s `func`. I had assumed it would be a _named_ list, but apparently not -- the examples are working because `schema` is applying the name and the names of the output `data.frame` don't matter.

As near as I can tell the description I've added is correct, namely, that `key` is an unnamed list.

### Does this PR introduce any user-facing change?

No? Not in code. Only documentation.

### How was this patch tested?

Not. Documentation only

Closes #28350 from MichaelChirico/r-gapply-key-doc.

Authored-by: Michael Chirico <michael.chirico@grabtaxi.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-04-27 17:02:18 +09:00
HyukjinKwon da3c6c4e35 [SPARK-31510][R][BUILD] Set setwd in R documentation build
Seems like in certain environment, it requires to set `setwd` as below:

```
> library(devtools); devtools::document(pkg="./pkg", roclets=c("rd"))
Loading required package: usethis
Error: Could not find package root, is your working directory inside a package?
```

see also https://stackoverflow.com/questions/52670051/how-to-troubleshoot-error-could-not-find-package-root and https://groups.google.com/forum/#!topic/rdevtools/79jjjdc_wjg

We can make up another story too. For example, if you set a specific directory in your `~/.Rprofile`, then R documentation build will fail as below:

```
echo 'setwd("~")' > ~/.Rprofile
sh R/create-rd.sh
```

```
Using R_SCRIPT_PATH = /usr/local/bin
Loading required package: usethis
Error: Can't find './pkg'.
Execution halted
```

This PR proposes to set the `setwd` explicitly so it does not get affected on the global environment.

To make R dev env more independent.

No, dev only.

Manually tested:

```bash
echo 'setwd("~")' > ~/.Rprofile
sh R/create-rd.sh
```

Before:

```
Using R_SCRIPT_PATH = /usr/local/bin
Loading required package: usethis
Error: Can't find './pkg'.
Execution halted
```

After:

```
Using R_SCRIPT_PATH = /usr/local/bin
Loading required package: usethis
Updating SparkR documentation
Loading SparkR
Creating a new generic function for ‘as.data.frame’ in package ‘SparkR’
Creating a new generic function for ‘colnames’ in package ‘SparkR’
Creating a new generic function for ‘colnames<-’ in package ‘SparkR’
Creating a new generic function for ‘cov’ in package ‘SparkR’
Creating a new generic function for ‘drop’ in package ‘SparkR’
Creating a new generic function for ‘na.omit’ in package ‘SparkR’
Creating a new generic function for ‘filter’ in package ‘SparkR’
Creating a new generic function for ‘intersect’ in package ‘SparkR’
...
```

Closes #28285
2020-04-23 10:23:01 +09:00
zero323 697fe911ac [SPARK-30819][SPARKR][ML] Add FMRegressor wrapper to SparkR
### What changes were proposed in this pull request?

This pull request adds SparkR wrapper for `FMRegressor`:

- Supporting ` org.apache.spark.ml.r.FMRegressorWrapper`.
- `FMRegressionModel` S4 class.
- Corresponding `spark.fmRegressor`, `predict`, `summary` and `write.ml` generics.
- Corresponding docs and tests.

### Why are the changes needed?

Feature parity.

### Does this PR introduce any user-facing change?

No (new API).

### How was this patch tested?

New unit tests.

Closes #27571 from zero323/SPARK-30819.

Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-04-09 19:38:11 -05:00
zero323 0063462d55 [SPARK-30818][SPARKR][ML] Add SparkR LinearRegression wrapper
### What changes were proposed in this pull request?

This pull request adds SparkR wrapper for `LinearRegression`

- Supporting `org.apache.spark.ml.rLinearRegressionWrapper`.
- `LinearRegressionModel` S4 class.
- Corresponding `spark.lm` predict, summary and write.ml generics.
- Corresponding docs and tests.

### Why are the changes needed?

Feature parity.

### Does this PR introduce any user-facing change?

No (new API).

### How was this patch tested?

New unit tests.

Closes #27593 from zero323/SPARK-30818.

Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-04-08 22:29:44 -05:00
zero323 0d37f794ef [SPARK-30820][SPARKR][ML] Add FMClassifier to SparkR
### What changes were proposed in this pull request?

This pull request adds SparkR wrapper for `FMClassifier`:

- Supporting ` org.apache.spark.ml.r.FMClassifierWrapper`.
- `FMClassificationModel` S4 class.
- Corresponding `spark.fmClassifier`, `predict`, `summary` and `write.ml` generics.
- Corresponding docs and tests.

### Why are the changes needed?

Feature parity.

### Does this PR introduce any user-facing change?

No (new API).

### How was this patch tested?

New unit tests.

Closes #27570 from zero323/SPARK-30820.

Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-04-07 09:01:45 -05:00
Huaxin Gao fd0b228127 [SPARK-31290][R] Add back the deprecated R APIs
### What changes were proposed in this pull request?
Add back the deprecated R APIs removed by https://github.com/apache/spark/pull/22843/ and https://github.com/apache/spark/pull/22815.

These APIs are

- `sparkR.init`
- `sparkRSQL.init`
- `sparkRHive.init`
- `registerTempTable`
- `createExternalTable`
- `dropTempTable`

No need to port the function such as
```r
createExternalTable <- function(x, ...) {
  dispatchFunc("createExternalTable(tableName, path = NULL, source = NULL, ...)", x, ...)
}
```
because this was for the backward compatibility when SQLContext exists before assuming from https://github.com/apache/spark/pull/9192,  but seems we don't need it anymore since SparkR replaced SQLContext with Spark Session at https://github.com/apache/spark/pull/13635.

### Why are the changes needed?
Amend Spark's Semantic Versioning Policy

### Does this PR introduce any user-facing change?
Yes
The removed R APIs are put back.

### How was this patch tested?
Add back the removed tests

Closes #28058 from huaxingao/r.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-04-01 10:38:03 +09:00
Ben Ryves fa37856710 [SPARK-31306][DOCS] update rand() function documentation to indicate exclusive upper bound
### What changes were proposed in this pull request?
A small documentation change to clarify that the `rand()` function produces values in `[0.0, 1.0)`.

### Why are the changes needed?
`rand()` uses `Rand()` - which generates values in [0, 1) ([documented here](a1dbcd13a3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/randomExpressions.scala (L71))). The existing documentation suggests that 1.0 is a possible value returned by rand (i.e for a distribution written as `X ~ U(a, b)`, x can be a or b, so `U[0.0, 1.0]` suggests the value returned could include 1.0).

### Does this PR introduce any user-facing change?
Only documentation changes.

### How was this patch tested?
Documentation changes only.

Closes #28071 from Smeb/master.

Authored-by: Ben Ryves <benjamin.ryves@getyourguide.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-03-31 15:16:17 +09:00
Maxim Gekk d2ff5c5bfb [SPARK-31286][SQL][DOC] Specify formats of time zone ID for JSON/CSV option and from/to_utc_timestamp
### What changes were proposed in this pull request?
In the PR, I propose to update the doc for the `timeZone` option in JSON/CSV datasources and for the `tz` parameter of the `from_utc_timestamp()`/`to_utc_timestamp()` functions, and to restrict format of config's values to 2 forms:
1. Geographical regions, such as `America/Los_Angeles`.
2. Fixed offsets - a fully resolved offset from UTC. For example, `-08:00`.

### Why are the changes needed?
Other formats such as three-letter time zone IDs are ambitious, and depend on the locale. For example, `CST` could be U.S. `Central Standard Time` and `China Standard Time`. Such formats have been already deprecated in JDK, see [Three-letter time zone IDs](https://docs.oracle.com/javase/8/docs/api/java/util/TimeZone.html).

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
By running `./dev/scalastyle`, and manual testing.

Closes #28051 from MaxGekk/doc-time-zone-option.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-30 12:20:11 +08:00
Kent Yao f81f11822c [SPARK-31189][R][DOCS][FOLLOWUP] Replace Datetime pattern links in R doc
### What changes were proposed in this pull request?

Use our own docs for data pattern instructions to replace java doc.

### Why are the changes needed?

fix doc

### Does this PR introduce any user-facing change?

yes. doc changed
### How was this patch tested?

pass jenkins

Closes #27975 from yaooqinn/SPARK-31189-2.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-03-22 14:22:44 +09:00
zero323 01f20394ac [SPARK-30569][SQL][PYSPARK][SPARKR] Add percentile_approx DSL functions
### What changes were proposed in this pull request?

- Adds following overloaded variants to Scala `o.a.s.sql.functions`:

  - `percentile_approx(e: Column, percentage: Array[Double], accuracy: Long): Column`
  - `percentile_approx(columnName: String, percentage: Array[Double], accuracy: Long): Column`
  - `percentile_approx(e: Column, percentage: Double, accuracy: Long): Column`
  - `percentile_approx(columnName: String, percentage: Double, accuracy: Long): Column`
  - `percentile_approx(e: Column, percentage: Seq[Double], accuracy: Long): Column` (primarily for
Python interop).
  - `percentile_approx(columnName: String, percentage: Seq[Double], accuracy: Long): Column`

- Adds `percentile_approx` to `pyspark.sql.functions`.

- Adds `percentile_approx` function to SparkR.

### Why are the changes needed?

Currently we support `percentile_approx` only in SQL expression. It is inconvenient and makes this function relatively unknown.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

New unit tests for SparkR an PySpark.

As for now there are no additional tests in Scala API ‒ `ApproximatePercentile` is well tested and Python (including docstrings) and R tests provide additional tests, so it seems unnecessary.

Closes #27278 from zero323/SPARK-30569.

Lead-authored-by: zero323 <mszymkiewicz@gmail.com>
Co-authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-03-17 10:44:21 +09:00
Yuanjian Li 3493162c78 [SPARK-31030][SQL] Backward Compatibility for Parsing and formatting Datetime
### What changes were proposed in this pull request?
In Spark version 2.4 and earlier, datetime parsing, formatting and conversion are performed by using the hybrid calendar (Julian + Gregorian).
Since the Proleptic Gregorian calendar is de-facto calendar worldwide, as well as the chosen one in ANSI SQL standard, Spark 3.0 switches to it by using Java 8 API classes (the java.time packages that are based on ISO chronology ). The switching job is completed in SPARK-26651.
But after the switching, there are some patterns not compatible between Java 8 and Java 7, Spark needs its own definition on the patterns rather than depends on Java API.
In this PR, we achieve this by writing the document and shadow the incompatible letters. See more details in [SPARK-31030](https://issues.apache.org/jira/browse/SPARK-31030)

### Why are the changes needed?
For backward compatibility.

### Does this PR introduce any user-facing change?
No.
After we define our own datetime parsing and formatting patterns, it's same to old Spark version.

### How was this patch tested?
Existing and new added UT.
Locally document test:
![image](https://user-images.githubusercontent.com/4833765/76064100-f6acc280-5fc3-11ea-9ef7-82e7dc074205.png)

Closes #27830 from xuanyuanking/SPARK-31030.

Authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-11 14:11:13 +08:00
zero323 c467961e8a [SPARK-30682][SPARKR][SQL] Add SparkR interface for higher order functions
### What changes were proposed in this pull request?

This PR add R API for invoking following higher functions:

- `transform` -> `array_transform` (to avoid conflict with `base::transform`).
- `exists` -> `array_exists` (to avoid conflict with `base::exists`).
- `forall` -> `array_forall` (no conflicts, renamed for consistency)
- `filter` -> `array_filter` (to avoid conflict with `stats::filter`).
- `aggregate` -> `array_aggregate` (to avoid conflict with `stats::transform`).
- `zip_with` -> `arrays_zip_with` (no conflicts, renamed for consistency)
- `transform_keys`
- `transform_values`
- `map_filter`
- `map_zip_with`

Overall implementation follows the same pattern as proposed for PySpark (#27406) and reuses object supporting Scala implementation (SPARK-27297).

### Why are the changes needed?

Currently higher order functions are available only using SQL and Scala API and can use only SQL expressions:

```r
select(df, expr("transform(xs, x -> x + 1)")
```

This is error-prone, and hard to do right, when complex logic is used (`when` / `otherwise`, complex objects).

If this PR is accepted, above function could be simply rewritten as:

```r
select(df, transform("xs", function(x) x + 1))
```

### Does this PR introduce any user-facing change?

No (but new user-facing functions are added).

### How was this patch tested?

Added new unit tests.

Closes #27433 from zero323/SPARK-30682.

Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-02-28 12:58:56 +09:00
gatorsmile 28b8713036 [SPARK-30950][BUILD] Setting version to 3.1.0-SNAPSHOT
### What changes were proposed in this pull request?
This patch is to bump the master branch version to 3.1.0-SNAPSHOT.

### Why are the changes needed?
N/A

### Does this PR introduce any user-facing change?
N/A

### How was this patch tested?
N/A

Closes #27698 from gatorsmile/updateVersion.

Authored-by: gatorsmile <gatorsmile@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-02-25 19:44:31 -08:00
HyukjinKwon b95ccb1d8b [SPARK-30737][SPARK-27262][R][BUILD] Reenable CRAN check with UTF-8 encoding to DESCRIPTION
### What changes were proposed in this pull request?

This PR proposes to reenable CRAN check disabled at https://github.com/apache/spark/pull/27460. Given the tests https://github.com/apache/spark/pull/27468, seems we should also port https://github.com/apache/spark/pull/23823 together.

### Why are the changes needed?
To check CRAN back.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
It was tested at https://github.com/apache/spark/pull/27468 and Jenkins should test it out.

Closes #27472 from HyukjinKwon/SPARK-30737.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-02-06 13:01:08 +09:00
HyukjinKwon e2d984aa1c [SPARK-30733][R][HOTFIX] Fix SparkR tests per testthat and R version upgrade, and disable CRAN
### What changes were proposed in this pull request?

There are currently the R test failures after upgrading `testthat` to 2.0.0, and R version 3.5.2 as of SPARK-23435. This PR targets to fix the tests and make the tests pass. See the explanations and causes below:

```
test_context.R:49: failure: Check masked functions
length(maskedCompletely) not equal to length(namesOfMaskedCompletely).
1/1 mismatches
[1] 6 - 4 == 2

test_context.R:53: failure: Check masked functions
sort(maskedCompletely, na.last = TRUE) not equal to sort(namesOfMaskedCompletely, na.last = TRUE).
5/6 mismatches
x[2]: "endsWith"
y[2]: "filter"

x[3]: "filter"
y[3]: "not"

x[4]: "not"
y[4]: "sample"

x[5]: "sample"
y[5]: NA

x[6]: "startsWith"
y[6]: NA
```

From my cursory look, R base and R's version are mismatched. I fixed accordingly and Jenkins will test it out.

```
test_includePackage.R:31: error: include inside function
package or namespace load failed for ���plyr���:
 package ���plyr��� was installed by an R version with different internals; it needs to be reinstalled for use with this R version
Seems it's a package installation issue. Looks like plyr has to be re-installed.
```

From my cursory look, previously installed `plyr` remains and it's not compatible with the new R version. I fixed accordingly and Jenkins will test it out.

```
test_sparkSQL.R:499: warning: SPARK-17811: can create DataFrame containing NA as date and time
Your system is mis-configured: ���/etc/localtime��� is not a symlink
```

Seems a env problem. I suppressed the warnings for now.

```
test_sparkSQL.R:499: warning: SPARK-17811: can create DataFrame containing NA as date and time
It is strongly recommended to set envionment variable TZ to ���America/Los_Angeles��� (or equivalent)
```

Seems a env problem. I suppressed the warnings for now.

```
test_sparkSQL.R:1814: error: string operators
unable to find an inherited method for function ���startsWith��� for signature ���"character"���
1: expect_true(startsWith("Hello World", "Hello")) at /home/jenkins/workspace/SparkPullRequestBuilder2/R/pkg/tests/fulltests/test_sparkSQL.R:1814
2: quasi_label(enquo(object), label)
3: eval_bare(get_expr(quo), get_env(quo))
4: startsWith("Hello World", "Hello")
5: (function (classes, fdef, mtable)
   {
       methods <- .findInheritedMethods(classes, fdef, mtable)
       if (length(methods) == 1L)
           return(methods[[1L]])
       else if (length(methods) == 0L) {
           cnames <- paste0("\"", vapply(classes, as.character, ""), "\"", collapse = ", ")
           stop(gettextf("unable to find an inherited method for function %s for signature %s",
               sQuote(fdefgeneric), sQuote(cnames)), domain = NA)
       }
       else stop("Internal error in finding inherited methods; didn't return a unique method",
           domain = NA)
   })(list("character"), new("nonstandardGenericFunction", .Data = function (x, prefix)
   {
       standardGeneric("startsWith")
   }, generic = structure("startsWith", package = "SparkR"), package = "SparkR", group = list(),
       valueClass = character(0), signature = c("x", "prefix"), default = NULL, skeleton = (function (x,
           prefix)
       stop("invalid call in method dispatch to 'startsWith' (no default method)", domain = NA))(x,
           prefix)), <environment>)
6: stop(gettextf("unable to find an inherited method for function %s for signature %s",
       sQuote(fdefgeneric), sQuote(cnames)), domain = NA)
```

From my cursory look, R base and R's version are mismatched. I fixed accordingly and Jenkins will test it out.

Also, this PR causes a CRAN check failure as below:

```
* creating vignettes ... ERROR
Error: processing vignette 'sparkr-vignettes.Rmd' failed with diagnostics:
package ���htmltools��� was installed by an R version with different internals; it needs to be reinstalled for use with this R version
```

This PR disables it for now.

### Why are the changes needed?

To unblock other PRs.

### Does this PR introduce any user-facing change?

No. Test only and dev only.

### How was this patch tested?

No. I am going to use Jenkins to test.

Closes #27460 from HyukjinKwon/r-test-failure.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-02-05 16:45:54 +09:00
zero323 fb321b666d [MINOR][SPARKR][DOCS] Remove duplicate @name tags from read.df and read.stream
### What changes were proposed in this pull request?

Remove duplicate `name` tags from `read.df` and `read.stream`.

### Why are the changes needed?

These tags are already present in

1adf3520e3/R/pkg/R/SQLContext.R (L546)

and

1adf3520e3/R/pkg/R/SQLContext.R (L678)

for `read.df` and `read.stream` respectively.

As only one `name` tag per block is allowed, this causes build warnings with recent `roxygen2` versions:

```
Warning: [/path/to/spark/R/pkg/R/SQLContext.R:559] name May only use one name per block
Warning: [/path/to/spark/R/pkg/R/SQLContext.R:690] name May only use one name per block
```

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #27437 from zero323/roxygen-warnings-names.

Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-02-03 11:09:25 +09:00
zero323 298d0a5102 [SPARK-23435][SPARKR][TESTS] Update testthat to >= 2.0.0
### What changes were proposed in this pull request?

- Update `testthat` to >= 2.0.0
- Replace of `testthat:::run_tests` with `testthat:::test_package_dir`
- Add trivial assertions for tests, without any expectations, to avoid skipping.
- Update related docs.

### Why are the changes needed?

`testthat` version has been frozen by [SPARK-22817](https://issues.apache.org/jira/browse/SPARK-22817) / https://github.com/apache/spark/pull/20003, but 1.0.2 is pretty old, and we shouldn't keep things in this state forever.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

- Existing CI pipeline:
     - Windows build on AppVeyor, R 3.6.2, testthtat 2.3.1
     - Linux build on Jenkins, R 3.1.x, testthat 1.0.2

- Additional builds with thesthat 2.3.1  using [sparkr-build-sandbox](https://github.com/zero323/sparkr-build-sandbox) on c7ed64af9e697b3619779857dd820832176b3be3

   R 3.4.4  (image digest ec9032f8cf98)
   ```
   docker pull zero323/sparkr-build-sandbox:3.4.4
   docker run zero323/sparkr-build-sandbox:3.4.4 zero323 --branch SPARK-23435 --commit c7ed64af9e697b3619779857dd820832176b3be3 --public-key https://keybase.io/zero323/pgp_keys.asc
    ```
    3.5.3 (image digest 0b1759ee4d1d)

    ```
    docker pull zero323/sparkr-build-sandbox:3.5.3
    docker run zero323/sparkr-build-sandbox:3.5.3 zero323 --branch SPARK-23435 --commit
    c7ed64af9e697b3619779857dd820832176b3be3 --public-key https://keybase.io/zero323/pgp_keys.asc
    ```

   and 3.6.2 (image digest 6594c8ceb72f)
    ```
   docker pull zero323/sparkr-build-sandbox:3.6.2
   docker run zero323/sparkr-build-sandbox:3.6.2 zero323 --branch SPARK-23435 --commit c7ed64af9e697b3619779857dd820832176b3be3 --public-key https://keybase.io/zero323/pgp_keys.asc
   ````

   Corresponding [asciicast](https://asciinema.org/) are available as 10.5281/zenodo.3629431

     [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.3629431.svg)](https://doi.org/10.5281/zenodo.3629431)

   (a bit to large to burden asciinema.org, but can run locally via `asciinema play`).

----------------------------

Continued from #27328

Closes #27359 from zero323/SPARK-23435.

Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-01-29 10:37:08 +09:00
Xiao Li d69ed9afdf Revert "[SPARK-25496][SQL] Deprecate from_utc_timestamp and to_utc_timestamp"
This reverts commit 1d20d13149.

Closes #27351 from gatorsmile/revertSPARK25496.

Authored-by: Xiao Li <gatorsmile@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-01-25 21:34:12 -08:00
zero323 c3f17fa039 [SPARK-29777][FOLLOW-UP][SPARKR] Remove no longer valid test for recursive calls
### What changes were proposed in this pull request?

Disabling test for cleaning closure of recursive function.

### Why are the changes needed?

As of 9514b822a7 this test is no longer valid, and recursive calls, even simple ones:

```lead
  f <- function(x) {
    if(x > 0) {
      f(x - 1)
    } else {
      x
    }
  }
```

lead to

```
Error: node stack overflow
```

This is issue is silenced when tested with `testthat` 1.x (reason unknown), but cause failures when using `testthat` 2.x (issue can be reproduced outside test context).

Problem is known and tracked by [SPARK-30629](https://issues.apache.org/jira/browse/SPARK-30629)

Therefore, keeping this test active doesn't make sense, as it will lead to continuous test failures, when `testthat` is updated (https://github.com/apache/spark/pull/27359 / SPARK-23435).

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Existing tests.

CC falaki

Closes #27363 from zero323/SPARK-29777-FOLLOWUP.

Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-01-25 21:16:22 -08:00
zero323 40b1f4d87e [SPARK-30645][SPARKR][TESTS][WINDOWS] Move Unicode test data to external file
### What changes were proposed in this pull request?

Reference data for "collect() support Unicode characters" has been moved to an external file, to make test OS and locale independent.

### Why are the changes needed?

As-is, embedded data is not properly encoded on Windows:

```
library(SparkR)
SparkR::sparkR.session()
Sys.info()
#           sysname           release           version
#         "Windows"      "Server x64"     "build 17763"
#          nodename           machine             login
# "WIN-5BLT6Q610KH"          "x86-64"   "Administrator"
#              user    effective_user
#   "Administrator"   "Administrator"

Sys.getlocale()

# [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"

lines <- c("{\"name\":\"안녕하세요\"}",
           "{\"name\":\"您好\", \"age\":30}",
           "{\"name\":\"こんにちは\", \"age\":19}",
           "{\"name\":\"Xin chào\"}")

system(paste0("cat ", jsonPath))
# {"name":"<U+C548><U+B155><U+D558><U+C138><U+C694>"}
# {"name":"<U+60A8><U+597D>", "age":30}
# {"name":"<U+3053><U+3093><U+306B><U+3061><U+306F>", "age":19}
# {"name":"Xin chào"}
# [1] 0

jsonPath <- tempfile(pattern = "sparkr-test", fileext = ".tmp")
writeLines(lines, jsonPath)

df <- read.df(jsonPath, "json")

printSchema(df)
# root
#  |-- _corrupt_record: string (nullable = true)
#  |-- age: long (nullable = true)
#  |-- name: string (nullable = true)

head(df)
#              _corrupt_record age                                     name
# 1                       <NA>  NA <U+C548><U+B155><U+D558><U+C138><U+C694>
# 2                       <NA>  30                         <U+60A8><U+597D>
# 3                       <NA>  19 <U+3053><U+3093><U+306B><U+3061><U+306F>
# 4 {"name":"Xin ch<U+FFFD>o"}  NA                                     <NA>
```
This can be reproduced outside tests (Windows Server 2019, English locale), and causes failures, when `testthat` is updated to 2.x (https://github.com/apache/spark/pull/27359). Somehow problem is not picked-up when test is executed on `testthat` 1.0.2.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Running modified test, manual testing.

### Note

Alternative seems to be to used bytes, but it hasn't been properly tested.

```
test_that("collect() support Unicode characters", {

  lines <- markUtf8(c(
    '{"name": "안녕하세요"}',
    '{"name": "您好", "age": 30}',
    '{"name": "こんにちは", "age": 19}',
    '{"name": "Xin ch\xc3\xa0o"}'
  ))

  jsonPath <- tempfile(pattern = "sparkr-test", fileext = ".tmp")
  writeLines(lines, jsonPath, useBytes = TRUE)

  expected <- regmatches(lines, regexec('(?<="name": ").*?(?=")', lines, perl = TRUE))

  df <- read.df(jsonPath, "json")
  rdf <- collect(df)
  expect_true(is.data.frame(rdf))

  rdf$name <- markUtf8(rdf$name)
  expect_equal(rdf$name[1], expected[[1]])
  expect_equal(rdf$name[2], expected[[2]])
  expect_equal(rdf$name[3], expected[[3]])
  expect_equal(rdf$name[4], expected[[4]])

  df1 <- createDataFrame(rdf)
  expect_equal(
    collect(
      where(df1, df1$name == expected[[2]])
    )$name,
    expected[[2]]
  )
})
```

Closes #27362 from zero323/SPARK-30645.

Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-01-26 12:59:53 +09:00
zhengruifeng f35f352096 [SPARK-30543][ML][PYSPARK][R] RandomForest add Param bootstrap to control sampling method
### What changes were proposed in this pull request?
add a param `bootstrap` to control whether bootstrap samples are used.

### Why are the changes needed?
Current RF with numTrees=1 will directly build a tree using the orignial dataset,

while with numTrees>1 it will use bootstrap samples to build trees.

This design is for training a DecisionTreeModel by the impl of RandomForest, however, it is somewhat strange.

In Scikit-Learn, there is a param [bootstrap](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier) to control whether bootstrap samples are used.

### Does this PR introduce any user-facing change?
Yes, new param is added

### How was this patch tested?
existing testsuites

Closes #27254 from zhengruifeng/add_bootstrap.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
2020-01-23 16:44:13 +08:00
zero323 2330a5682d [SPARK-30607][SQL][PYSPARK][SPARKR] Add overlay wrappers for SparkR and PySpark
### What changes were proposed in this pull request?

This PR adds:

- `pyspark.sql.functions.overlay` function to PySpark
- `overlay` function to SparkR

### Why are the changes needed?

Feature parity. At the moment R and Python users can access this function only using SQL or `expr` / `selectExpr`.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

New unit tests.

Closes #27325 from zero323/SPARK-30607.

Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-01-23 16:16:47 +09:00
jiake b389b8c5f0 [SPARK-30188][SQL] Resolve the failed unit tests when enable AQE
### What changes were proposed in this pull request?
Fix all the failed tests when enable AQE.

### Why are the changes needed?
Run more tests with AQE to catch bugs, and make it easier to enable AQE by default in the future.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Existing unit tests

Closes #26813 from JkSelf/enableAQEDefault.

Authored-by: jiake <ke.a.jia@intel.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-01-13 22:55:19 +08:00
HyukjinKwon 866b7df348 [SPARK-30335][SQL][DOCS] Add a note first, last, collect_list and collect_set can be non-deterministic in SQL function docs as well
### What changes were proposed in this pull request?
This PR adds a note first and last can be non-deterministic in SQL function docs as well.
This is already documented in `functions.scala`.

### Why are the changes needed?
Some people look reading SQL docs only.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Jenkins will test.

Closes #27099 from HyukjinKwon/SPARK-30335.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-01-07 14:31:59 +09:00
Yuming Wang 696288f623 [INFRA] Reverts commit 56dcd79 and c216ef1
### What changes were proposed in this pull request?
1. Revert "Preparing development version 3.0.1-SNAPSHOT": 56dcd79

2. Revert "Preparing Spark release v3.0.0-preview2-rc2": c216ef1

### Why are the changes needed?
Shouldn't change master.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
manual test:
https://github.com/apache/spark/compare/5de5e46..wangyum:revert-master

Closes #26915 from wangyum/revert-master.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Yuming Wang <wgyumg@gmail.com>
2019-12-16 19:57:44 -07:00
Yuming Wang 56dcd79992 Preparing development version 3.0.1-SNAPSHOT 2019-12-17 01:57:27 +00:00
Yuming Wang c216ef1d03 Preparing Spark release v3.0.0-preview2-rc2 2019-12-17 01:57:21 +00:00
Boris Boutkov 3bf5498b4a [MINOR][DOCS] Fix documentation for slide function
### What changes were proposed in this pull request?

This PR proposes to fix documentation for slide function. Fixed the spacing issue and added some parameter related info.

### Why are the changes needed?

Documentation improvement

### Does this PR introduce any user-facing change?

No (doc-only change).

### How was this patch tested?

Manually tested by documentation build.

Closes #26896 from bboutkov/pyspark_doc_fix.

Authored-by: Boris Boutkov <boris.boutkov@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-12-16 16:29:09 +09:00
Hossein 9514b822a7 [SPARK-29777][SPARKR] SparkR::cleanClosure aggressively removes a function required by user function
### What changes were proposed in this pull request?
The implementation for walking through the user function AST and picking referenced variables and functions, had an optimization to skip a branch if it had already seen it. This runs into an interesting problem in the following example

```
df <- createDataFrame(data.frame(x=1))
f1 <- function(x) x + 1
f2 <- function(x) f1(x) + 2
dapplyCollect(df, function(x) { f1(x); f2(x) })
```
Results in error:
```
org.apache.spark.SparkException: R computation failed with
 Error in f1(x) : could not find function "f1"
Calls: compute -> computeFunc -> f2
```

### Why are the changes needed?
Bug fix

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Unit tests in `test_utils.R`

Closes #26429 from falaki/SPARK-29777.

Authored-by: Hossein <hossein@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-11-19 09:04:59 +09:00
Dongjoon Hyun 42f8f79ff0 [SPARK-29936][R] Fix SparkR lint errors and add lint-r GitHub Action
### What changes were proposed in this pull request?

This PR fixes SparkR lint errors and adds `lint-r` GitHub Action to protect the branch.

### Why are the changes needed?

It turns out that we currently don't run it. It's recovered yesterday. However, after that, our Jenkins linter jobs (`master`/`branch-2.4`) has been broken on `lint-r` tasks.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Pass the GitHub Action on this PR in addition to Jenkins R and AppVeyor R.

Closes #26564 from dongjoon-hyun/SPARK-29936.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-11-17 21:01:01 -08:00
Dongjoon Hyun cc12cf6029 [SPARK-29378][R] Upgrade SparkR to use Arrow 0.15 API
### What changes were proposed in this pull request?

[[SPARK-29376] Upgrade Apache Arrow to version 0.15.1](https://github.com/apache/spark/pull/26133) upgrades to Arrow 0.15 at Scala/Java/Python. This PR aims to upgrade `SparkR` to use Arrow 0.15 API. Currently, it's broken.

### Why are the changes needed?

First of all, it turns out that our Jenkins jobs (including PR builder) ignores Arrow test. Arrow 0.15 has a breaking R API changes at [ARROW-5505](https://issues.apache.org/jira/browse/ARROW-5505) and we missed that. AppVeyor was the only one having SparkR Arrow tests but it's broken now.

**Jenkins**
```
Skipped ------------------------------------------------------------------------
1. createDataFrame/collect Arrow optimization (test_sparkSQL_arrow.R#25)
- arrow not installed
```

Second, Arrow throws OOM on AppVeyor environment (Windows JDK8) like the following because it still has Arrow 0.14.
```
Warnings -----------------------------------------------------------------------
1. createDataFrame/collect Arrow optimization (test_sparkSQL_arrow.R#39) - createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.sparkr.enabled' is set to true; however, failed, attempting non-optimization. Reason: Error in handleErrors(returnStatus, conn): java.lang.OutOfMemoryError: Java heap space
	at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57)
	at java.nio.ByteBuffer.allocate(ByteBuffer.java:335)
	at org.apache.arrow.vector.ipc.message.MessageSerializer.readMessage(MessageSerializer.java:669)
	at org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$3.readNextBatch(ArrowConverters.scala:243)
```

It is due to the version mismatch.
```java
int messageLength = MessageSerializer.bytesToInt(buffer.array());
if (messageLength == IPC_CONTINUATION_TOKEN) {
  buffer.clear();
  // ARROW-6313, if the first 4 bytes are continuation message, read the next 4 for the length
  if (in.readFully(buffer) == 4) {
    messageLength = MessageSerializer.bytesToInt(buffer.array());
  }
}

// Length of 0 indicates end of stream
if (messageLength != 0) {
  // Read the message into the buffer.
  ByteBuffer messageBuffer = ByteBuffer.allocate(messageLength);
```
 After upgrading this to 0.15, we are hitting ARROW-5505. This PR upgrades Arrow version in AppVeyor and fix the issue.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Pass the AppVeyor.

This PR passed here.
- https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/builds/28909044

```
SparkSQL Arrow optimization: Spark package found in SPARK_HOME: C:\projects\spark\bin\..
................
```

Closes #26555 from dongjoon-hyun/SPARK-R-TEST.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-11-16 18:28:27 -08:00
Dongjoon Hyun f77c10de38 [SPARK-29923][SQL][TESTS] Set io.netty.tryReflectionSetAccessible for Arrow on JDK9+
### What changes were proposed in this pull request?

This PR aims to add `io.netty.tryReflectionSetAccessible=true` to the testing configuration for JDK11 because this is an officially documented requirement of Apache Arrow.

Apache Arrow community documented this requirement at `0.15.0` ([ARROW-6206](https://github.com/apache/arrow/pull/5078)).
> #### For java 9 or later, should set "-Dio.netty.tryReflectionSetAccessible=true".
> This fixes `java.lang.UnsupportedOperationException: sun.misc.Unsafe or java.nio.DirectByteBuffer.(long, int) not available`. thrown by netty.

### Why are the changes needed?

After ARROW-3191, Arrow Java library requires the property `io.netty.tryReflectionSetAccessible` to be set to true for JDK >= 9. After https://github.com/apache/spark/pull/26133, JDK11 Jenkins job seem to fail.

- https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-3.2-jdk-11/676/
- https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-3.2-jdk-11/677/
- https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-3.2-jdk-11/678/

```scala
Previous exception in task:
sun.misc.Unsafe or java.nio.DirectByteBuffer.<init>(long, int) not available&#010;
io.netty.util.internal.PlatformDependent.directBuffer(PlatformDependent.java:473)&#010;
io.netty.buffer.NettyArrowBuf.getDirectBuffer(NettyArrowBuf.java:243)&#010;
io.netty.buffer.NettyArrowBuf.nioBuffer(NettyArrowBuf.java:233)&#010;
io.netty.buffer.ArrowBuf.nioBuffer(ArrowBuf.java:245)&#010;
org.apache.arrow.vector.ipc.message.ArrowRecordBatch.computeBodyLength(ArrowRecordBatch.java:222)&#010;
```

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Pass the Jenkins with JDK11.

Closes #26552 from dongjoon-hyun/SPARK-ARROW-JDK11.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-11-15 23:58:15 -08:00
HyukjinKwon 17321782de [SPARK-26923][R][SQL][FOLLOW-UP] Show stderr in the exception whenever possible in RRunner
### What changes were proposed in this pull request?

This is a followup of https://github.com/apache/spark/pull/23977 I made a mistake related to this line: 3725b1324f (diff-71c2cad03f08cb5f6c70462aa4e28d3aL112)

Previously,

1. the reader iterator for R worker read some initial data eagerly during RDD materialization. So it read the data before actual execution. For some reasons, in this case, it showed standard error from R worker.

2. After that, when error happens during actual execution, stderr wasn't shown: 3725b1324f (diff-71c2cad03f08cb5f6c70462aa4e28d3aL260)

After my change 3725b1324f (diff-71c2cad03f08cb5f6c70462aa4e28d3aL112), it now ignores 1. case and only does 2. of previous code path, because 1. does not happen anymore as I avoided to such eager execution (which is consistent with PySpark code path).

This PR proposes to do only 1.  before/after execution always because It is pretty much possible R worker was failed during actual execution and it's best to show the stderr from R worker whenever possible.

### Why are the changes needed?

It currently swallows standard error from R worker which makes debugging harder.

### Does this PR introduce any user-facing change?

Yes,

```R
df <- createDataFrame(list(list(n=1)))
collect(dapply(df, function(x) {
  stop("asdkjasdjkbadskjbsdajbk")
  x
}, structType("a double")))
```

**Before:**

```
Error in handleErrors(returnStatus, conn) :
  org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 13.0 failed 1 times, most recent failure: Lost task 0.0 in stage 13.0 (TID 13, 192.168.35.193, executor driver): org.apache.spark.SparkException: R worker exited unexpectedly (cranshed)
	at org.apache.spark.api.r.RRunner$$anon$1.read(RRunner.scala:130)
	at org.apache.spark.api.r.BaseRRunner$ReaderIterator.hasNext(BaseRRunner.scala:118)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:726)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:337)
	at org.apache.spark.
```

**After:**

```
Error in handleErrors(returnStatus, conn) :
  org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 1, 192.168.35.193, executor driver): org.apache.spark.SparkException: R unexpectedly exited.
R worker produced errors: Error in computeFunc(inputData) : asdkjasdjkbadskjbsdajbk

	at org.apache.spark.api.r.BaseRRunner$ReaderIterator$$anonfun$1.applyOrElse(BaseRRunner.scala:144)
	at org.apache.spark.api.r.BaseRRunner$ReaderIterator$$anonfun$1.applyOrElse(BaseRRunner.scala:137)
	at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38)
	at org.apache.spark.api.r.RRunner$$anon$1.read(RRunner.scala:128)
	at org.apache.spark.api.r.BaseRRunner$ReaderIterator.hasNext(BaseRRunner.scala:113)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegen
```

### How was this patch tested?

Manually tested and unittest was added.

Closes #26517 from HyukjinKwon/SPARK-26923-followup.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-11-15 11:13:36 +09:00
Liang-Chi Hsieh e7263242bd Revert "[SPARK-24152][R][TESTS] Disable check-cran from run-tests.sh"
### What changes were proposed in this pull request?

This reverts commit 91d990162f.

### Why are the changes needed?

CRAN check is pretty important for R package, we should enable it.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Unit tests.

Closes #26381 from viirya/revert-SPARK-24152.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-11-03 15:14:58 -08:00
Dongjoon Hyun 91d990162f [SPARK-24152][R][TESTS] Disable check-cran from run-tests.sh
### What changes were proposed in this pull request?

This PR aims to remove `check-cran` from `run-tests.sh`.
We had better add an independent Jenkins job to run `check-cran`.

### Why are the changes needed?

CRAN instability has been a blocker for our daily dev process.
The following simple check causes consecutive failures in 4 of 9 Jenkins
jobs + PR builder.

```
* checking CRAN incoming feasibility ...Error in
.check_package_CRAN_incoming(pkgdir) :
  dims [product 24] do not match the length of object [0]
```

- spark-branch-2.4-test-sbt-hadoop-2.6
- spark-branch-2.4-test-sbt-hadoop-2.7
- spark-master-test-sbt-hadoop-2.7
- spark-master-test-sbt-hadoop-3.2
- PRBuilder

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Currently, PR builder is failing due to the above issue. This PR should pass the Jenkins.

Closes #26375 from dongjoon-hyun/SPARK-24152.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-11-02 21:37:40 -07:00
Xingbo Jiang 8207c835b4 Revert "Prepare Spark release v3.0.0-preview-rc2"
This reverts commit 007c873ae3.
2019-10-30 17:45:44 -07:00
Xingbo Jiang 007c873ae3 Prepare Spark release v3.0.0-preview-rc2
### What changes were proposed in this pull request?

To push the built jars to maven release repository, we need to remove the 'SNAPSHOT' tag from the version name.

Made the following changes in this PR:
* Update all the `3.0.0-SNAPSHOT` version name to `3.0.0-preview`
* Update the sparkR version number check logic to allow jvm version like `3.0.0-preview`

**Please note those changes were generated by the release script in the past, but this time since we manually add tags on master branch, we need to manually apply those changes too.**

We shall revert the changes after 3.0.0-preview release passed.

### Why are the changes needed?

To make the maven release repository to accept the built jars.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

N/A
2019-10-30 17:42:59 -07:00
Xingbo Jiang b33a58c0c6 Revert "Prepare Spark release v3.0.0-preview-rc1"
This reverts commit 5eddbb5f1d.
2019-10-28 22:32:34 -07:00
Xingbo Jiang 5eddbb5f1d Prepare Spark release v3.0.0-preview-rc1
### What changes were proposed in this pull request?

To push the built jars to maven release repository, we need to remove the 'SNAPSHOT' tag from the version name.

Made the following changes in this PR:
* Update all the `3.0.0-SNAPSHOT` version name to `3.0.0-preview`
* Update the PySpark version from `3.0.0.dev0` to `3.0.0`

**Please note those changes were generated by the release script in the past, but this time since we manually add tags on master branch, we need to manually apply those changes too.**

We shall revert the changes after 3.0.0-preview release passed.

### Why are the changes needed?

To make the maven release repository to accept the built jars.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

N/A

Closes #26243 from jiangxb1987/3.0.0-preview-prepare.

Lead-authored-by: Xingbo Jiang <xingbo.jiang@databricks.com>
Co-authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Xingbo Jiang <xingbo.jiang@databricks.com>
2019-10-28 22:31:29 -07:00
HyukjinKwon 0f48aafab8 [SPARK-29339][R] Support Arrow 0.14 in vectoried dapply and gapply (test it in AppVeyor build)
### What changes were proposed in this pull request?

This PR proposes:

1. Use `is.data.frame` to check if it is a DataFrame.
2. to install Arrow and test Arrow optimization in AppVeyor build. We're currently not testing this in CI.

### Why are the changes needed?

1. To support SparkR with Arrow 0.14
2. To check if there's any regression and if it works correctly.

### Does this PR introduce any user-facing change?

```r
df <- createDataFrame(mtcars)
collect(dapply(df, function(rdf) { data.frame(rdf$gear + 1) }, structType("gear double")))
```

**Before:**

```
Error in readBin(con, raw(), as.integer(dataLen), endian = "big") :
  invalid 'n' argument
```

**After:**

```
   gear
1     5
2     5
3     5
4     4
5     4
6     4
7     4
8     5
9     5
...
```

### How was this patch tested?

AppVeyor

Closes #25993 from HyukjinKwon/arrow-r-appveyor.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-10-04 08:56:45 +09:00
sheepstop 81de9d3c29 [SPARK-28678][DOC] Specify that array indices start at 1 for function slice in R Scala Python
### What changes were proposed in this pull request?
Added "array indices start at 1" in annotation to make it clear for the usage of function slice, in R Scala Python component

### Why are the changes needed?
It will throw exception if the value stare is 0, but array indices start at 0 most of times in other scenarios.

### Does this PR introduce any user-facing change?
Yes, more info provided to user.

### How was this patch tested?
No tests added, only doc change.

Closes #25704 from sheepstop/master.

Authored-by: sheepstop <yangting617@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-09-24 18:57:54 +09:00
Liang-Chi Hsieh 12e1583093 [SPARK-28927][ML] Rethrow block mismatch exception in ALS when input data is nondeterministic
### What changes were proposed in this pull request?

Fitting ALS model can be failed due to nondeterministic input data. Currently the failure is thrown by an ArrayIndexOutOfBoundsException which is not explainable for end users what is wrong in fitting.

This patch catches this exception and rethrows a more explainable one, when the input data is nondeterministic.

Because we may not exactly know the output deterministic level of RDDs produced by user code, this patch also adds a note to Scala/Python/R ALS document about the training data deterministic level.

### Why are the changes needed?

ArrayIndexOutOfBoundsException was observed during fitting ALS model. It was caused by mismatching between in/out user/item blocks during computing ratings.

If the training RDD output is nondeterministic, when fetch failure is happened, rerun part of training RDD can produce inconsistent user/item blocks.

This patch is needed to notify users ALS fitting on nondeterministic input.

### Does this PR introduce any user-facing change?

Yes. When fitting ALS model on nondeterministic input data, previously if rerun happens, users would see ArrayIndexOutOfBoundsException caused by mismatch between In/Out user/item blocks.

After this patch, a SparkException with more clear message will be thrown, and original ArrayIndexOutOfBoundsException is wrapped.

### How was this patch tested?

Tested on development cluster.

Closes #25789 from viirya/als-indeterminate-input.

Lead-authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Co-authored-by: Liang-Chi Hsieh <liangchi@uber.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-09-18 09:22:13 -05:00
dengziming 8f632d7045 [MINOR][DOCS] Fix few typos in the java docs
JIRA :https://issues.apache.org/jira/browse/SPARK-29050
'a hdfs' change into  'an hdfs'
'an unique' change into 'a unique'
'an url' change into 'a url'
'a error' change into 'an error'

Closes #25756 from dengziming/feature_fix_typos.

Authored-by: dengziming <dengziming@growingio.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-09-12 09:30:03 +09:00
Sean Owen 6378d4bc06 [SPARK-28980][CORE][SQL][STREAMING][MLLIB] Remove most items deprecated in Spark 2.2.0 or earlier, for Spark 3
### What changes were proposed in this pull request?

- Remove SQLContext.createExternalTable and Catalog.createExternalTable, deprecated in favor of createTable since 2.2.0, plus tests of deprecated methods
- Remove HiveContext, deprecated in 2.0.0, in favor of `SparkSession.builder.enableHiveSupport`
- Remove deprecated KinesisUtils.createStream methods, plus tests of deprecated methods, deprecate in 2.2.0
- Remove deprecated MLlib (not Spark ML) linear method support, mostly utility constructors and 'train' methods, and associated docs. This includes methods in LinearRegression, LogisticRegression, Lasso, RidgeRegression. These have been deprecated since 2.0.0
- Remove deprecated Pyspark MLlib linear method support, including LogisticRegressionWithSGD, LinearRegressionWithSGD, LassoWithSGD
- Remove 'runs' argument in KMeans.train() method, which has been a no-op since 2.0.0
- Remove deprecated ChiSqSelector isSorted protected method
- Remove deprecated 'yarn-cluster' and 'yarn-client' master argument in favor of 'yarn' and deploy mode 'cluster', etc

Notes:

- I was not able to remove deprecated DataFrameReader.json(RDD) in favor of DataFrameReader.json(Dataset); the former was deprecated in 2.2.0, but, it is still needed to support Pyspark's .json() method, which can't use a Dataset.
- Looks like SQLContext.createExternalTable was not actually deprecated in Pyspark, but, almost certainly was meant to be? Catalog.createExternalTable was.
- I afterwards noted that the toDegrees, toRadians functions were almost removed fully in SPARK-25908, but Felix suggested keeping just the R version as they hadn't been technically deprecated. I'd like to revisit that. Do we really want the inconsistency? I'm not against reverting it again, but then that implies leaving SQLContext.createExternalTable just in Pyspark too, which seems weird.
- I *kept* LogisticRegressionWithSGD, LinearRegressionWithSGD, LassoWithSGD, RidgeRegressionWithSGD in Pyspark, though deprecated, as it is hard to remove them (still used by StreamingLogisticRegressionWithSGD?) and they are not fully removed in Scala. Maybe should not have been deprecated.

### Why are the changes needed?

Deprecated items are easiest to remove in a major release, so we should do so as much as possible for Spark 3. This does not target items deprecated 'recently' as of Spark 2.3, which is still 18 months old.

### Does this PR introduce any user-facing change?

Yes, in that deprecated items are removed from some public APIs.

### How was this patch tested?

Existing tests.

Closes #25684 from srowen/SPARK-28980.

Lead-authored-by: Sean Owen <sean.owen@databricks.com>
Co-authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-09-09 10:19:40 -05:00
Sean Owen 36559b6525 [SPARK-28977][DOCS][SQL] Fix DataFrameReader.json docs to doc that partition column can be numeric, date or timestamp type
### What changes were proposed in this pull request?

`DataFrameReader.json()` accepts a partition column that is of numeric, date or timestamp type, according to the implementation in `JDBCRelation.scala`. Update the scaladoc accordingly, to match the documentation in `sql-data-sources-jdbc.md` too.

### Why are the changes needed?

scaladoc is incorrect.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

N/A

Closes #25687 from srowen/SPARK-28977.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-09-05 18:32:45 +09:00
HyukjinKwon 5cf2602ccb [SPARK-28946][R][DOCS] Add some more information about building SparkR on Windows
### What changes were proposed in this pull request?

This PR adds three more information:

- Mentions that `bash` in `PATH` to build is required.
- Specifies supported JDK and Maven versions
- Explicitly mentions that building on Windows is not the official support

### Why are the changes needed?

In order to make SparkR developers on Windows able to work, and describe what is needed for AppVeyor build.

### Does this PR introduce any user-facing change?

No. It just adds some information in `R/WINDOWS.md`

### How was this patch tested?

This is already being tested as so in AppVeyor. Also, I tested as so (long ago though).

Closes #25647 from HyukjinKwon/SPARK-28946.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-09-03 15:08:18 +09:00
WeichenXu 7f605f5559 [SPARK-28621][SQL] Make spark.sql.crossJoin.enabled default value true
### What changes were proposed in this pull request?

Make `spark.sql.crossJoin.enabled` default value true

### Why are the changes needed?

For implicit cross join, we can set up a watchdog to cancel it if running for a long time.
When "spark.sql.crossJoin.enabled" is false, because `CheckCartesianProducts` is implemented in logical plan stage, it may generate some mismatching error which may confuse end user:
* it's done in logical phase, so we may fail queries that can be executed via broadcast join, which is very fast.
* if we move the check to the physical phase, then a query may success at the beginning, and begin to fail when the table size gets larger (other people insert data to the table). This can be quite confusing.
* the CROSS JOIN syntax doesn't work well if join reorder happens.
* some non-equi-join will generate plan using cartesian product, but `CheckCartesianProducts` do not detect it and raise error.

So that in order to address this in simpler way, we can turn off showing this cross-join error by default.

For reference, I list some cases raising mismatching error here:
Providing:
```
spark.range(2).createOrReplaceTempView("sm1") // can be broadcast
spark.range(50000000).createOrReplaceTempView("bg1") // cannot be broadcast
spark.range(60000000).createOrReplaceTempView("bg2") // cannot be broadcast
```
1) Some join could be convert to broadcast nested loop join, but CheckCartesianProducts raise error. e.g.
```
select sm1.id, bg1.id from bg1 join sm1 where sm1.id < bg1.id
```
2) Some join will run by CartesianJoin but CheckCartesianProducts DO NOT raise error. e.g.
```
select bg1.id, bg2.id from bg1 join bg2 where bg1.id < bg2.id
```

### Does this PR introduce any user-facing change?

### How was this patch tested?

Closes #25520 from WeichenXu123/SPARK-28621.

Authored-by: WeichenXu <weichen.xu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-08-27 21:53:37 +08:00
HyukjinKwon 2fd83c2820 [SPARK-28756][R][FOLLOW-UP] Specify minimum and maximum Java versions
<!--
Thanks for sending a pull request!  Here are some tips for you:
  1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
  2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
  3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
  4. Be sure to keep the PR description updated to reflect all changes.
  5. Please write your PR title to summarize what this PR proposes.
  6. If possible, provide a concise example to reproduce the issue for a faster review.
-->

### What changes were proposed in this pull request?
<!--
Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
  1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
  2. If you fix some SQL features, you can provide some references of other DBMSes.
  3. If there is design documentation, please add the link.
  4. If there is a discussion in the mailing list, please add the link.
-->

This PR proposes to set minimum and maximum Java version specification. (see https://cran.r-project.org/doc/manuals/r-release/R-exts.html#Writing-portable-packages).

Seems there is not the standard way to specify both given the documentation and other packages (see https://gist.github.com/glin/bd36cf1eb0c7f8b1f511e70e2fb20f8d).

I found two ways from existing packages on CRAN.

```
Package (<= 1 & > 2)
Package (<= 1, > 2)
```

The latter seems closer to other standard notations such as `R (>= 2.14.0), R (>= r56550)`. So I have chosen the latter way.

### Why are the changes needed?
<!--
Please clarify why the changes are needed. For instance,
  1. If you propose a new API, clarify the use case for a new API.
  2. If you fix a bug, you can clarify why it is a bug.
-->

Seems the package might be rejected by CRAN. See https://github.com/apache/spark/pull/25472#issuecomment-522405742

### Does this PR introduce any user-facing change?
<!--
If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
If no, write 'No'.
-->

No.

### How was this patch tested?
<!--
If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
If tests were not added, please describe why they were not added and/or why it was difficult to add.
-->

JDK 8

```bash
./build/mvn -DskipTests -Psparkr clean package
./R/run-tests.sh

...
basic tests for CRAN: .............
...
```

JDK 11

```bash
./build/mvn -DskipTests -Psparkr -Phadoop-3.2 clean package
./R/run-tests.sh

...
basic tests for CRAN: .............
...
```

Closes #25490 from HyukjinKwon/SPARK-28756.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-08-19 20:15:17 +09:00
Dongjoon Hyun 5756a47a9f [SPARK-28766][R][DOC] Fix CRAN incoming feasibility warning on invalid URL
### What changes were proposed in this pull request?

This updates an URL in R doc to fix `Had CRAN check errors; see logs`.

### Why are the changes needed?
Currently, this invalid link causes a warning during CRAN incoming feasibility. We had better fix this before submitting `3.0.0/2.4.4/2.3.4`.

**BEFORE**
```
* checking CRAN incoming feasibility ... NOTE
Maintainer: ‘Shivaram Venkataraman <shivaramcs.berkeley.edu>’

Found the following (possibly) invalid URLs:
  URL: https://wiki.apache.org/hadoop/HCFS (moved to https://cwiki.apache.org/confluence/display/hadoop/HCFS)
    From: man/spark.addFile.Rd
    Status: 404
    Message: Not Found
```

**AFTER**
```
* checking CRAN incoming feasibility ... Note_to_CRAN_maintainers
Maintainer: ‘Shivaram Venkataraman <shivaramcs.berkeley.edu>’
```

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Check the warning message during R testing.
```
$ R/install-dev.sh
$ R/run-tests.sh
```

Closes #25483 from dongjoon-hyun/SPARK-28766.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-08-17 11:11:36 -07:00
HyukjinKwon 7f44a6e367 [SPARK-28755][R][TESTS] Increase tolerance in 'spark.mlp' SparkR test for JDK 11
<!--
Thanks for sending a pull request!  Here are some tips for you:
  1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
  2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
  3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
  4. Be sure to keep the PR description updated to reflect all changes.
  5. Please write your PR title to summarize what this PR proposes.
  6. If possible, provide a concise example to reproduce the issue for a faster review.
-->

### What changes were proposed in this pull request?
<!--
Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
  1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
  2. If you fix some SQL features, you can provide some references of other DBMSes.
  3. If there is design documentation, please add the link.
  4. If there is a discussion in the mailing list, please add the link.
-->

This PR proposes to increase the tolerance for the exact value comparison in `spark.mlp` test. I don't know the root cause but some tolerance is already expected. I suspect it is not a big deal considering all other tests pass.

The values are fairly close:

JDK 8:

```
-24.28415, 107.8701, 16.86376, 1.103736, 9.244488
```

JDK 11:

```
-24.33892, 108.0316, 16.89082, 1.090723, 9.260533
```

### Why are the changes needed?
<!--
Please clarify why the changes are needed. For instance,
  1. If you propose a new API, clarify the use case for a new API.
  2. If you fix a bug, you can clarify why it is a bug.
-->

To fully support JDK 11. See, for instance, #25443 and #25423 for ongoing efforts.

### Does this PR introduce any user-facing change?
<!--
If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
If no, write 'No'.
-->

No

### How was this patch tested?
<!--
If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
If tests were not added, please describe why they were not added and/or why it was difficult to add.
-->

Manually tested on the top of https://github.com/apache/spark/pull/25472 with JDK 11

```bash
./build/mvn -DskipTests -Psparkr -Phadoop-3.2 package
./bin/sparkR
```

```R
absoluteSparkPath <- function(x) {
  sparkHome <- sparkR.conf("spark.home")
  file.path(sparkHome, x)
}
df <- read.df(absoluteSparkPath("data/mllib/sample_multiclass_classification_data.txt"),
              source = "libsvm")
model <- spark.mlp(df, label ~ features, blockSize = 128, layers = c(4, 5, 4, 3),
                   solver = "l-bfgs", maxIter = 100, tol = 0.00001, stepSize = 1, seed = 1)
summary <- summary(model)
head(summary$weights, 5)
```

Closes #25478 from HyukjinKwon/SPARK-28755.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-08-16 10:03:14 -07:00
Dongjoon Hyun 2f04152921 [SPARK-28756][R] Fix checkJavaVersion to accept JDK8+
### What changes were proposed in this pull request?
Currently, `checkJavaVersion` only accepts JDK8 because it compares with the number in `SystemRequirements`. This PR changes it to accept the higher version, too.

### Why are the changes needed?
Without this, two test suites are skipped on JDK11 environment due to this check.

**BEFORE**
```
$ build/mvn -Phadoop-3.2 -Psparkr -DskipTests package
$ R/install-dev.sh
$ R/run-tests.sh
...
basic tests for CRAN: SS

Skipped ------------------------------------------------------------------------
1. create DataFrame from list or data.frame (test_basic.R#21) - error on Java check
2. spark.glm and predict (test_basic.R#57) - error on Java check
DONE ===========================================================================
```

**AFTER**
```
basic tests for CRAN: .............

DONE ===========================================================================
```

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?

Manually, build and test on JDK11.

Closes #25472 from dongjoon-hyun/SPARK-28756.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-08-16 10:01:59 -07:00
Maxim Gekk a5a5da78cf [SPARK-28471][SQL] Replace yyyy by uuuu in date-timestamp patterns without era
## What changes were proposed in this pull request?

In the PR, I propose to use `uuuu` for years instead of `yyyy` in date/timestamp patterns without the era pattern `G` (https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html). **Parsing/formatting of positive years (current era) will be the same.** The difference is in formatting negative years belong to previous era - BC (Before Christ).

I replaced the `yyyy` pattern by `uuuu` everywhere except:
1. Test, Suite & Benchmark. Existing tests must work as is.
2. `SimpleDateFormat` because it doesn't support the `uuuu` pattern.
3. Comments and examples (except comments related to already replaced patterns).

Before the changes, the year of common era `100` and the year of BC era `-99`, showed similarly as `100`.  After the changes negative years will be formatted with the `-` sign.

Before:
```Scala
scala> Seq(java.time.LocalDate.of(-99, 1, 1)).toDF().show
+----------+
|     value|
+----------+
|0100-01-01|
+----------+
```

After:
```Scala
scala> Seq(java.time.LocalDate.of(-99, 1, 1)).toDF().show
+-----------+
|      value|
+-----------+
|-0099-01-01|
+-----------+
```

## How was this patch tested?

By existing test suites, and added tests for negative years to `DateFormatterSuite` and `TimestampFormatterSuite`.

Closes #25230 from MaxGekk/year-pattern-uuuu.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-07-28 20:36:36 -07:00
Douglas R Colkitt 8fc5cb6285 [SPARK-28473][DOC] Stylistic consistency of build command in README
## What changes were proposed in this pull request?

Change the format of the build command in the README to start with a `./` prefix

    ./build/mvn -DskipTests clean package

This increases stylistic consistency across the README- all the other commands have a `./` prefix. Having a visible `./` prefix also makes it clear to the user that the shell command requires the current working directory to be at the repository root.

## How was this patch tested?

README.md was reviewed both in raw markdown and in the Github rendered landing page for stylistic consistency.

Closes #25231 from Mister-Meeseeks/master.

Lead-authored-by: Douglas R Colkitt <douglas.colkitt@gmail.com>
Co-authored-by: Mister-Meeseeks <douglas.colkitt@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-07-23 16:29:46 -07:00
Liang-Chi Hsieh 7083ec051e [SPARK-28215][SQL][R] as_tibble was removed from Arrow R API
## What changes were proposed in this pull request?

New R api of Arrow has removed `as_tibble` as of 2ef96c8623. Arrow optimization for DataFrame in R doesn't work due to the change.

This can be tested as below, after installing latest Arrow:

```
./bin/sparkR --conf spark.sql.execution.arrow.sparkr.enabled=true
```

```
> collect(createDataFrame(mtcars))
```

Before this PR:
```
> collect(createDataFrame(mtcars))
 Error in get("as_tibble", envir = asNamespace("arrow")) :
   object 'as_tibble' not found
```

After:
```
> collect(createDataFrame(mtcars))
    mpg cyl  disp  hp drat    wt  qsec vs am gear carb
1  21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
2  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
3  22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
...
```

## How was this patch tested?

Manual test.

Closes #25012 from viirya/SPARK-28215.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-07-01 13:21:06 +09:00
ozan a38d605d0d [SPARK-18570][ML][R] RFormula support * and ^ operators
## What changes were proposed in this pull request?

Added support for `*` and `^` operators, along with expressions within parentheses. New operators just expand to already supported terms, such as;

 - y ~ a * b = y ~ a + b + a : b
 - y ~ (a+b+c)^3 = y ~ a + b + c + a : b + a : c + a :b : c

## How was this patch tested?

Added new unit tests to RFormulaParserSuite

mengxr yanboliang

Closes #24764 from ozancicek/rformula.

Authored-by: ozan <ozancancicekci@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-06-04 08:59:30 -05:00
HyukjinKwon db48da87f0 [SPARK-27834][SQL][R][PYTHON] Make separate PySpark/SparkR vectorization configurations
## What changes were proposed in this pull request?

`spark.sql.execution.arrow.enabled` was added when we add PySpark arrow optimization.
Later, in the current master, SparkR arrow optimization was added and it's controlled by the same configuration `spark.sql.execution.arrow.enabled`.

There look two issues about this:

1. `spark.sql.execution.arrow.enabled` in PySpark was added from 2.3.0 whereas SparkR optimization was added 3.0.0. The stability is different so it's problematic when we change the default value for one of both optimization first.

2. Suppose users want to share some JVM by PySpark and SparkR. They are currently forced to use the optimization for all or none if the configuration is set globally.

This PR proposes two separate configuration groups for PySpark and SparkR about Arrow optimization:

- Deprecate `spark.sql.execution.arrow.enabled`
- Add `spark.sql.execution.arrow.pyspark.enabled` (fallback to `spark.sql.execution.arrow.enabled`)
- Add `spark.sql.execution.arrow.sparkr.enabled`
- Deprecate `spark.sql.execution.arrow.fallback.enabled`
- Add `spark.sql.execution.arrow.pyspark.fallback.enabled ` (fallback to `spark.sql.execution.arrow.fallback.enabled`)

Note that `spark.sql.execution.arrow.maxRecordsPerBatch` is used within JVM side for both.
Note that `spark.sql.execution.arrow.fallback.enabled` was added due to behaviour change. We don't need it in SparkR - SparkR side has the automatic fallback.

## How was this patch tested?

Manually tested and some unittests were added.

Closes #24700 from HyukjinKwon/separate-sparkr-arrow.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-06-03 10:01:37 +09:00
Sean Owen eed6de1a65 [MINOR][DOCS] Tighten up some key links to the project and download pages to use HTTPS
## What changes were proposed in this pull request?

Tighten up some key links to the project and download pages to use HTTPS

## How was this patch tested?

N/A

Closes #24665 from srowen/HTTPSURLs.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-05-21 10:56:42 -07:00
Jash Gala 90085a1847 [SPARK-23619][DOCS] Add output description for some generator expressions / functions
## What changes were proposed in this pull request?

This PR addresses SPARK-23619: https://issues.apache.org/jira/browse/SPARK-23619

It adds additional comments indicating the default column names for the `explode` and `posexplode`
functions in Spark-SQL.

Functions for which comments have been updated so far:
* stack
* inline
* explode
* posexplode
* explode_outer
* posexplode_outer

## How was this patch tested?

This is just a change in the comments. The package builds and tests successfullly after the change.

Closes #23748 from jashgala/SPARK-23619.

Authored-by: Jash Gala <jashgala@amazon.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-04-27 10:30:12 +09:00
Maxim Gekk 1d20d13149 [SPARK-25496][SQL] Deprecate from_utc_timestamp and to_utc_timestamp
## What changes were proposed in this pull request?

In the PR, I propose to deprecate the `from_utc_timestamp()` and `to_utc_timestamp`, and disable them by default. The functions can be enabled back via the SQL config `spark.sql.legacy.utcTimestampFunc.enabled`. By default, any calls of the functions throw an analysis exception.

One of the reason for deprecation is functions violate semantic of `TimestampType` which is number of microseconds since epoch in UTC time zone. Shifting microseconds since epoch by time zone offset doesn't make sense because the result doesn't represent microseconds since epoch in UTC time zone any more, and cannot be considered as `TimestampType`.

## How was this patch tested?

The changes were tested by `DateExpressionsSuite` and `DateFunctionsSuite`.

Closes #24195 from MaxGekk/conv-utc-timestamp-deprecate.

Lead-authored-by: Maxim Gekk <max.gekk@gmail.com>
Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com>
Co-authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-04-03 10:55:56 +08:00
Hyukjin Kwon d7dd59a6b4 [SPARK-26224][SQL][PYTHON][R][FOLLOW-UP] Add notes about many projects in withColumn at SparkR and PySpark as well
## What changes were proposed in this pull request?

This is a followup of https://github.com/apache/spark/pull/23285. This PR adds the notes into PySpark and SparkR documentation as well.

While I am here, I revised the doc a bit to make it sound a bit more neutral

## How was this patch tested?

Manually built the doc and verified.

Closes #24272 from HyukjinKwon/SPARK-26224.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2019-04-03 08:30:24 +09:00
Dongjoon Hyun d575a453db Revert "[SPARK-25496][SQL] Deprecate from_utc_timestamp and to_utc_timestamp"
This reverts commit c5e83ab92c.
2019-04-02 01:05:54 -07:00
Maxim Gekk c5e83ab92c [SPARK-25496][SQL] Deprecate from_utc_timestamp and to_utc_timestamp
## What changes were proposed in this pull request?

In the PR, I propose to deprecate the `from_utc_timestamp()` and `to_utc_timestamp`, and disable them by default. The functions can be enabled back via the SQL config `spark.sql.legacy.utcTimestampFunc.enabled`. By default, any calls of the functions throw an analysis exception.

One of the reason for deprecation is functions violate semantic of `TimestampType` which is number of microseconds since epoch in UTC time zone. Shifting microseconds since epoch by time zone offset doesn't make sense because the result doesn't represent microseconds since epoch in UTC time zone any more, and cannot be considered as `TimestampType`.

## How was this patch tested?

The changes were tested by `DateExpressionsSuite` and `DateFunctionsSuite`.

Closes #24195 from MaxGekk/conv-utc-timestamp-deprecate.

Lead-authored-by: Maxim Gekk <max.gekk@gmail.com>
Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com>
Co-authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-04-02 10:20:06 +08:00
Felix Cheung fa0f791d4d [MINOR][R] fix R project description
## What changes were proposed in this pull request?

update as per this NOTE when running CRAN check

```
The Title field should be in title case, current version then in title case:
‘R Front end for 'Apache Spark'’
‘R Front End for 'Apache Spark'’

```

Closes #24255 from felixcheung/rdesc.

Authored-by: Felix Cheung <felixcheung_m@hotmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-03-30 21:00:46 -07:00
Sean Owen 754f820035 [SPARK-26918][DOCS] All .md should have ASF license header
## What changes were proposed in this pull request?

Add AL2 license to metadata of all .md files.
This seemed to be the tidiest way as it will get ignored by .md renderers and other tools. Attempts to write them as markdown comments revealed that there is no such standard thing.

## How was this patch tested?

Doc build

Closes #24243 from srowen/SPARK-26918.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-03-30 19:49:45 -05:00
Sean Owen 8bc304f97e [SPARK-26132][BUILD][CORE] Remove support for Scala 2.11 in Spark 3.0.0
## What changes were proposed in this pull request?

Remove Scala 2.11 support in build files and docs, and in various parts of code that accommodated 2.11. See some targeted comments below.

## How was this patch tested?

Existing tests.

Closes #23098 from srowen/SPARK-26132.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-03-25 10:46:42 -05:00
Hyukjin Kwon 84ec06d95e Revert "[SPARK-27262][R] Add explicit UTF-8 Encoding to DESCRIPTION"
This reverts commit 240c6a4d75.
2019-03-25 11:02:14 +09:00
Michael Chirico 240c6a4d75 [SPARK-27262][R] Add explicit UTF-8 Encoding to DESCRIPTION
## What changes were proposed in this pull request?

I got this warning when following the recommended approach to generating documentation:

```
Warning message:
roxygen2 requires Encoding: UTF-8
```

As can be seen in [other](https://github.com/tidyverse/tidyverse/blob/master/DESCRIPTION) [`tidyverse`](https://github.com/tidyverse/dplyr/blob/master/DESCRIPTION) [`DESCRIPTION`s](https://github.com/tidyverse/readr/blob/master/DESCRIPTION), this is standard practice

This PR adds `Encoding: UTF-8` to `R/pkg/DESCRIPTION`

## How was this patch tested?

Pass the Jenkins without warning.

Closes #23823 from MichaelChirico/patch-1.

Authored-by: Michael Chirico <michaelchirico4@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-03-23 15:24:54 -07:00
Maxim Gekk 027ed2d11b [SPARK-23643][CORE][SQL][ML] Shrinking the buffer in hashSeed up to size of the seed parameter
## What changes were proposed in this pull request?

The hashSeed method allocates 64 bytes instead of 8. Other bytes are always zeros (thanks to default behavior of ByteBuffer). And they could be excluded from hash calculation because they don't differentiate inputs.

## How was this patch tested?

By running the existing tests - XORShiftRandomSuite

Closes #20793 from MaxGekk/hash-buff-size.

Lead-authored-by: Maxim Gekk <maxim.gekk@databricks.com>
Co-authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-03-23 11:26:09 -05:00
Huon Wilson b67d369572 [SPARK-27099][SQL] Add 'xxhash64' for hashing arbitrary columns to Long
## What changes were proposed in this pull request?

This introduces a new SQL function 'xxhash64' for getting a 64-bit hash of an arbitrary number of columns.

This is designed to exactly mimic the 32-bit `hash`, which uses
MurmurHash3. The name is designed to be more future-proof than the
'hash', by indicating the exact algorithm used, similar to md5 and the
sha hashes.

## How was this patch tested?

The tests for the existing `hash` function were duplicated to run with `xxhash64`.

Closes #24019 from huonw/hash64.

Authored-by: Huon Wilson <Huon.Wilson@data61.csiro.au>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-03-20 16:34:34 +08:00
Dilip Biswal 7a136f8670 [SPARK-27096][SQL][FOLLOWUP] Do the correct validation of join types in R side and fix join docs for scala, python and r
## What changes were proposed in this pull request?
This is a minor follow-up PR for SPARK-27096. The original PR reconciled the join types supported between dataset and sql interface. In case of R, we do the join type validation in the R side. In this PR we do the correct validation and adds tests in R to test all the join types along with the error condition. Along with this, i made the necessary doc correction.

## How was this patch tested?
Add R tests.

Closes #24087 from dilipbiswal/joinfix_followup.

Authored-by: Dilip Biswal <dbiswal@us.ibm.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2019-03-16 13:04:54 +09:00
Jagadesh Kiran d9978fb4e4 [SPARK-26860][PYSPARK][SPARKR] Fix for RangeBetween and RowsBetween docs to be in sync with spark documentation
The docs describing RangeBetween & RowsBetween for pySpark & SparkR are not in sync with Spark description.

a. Edited PySpark and SparkR docs  and made description same for both RangeBetween and RowsBetween
b. created executable examples in both pySpark and SparkR documentation
c. Locally tested the patch for scala Style checks and UT for checking no testcase failures

Closes #23946 from jagadesh-kiran/master.

Authored-by: Jagadesh Kiran <jagadesh.n@in.verizon.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-03-11 08:53:09 -05:00
Hyukjin Kwon 93ea353cae [SPARK-26920][R] Deduplicate type checking across Arrow optimization in SparkR
## What changes were proposed in this pull request?

This PR proposes two things.

1.. Deduplicates the type checking logic. While I am here, I checked each type. Currently, binary type, float type, nested struct type and array type are not supported.

**For map and nested struct types:**

 it's expected to be unsupported per Spark's arrow optimization.

```
Exception in thread "serve-Arrow" java.lang.UnsupportedOperationException: Unsupported data type: map<string,double>
...
```
```
Exception in thread "serve-Arrow" java.lang.UnsupportedOperationException: Unsupported data type:  struct<type:tinyint,size:int,indices:array<int>,values:array<double>>
...
```

Please track the trace below to double check.

```
	at org.apache.spark.sql.execution.arrow.ArrowUtils$.toArrowType(ArrowUtils.scala:56)
	at org.apache.spark.sql.execution.arrow.ArrowUtils$.toArrowField(ArrowUtils.scala:92)
	at org.apache.spark.sql.execution.arrow.ArrowUtils$.$anonfun$toArrowSchema$1(ArrowUtils.scala:116)
	at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:237)
	at scala.collection.Iterator.foreach(Iterator.scala:941)
	at scala.collection.Iterator.foreach$(Iterator.scala:941)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
	at scala.collection.IterableLike.foreach(IterableLike.scala:74)
	at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
	at org.apache.spark.sql.types.StructType.foreach(StructType.scala:99)
	at scala.collection.TraversableLike.map(TraversableLike.scala:237)
	at scala.collection.TraversableLike.map$(TraversableLike.scala:230)
	at org.apache.spark.sql.types.StructType.map(StructType.scala:99)
	at org.apache.spark.sql.execution.arrow.ArrowUtils$.toArrowSchema(ArrowUtils.scala:115)
	at org.apache.spark.sql.execution.arrow.ArrowBatchStreamWriter.<init>(ArrowConverters.scala:50)
	at org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToR$2(Dataset.scala:3215)
	at org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToR$2$adapted(Dataset.scala:3212)
```

**For float and binary types:**

They cause corrupt values in some cases. It needs to be investigated separately.

**For array type:**

```
Error in Table__to_dataframe(x, use_threads = use_threads) :
  cannot handle Array of type list
```

Seems to be Arrow's R library limitation. It needs to be investigated separately as well.

2.. While I am touching the type specification codes across Arrow optimization, I move the Arrow optimization related tests into a separate filed called `test_arrow.R`.

## How was this patch tested?

Tests were added and also manually tested.

Closes #23969 from HyukjinKwon/SPARK-26920.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2019-03-11 10:23:24 +09:00
Hyukjin Kwon 28d003097b [SPARK-27102][R][PYTHON][CORE] Remove the references to Python's Scala codes in R's Scala codes
## What changes were proposed in this pull request?

Currently, R's Scala codes happened to refer Python's Scala codes for code deduplications. It's a bit odd. For instance, when we face an exception from R, it shows python related code path, which makes confusing to debug. It should rather have one code base and R's and Python's should share.

This PR proposes:

1. Make a `SocketAuthServer` and move `PythonServer` so that `PythonRDD` and `RRDD` can share it.
2. Move `readRDDFromFile` and `readRDDFromInputStream` into `JavaRDD`.
3. Reuse `RAuthHelper` and remove `RSocketAuthHelper` in `RRDD`.
4. Rename `getEncryptionEnabled` to `isEncryptionEnabled` while I am here.

So, now, the places below:

- `sql/core/src/main/scala/org/apache/spark/sql/api/r`
- `core/src/main/scala/org/apache/spark/api/r`
- `mllib/src/main/scala/org/apache/spark/ml/r`

don't refer Python's Scala codes.

## How was this patch tested?

Existing tests should cover this.

Closes #24023 from HyukjinKwon/SPARK-27102.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2019-03-10 15:08:23 +09:00
Hyukjin Kwon 543cd572c6 [SPARK-26922][R] Set socket timeout consistently in Arrow optimization
## What changes were proposed in this pull request?

This PR sets socket timeout consistently across Arrow optimization by `SPARKR_BACKEND_CONNECTION_TIMEOUT`. There looks only one place left.

## How was this patch tested?

Existing tests should cover.

Closes #23971 from HyukjinKwon/SPARK-26922.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2019-03-06 08:58:49 +09:00
Sean Owen 0deebd3820 [SPARK-26016][DOCS] Clarify that text DataSource read/write, and RDD methods that read text, always use UTF-8
## What changes were proposed in this pull request?

Clarify that text DataSource read/write, and RDD methods that read text, always use UTF-8 as they use Hadoop's implementation underneath. I think these are all the places that this needs a mention in the user-facing docs.

## How was this patch tested?

Doc tests.

Closes #23962 from srowen/SPARK-26016.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2019-03-05 08:03:39 +09:00
Hyukjin Kwon 88bc481b9e [SPARK-26830][SQL][R] Vectorized R dapply() implementation
## What changes were proposed in this pull request?

This PR targets to add vectorized `dapply()` in R, Arrow optimization.

This can be tested as below:

```bash
$ ./bin/sparkR --conf spark.sql.execution.arrow.enabled=true
```

```r
df <- createDataFrame(mtcars)
collect(dapply(df, function(rdf) { data.frame(rdf$gear + 1) }, structType("gear double")))
```

### Requirements
  - R 3.5.x
  - Arrow package 0.12+
    ```bash
    Rscript -e 'remotes::install_github("apache/arrowapache-arrow-0.12.0", subdir = "r")'
    ```

**Note:** currently, Arrow R package is not in CRAN. Please take a look at ARROW-3204.
**Note:** currently, Arrow R package seems not supporting Windows. Please take a look at ARROW-3204.

### Benchmarks

**Shall**

```bash
sync && sudo purge
./bin/sparkR --conf spark.sql.execution.arrow.enabled=false --driver-memory 4g
```

```bash
sync && sudo purge
./bin/sparkR --conf spark.sql.execution.arrow.enabled=true --driver-memory 4g
```

**R code**

```r
rdf <- read.csv("500000.csv")
df <- cache(createDataFrame(rdf))
count(df)

test <- function() {
  options(digits.secs = 6) # milliseconds
  start.time <- Sys.time()
  count(cache(dapply(df, function(rdf) { rdf }, schema(df))))
  end.time <- Sys.time()
  time.taken <- end.time - start.time
  print(time.taken)
}

test()
```

**Data (350 MB):**

```r
object.size(read.csv("500000.csv"))
350379504 bytes
```

"500000 Records"  http://eforexcel.com/wp/downloads-16-sample-csv-files-data-sets-for-testing/

**Results**

```
Time difference of 13.42037 mins
```

```
Time difference of 30.64156 secs
```

The performance improvement was around **2627%**.

### Limitations

- For now, Arrow optimization with R does not support when the data is `raw`, and when user explicitly gives float type in the schema. They produce corrupt values.

- Due to ARROW-4512, it cannot send and receive batch by batch. It has to send all batches in Arrow stream format at once. It needs improvement later.

## How was this patch tested?

Unit tests were added, and manually tested.

Closes #23787 from HyukjinKwon/SPARK-26830-1.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2019-02-27 14:29:58 +09:00
Felix Cheung 927081dd95 [R] update package description
## What changes were proposed in this pull request?

update package description

Closes #23852 from felixcheung/rdesccran.

Authored-by: Felix Cheung <felixcheung_m@hotmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2019-02-21 19:00:36 +08:00
Hyukjin Kwon 3c15d8b71c [SPARK-26762][SQL][R] Arrow optimization for conversion from Spark DataFrame to R DataFrame
## What changes were proposed in this pull request?

This PR targets to support Arrow optimization for conversion from Spark DataFrame to R DataFrame.
Like PySpark side, it falls back to non-optimization code path when it's unable to use Arrow optimization.

This can be tested as below:

```bash
$ ./bin/sparkR --conf spark.sql.execution.arrow.enabled=true
```

```r
collect(createDataFrame(mtcars))
```

### Requirements
  - R 3.5.x
  - Arrow package 0.12+
    ```bash
    Rscript -e 'remotes::install_github("apache/arrowapache-arrow-0.12.0", subdir = "r")'
    ```

**Note:** currently, Arrow R package is not in CRAN. Please take a look at ARROW-3204.
**Note:** currently, Arrow R package seems not supporting Windows. Please take a look at ARROW-3204.

### Benchmarks

**Shall**

```bash
sync && sudo purge
./bin/sparkR --conf spark.sql.execution.arrow.enabled=false --driver-memory 4g
```

```bash
sync && sudo purge
./bin/sparkR --conf spark.sql.execution.arrow.enabled=true --driver-memory 4g
```

**R code**

```r
df <- cache(createDataFrame(read.csv("500000.csv")))
count(df)

test <- function() {
  options(digits.secs = 6) # milliseconds
  start.time <- Sys.time()
  collect(df)
  end.time <- Sys.time()
  time.taken <- end.time - start.time
  print(time.taken)
}

test()
```

**Data (350 MB):**

```r
object.size(read.csv("500000.csv"))
350379504 bytes
```

"500000 Records"  http://eforexcel.com/wp/downloads-16-sample-csv-files-data-sets-for-testing/

**Results**

```
Time difference of 221.32014 secs
```

```
Time difference of 15.51145 secs
```

The performance improvement was around **1426%**.

### Limitations:

- For now, Arrow optimization with R does not support when the data is `raw`, and when user explicitly gives float type in the schema. They produce corrupt values. In this case, we decide to fall back to non-optimization code path.

- Due to ARROW-4512, it cannot send and receive batch by batch. It has to send all batches in Arrow stream format at once. It needs improvement later.

## How was this patch tested?

Existing tests related with Arrow optimization cover this change. Also, manually tested.

Closes #23760 from HyukjinKwon/SPARK-26762.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2019-02-20 11:35:17 +08:00
Hyukjin Kwon ab850c02f7 [SPARK-26901][SQL][R] Adds child's output into references to avoid column-pruning for vectorized gapply()
## What changes were proposed in this pull request?

Currently, looks column pruning is done to vectorized `gapply()`. Given R native function could use all referred fields so it shouldn't be pruned. To avoid this, it adds child's output into `references` like `OutputConsumer`.

```
$ ./bin/sparkR --conf spark.sql.execution.arrow.enabled=true
```

```r
df <- createDataFrame(mtcars)
explain(count(groupBy(gapply(df,
                             "gear",
                             function(key, group) {
                               data.frame(gear = key[[1]], disp = mean(group$disp))
                             },
                             structType("gear double, disp double")))), TRUE)
```

**Before:**

```
== Optimized Logical Plan ==
Aggregate [count(1) AS count#41L]
+- Project
   +- FlatMapGroupsInRWithArrow [...]
      +- Project [gear#9]
         +- LogicalRDD [mpg#0, cyl#1, disp#2, hp#3, drat#4, wt#5, qsec#6, vs#7, am#8, gear#9, carb#10], false

== Physical Plan ==
*(4) HashAggregate(keys=[], functions=[count(1)], output=[count#41L])
+- Exchange SinglePartition
   +- *(3) HashAggregate(keys=[], functions=[partial_count(1)], output=[count#44L])
      +- *(3) Project
         +- FlatMapGroupsInRWithArrow [...]
            +- *(2) Sort [gear#9 ASC NULLS FIRST], false, 0
               +- Exchange hashpartitioning(gear#9, 200)
                  +- *(1) Project [gear#9]
                     +- *(1) Scan ExistingRDD arrow[mpg#0,cyl#1,disp#2,hp#3,drat#4,wt#5,qsec#6,vs#7,am#8,gear#9,carb#10]
```

**After:**

```
== Optimized Logical Plan ==
Aggregate [count(1) AS count#91L]
+- Project
   +- FlatMapGroupsInRWithArrow [...]
      +- LogicalRDD [mpg#0, cyl#1, disp#2, hp#3, drat#4, wt#5, qsec#6, vs#7, am#8, gear#9, carb#10], false

== Physical Plan ==
*(4) HashAggregate(keys=[], functions=[count(1)], output=[count#91L])
+- Exchange SinglePartition
   +- *(3) HashAggregate(keys=[], functions=[partial_count(1)], output=[count#94L])
      +- *(3) Project
         +- FlatMapGroupsInRWithArrow [...]
            +- *(2) Sort [gear#9 ASC NULLS FIRST], false, 0
               +- Exchange hashpartitioning(gear#9, 200)
                  +- *(1) Scan ExistingRDD arrow[mpg#0,cyl#1,disp#2,hp#3,drat#4,wt#5,qsec#6,vs#7,am#8,gear#9,carb#10]
```

Currently, it adds corrupt values for missing columns (via pruned columnar batches to Arrow writers that requires non-pruned columns) such as:

```r
...
  c(7.90505033345994e-323, 7.90505033345994e-323, 7.90505033345994e-323, 7.90505033345994e-323, 7.90505033345994e-323, 7.90505033345994e-323, 7.90505033345994e-323, 7.90505033345994e-323, 7.90505033345994e-323, 7.90505033345994e-323, 7.90505033345994e-323, 7.90505033345994e-323, 0, 0, 4.17777978645388e-314)
  c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1.04669129845114e+219)
  c(3.4482690635875e-313, 3.4482690635875e-313, 3.4482690635875e-313,
  c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2.47032822920623e-323)
...
```

which should be something like:

```r
...
  c(4, 4, 1, 2, 2, 4, 4, 1, 2, 1, 1, 2)
  c(26, 30.4, 15.8, 19.7, 15)
  c(4, 4, 8, 6, 8)
  c(120.3, 95.1, 351, 145, 301)
...
```

## How was this patch tested?

Manually tested, and unit tests were added.

The test code is basiaclly:

```r
df <- createDataFrame(mtcars)
count(gapply(df,
             c("gear"),
             function(key, group) {
                stopifnot(all(group$hp > 50))
                group
             },
             schema(df)))
```

`mtcars`'s hp is all more then 50.

```r
> mtcars$hp > 50
 [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[31] TRUE TRUE
```

However, due to corrpt value, (like 0 or 7.xxxxx), werid values were found. So, it's currently being failed as below in the master:

```
Error in handleErrors(returnStatus, conn) :
  org.apache.spark.SparkException: Job aborted due to stage failure: Task 82 in stage 1.0 failed 1 times, most recent failure: Lost task 82.0 in stage 1.0 (TID 198, localhost, executor driver): org.apache.spark.SparkException: R worker exited unexpectedly (crashed)
 Error in computeFunc(key, inputData) : all(group$hp > 50) is not TRUE
Error in computeFunc(key, inputData) : all(group$hp > 50) is not TRUE
Error in computeFunc(key, inputData) : all(group$hp > 50) is not TRUE
```

I also compared the total length while I am here. Regular `gapply` without Arrow has some holes .. so I had to compare the results with R data frame.

Closes #23810 from HyukjinKwon/SPARK-26901.

Lead-authored-by: Hyukjin Kwon <gurwls223@apache.org>
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-02-20 10:24:40 +08:00
Hyukjin Kwon 8126d09fb5 [SPARK-26761][SQL][R] Vectorized R gapply() implementation
## What changes were proposed in this pull request?

This PR targets to add vectorized `gapply()` in R, Arrow optimization.

This can be tested as below:

```bash
$ ./bin/sparkR --conf spark.sql.execution.arrow.enabled=true
```

```r
df <- createDataFrame(mtcars)
collect(gapply(df,
               "gear",
               function(key, group) {
                 data.frame(gear = key[[1]], disp = mean(group$disp) > group$disp)
               },
               structType("gear double, disp boolean")))
```

### Requirements
  - R 3.5.x
  - Arrow package 0.12+
    ```bash
    Rscript -e 'remotes::install_github("apache/arrowapache-arrow-0.12.0", subdir = "r")'
    ```

**Note:** currently, Arrow R package is not in CRAN. Please take a look at ARROW-3204.
**Note:** currently, Arrow R package seems not supporting Windows. Please take a look at ARROW-3204.

### Benchmarks

**Shall**

```bash
sync && sudo purge
./bin/sparkR --conf spark.sql.execution.arrow.enabled=false
```

```bash
sync && sudo purge
./bin/sparkR --conf spark.sql.execution.arrow.enabled=true
```

**R code**

```r
rdf <- read.csv("500000.csv")
rdf <- rdf[, c("Month.of.Joining", "Weight.in.Kgs.")]  # We're only interested in the key and values to calculate.
df <- cache(createDataFrame(rdf))
count(df)

test <- function() {
  options(digits.secs = 6) # milliseconds
  start.time <- Sys.time()
  count(gapply(df,
               "Month_of_Joining",
               function(key, group) {
                 data.frame(Month_of_Joining = key[[1]], Weight_in_Kgs_ = mean(group$Weight_in_Kgs_) > group$Weight_in_Kgs_)
               },
               structType("Month_of_Joining integer, Weight_in_Kgs_ boolean")))
  end.time <- Sys.time()
  time.taken <- end.time - start.time
  print(time.taken)
}

test()
```

**Data (350 MB):**

```r
object.size(read.csv("500000.csv"))
350379504 bytes
```

"500000 Records"  http://eforexcel.com/wp/downloads-16-sample-csv-files-data-sets-for-testing/

**Results**

```
Time difference of 35.67459 secs
```

```
Time difference of 4.301399 secs
```

The performance improvement was around **829%**.

**Note that** I am 100% sure this PR improves more then 829% because I gave up testing it with non-Arrow optimization because it took super super super long when the data size becomes bigger.

### Limitations

- For now, Arrow optimization with R does not support when the data is `raw`, and when user explicitly gives float type in the schema. They produce corrupt values.

- Due to ARROW-4512, it cannot send and receive batch by batch. It has to send all batches in Arrow stream format at once. It needs improvement later.

## How was this patch tested?

Unit tests were added

**TODOs:**
- [x] Draft codes
- [x] make the tests passed
- [x] make the CRAN check pass
- [x] Performance measurement
- [x] Supportability investigation (for instance types)

Closes #23746 from HyukjinKwon/SPARK-26759.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2019-02-13 11:19:58 +08:00
Huaxin Gao bc6f191451 [SPARK-24779][R] Add map_concat / map_from_entries / an option in months_between UDF to disable rounding-off
## What changes were proposed in this pull request?

Add the R version of map_concat / map_from_entries / an option in months_between UDF to disable rounding-off

## How was this patch tested?

Add test in test_sparkSQL.R

Closes #21835 from huaxingao/spark-24779.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2019-01-31 19:38:32 +08:00
Liang-Chi Hsieh 33107897ad [SPARK-11215][ML] Add multiple columns support to StringIndexer
## What changes were proposed in this pull request?

This takes over #19621 to add multi-column support to StringIndexer:

1. Supports encoding multiple columns.
2. Previously, when specifying `frequencyDesc` or `frequencyAsc` as `stringOrderType` param in `StringIndexer`, in case of equal frequency, the order of strings is undefined. After this change, the strings with equal frequency are further sorted alphabetically.

## How was this patch tested?

Added tests.

Closes #20146 from viirya/SPARK-11215.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-01-29 09:21:25 -06:00
hyukjinkwon e8982ca7ad [SPARK-25981][R] Enables Arrow optimization from R DataFrame to Spark DataFrame
## What changes were proposed in this pull request?

This PR targets to support Arrow optimization for conversion from R DataFrame to Spark DataFrame.
Like PySpark side, it falls back to non-optimization code path when it's unable to use Arrow optimization.

This can be tested as below:

```bash
$ ./bin/sparkR --conf spark.sql.execution.arrow.enabled=true
```

```r
collect(createDataFrame(mtcars))
```

### Requirements
  - R 3.5.x
  - Arrow package 0.12+
    ```bash
    Rscript -e 'remotes::install_github("apache/arrowapache-arrow-0.12.0", subdir = "r")'
    ```

**Note:** currently, Arrow R package is not in CRAN. Please take a look at ARROW-3204.
**Note:** currently, Arrow R package seems not supporting Windows. Please take a look at ARROW-3204.

### Benchmarks

**Shall**

```bash
sync && sudo purge
./bin/sparkR --conf spark.sql.execution.arrow.enabled=false
```

```bash
sync && sudo purge
./bin/sparkR --conf spark.sql.execution.arrow.enabled=true
```

**R code**

```r
createDataFrame(mtcars) # Initializes
rdf <- read.csv("500000.csv")

test <- function() {
  options(digits.secs = 6) # milliseconds
  start.time <- Sys.time()
  createDataFrame(rdf)
  end.time <- Sys.time()
  time.taken <- end.time - start.time
  print(time.taken)
}

test()
```

**Data (350 MB):**

```r
object.size(read.csv("500000.csv"))
350379504 bytes
```

"500000 Records"  http://eforexcel.com/wp/downloads-16-sample-csv-files-data-sets-for-testing/

**Results**

```
Time difference of 29.9468 secs
```

```
Time difference of 3.222129 secs
```

The performance improvement was around **950%**.
Actually, this PR improves around **1200%**+ because this PR includes a small optimization about regular R DataFrame -> Spark DatFrame. See https://github.com/apache/spark/pull/22954#discussion_r231847272

### Limitations:

For now, Arrow optimization with R does not support when the data is `raw`, and when user explicitly gives float type in the schema. They produce corrupt values.
In this case, we decide to fall back to non-optimization code path.

## How was this patch tested?

Small test was added.

I manually forced to set this optimization `true` for _all_ R tests and they were _all_ passed (with few of fallback warnings).

**TODOs:**
- [x] Draft codes
- [x] make the tests passed
- [x] make the CRAN check pass
- [x] Performance measurement
- [x] Supportability investigation (for instance types)
- [x] Wait for Arrow 0.12.0 release
- [x] Fix and match it to Arrow 0.12.0

Closes #22954 from HyukjinKwon/r-arrow-createdataframe.

Lead-authored-by: hyukjinkwon <gurwls223@apache.org>
Co-authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2019-01-27 10:45:49 +08:00
Maxim Gekk 33b5039cd3 [SPARK-25935][SQL] Allow null rows for bad records from JSON/CSV parsers
## What changes were proposed in this pull request?

This PR reverts  #22938 per discussion in #23325

Closes #23325

Closes #23543 from MaxGekk/return-nulls-from-json-parser.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-01-15 13:02:55 +08:00
Hyukjin Kwon d371180c01 [MINOR][R] Deduplicate RStudio setup documentation
## What changes were proposed in this pull request?

This PR targets to deduplicate RStudio setup for SparkR.

## How was this patch tested?

N/A

Closes #23421 from HyukjinKwon/minor-doc.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2019-01-02 08:04:36 +08:00
Hyukjin Kwon 39a0493387 [SPARK-26227][R] from_[csv|json] should accept schema_of_[csv|json] in R API
## What changes were proposed in this pull request?

**1. Document `from_csv(..., schema_of_csv(...))` support:**

```R
csv <- "Amsterdam,2018"
df <- sql(paste0("SELECT '", csv, "' as csv"))
head(select(df, from_csv(df$csv, schema_of_csv(csv))))
```

```
    from_csv(csv)
1 Amsterdam, 2018
```

**2. Allow `from_json(..., schema_of_json(...))`**

Before:

```R
df2 <- sql("SELECT named_struct('name', 'Bob') as people")
df2 <- mutate(df2, people_json = to_json(df2$people))
head(select(df2, from_json(df2$people_json, schema_of_json(head(df2)$people_json))))
```

```
Error in (function (classes, fdef, mtable)  :
  unable to find an inherited method for function ‘from_json’ for signature ‘"Column", "Column"’
```

After:

```R
df2 <- sql("SELECT named_struct('name', 'Bob') as people")
df2 <- mutate(df2, people_json = to_json(df2$people))
head(select(df2, from_json(df2$people_json, schema_of_json(head(df2)$people_json))))
```

```
  from_json(people_json)
1                    Bob
```

**3. (While I'm here) Allow `structType` as schema for `from_csv` support to match with `from_json`.**

Before:

```R
csv <- "Amsterdam,2018"
df <- sql(paste0("SELECT '", csv, "' as csv"))
head(select(df, from_csv(df$csv, structType("city STRING, year INT"))))
```

```
Error in (function (classes, fdef, mtable)  :
  unable to find an inherited method for function ‘from_csv’ for signature ‘"Column", "structType"’
```

After:

```R
csv <- "Amsterdam,2018"
df <- sql(paste0("SELECT '", csv, "' as csv"))
head(select(df, from_csv(df$csv, structType("city STRING, year INT"))))
```

```
    from_csv(csv)
1 Amsterdam, 2018
```

## How was this patch tested?

Manually tested and unittests were added.

Closes #23184 from HyukjinKwon/SPARK-26227-1.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2019-01-02 08:01:34 +08:00
Maxim Gekk 7c7fccfeb5 [SPARK-26424][SQL] Use java.time API in date/timestamp expressions
## What changes were proposed in this pull request?

In the PR, I propose to switch the `DateFormatClass`, `ToUnixTimestamp`, `FromUnixTime`, `UnixTime` on java.time API for parsing/formatting dates and timestamps. The API has been already implemented by the `Timestamp`/`DateFormatter` classes. One of benefit is those classes support parsing timestamps with microsecond precision. Old behaviour can be switched on via SQL config: `spark.sql.legacy.timeParser.enabled` (`false` by default).

## How was this patch tested?

It was tested by existing test suites - `DateFunctionsSuite`, `DateExpressionsSuite`, `JsonSuite`, `CsvSuite`, `SQLQueryTestSuite` as well as PySpark tests.

Closes #23358 from MaxGekk/new-time-cast.

Lead-authored-by: Maxim Gekk <maxim.gekk@databricks.com>
Co-authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2018-12-27 11:09:50 +08:00
Qi Shao 19b63c560d [MINOR][R] Fix indents of sparkR welcome message to be consistent with pyspark and spark-shell
## What changes were proposed in this pull request?

1. Removed empty space at the beginning of welcome message lines of sparkR to be consistent with welcome message of `pyspark` and `spark-shell`
2. Setting indent of logo message lines to 3 to be consistent with welcome message of `pyspark` and `spark-shell`

Output of `pyspark`:
```
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.4.0
      /_/

Using Python version 3.6.6 (default, Jun 28 2018 11:07:29)
SparkSession available as 'spark'.
```

Output of `spark-shell`:
```
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.0
      /_/

Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_161)
Type in expressions to have them evaluated.
Type :help for more information.
```

## How was this patch tested?

Before:
Output of `sparkR`:
```
 Welcome to
    ____              __
   / __/__  ___ _____/ /__
  _\ \/ _ \/ _ `/ __/  '_/
 /___/ .__/\_,_/_/ /_/\_\   version  2.4.0
    /_/

 SparkSession available as 'spark'.
```
After:
```
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.0
      /_/

SparkSession available as 'spark'.
```

Closes #23293 from AzureQ/master.

Authored-by: Qi Shao <qi.shao.nyu@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2018-12-13 20:05:49 +08:00