[SPARK-36631][R] Ask users if they want to download and install SparkR in non Spark scripts

### What changes were proposed in this pull request?

This PR proposes to ask users if they want to download and install SparkR when they install SparkR from CRAN.

`SPARKR_ASK_INSTALLATION` environment variable was added in case other notebook projects are affected.

### Why are the changes needed?

This is required for CRAN. Currently SparkR is removed: https://cran.r-project.org/web/packages/SparkR/index.html.
See also https://lists.apache.org/thread.html/r02b9046273a518e347dfe85f864d23d63d3502c6c1edd33df17a3b86%40%3Cdev.spark.apache.org%3E

### Does this PR introduce _any_ user-facing change?

Yes, `sparkR.session(...)` will ask if users want to download and install Spark package or not if they are in the plain R shell or `Rscript`.

### How was this patch tested?

**R shell**

Valid input (`n`):

```
> sparkR.session(master="local")
Spark not found in SPARK_HOME:
Will you download and install (or reuse if it exists) Spark package under the cache [/.../Caches/spark]? (y/n): n
```
```
Error in sparkCheckInstall(sparkHome, master, deployMode) :
  Please make sure Spark package is installed in this machine.
- If there is one, set the path in sparkHome parameter or environment variable SPARK_HOME.
- If not, you may run install.spark function to do the job.
```

Invalid input:

```
> sparkR.session(master="local")
Spark not found in SPARK_HOME:
Will you download and install (or reuse if it exists) Spark package under the cache [/.../Caches/spark]? (y/n): abc
```
```
Will you download and install (or reuse if it exists) Spark package under the cache [/.../Caches/spark]? (y/n):
```

Valid input (`y`):

```
> sparkR.session(master="local")
Will you download and install (or reuse if it exists) Spark package under the cache [/.../Caches/spark]? (y/n): y
Spark not found in the cache directory. Installation will start.
MirrorUrl not provided.
Looking for preferred site from apache website...
Preferred mirror site found: https://ftp.riken.jp/net/apache/spark
Downloading spark-3.3.0 for Hadoop 2.7 from:
- https://ftp.riken.jp/net/apache/spark/spark-3.3.0/spark-3.3.0-bin-hadoop2.7.tgz
trying URL 'https://ftp.riken.jp/net/apache/spark/spark-3.3.0/spark-3.3.0-bin-hadoop2.7.tgz'
...
```

**Rscript**

```
cat tmp.R
```
```
library(SparkR, lib.loc = c(file.path(".", "R", "lib")))
sparkR.session(master="local")
```

```
Rscript tmp.R
```

Valid input (`n`):

```
Spark not found in SPARK_HOME:
Will you download and install (or reuse if it exists) Spark package under the cache [/.../Caches/spark]? (y/n): n
```
```
Error in sparkCheckInstall(sparkHome, master, deployMode) :
  Please make sure Spark package is installed in this machine.
- If there is one, set the path in sparkHome parameter or environment variable SPARK_HOME.
- If not, you may run install.spark function to do the job.
Calls: sparkR.session -> sparkCheckInstall
```

Invalid input:

```
Spark not found in SPARK_HOME:
Will you download and install (or reuse if it exists) Spark package under the cache [/.../Caches/spark]? (y/n): abc
```
```
Will you download and install (or reuse if it exists) Spark package under the cache [/.../Caches/spark]? (y/n):
```

Valid input (`y`):

```
...
Spark not found in SPARK_HOME:
Will you download and install (or reuse if it exists) Spark package under the cache [/.../Caches/spark]? (y/n): y
Spark not found in the cache directory. Installation will start.
MirrorUrl not provided.
Looking for preferred site from apache website...
Preferred mirror site found: https://ftp.riken.jp/net/apache/spark
Downloading spark-3.3.0 for Hadoop 2.7 from:
...
```

`bin/sparkR` and `bin/spark-submit *.R` are not affected (tested).

Closes #33887 from HyukjinKwon/SPARK-36631.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit e983ba8fce)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
This commit is contained in:
Hyukjin Kwon 2021-09-02 13:27:43 +09:00
parent b81e9741cd
commit e9f2e34261
2 changed files with 38 additions and 0 deletions

View file

@ -655,6 +655,40 @@ sparkCheckInstall <- function(sparkHome, master, deployMode) {
} else { } else {
if (interactive() || isMasterLocal(master)) { if (interactive() || isMasterLocal(master)) {
message("Spark not found in SPARK_HOME: ", sparkHome) message("Spark not found in SPARK_HOME: ", sparkHome)
# If EXISTING_SPARKR_BACKEND_PORT environment variable is set, assume
# that we're in Spark submit. spark-submit always sets Spark home
# so this case should not happen. This is just a safeguard.
isSparkRSubmit <- Sys.getenv("EXISTING_SPARKR_BACKEND_PORT", "") != ""
# SPARKR_ASK_INSTALLATION is an internal environment variable in case
# users want to disable this behavior. This environment variable should
# be removed if no user complains. This environment variable was added
# in case other notebook projects are affected.
if (!isSparkRSubmit && Sys.getenv("SPARKR_ASK_INSTALLATION", "TRUE") == "TRUE") {
# Finally, we're either plain R shell or Rscript.
msg <- paste0(
"Will you download and install (or reuse if it exists) Spark package ",
"under the cache [", sparkCachePath(), "]? (y/n): ")
answer <- NA
while (is.na(answer) || (answer != "y" && answer != "n")) {
# Dispatch on R shell in case readLines does not work in RStudio
# See https://stackoverflow.com/questions/30191232/use-stdin-from-within-r-studio
if (interactive()) {
answer <- readline(prompt = msg)
} else {
cat(msg)
answer <- readLines("stdin", n = 1)
}
}
if (answer == "n") {
stop(paste0(
"Please make sure Spark package is installed in this machine.\n",
" - If there is one, set the path in sparkHome parameter or ",
"environment variable SPARK_HOME.\n",
" - If not, you may run install.spark function to do the job."))
}
}
packageLocalDir <- install.spark() packageLocalDir <- install.spark()
packageLocalDir packageLocalDir
} else if (isClientMode(master) || deployMode == "client") { } else if (isClientMode(master) || deployMode == "client") {

View file

@ -26,6 +26,10 @@ Note that this migration guide describes the items specific to SparkR.
Many items of SQL migration can be applied when migrating SparkR to higher versions. Many items of SQL migration can be applied when migrating SparkR to higher versions.
Please refer [Migration Guide: SQL, Datasets and DataFrame](sql-migration-guide.html). Please refer [Migration Guide: SQL, Datasets and DataFrame](sql-migration-guide.html).
## Upgrading from SparkR 3.1 to 3.2
- Previously, SparkR automatically downloaded and installed the Spark distribution in user' cache directory to complete SparkR installation when SparkR runs in a plain R shell or Rscript, and the Spark distribution cannot be found. Now, it asks if users want to download and install or not. To restore the previous behavior, set `SPARKR_ASK_INSTALLATION` environment variable to `FALSE`.
## Upgrading from SparkR 2.4 to 3.0 ## Upgrading from SparkR 2.4 to 3.0
- The deprecated methods `parquetFile`, `saveAsParquetFile`, `jsonFile`, `jsonRDD` have been removed. Use `read.parquet`, `write.parquet`, `read.json` instead. - The deprecated methods `parquetFile`, `saveAsParquetFile`, `jsonFile`, `jsonRDD` have been removed. Use `read.parquet`, `write.parquet`, `read.json` instead.