e9f2e34261
### What changes were proposed in this pull request?
This PR proposes to ask users if they want to download and install SparkR when they install SparkR from CRAN.
`SPARKR_ASK_INSTALLATION` environment variable was added in case other notebook projects are affected.
### Why are the changes needed?
This is required for CRAN. Currently SparkR is removed: https://cran.r-project.org/web/packages/SparkR/index.html.
See also https://lists.apache.org/thread.html/r02b9046273a518e347dfe85f864d23d63d3502c6c1edd33df17a3b86%40%3Cdev.spark.apache.org%3E
### Does this PR introduce _any_ user-facing change?
Yes, `sparkR.session(...)` will ask if users want to download and install Spark package or not if they are in the plain R shell or `Rscript`.
### How was this patch tested?
**R shell**
Valid input (`n`):
```
> sparkR.session(master="local")
Spark not found in SPARK_HOME:
Will you download and install (or reuse if it exists) Spark package under the cache [/.../Caches/spark]? (y/n): n
```
```
Error in sparkCheckInstall(sparkHome, master, deployMode) :
Please make sure Spark package is installed in this machine.
- If there is one, set the path in sparkHome parameter or environment variable SPARK_HOME.
- If not, you may run install.spark function to do the job.
```
Invalid input:
```
> sparkR.session(master="local")
Spark not found in SPARK_HOME:
Will you download and install (or reuse if it exists) Spark package under the cache [/.../Caches/spark]? (y/n): abc
```
```
Will you download and install (or reuse if it exists) Spark package under the cache [/.../Caches/spark]? (y/n):
```
Valid input (`y`):
```
> sparkR.session(master="local")
Will you download and install (or reuse if it exists) Spark package under the cache [/.../Caches/spark]? (y/n): y
Spark not found in the cache directory. Installation will start.
MirrorUrl not provided.
Looking for preferred site from apache website...
Preferred mirror site found: https://ftp.riken.jp/net/apache/spark
Downloading spark-3.3.0 for Hadoop 2.7 from:
- https://ftp.riken.jp/net/apache/spark/spark-3.3.0/spark-3.3.0-bin-hadoop2.7.tgz
trying URL 'https://ftp.riken.jp/net/apache/spark/spark-3.3.0/spark-3.3.0-bin-hadoop2.7.tgz'
...
```
**Rscript**
```
cat tmp.R
```
```
library(SparkR, lib.loc = c(file.path(".", "R", "lib")))
sparkR.session(master="local")
```
```
Rscript tmp.R
```
Valid input (`n`):
```
Spark not found in SPARK_HOME:
Will you download and install (or reuse if it exists) Spark package under the cache [/.../Caches/spark]? (y/n): n
```
```
Error in sparkCheckInstall(sparkHome, master, deployMode) :
Please make sure Spark package is installed in this machine.
- If there is one, set the path in sparkHome parameter or environment variable SPARK_HOME.
- If not, you may run install.spark function to do the job.
Calls: sparkR.session -> sparkCheckInstall
```
Invalid input:
```
Spark not found in SPARK_HOME:
Will you download and install (or reuse if it exists) Spark package under the cache [/.../Caches/spark]? (y/n): abc
```
```
Will you download and install (or reuse if it exists) Spark package under the cache [/.../Caches/spark]? (y/n):
```
Valid input (`y`):
```
...
Spark not found in SPARK_HOME:
Will you download and install (or reuse if it exists) Spark package under the cache [/.../Caches/spark]? (y/n): y
Spark not found in the cache directory. Installation will start.
MirrorUrl not provided.
Looking for preferred site from apache website...
Preferred mirror site found: https://ftp.riken.jp/net/apache/spark
Downloading spark-3.3.0 for Hadoop 2.7 from:
...
```
`bin/sparkR` and `bin/spark-submit *.R` are not affected (tested).
Closes #33887 from HyukjinKwon/SPARK-36631.
Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit e983ba8fce
)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
5.6 KiB
5.6 KiB
layout | title | displayTitle | license |
---|---|---|---|
global | Migration Guide: SparkR (R on Spark) | Migration Guide: SparkR (R on Spark) | Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to You under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. |
- Table of contents {:toc}
Note that this migration guide describes the items specific to SparkR. Many items of SQL migration can be applied when migrating SparkR to higher versions. Please refer Migration Guide: SQL, Datasets and DataFrame.
Upgrading from SparkR 3.1 to 3.2
- Previously, SparkR automatically downloaded and installed the Spark distribution in user' cache directory to complete SparkR installation when SparkR runs in a plain R shell or Rscript, and the Spark distribution cannot be found. Now, it asks if users want to download and install or not. To restore the previous behavior, set
SPARKR_ASK_INSTALLATION
environment variable toFALSE
.
Upgrading from SparkR 2.4 to 3.0
- The deprecated methods
parquetFile
,saveAsParquetFile
,jsonFile
,jsonRDD
have been removed. Useread.parquet
,write.parquet
,read.json
instead.
Upgrading from SparkR 2.3 to 2.4
- Previously, we don't check the validity of the size of the last layer in
spark.mlp
. For example, if the training data only has two labels, alayers
param likec(1, 3)
doesn't cause an error previously, now it does.
Upgrading from SparkR 2.3 to 2.3.1 and above
- In SparkR 2.3.0 and earlier, the
start
parameter ofsubstr
method was wrongly subtracted by one and considered as 0-based. This can lead to inconsistent substring results and also does not match with the behaviour withsubstr
in R. In version 2.3.1 and later, it has been fixed so thestart
parameter ofsubstr
method is now 1-based. As an example,substr(lit('abcdef'), 2, 4))
would result toabc
in SparkR 2.3.0, and the result would bebcd
in SparkR 2.3.1.
Upgrading from SparkR 2.2 to 2.3
- The
stringsAsFactors
parameter was previously ignored withcollect
, for example, incollect(createDataFrame(iris), stringsAsFactors = TRUE))
. It has been corrected. - For
summary
, option for statistics to compute has been added. Its output is changed from that fromdescribe
. - A warning can be raised if versions of SparkR package and the Spark JVM do not match.
Upgrading from SparkR 2.1 to 2.2
- A
numPartitions
parameter has been added tocreateDataFrame
andas.DataFrame
. When splitting the data, the partition position calculation has been made to match the one in Scala. - The method
createExternalTable
has been deprecated to be replaced bycreateTable
. Either methods can be called to create external or managed table. Additional catalog methods have also been added. - By default, derby.log is now saved to
tempdir()
. This will be created when instantiating the SparkSession withenableHiveSupport
set toTRUE
. spark.lda
was not setting the optimizer correctly. It has been corrected.- Several model summary outputs are updated to have
coefficients
asmatrix
. This includesspark.logit
,spark.kmeans
,spark.glm
. Model summary outputs forspark.gaussianMixture
have added log-likelihood asloglik
.
Upgrading from SparkR 2.0 to 3.1
join
no longer performs Cartesian Product by default, usecrossJoin
instead.
Upgrading from SparkR 1.6 to 2.0
- The method
table
has been removed and replaced bytableToDF
. - The class
DataFrame
has been renamed toSparkDataFrame
to avoid name conflicts. - Spark's
SQLContext
andHiveContext
have been deprecated to be replaced bySparkSession
. Instead ofsparkR.init()
, callsparkR.session()
in its place to instantiate the SparkSession. Once that is done, that currently active SparkSession will be used for SparkDataFrame operations. - The parameter
sparkExecutorEnv
is not supported bysparkR.session
. To set environment for the executors, set Spark config properties with the prefix "spark.executorEnv.VAR_NAME", for example, "spark.executorEnv.PATH" - The
sqlContext
parameter is no longer required for these functions:createDataFrame
,as.DataFrame
,read.json
,jsonFile
,read.parquet
,parquetFile
,read.text
,sql
,tables
,tableNames
,cacheTable
,uncacheTable
,clearCache
,dropTempTable
,read.df
,loadDF
,createExternalTable
. - The method
registerTempTable
has been deprecated to be replaced bycreateOrReplaceTempView
. - The method
dropTempTable
has been deprecated to be replaced bydropTempView
. - The
sc
SparkContext parameter is no longer required for these functions:setJobGroup
,clearJobGroup
,cancelJobGroup
Upgrading from SparkR 1.5 to 1.6
- Before Spark 1.6.0, the default mode for writes was
append
. It was changed in Spark 1.6.0 toerror
to match the Scala API. - SparkSQL converts
NA
in R tonull
and vice-versa. - Since 1.6.1, withColumn method in SparkR supports adding a new column to or replacing existing columns of the same name of a DataFrame.