fd0b228127
### What changes were proposed in this pull request? Add back the deprecated R APIs removed by https://github.com/apache/spark/pull/22843/ and https://github.com/apache/spark/pull/22815. These APIs are - `sparkR.init` - `sparkRSQL.init` - `sparkRHive.init` - `registerTempTable` - `createExternalTable` - `dropTempTable` No need to port the function such as ```r createExternalTable <- function(x, ...) { dispatchFunc("createExternalTable(tableName, path = NULL, source = NULL, ...)", x, ...) } ``` because this was for the backward compatibility when SQLContext exists before assuming from https://github.com/apache/spark/pull/9192, but seems we don't need it anymore since SparkR replaced SQLContext with Spark Session at https://github.com/apache/spark/pull/13635. ### Why are the changes needed? Amend Spark's Semantic Versioning Policy ### Does this PR introduce any user-facing change? Yes The removed R APIs are put back. ### How was this patch tested? Add back the removed tests Closes #28058 from huaxingao/r. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
77 lines
5.2 KiB
Markdown
77 lines
5.2 KiB
Markdown
---
|
|
layout: global
|
|
title: "Migration Guide: SparkR (R on Spark)"
|
|
displayTitle: "Migration Guide: SparkR (R on Spark)"
|
|
license: |
|
|
Licensed to the Apache Software Foundation (ASF) under one or more
|
|
contributor license agreements. See the NOTICE file distributed with
|
|
this work for additional information regarding copyright ownership.
|
|
The ASF licenses this file to You under the Apache License, Version 2.0
|
|
(the "License"); you may not use this file except in compliance with
|
|
the License. You may obtain a copy of the License at
|
|
|
|
http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
Unless required by applicable law or agreed to in writing, software
|
|
distributed under the License is distributed on an "AS IS" BASIS,
|
|
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
See the License for the specific language governing permissions and
|
|
limitations under the License.
|
|
---
|
|
|
|
* Table of contents
|
|
{:toc}
|
|
|
|
Note that this migration guide describes the items specific to SparkR.
|
|
Many items of SQL migration can be applied when migrating SparkR to higher versions.
|
|
Please refer [Migration Guide: SQL, Datasets and DataFrame](sql-migration-guide.html).
|
|
|
|
## Upgrading from SparkR 2.4 to 3.0
|
|
|
|
- The deprecated methods `parquetFile`, `saveAsParquetFile`, `jsonFile`, `jsonRDD` have been removed. Use `read.parquet`, `write.parquet`, `read.json` instead.
|
|
|
|
## Upgrading from SparkR 2.3 to 2.4
|
|
|
|
- Previously, we don't check the validity of the size of the last layer in `spark.mlp`. For example, if the training data only has two labels, a `layers` param like `c(1, 3)` doesn't cause an error previously, now it does.
|
|
|
|
## Upgrading from SparkR 2.3 to 2.3.1 and above
|
|
|
|
- In SparkR 2.3.0 and earlier, the `start` parameter of `substr` method was wrongly subtracted by one and considered as 0-based. This can lead to inconsistent substring results and also does not match with the behaviour with `substr` in R. In version 2.3.1 and later, it has been fixed so the `start` parameter of `substr` method is now 1-based. As an example, `substr(lit('abcdef'), 2, 4))` would result to `abc` in SparkR 2.3.0, and the result would be `bcd` in SparkR 2.3.1.
|
|
|
|
## Upgrading from SparkR 2.2 to 2.3
|
|
|
|
- The `stringsAsFactors` parameter was previously ignored with `collect`, for example, in `collect(createDataFrame(iris), stringsAsFactors = TRUE))`. It has been corrected.
|
|
- For `summary`, option for statistics to compute has been added. Its output is changed from that from `describe`.
|
|
- A warning can be raised if versions of SparkR package and the Spark JVM do not match.
|
|
|
|
## Upgrading from SparkR 2.1 to 2.2
|
|
|
|
- A `numPartitions` parameter has been added to `createDataFrame` and `as.DataFrame`. When splitting the data, the partition position calculation has been made to match the one in Scala.
|
|
- The method `createExternalTable` has been deprecated to be replaced by `createTable`. Either methods can be called to create external or managed table. Additional catalog methods have also been added.
|
|
- By default, derby.log is now saved to `tempdir()`. This will be created when instantiating the SparkSession with `enableHiveSupport` set to `TRUE`.
|
|
- `spark.lda` was not setting the optimizer correctly. It has been corrected.
|
|
- Several model summary outputs are updated to have `coefficients` as `matrix`. This includes `spark.logit`, `spark.kmeans`, `spark.glm`. Model summary outputs for `spark.gaussianMixture` have added log-likelihood as `loglik`.
|
|
|
|
## Upgrading from SparkR 2.0 to 3.1
|
|
|
|
- `join` no longer performs Cartesian Product by default, use `crossJoin` instead.
|
|
|
|
|
|
## Upgrading from SparkR 1.6 to 2.0
|
|
|
|
- The method `table` has been removed and replaced by `tableToDF`.
|
|
- The class `DataFrame` has been renamed to `SparkDataFrame` to avoid name conflicts.
|
|
- Spark's `SQLContext` and `HiveContext` have been deprecated to be replaced by `SparkSession`. Instead of `sparkR.init()`, call `sparkR.session()` in its place to instantiate the SparkSession. Once that is done, that currently active SparkSession will be used for SparkDataFrame operations.
|
|
- The parameter `sparkExecutorEnv` is not supported by `sparkR.session`. To set environment for the executors, set Spark config properties with the prefix "spark.executorEnv.VAR_NAME", for example, "spark.executorEnv.PATH"
|
|
- The `sqlContext` parameter is no longer required for these functions: `createDataFrame`, `as.DataFrame`, `read.json`, `jsonFile`, `read.parquet`, `parquetFile`, `read.text`, `sql`, `tables`, `tableNames`, `cacheTable`, `uncacheTable`, `clearCache`, `dropTempTable`, `read.df`, `loadDF`, `createExternalTable`.
|
|
- The method `registerTempTable` has been deprecated to be replaced by `createOrReplaceTempView`.
|
|
- The method `dropTempTable` has been deprecated to be replaced by `dropTempView`.
|
|
- The `sc` SparkContext parameter is no longer required for these functions: `setJobGroup`, `clearJobGroup`, `cancelJobGroup`
|
|
|
|
## Upgrading from SparkR 1.5 to 1.6
|
|
|
|
- Before Spark 1.6.0, the default mode for writes was `append`. It was changed in Spark 1.6.0 to `error` to match the Scala API.
|
|
- SparkSQL converts `NA` in R to `null` and vice-versa.
|
|
- Since 1.6.1, withColumn method in SparkR supports adding a new column to or replacing existing columns
|
|
of the same name of a DataFrame.
|