78 lines
5.4 KiB
Markdown
78 lines
5.4 KiB
Markdown
|
---
|
||
|
layout: global
|
||
|
title: "Migration Guide: SparkR (R on Spark)"
|
||
|
displayTitle: "Migration Guide: SparkR (R on Spark)"
|
||
|
license: |
|
||
|
Licensed to the Apache Software Foundation (ASF) under one or more
|
||
|
contributor license agreements. See the NOTICE file distributed with
|
||
|
this work for additional information regarding copyright ownership.
|
||
|
The ASF licenses this file to You under the Apache License, Version 2.0
|
||
|
(the "License"); you may not use this file except in compliance with
|
||
|
the License. You may obtain a copy of the License at
|
||
|
|
||
|
http://www.apache.org/licenses/LICENSE-2.0
|
||
|
|
||
|
Unless required by applicable law or agreed to in writing, software
|
||
|
distributed under the License is distributed on an "AS IS" BASIS,
|
||
|
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||
|
See the License for the specific language governing permissions and
|
||
|
limitations under the License.
|
||
|
---
|
||
|
|
||
|
* Table of contents
|
||
|
{:toc}
|
||
|
|
||
|
Note that this migration guide describes the items specific to SparkR.
|
||
|
Many items of SQL migration can be applied when migrating SparkR to higher versions.
|
||
|
Please refer [Migration Guide: SQL, Datasets and DataFrame](sql-migration-guide.html).
|
||
|
|
||
|
## Upgrading from SparkR 2.4 to 3.0
|
||
|
|
||
|
- The deprecated methods `sparkR.init`, `sparkRSQL.init`, `sparkRHive.init` have been removed. Use `sparkR.session` instead.
|
||
|
- The deprecated methods `parquetFile`, `saveAsParquetFile`, `jsonFile`, `registerTempTable`, `createExternalTable`, and `dropTempTable` have been removed. Use `read.parquet`, `write.parquet`, `read.json`, `createOrReplaceTempView`, `createTable`, `dropTempView`, `union` instead.
|
||
|
|
||
|
## Upgrading from SparkR 2.3 to 2.4
|
||
|
|
||
|
- Previously, we don't check the validity of the size of the last layer in `spark.mlp`. For example, if the training data only has two labels, a `layers` param like `c(1, 3)` doesn't cause an error previously, now it does.
|
||
|
|
||
|
## Upgrading from SparkR 2.3 to 2.3.1 and above
|
||
|
|
||
|
- In SparkR 2.3.0 and earlier, the `start` parameter of `substr` method was wrongly subtracted by one and considered as 0-based. This can lead to inconsistent substring results and also does not match with the behaviour with `substr` in R. In version 2.3.1 and later, it has been fixed so the `start` parameter of `substr` method is now 1-based. As an example, `substr(lit('abcdef'), 2, 4))` would result to `abc` in SparkR 2.3.0, and the result would be `bcd` in SparkR 2.3.1.
|
||
|
|
||
|
## Upgrading from SparkR 2.2 to 2.3
|
||
|
|
||
|
- The `stringsAsFactors` parameter was previously ignored with `collect`, for example, in `collect(createDataFrame(iris), stringsAsFactors = TRUE))`. It has been corrected.
|
||
|
- For `summary`, option for statistics to compute has been added. Its output is changed from that from `describe`.
|
||
|
- A warning can be raised if versions of SparkR package and the Spark JVM do not match.
|
||
|
|
||
|
## Upgrading from SparkR 2.1 to 2.2
|
||
|
|
||
|
- A `numPartitions` parameter has been added to `createDataFrame` and `as.DataFrame`. When splitting the data, the partition position calculation has been made to match the one in Scala.
|
||
|
- The method `createExternalTable` has been deprecated to be replaced by `createTable`. Either methods can be called to create external or managed table. Additional catalog methods have also been added.
|
||
|
- By default, derby.log is now saved to `tempdir()`. This will be created when instantiating the SparkSession with `enableHiveSupport` set to `TRUE`.
|
||
|
- `spark.lda` was not setting the optimizer correctly. It has been corrected.
|
||
|
- Several model summary outputs are updated to have `coefficients` as `matrix`. This includes `spark.logit`, `spark.kmeans`, `spark.glm`. Model summary outputs for `spark.gaussianMixture` have added log-likelihood as `loglik`.
|
||
|
|
||
|
## Upgrading from SparkR 2.0 to 3.1
|
||
|
|
||
|
- `join` no longer performs Cartesian Product by default, use `crossJoin` instead.
|
||
|
|
||
|
|
||
|
## Upgrading from SparkR 1.6 to 2.0
|
||
|
|
||
|
- The method `table` has been removed and replaced by `tableToDF`.
|
||
|
- The class `DataFrame` has been renamed to `SparkDataFrame` to avoid name conflicts.
|
||
|
- Spark's `SQLContext` and `HiveContext` have been deprecated to be replaced by `SparkSession`. Instead of `sparkR.init()`, call `sparkR.session()` in its place to instantiate the SparkSession. Once that is done, that currently active SparkSession will be used for SparkDataFrame operations.
|
||
|
- The parameter `sparkExecutorEnv` is not supported by `sparkR.session`. To set environment for the executors, set Spark config properties with the prefix "spark.executorEnv.VAR_NAME", for example, "spark.executorEnv.PATH"
|
||
|
- The `sqlContext` parameter is no longer required for these functions: `createDataFrame`, `as.DataFrame`, `read.json`, `jsonFile`, `read.parquet`, `parquetFile`, `read.text`, `sql`, `tables`, `tableNames`, `cacheTable`, `uncacheTable`, `clearCache`, `dropTempTable`, `read.df`, `loadDF`, `createExternalTable`.
|
||
|
- The method `registerTempTable` has been deprecated to be replaced by `createOrReplaceTempView`.
|
||
|
- The method `dropTempTable` has been deprecated to be replaced by `dropTempView`.
|
||
|
- The `sc` SparkContext parameter is no longer required for these functions: `setJobGroup`, `clearJobGroup`, `cancelJobGroup`
|
||
|
|
||
|
## Upgrading from SparkR 1.5 to 1.6
|
||
|
|
||
|
- Before Spark 1.6.0, the default mode for writes was `append`. It was changed in Spark 1.6.0 to `error` to match the Scala API.
|
||
|
- SparkSQL converts `NA` in R to `null` and vice-versa.
|
||
|
- Since 1.6.1, withColumn method in SparkR supports adding a new column to or replacing existing columns
|
||
|
of the same name of a DataFrame.
|