[SPARK-25908][SQL][FOLLOW-UP] Add back unionAll
## What changes were proposed in this pull request? This PR is to add back `unionAll`, which is widely used. The name is also consistent with our ANSI SQL. We also have the corresponding `intersectAll` and `exceptAll`, which were introduced in Spark 2.4. ## How was this patch tested? Added a test case in DataFrameSuite Closes #23131 from gatorsmile/addBackUnionAll. Authored-by: gatorsmile <gatorsmile@gmail.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>
This commit is contained in:
parent
c5daccb1da
commit
94145786a5
|
@ -169,6 +169,7 @@ exportMethods("arrange",
|
||||||
"toJSON",
|
"toJSON",
|
||||||
"transform",
|
"transform",
|
||||||
"union",
|
"union",
|
||||||
|
"unionAll",
|
||||||
"unionByName",
|
"unionByName",
|
||||||
"unique",
|
"unique",
|
||||||
"unpersist",
|
"unpersist",
|
||||||
|
|
|
@ -2732,6 +2732,20 @@ setMethod("union",
|
||||||
dataFrame(unioned)
|
dataFrame(unioned)
|
||||||
})
|
})
|
||||||
|
|
||||||
|
#' Return a new SparkDataFrame containing the union of rows
|
||||||
|
#'
|
||||||
|
#' This is an alias for `union`.
|
||||||
|
#'
|
||||||
|
#' @rdname union
|
||||||
|
#' @name unionAll
|
||||||
|
#' @aliases unionAll,SparkDataFrame,SparkDataFrame-method
|
||||||
|
#' @note unionAll since 1.4.0
|
||||||
|
setMethod("unionAll",
|
||||||
|
signature(x = "SparkDataFrame", y = "SparkDataFrame"),
|
||||||
|
function(x, y) {
|
||||||
|
union(x, y)
|
||||||
|
})
|
||||||
|
|
||||||
#' Return a new SparkDataFrame containing the union of rows, matched by column names
|
#' Return a new SparkDataFrame containing the union of rows, matched by column names
|
||||||
#'
|
#'
|
||||||
#' Return a new SparkDataFrame containing the union of rows in this SparkDataFrame
|
#' Return a new SparkDataFrame containing the union of rows in this SparkDataFrame
|
||||||
|
|
|
@ -631,6 +631,9 @@ setGeneric("toRDD", function(x) { standardGeneric("toRDD") })
|
||||||
#' @rdname union
|
#' @rdname union
|
||||||
setGeneric("union", function(x, y) { standardGeneric("union") })
|
setGeneric("union", function(x, y) { standardGeneric("union") })
|
||||||
|
|
||||||
|
#' @rdname union
|
||||||
|
setGeneric("unionAll", function(x, y) { standardGeneric("unionAll") })
|
||||||
|
|
||||||
#' @rdname unionByName
|
#' @rdname unionByName
|
||||||
setGeneric("unionByName", function(x, y) { standardGeneric("unionByName") })
|
setGeneric("unionByName", function(x, y) { standardGeneric("unionByName") })
|
||||||
|
|
||||||
|
|
|
@ -2458,6 +2458,7 @@ test_that("union(), unionByName(), rbind(), except(), and intersect() on a DataF
|
||||||
expect_equal(count(unioned), 6)
|
expect_equal(count(unioned), 6)
|
||||||
expect_equal(first(unioned)$name, "Michael")
|
expect_equal(first(unioned)$name, "Michael")
|
||||||
expect_equal(count(arrange(suppressWarnings(union(df, df2)), df$age)), 6)
|
expect_equal(count(arrange(suppressWarnings(union(df, df2)), df$age)), 6)
|
||||||
|
expect_equal(count(arrange(suppressWarnings(unionAll(df, df2)), df$age)), 6)
|
||||||
|
|
||||||
df1 <- select(df2, "age", "name")
|
df1 <- select(df2, "age", "name")
|
||||||
unioned1 <- arrange(unionByName(df1, df), df1$age)
|
unioned1 <- arrange(unionByName(df1, df), df1$age)
|
||||||
|
|
|
@ -718,4 +718,4 @@ You can inspect the search path in R with [`search()`](https://stat.ethz.ch/R-ma
|
||||||
## Upgrading to SparkR 3.0.0
|
## Upgrading to SparkR 3.0.0
|
||||||
|
|
||||||
- The deprecated methods `sparkR.init`, `sparkRSQL.init`, `sparkRHive.init` have been removed. Use `sparkR.session` instead.
|
- The deprecated methods `sparkR.init`, `sparkRSQL.init`, `sparkRHive.init` have been removed. Use `sparkR.session` instead.
|
||||||
- The deprecated methods `parquetFile`, `saveAsParquetFile`, `jsonFile`, `registerTempTable`, `createExternalTable`, `dropTempTable`, `unionAll` have been removed. Use `read.parquet`, `write.parquet`, `read.json`, `createOrReplaceTempView`, `createTable`, `dropTempView`, `union` instead.
|
- The deprecated methods `parquetFile`, `saveAsParquetFile`, `jsonFile`, `registerTempTable`, `createExternalTable`, and `dropTempTable` have been removed. Use `read.parquet`, `write.parquet`, `read.json`, `createOrReplaceTempView`, `createTable`, `dropTempView`, `union` instead.
|
||||||
|
|
|
@ -9,6 +9,8 @@ displayTitle: Spark SQL Upgrading Guide
|
||||||
|
|
||||||
## Upgrading From Spark SQL 2.4 to 3.0
|
## Upgrading From Spark SQL 2.4 to 3.0
|
||||||
|
|
||||||
|
- Since Spark 3.0, the Dataset and DataFrame API `unionAll` is not deprecated any more. It is an alias for `union`.
|
||||||
|
|
||||||
- In PySpark, when creating a `SparkSession` with `SparkSession.builder.getOrCreate()`, if there is an existing `SparkContext`, the builder was trying to update the `SparkConf` of the existing `SparkContext` with configurations specified to the builder, but the `SparkContext` is shared by all `SparkSession`s, so we should not update them. Since 3.0, the builder comes to not update the configurations. This is the same behavior as Java/Scala API in 2.3 and above. If you want to update them, you need to update them prior to creating a `SparkSession`.
|
- In PySpark, when creating a `SparkSession` with `SparkSession.builder.getOrCreate()`, if there is an existing `SparkContext`, the builder was trying to update the `SparkConf` of the existing `SparkContext` with configurations specified to the builder, but the `SparkContext` is shared by all `SparkSession`s, so we should not update them. Since 3.0, the builder comes to not update the configurations. This is the same behavior as Java/Scala API in 2.3 and above. If you want to update them, you need to update them prior to creating a `SparkSession`.
|
||||||
|
|
||||||
- In Spark version 2.4 and earlier, the parser of JSON data source treats empty strings as null for some data types such as `IntegerType`. For `FloatType` and `DoubleType`, it fails on empty strings and throws exceptions. Since Spark 3.0, we disallow empty strings and will throw exceptions for data types except for `StringType` and `BinaryType`.
|
- In Spark version 2.4 and earlier, the parser of JSON data source treats empty strings as null for some data types such as `IntegerType`. For `FloatType` and `DoubleType`, it fails on empty strings and throws exceptions. Since Spark 3.0, we disallow empty strings and will throw exceptions for data types except for `StringType` and `BinaryType`.
|
||||||
|
|
|
@ -1448,6 +1448,17 @@ class DataFrame(object):
|
||||||
"""
|
"""
|
||||||
return DataFrame(self._jdf.union(other._jdf), self.sql_ctx)
|
return DataFrame(self._jdf.union(other._jdf), self.sql_ctx)
|
||||||
|
|
||||||
|
@since(1.3)
|
||||||
|
def unionAll(self, other):
|
||||||
|
""" Return a new :class:`DataFrame` containing union of rows in this and another frame.
|
||||||
|
|
||||||
|
This is equivalent to `UNION ALL` in SQL. To do a SQL-style set union
|
||||||
|
(that does deduplication of elements), use this function followed by :func:`distinct`.
|
||||||
|
|
||||||
|
Also as standard in SQL, this function resolves columns by position (not by name).
|
||||||
|
"""
|
||||||
|
return self.union(other)
|
||||||
|
|
||||||
@since(2.3)
|
@since(2.3)
|
||||||
def unionByName(self, other):
|
def unionByName(self, other):
|
||||||
""" Returns a new :class:`DataFrame` containing union of rows in this and another frame.
|
""" Returns a new :class:`DataFrame` containing union of rows in this and another frame.
|
||||||
|
|
|
@ -1852,6 +1852,20 @@ class Dataset[T] private[sql](
|
||||||
CombineUnions(Union(logicalPlan, other.logicalPlan))
|
CombineUnions(Union(logicalPlan, other.logicalPlan))
|
||||||
}
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Returns a new Dataset containing union of rows in this Dataset and another Dataset.
|
||||||
|
* This is an alias for `union`.
|
||||||
|
*
|
||||||
|
* This is equivalent to `UNION ALL` in SQL. To do a SQL-style set union (that does
|
||||||
|
* deduplication of elements), use this function followed by a [[distinct]].
|
||||||
|
*
|
||||||
|
* Also as standard in SQL, this function resolves columns by position (not by name).
|
||||||
|
*
|
||||||
|
* @group typedrel
|
||||||
|
* @since 2.0.0
|
||||||
|
*/
|
||||||
|
def unionAll(other: Dataset[T]): Dataset[T] = union(other)
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* Returns a new Dataset containing union of rows in this Dataset and another Dataset.
|
* Returns a new Dataset containing union of rows in this Dataset and another Dataset.
|
||||||
*
|
*
|
||||||
|
|
|
@ -97,6 +97,12 @@ class DataFrameSuite extends QueryTest with SharedSQLContext {
|
||||||
unionDF.agg(avg('key), max('key), min('key), sum('key)),
|
unionDF.agg(avg('key), max('key), min('key), sum('key)),
|
||||||
Row(50.5, 100, 1, 25250) :: Nil
|
Row(50.5, 100, 1, 25250) :: Nil
|
||||||
)
|
)
|
||||||
|
|
||||||
|
// unionAll is an alias of union
|
||||||
|
val unionAllDF = testData.unionAll(testData).unionAll(testData)
|
||||||
|
.unionAll(testData).unionAll(testData)
|
||||||
|
|
||||||
|
checkAnswer(unionDF, unionAllDF)
|
||||||
}
|
}
|
||||||
|
|
||||||
test("union should union DataFrames with UDTs (SPARK-13410)") {
|
test("union should union DataFrames with UDTs (SPARK-13410)") {
|
||||||
|
|
Loading…
Reference in a new issue