spark-instrumented-optimizer

History

Dilip Biswal aea9a574c4 [SPARK-27134][SQL] array_distinct function does not work correctly with columns containing array of array ## What changes were proposed in this pull request? Correct the logic to compute the distinct. Below is a small repro snippet. ``` scala> val df = Seq(Seq(Seq(1, 2), Seq(1, 2), Seq(1, 2), Seq(3, 4), Seq(4, 5))).toDF("array_col") df: org.apache.spark.sql.DataFrame = [array_col: array<array<int>>] scala> val distinctDF = df.select(array_distinct(col("array_col"))) distinctDF: org.apache.spark.sql.DataFrame = [array_distinct(array_col): array<array<int>>] scala> df.show(false) +----------------------------------------+ \|array_col \| +----------------------------------------+ \|[[1, 2], [1, 2], [1, 2], [3, 4], [4, 5]]\| +----------------------------------------+ ``` Error ``` scala> distinctDF.show(false) +-------------------------+ \|array_distinct(array_col)\| +-------------------------+ \|[[1, 2], [1, 2], [1, 2]] \| +-------------------------+ ``` Expected result ``` scala> distinctDF.show(false) +-------------------------+ \|array_distinct(array_col)\| +-------------------------+ \|[[1, 2], [3, 4], [4, 5]] \| +-------------------------+ ``` ## How was this patch tested? Added an additional test. Closes #24073 from dilipbiswal/SPARK-27134. Authored-by: Dilip Biswal <dbiswal@us.ibm.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-16 14:30:42 -05:00
..
main	[SPARK-27134][SQL] array_distinct function does not work correctly with columns containing array of array	2019-03-16 14:30:42 -05:00
test	[SPARK-27134][SQL] array_distinct function does not work correctly with columns containing array of array	2019-03-16 14:30:42 -05:00

Dilip Biswal aea9a574c4 [SPARK-27134][SQL] array_distinct function does not work correctly with columns containing array of array

## What changes were proposed in this pull request?
Correct the logic to compute the distinct.

Below is a small repro snippet.

```
scala> val df = Seq(Seq(Seq(1, 2), Seq(1, 2), Seq(1, 2), Seq(3, 4), Seq(4, 5))).toDF("array_col")
df: org.apache.spark.sql.DataFrame = [array_col: array<array<int>>]

scala> val distinctDF = df.select(array_distinct(col("array_col")))
distinctDF: org.apache.spark.sql.DataFrame = [array_distinct(array_col): array<array<int>>]

scala> df.show(false)
+----------------------------------------+
|array_col                               |
+----------------------------------------+
|[[1, 2], [1, 2], [1, 2], [3, 4], [4, 5]]|
+----------------------------------------+
```
Error
```
scala> distinctDF.show(false)
+-------------------------+
|array_distinct(array_col)|
+-------------------------+
|[[1, 2], [1, 2], [1, 2]] |
+-------------------------+
```
Expected result
```
scala> distinctDF.show(false)
+-------------------------+
|array_distinct(array_col)|
+-------------------------+
|[[1, 2], [3, 4], [4, 5]] |
+-------------------------+
```
## How was this patch tested?
Added an additional test.

Closes #24073 from dilipbiswal/SPARK-27134.

Authored-by: Dilip Biswal <dbiswal@us.ibm.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>

2019-03-16 14:30:42 -05:00

main

[SPARK-27134][SQL] array_distinct function does not work correctly with columns containing array of array

2019-03-16 14:30:42 -05:00

test

[SPARK-27134][SQL] array_distinct function does not work correctly with columns containing array of array

2019-03-16 14:30:42 -05:00