spark-instrumented-optimizer/sql/core/benchmarks
fqaiser94@gmail.com 2793347972 [SPARK-32511][SQL] Add dropFields method to Column class
### What changes were proposed in this pull request?

1. Refactored `WithFields` Expression to make it more extensible (now `UpdateFields`).
2. Added a new `dropFields` method to the `Column` class. This method should allow users to drop a `StructField` in a `StructType` column (with similar semantics to the `drop` method on `Dataset`).

### Why are the changes needed?

Often Spark users have to work with deeply nested data e.g. to fix a data quality issue with an existing `StructField`. To do this with the existing Spark APIs, users have to rebuild the entire struct column.

For example, let's say you have the following deeply nested data structure which has a data quality issue (`5` is missing):
```
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._

val data = spark.createDataFrame(sc.parallelize(
      Seq(Row(Row(Row(1, 2, 3), Row(Row(4, null, 6), Row(7, 8, 9), Row(10, 11, 12)), Row(13, 14, 15))))),
      StructType(Seq(
        StructField("a", StructType(Seq(
          StructField("a", StructType(Seq(
            StructField("a", IntegerType),
            StructField("b", IntegerType),
            StructField("c", IntegerType)))),
          StructField("b", StructType(Seq(
            StructField("a", StructType(Seq(
              StructField("a", IntegerType),
              StructField("b", IntegerType),
              StructField("c", IntegerType)))),
            StructField("b", StructType(Seq(
              StructField("a", IntegerType),
              StructField("b", IntegerType),
              StructField("c", IntegerType)))),
            StructField("c", StructType(Seq(
              StructField("a", IntegerType),
              StructField("b", IntegerType),
              StructField("c", IntegerType))))
          ))),
          StructField("c", StructType(Seq(
            StructField("a", IntegerType),
            StructField("b", IntegerType),
            StructField("c", IntegerType))))
        )))))).cache

data.show(false)
+---------------------------------+
|a                                |
+---------------------------------+
|[[1, 2, 3], [[4,, 6], [7, 8, 9]]]|
+---------------------------------+
```
Currently, to drop the missing value users would have to do something like this:
```
val result = data.withColumn("a",
  struct(
    $"a.a",
    struct(
      struct(
        $"a.b.a.a",
        $"a.b.a.c"
      ).as("a"),
      $"a.b.b",
      $"a.b.c"
    ).as("b"),
    $"a.c"
  ))

result.show(false)
+---------------------------------------------------------------+
|a                                                              |
+---------------------------------------------------------------+
|[[1, 2, 3], [[4, 6], [7, 8, 9], [10, 11, 12]], [13, 14, 15]]|
+---------------------------------------------------------------+
```
As you can see above, with the existing methods users must call the `struct` function and list all fields, including fields they don't want to change. This is not ideal as:
>this leads to complex, fragile code that cannot survive schema evolution.
[SPARK-16483](https://issues.apache.org/jira/browse/SPARK-16483)

In contrast, with the method added in this PR, a user could simply do something like this to get the same result:
```
val result = data.withColumn("a", 'a.dropFields("b.a.b"))
result.show(false)
+---------------------------------------------------------------+
|a                                                              |
+---------------------------------------------------------------+
|[[1, 2, 3], [[4, 6], [7, 8, 9], [10, 11, 12]], [13, 14, 15]]|
+---------------------------------------------------------------+

```

This is the second of maybe 3 methods that could be added to the `Column` class to make it easier to manipulate nested data.
Other methods under discussion in [SPARK-22231](https://issues.apache.org/jira/browse/SPARK-22231) include `withFieldRenamed`.
However, this should be added in a separate PR.

### Does this PR introduce _any_ user-facing change?

The documentation for `Column.withField` method has changed to include an additional note about how to write optimized queries when adding multiple nested Column directly.

### How was this patch tested?

New unit tests were added. Jenkins must pass them.

### Related JIRAs:
More discussion on this topic can be found here:
- https://issues.apache.org/jira/browse/SPARK-22231
- https://issues.apache.org/jira/browse/SPARK-16483

Closes #29795 from fqaiser94/SPARK-32511-dropFields-second-try.

Authored-by: fqaiser94@gmail.com <fqaiser94@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-10-06 08:53:30 +00:00
..
AggregateBenchmark-jdk11-results.txt [SPARK-30409][SPARK-29173][SQL][TESTS] Use NoOp datasource in SQL benchmarks 2020-01-12 13:18:19 -08:00
AggregateBenchmark-results.txt [SPARK-30409][SPARK-29173][SQL][TESTS] Use NoOp datasource in SQL benchmarks 2020-01-12 13:18:19 -08:00
BloomFilterBenchmark-jdk11-results.txt [SPARK-30409][SPARK-29173][SQL][TESTS] Use NoOp datasource in SQL benchmarks 2020-01-12 13:18:19 -08:00
BloomFilterBenchmark-results.txt [SPARK-30409][SPARK-29173][SQL][TESTS] Use NoOp datasource in SQL benchmarks 2020-01-12 13:18:19 -08:00
BuiltInDataSourceWriteBenchmark-jdk11-results.txt [SPARK-29320][TESTS] Compare sql/core module in JDK8/11 (Part 1) 2019-10-03 08:58:25 -07:00
BuiltInDataSourceWriteBenchmark-results.txt [SPARK-29320][TESTS] Compare sql/core module in JDK8/11 (Part 1) 2019-10-03 08:58:25 -07:00
ColumnarBatchBenchmark-jdk11-results.txt [SPARK-29320][TESTS] Compare sql/core module in JDK8/11 (Part 1) 2019-10-03 08:58:25 -07:00
ColumnarBatchBenchmark-results.txt [SPARK-29320][TESTS] Compare sql/core module in JDK8/11 (Part 1) 2019-10-03 08:58:25 -07:00
CompressionSchemeBenchmark-jdk11-results.txt [SPARK-32802][SQL] Avoid using SpecificInternalRow in RunLengthEncoding#Encoder 2020-09-12 22:19:30 -07:00
CompressionSchemeBenchmark-results.txt [SPARK-32802][SQL] Avoid using SpecificInternalRow in RunLengthEncoding#Encoder 2020-09-12 22:19:30 -07:00
CSVBenchmark-jdk11-results.txt [SPARK-30648][SQL] Support filters pushdown in JSON datasource 2020-07-17 00:01:13 +09:00
CSVBenchmark-results.txt [SPARK-30648][SQL] Support filters pushdown in JSON datasource 2020-07-17 00:01:13 +09:00
DatasetBenchmark-jdk11-results.txt [SPARK-29320][TESTS] Compare sql/core module in JDK8/11 (Part 1) 2019-10-03 08:58:25 -07:00
DatasetBenchmark-results.txt [SPARK-29320][TESTS] Compare sql/core module in JDK8/11 (Part 1) 2019-10-03 08:58:25 -07:00
DataSourceReadBenchmark-jdk11-results.txt [SPARK-30409][SPARK-29173][SQL][TESTS] Use NoOp datasource in SQL benchmarks 2020-01-12 13:18:19 -08:00
DataSourceReadBenchmark-results.txt [SPARK-30409][SPARK-29173][SQL][TESTS] Use NoOp datasource in SQL benchmarks 2020-01-12 13:18:19 -08:00
DateTimeBenchmark-jdk11-results.txt [SPARK-32006][SQL] Create date/timestamp formatters once before collect in hiveResultString() 2020-06-17 06:28:47 +00:00
DateTimeBenchmark-results.txt [SPARK-32006][SQL] Create date/timestamp formatters once before collect in hiveResultString() 2020-06-17 06:28:47 +00:00
DateTimeRebaseBenchmark-jdk11-results.txt [SPARK-31992][SQL] Benchmark the EXCEPTION rebase mode 2020-06-15 07:25:56 +00:00
DateTimeRebaseBenchmark-results.txt [SPARK-31992][SQL] Benchmark the EXCEPTION rebase mode 2020-06-15 07:25:56 +00:00
ExternalAppendOnlyUnsafeRowArrayBenchmark-jdk11-results.txt [SPARK-29320][TESTS] Compare sql/core module in JDK8/11 (Part 1) 2019-10-03 08:58:25 -07:00
ExternalAppendOnlyUnsafeRowArrayBenchmark-results.txt [SPARK-29320][TESTS] Compare sql/core module in JDK8/11 (Part 1) 2019-10-03 08:58:25 -07:00
ExtractBenchmark-jdk11-results.txt [SPARK-31507][SQL] Remove uncommon fields support and update some fields with meaningful names for extract function 2020-04-22 10:24:49 +00:00
ExtractBenchmark-results.txt [SPARK-31507][SQL] Remove uncommon fields support and update some fields with meaningful names for extract function 2020-04-22 10:24:49 +00:00
FilterPushdownBenchmark-jdk11-results.txt [SPARK-30409][SPARK-29173][SQL][TESTS] Use NoOp datasource in SQL benchmarks 2020-01-12 13:18:19 -08:00
FilterPushdownBenchmark-results.txt [SPARK-30409][SPARK-29173][SQL][TESTS] Use NoOp datasource in SQL benchmarks 2020-01-12 13:18:19 -08:00
HashedRelationMetricsBenchmark-jdk11-results.txt [SPARK-29320][TESTS] Compare sql/core module in JDK8/11 (Part 1) 2019-10-03 08:58:25 -07:00
HashedRelationMetricsBenchmark-results.txt [SPARK-29320][TESTS] Compare sql/core module in JDK8/11 (Part 1) 2019-10-03 08:58:25 -07:00
InExpressionBenchmark-jdk11-results.txt [SPARK-30409][SPARK-29173][SQL][TESTS] Use NoOp datasource in SQL benchmarks 2020-01-12 13:18:19 -08:00
InExpressionBenchmark-results.txt [SPARK-30409][SPARK-29173][SQL][TESTS] Use NoOp datasource in SQL benchmarks 2020-01-12 13:18:19 -08:00
IntervalBenchmark-jdk11-results.txt [SPARK-32071][SQL][TESTS] Add make_interval benchmark 2020-06-27 17:54:06 -07:00
IntervalBenchmark-results.txt [SPARK-32071][SQL][TESTS] Add make_interval benchmark 2020-06-27 17:54:06 -07:00
JoinBenchmark-jdk11-results.txt [SPARK-30409][SPARK-29173][SQL][TESTS] Use NoOp datasource in SQL benchmarks 2020-01-12 13:18:19 -08:00
JoinBenchmark-results.txt [SPARK-30409][SPARK-29173][SQL][TESTS] Use NoOp datasource in SQL benchmarks 2020-01-12 13:18:19 -08:00
JsonBenchmark-jdk11-results.txt [SPARK-30648][SQL] Support filters pushdown in JSON datasource 2020-07-17 00:01:13 +09:00
JsonBenchmark-results.txt [SPARK-30648][SQL] Support filters pushdown in JSON datasource 2020-07-17 00:01:13 +09:00
MakeDateTimeBenchmark-jdk11-results.txt [SPARK-32072][CORE][TESTS] Fix table formatting with benchmark results 2020-06-24 04:43:53 +00:00
MakeDateTimeBenchmark-results.txt [SPARK-32072][CORE][TESTS] Fix table formatting with benchmark results 2020-06-24 04:43:53 +00:00
MetricsAggregationBenchmark-jdk11-results.txt [SPARK-29562][SQL] Speed up and slim down metric aggregation in SQL listener 2019-10-24 22:18:10 -07:00
MetricsAggregationBenchmark-results.txt [SPARK-29562][SQL] Speed up and slim down metric aggregation in SQL listener 2019-10-24 22:18:10 -07:00
MiscBenchmark-jdk11-results.txt [SPARK-30409][SPARK-29173][SQL][TESTS] Use NoOp datasource in SQL benchmarks 2020-01-12 13:18:19 -08:00
MiscBenchmark-results.txt [SPARK-30409][SPARK-29173][SQL][TESTS] Use NoOp datasource in SQL benchmarks 2020-01-12 13:18:19 -08:00
OrcNestedSchemaPruningBenchmark-jdk11-results.txt [SPARK-30409][SPARK-29173][SQL][TESTS] Use NoOp datasource in SQL benchmarks 2020-01-12 13:18:19 -08:00
OrcNestedSchemaPruningBenchmark-results.txt [SPARK-30409][SPARK-29173][SQL][TESTS] Use NoOp datasource in SQL benchmarks 2020-01-12 13:18:19 -08:00
OrcV2NestedSchemaPruningBenchmark-jdk11-results.txt [SPARK-30409][SPARK-29173][SQL][TESTS] Use NoOp datasource in SQL benchmarks 2020-01-12 13:18:19 -08:00
OrcV2NestedSchemaPruningBenchmark-results.txt [SPARK-30409][SPARK-29173][SQL][TESTS] Use NoOp datasource in SQL benchmarks 2020-01-12 13:18:19 -08:00
ParquetNestedPredicatePushDownBenchmark-jdk11-results.txt [SPARK-31364][SQL][TESTS] Benchmark Parquet Nested Field Predicate Pushdown 2020-04-24 22:10:58 +00:00
ParquetNestedPredicatePushDownBenchmark-results.txt [SPARK-31364][SQL][TESTS] Benchmark Parquet Nested Field Predicate Pushdown 2020-04-24 22:10:58 +00:00
ParquetNestedSchemaPruningBenchmark-jdk11-results.txt [SPARK-30409][SPARK-29173][SQL][TESTS] Use NoOp datasource in SQL benchmarks 2020-01-12 13:18:19 -08:00
ParquetNestedSchemaPruningBenchmark-results.txt [SPARK-30409][SPARK-29173][SQL][TESTS] Use NoOp datasource in SQL benchmarks 2020-01-12 13:18:19 -08:00
PrimitiveArrayBenchmark-jdk11-results.txt [SPARK-29320][TESTS] Compare sql/core module in JDK8/11 (Part 1) 2019-10-03 08:58:25 -07:00
PrimitiveArrayBenchmark-results.txt [SPARK-29320][TESTS] Compare sql/core module in JDK8/11 (Part 1) 2019-10-03 08:58:25 -07:00
RangeBenchmark-jdk11-results.txt [SPARK-30409][SPARK-29173][SQL][TESTS] Use NoOp datasource in SQL benchmarks 2020-01-12 13:18:19 -08:00
RangeBenchmark-results.txt [SPARK-30409][SPARK-29173][SQL][TESTS] Use NoOp datasource in SQL benchmarks 2020-01-12 13:18:19 -08:00
SortBenchmark-jdk11-results.txt [SPARK-29320][TESTS] Compare sql/core module in JDK8/11 (Part 1) 2019-10-03 08:58:25 -07:00
SortBenchmark-results.txt [SPARK-29320][TESTS] Compare sql/core module in JDK8/11 (Part 1) 2019-10-03 08:58:25 -07:00
TPCDSQueryBenchmark-jdk11-results.txt [SPARK-30409][SPARK-29173][SQL][TESTS] Use NoOp datasource in SQL benchmarks 2020-01-12 13:18:19 -08:00
TPCDSQueryBenchmark-results.txt [SPARK-30409][SPARK-29173][SQL][TESTS] Use NoOp datasource in SQL benchmarks 2020-01-12 13:18:19 -08:00
UDFBenchmark-jdk11-results.txt [SPARK-30409][SPARK-29173][SQL][TESTS] Use NoOp datasource in SQL benchmarks 2020-01-12 13:18:19 -08:00
UDFBenchmark-results.txt [SPARK-30409][SPARK-29173][SQL][TESTS] Use NoOp datasource in SQL benchmarks 2020-01-12 13:18:19 -08:00
UnsafeArrayDataBenchmark-jdk11-results.txt [SPARK-29320][TESTS] Compare sql/core module in JDK8/11 (Part 1) 2019-10-03 08:58:25 -07:00
UnsafeArrayDataBenchmark-results.txt [SPARK-29320][TESTS] Compare sql/core module in JDK8/11 (Part 1) 2019-10-03 08:58:25 -07:00
UpdateFieldsBenchmark-results.txt [SPARK-32511][SQL] Add dropFields method to Column class 2020-10-06 08:53:30 +00:00
WideSchemaBenchmark-jdk11-results.txt [SPARK-30409][SPARK-29173][SQL][TESTS] Use NoOp datasource in SQL benchmarks 2020-01-12 13:18:19 -08:00
WideSchemaBenchmark-results.txt [SPARK-30409][SPARK-29173][SQL][TESTS] Use NoOp datasource in SQL benchmarks 2020-01-12 13:18:19 -08:00
WideTableBenchmark-jdk11-results.txt [SPARK-30409][SPARK-29173][SQL][TESTS] Use NoOp datasource in SQL benchmarks 2020-01-12 13:18:19 -08:00
WideTableBenchmark-results.txt [SPARK-30409][SPARK-29173][SQL][TESTS] Use NoOp datasource in SQL benchmarks 2020-01-12 13:18:19 -08:00