2793347972
### What changes were proposed in this pull request? 1. Refactored `WithFields` Expression to make it more extensible (now `UpdateFields`). 2. Added a new `dropFields` method to the `Column` class. This method should allow users to drop a `StructField` in a `StructType` column (with similar semantics to the `drop` method on `Dataset`). ### Why are the changes needed? Often Spark users have to work with deeply nested data e.g. to fix a data quality issue with an existing `StructField`. To do this with the existing Spark APIs, users have to rebuild the entire struct column. For example, let's say you have the following deeply nested data structure which has a data quality issue (`5` is missing): ``` import org.apache.spark.sql._ import org.apache.spark.sql.functions._ import org.apache.spark.sql.types._ val data = spark.createDataFrame(sc.parallelize( Seq(Row(Row(Row(1, 2, 3), Row(Row(4, null, 6), Row(7, 8, 9), Row(10, 11, 12)), Row(13, 14, 15))))), StructType(Seq( StructField("a", StructType(Seq( StructField("a", StructType(Seq( StructField("a", IntegerType), StructField("b", IntegerType), StructField("c", IntegerType)))), StructField("b", StructType(Seq( StructField("a", StructType(Seq( StructField("a", IntegerType), StructField("b", IntegerType), StructField("c", IntegerType)))), StructField("b", StructType(Seq( StructField("a", IntegerType), StructField("b", IntegerType), StructField("c", IntegerType)))), StructField("c", StructType(Seq( StructField("a", IntegerType), StructField("b", IntegerType), StructField("c", IntegerType)))) ))), StructField("c", StructType(Seq( StructField("a", IntegerType), StructField("b", IntegerType), StructField("c", IntegerType)))) )))))).cache data.show(false) +---------------------------------+ |a | +---------------------------------+ |[[1, 2, 3], [[4,, 6], [7, 8, 9]]]| +---------------------------------+ ``` Currently, to drop the missing value users would have to do something like this: ``` val result = data.withColumn("a", struct( $"a.a", struct( struct( $"a.b.a.a", $"a.b.a.c" ).as("a"), $"a.b.b", $"a.b.c" ).as("b"), $"a.c" )) result.show(false) +---------------------------------------------------------------+ |a | +---------------------------------------------------------------+ |[[1, 2, 3], [[4, 6], [7, 8, 9], [10, 11, 12]], [13, 14, 15]]| +---------------------------------------------------------------+ ``` As you can see above, with the existing methods users must call the `struct` function and list all fields, including fields they don't want to change. This is not ideal as: >this leads to complex, fragile code that cannot survive schema evolution. [SPARK-16483](https://issues.apache.org/jira/browse/SPARK-16483) In contrast, with the method added in this PR, a user could simply do something like this to get the same result: ``` val result = data.withColumn("a", 'a.dropFields("b.a.b")) result.show(false) +---------------------------------------------------------------+ |a | +---------------------------------------------------------------+ |[[1, 2, 3], [[4, 6], [7, 8, 9], [10, 11, 12]], [13, 14, 15]]| +---------------------------------------------------------------+ ``` This is the second of maybe 3 methods that could be added to the `Column` class to make it easier to manipulate nested data. Other methods under discussion in [SPARK-22231](https://issues.apache.org/jira/browse/SPARK-22231) include `withFieldRenamed`. However, this should be added in a separate PR. ### Does this PR introduce _any_ user-facing change? The documentation for `Column.withField` method has changed to include an additional note about how to write optimized queries when adding multiple nested Column directly. ### How was this patch tested? New unit tests were added. Jenkins must pass them. ### Related JIRAs: More discussion on this topic can be found here: - https://issues.apache.org/jira/browse/SPARK-22231 - https://issues.apache.org/jira/browse/SPARK-16483 Closes #29795 from fqaiser94/SPARK-32511-dropFields-second-try. Authored-by: fqaiser94@gmail.com <fqaiser94@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> |
||
---|---|---|
.. | ||
AggregateBenchmark-jdk11-results.txt | ||
AggregateBenchmark-results.txt | ||
BloomFilterBenchmark-jdk11-results.txt | ||
BloomFilterBenchmark-results.txt | ||
BuiltInDataSourceWriteBenchmark-jdk11-results.txt | ||
BuiltInDataSourceWriteBenchmark-results.txt | ||
ColumnarBatchBenchmark-jdk11-results.txt | ||
ColumnarBatchBenchmark-results.txt | ||
CompressionSchemeBenchmark-jdk11-results.txt | ||
CompressionSchemeBenchmark-results.txt | ||
CSVBenchmark-jdk11-results.txt | ||
CSVBenchmark-results.txt | ||
DatasetBenchmark-jdk11-results.txt | ||
DatasetBenchmark-results.txt | ||
DataSourceReadBenchmark-jdk11-results.txt | ||
DataSourceReadBenchmark-results.txt | ||
DateTimeBenchmark-jdk11-results.txt | ||
DateTimeBenchmark-results.txt | ||
DateTimeRebaseBenchmark-jdk11-results.txt | ||
DateTimeRebaseBenchmark-results.txt | ||
ExternalAppendOnlyUnsafeRowArrayBenchmark-jdk11-results.txt | ||
ExternalAppendOnlyUnsafeRowArrayBenchmark-results.txt | ||
ExtractBenchmark-jdk11-results.txt | ||
ExtractBenchmark-results.txt | ||
FilterPushdownBenchmark-jdk11-results.txt | ||
FilterPushdownBenchmark-results.txt | ||
HashedRelationMetricsBenchmark-jdk11-results.txt | ||
HashedRelationMetricsBenchmark-results.txt | ||
InExpressionBenchmark-jdk11-results.txt | ||
InExpressionBenchmark-results.txt | ||
IntervalBenchmark-jdk11-results.txt | ||
IntervalBenchmark-results.txt | ||
JoinBenchmark-jdk11-results.txt | ||
JoinBenchmark-results.txt | ||
JsonBenchmark-jdk11-results.txt | ||
JsonBenchmark-results.txt | ||
MakeDateTimeBenchmark-jdk11-results.txt | ||
MakeDateTimeBenchmark-results.txt | ||
MetricsAggregationBenchmark-jdk11-results.txt | ||
MetricsAggregationBenchmark-results.txt | ||
MiscBenchmark-jdk11-results.txt | ||
MiscBenchmark-results.txt | ||
OrcNestedSchemaPruningBenchmark-jdk11-results.txt | ||
OrcNestedSchemaPruningBenchmark-results.txt | ||
OrcV2NestedSchemaPruningBenchmark-jdk11-results.txt | ||
OrcV2NestedSchemaPruningBenchmark-results.txt | ||
ParquetNestedPredicatePushDownBenchmark-jdk11-results.txt | ||
ParquetNestedPredicatePushDownBenchmark-results.txt | ||
ParquetNestedSchemaPruningBenchmark-jdk11-results.txt | ||
ParquetNestedSchemaPruningBenchmark-results.txt | ||
PrimitiveArrayBenchmark-jdk11-results.txt | ||
PrimitiveArrayBenchmark-results.txt | ||
RangeBenchmark-jdk11-results.txt | ||
RangeBenchmark-results.txt | ||
SortBenchmark-jdk11-results.txt | ||
SortBenchmark-results.txt | ||
TPCDSQueryBenchmark-jdk11-results.txt | ||
TPCDSQueryBenchmark-results.txt | ||
UDFBenchmark-jdk11-results.txt | ||
UDFBenchmark-results.txt | ||
UnsafeArrayDataBenchmark-jdk11-results.txt | ||
UnsafeArrayDataBenchmark-results.txt | ||
UpdateFieldsBenchmark-results.txt | ||
WideSchemaBenchmark-jdk11-results.txt | ||
WideSchemaBenchmark-results.txt | ||
WideTableBenchmark-jdk11-results.txt | ||
WideTableBenchmark-results.txt |