spark-instrumented-optimizer/external
Huaxin Gao 38b6fbd9b8 [SPARK-36351][SQL] Refactor filter push down in file source v2
### What changes were proposed in this pull request?

Currently in `V2ScanRelationPushDown`, we push the filters (partition filters + data filters) to file source, and then pass all the filters (partition filters + data filters) as post scan filters to v2 Scan, and later in `PruneFileSourcePartitions`, we separate partition filters and data filters, set them in the format of `Expression` to file source.

Changes in this PR:
When we push filters to file sources in `V2ScanRelationPushDown`, since we already have the information about partition column , we want to separate partition filter and data filter there.

The benefit of doing this:
- we can handle all the filter related work for v2 file source at one place instead of two (`V2ScanRelationPushDown` and `PruneFileSourcePartitions`), so the code will be cleaner and easier to maintain.
- we actually have to separate partition filters and data filters at `V2ScanRelationPushDown`, otherwise, there is no way to find out which filters are partition filters, and we can't push down aggregate for parquet even if we only have partition filter.
- By separating the filters early at `V2ScanRelationPushDown`, we only needs to check data filters to find out which one needs to be converted to data source filters (e.g. Parquet predicates, ORC predicates) and pushed down to file source, right now we are checking all the filters (both partition filters and data filters)
- Similarly, we can only pass data filters as post scan filters to v2 Scan, because partition filters are used for partition pruning only, no need to pass them as post scan filters.

In order to do this, we will have the following changes

-  add `pushFilters` in file source v2. In this method:
    - push both Expression partition filter and Expression data filter to file source. Have to use Expression filters because we need these for partition pruning.
    - data filters are used for filter push down. If file source needs to push down data filters, it translates the data filters from `Expression` to `Sources.Filer`, and then decides which filters to push down.
    - partition filters are used for partition pruning.
- file source v2 no need to implement `SupportsPushdownFilters` any more, because when we separating the two types of filters, we have already set them on file data sources. It's redundant to use `SupportsPushdownFilters` to set the filters again on file data sources.

### Why are the changes needed?

see section one

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existing tests

Closes #33650 from huaxingao/partition_filter.

Authored-by: Huaxin Gao <huaxin_gao@apple.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2021-09-02 19:11:43 -07:00
..
avro [SPARK-36351][SQL] Refactor filter push down in file source v2 2021-09-02 19:11:43 -07:00
docker [SPARK-32353][TEST] Update docker/spark-test and clean up unused stuff 2020-07-17 12:05:45 -07:00
docker-integration-tests [SPARK-36573][BUILD][TEST] Add a default value to ORACLE_DOCKER_IMAGE 2021-08-24 13:30:21 -07:00
kafka-0-10 [SPARK-36410][CORE][SQL][STRUCTURED STREAMING][EXAMPLES] Replace anonymous classes with lambda expressions 2021-08-09 19:28:31 +09:00
kafka-0-10-assembly [SPARK-35996][BUILD] Setting version to 3.3.0-SNAPSHOT 2021-07-02 13:47:36 -07:00
kafka-0-10-sql [SPARK-36576][SS] Improve range split calculation for Kafka Source minPartitions option 2021-08-29 16:38:29 +09:00
kafka-0-10-token-provider [SPARK-35996][BUILD] Setting version to 3.3.0-SNAPSHOT 2021-07-02 13:47:36 -07:00
kinesis-asl [SPARK-35996][BUILD] Setting version to 3.3.0-SNAPSHOT 2021-07-02 13:47:36 -07:00
kinesis-asl-assembly [SPARK-35996][BUILD] Setting version to 3.3.0-SNAPSHOT 2021-07-02 13:47:36 -07:00
spark-ganglia-lgpl [SPARK-35996][BUILD] Setting version to 3.3.0-SNAPSHOT 2021-07-02 13:47:36 -07:00