[SPARK-32792][SQL] Improve Parquet In filter pushdown
...
### What changes were proposed in this pull request?
Support push down `GreaterThanOrEqual` minimum value and `LessThanOrEqual` maximum value for Parquet when [sources.In](a744fea3be/sql/catalyst/src/main/scala/org/apache/spark/sql/sources/filters.scala (L162-L181)
)'s values exceeds `spark.sql.optimizer.inSetRewriteMinMaxThreshold`. For example:
```sql
SELECT * FROM t WHERE id IN (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 15)
```
We will push down `id >= 1 and id <= 15`.
Impala also has this improvement: https://issues.apache.org/jira/browse/IMPALA-3654
### Why are the changes needed?
Improve query performance.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Unit test, [manual test](https://github.com/apache/spark/pull/29642#issuecomment-743109098 ) and benchmark test.
Before this PR:
```
================================================================================================
Pushdown benchmark for InSet -> InFilters
================================================================================================
Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU 2.40GHz
InSet -> InFilters (values count: 10, distribution: 10): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
---------------------------------------------------------------------------------------------------------------------------------------
Parquet Vectorized 5995 6026 53 2.6 381.2 1.0X
Parquet Vectorized (Pushdown) 423 440 11 37.2 26.9 14.2X
Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU 2.40GHz
InSet -> InFilters (values count: 10, distribution: 50): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
---------------------------------------------------------------------------------------------------------------------------------------
Parquet Vectorized 5767 5887 154 2.7 366.7 1.0X
Parquet Vectorized (Pushdown) 419 428 6 37.6 26.6 13.8X
Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU 2.40GHz
InSet -> InFilters (values count: 10, distribution: 90): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
---------------------------------------------------------------------------------------------------------------------------------------
Parquet Vectorized 5764 5857 96 2.7 366.4 1.0X
Parquet Vectorized (Pushdown) 408 419 9 38.6 25.9 14.1X
Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU 2.40GHz
InSet -> InFilters (values count: 100, distribution: 10): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
----------------------------------------------------------------------------------------------------------------------------------------
Parquet Vectorized 5895 5949 41 2.7 374.8 1.0X
Parquet Vectorized (Pushdown) 5908 5986 114 2.7 375.6 1.0X
Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU 2.40GHz
InSet -> InFilters (values count: 100, distribution: 50): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
----------------------------------------------------------------------------------------------------------------------------------------
Parquet Vectorized 5893 5988 106 2.7 374.7 1.0X
Parquet Vectorized (Pushdown) 5875 5939 57 2.7 373.5 1.0X
Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU 2.40GHz
InSet -> InFilters (values count: 100, distribution: 90): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
----------------------------------------------------------------------------------------------------------------------------------------
Parquet Vectorized 5891 5954 42 2.7 374.5 1.0X
Parquet Vectorized (Pushdown) 5901 5976 99 2.7 375.2 1.0X
Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU 2.40GHz
InSet -> InFilters (values count: 2000, distribution: 10): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
-----------------------------------------------------------------------------------------------------------------------------------------
Parquet Vectorized 6128 6158 40 2.6 389.6 1.0X
Parquet Vectorized (Pushdown) 6145 6190 37 2.6 390.7 1.0X
Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU 2.40GHz
InSet -> InFilters (values count: 2000, distribution: 50): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
-----------------------------------------------------------------------------------------------------------------------------------------
Parquet Vectorized 6142 6217 64 2.6 390.5 1.0X
Parquet Vectorized (Pushdown) 6149 6235 90 2.6 391.0 1.0X
Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU 2.40GHz
InSet -> InFilters (values count: 2000, distribution: 90): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
-----------------------------------------------------------------------------------------------------------------------------------------
Parquet Vectorized 6148 6218 64 2.6 390.9 1.0X
Parquet Vectorized (Pushdown) 6145 6177 30 2.6 390.7 1.0X
```
After this PR:
```
================================================================================================
Pushdown benchmark for InSet -> InFilters
================================================================================================
Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU 2.40GHz
InSet -> InFilters (values count: 10, distribution: 10): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
---------------------------------------------------------------------------------------------------------------------------------------
Parquet Vectorized 5745 5768 28 2.7 365.2 1.0X
Parquet Vectorized (Pushdown) 401 412 12 39.2 25.5 14.3X
Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU 2.40GHz
InSet -> InFilters (values count: 10, distribution: 50): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
---------------------------------------------------------------------------------------------------------------------------------------
Parquet Vectorized 5796 5861 61 2.7 368.5 1.0X
Parquet Vectorized (Pushdown) 417 482 37 37.7 26.5 13.9X
Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU 2.40GHz
InSet -> InFilters (values count: 10, distribution: 90): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
---------------------------------------------------------------------------------------------------------------------------------------
Parquet Vectorized 5754 5777 20 2.7 365.8 1.0X
Parquet Vectorized (Pushdown) 408 418 9 38.6 25.9 14.1X
Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU 2.40GHz
InSet -> InFilters (values count: 100, distribution: 10): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
----------------------------------------------------------------------------------------------------------------------------------------
Parquet Vectorized 5878 5915 40 2.7 373.7 1.0X
Parquet Vectorized (Pushdown) 929 940 10 16.9 59.1 6.3X
Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU 2.40GHz
InSet -> InFilters (values count: 100, distribution: 50): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
----------------------------------------------------------------------------------------------------------------------------------------
Parquet Vectorized 5886 5917 29 2.7 374.2 1.0X
Parquet Vectorized (Pushdown) 3091 3114 20 5.1 196.5 1.9X
Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU 2.40GHz
InSet -> InFilters (values count: 100, distribution: 90): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
----------------------------------------------------------------------------------------------------------------------------------------
Parquet Vectorized 5913 5948 48 2.7 375.9 1.0X
Parquet Vectorized (Pushdown) 5330 5427 98 3.0 338.9 1.1X
Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU 2.40GHz
InSet -> InFilters (values count: 2000, distribution: 10): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
-----------------------------------------------------------------------------------------------------------------------------------------
Parquet Vectorized 6147 6228 72 2.6 390.8 1.0X
Parquet Vectorized (Pushdown) 1023 1029 4 15.4 65.1 6.0X
Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU 2.40GHz
InSet -> InFilters (values count: 2000, distribution: 50): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
-----------------------------------------------------------------------------------------------------------------------------------------
Parquet Vectorized 6164 6224 47 2.6 391.9 1.0X
Parquet Vectorized (Pushdown) 3332 3360 45 4.7 211.9 1.8X
Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU 2.40GHz
InSet -> InFilters (values count: 2000, distribution: 90): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
-----------------------------------------------------------------------------------------------------------------------------------------
Parquet Vectorized 6154 6192 38 2.6 391.3 1.0X
Parquet Vectorized (Pushdown) 5588 5679 92 2.8 355.3 1.1X
```
Closes #29642 from wangyum/SPARK-32792.
Lead-authored-by: Yuming Wang <yumwang@ebay.com>
Co-authored-by: Yuming Wang <yumwang@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>