spark-instrumented-optimizer

History

DB Tsai 0f0d1865f5 [SPARK-24402][SQL] Optimize `In` expression when only one element in the collection or collection is empty ## What changes were proposed in this pull request? Two new rules in the logical plan optimizers are added. 1. When there is only one element in the `Collection`, the physical plan will be optimized to `EqualTo`, so predicate pushdown can be used. ```scala profileDF.filter( $"profileID".isInCollection(Set(6))).explain(true) """ \|== Physical Plan == \|(1) Project [profileID#0] \|+- (1) Filter (isnotnull(profileID#0) && (profileID#0 = 6)) \| +- (1) FileScan parquet [profileID#0] Batched: true, Format: Parquet, \| PartitionFilters: [], \| PushedFilters: [IsNotNull(profileID), EqualTo(profileID,6)], \| ReadSchema: struct<profileID:int> """.stripMargin ``` 2. When the `Collection`* is empty, and the input is nullable, the logical plan will be simplified to ```scala profileDF.filter( $"profileID".isInCollection(Set())).explain(true) """ \|== Optimized Logical Plan == \|Filter if (isnull(profileID#0)) null else false \|+- Relation[profileID#0] parquet """.stripMargin ``` TODO: 1. For multiple conditions with numbers less than certain thresholds, we should still allow predicate pushdown. 2. Optimize the `In` using `tableswitch` or `lookupswitch` when the numbers of the categories are low, and they are `Int`, `Long`. 3. The default immutable hash trees set is slow for query, and we should do benchmark for using different set implementation for faster query. 4. `filter(if (condition) null else false)` can be optimized to false. ## How was this patch tested? Couple new tests are added. Author: DB Tsai <d_tsai@apple.com> Closes #21442 from dbtsai/optimize-in.	2018-07-16 15:33:39 -07:00
..
src	[SPARK-24402][SQL] Optimize `In` expression when only one element in the collection or collection is empty	2018-07-16 15:33:39 -07:00
pom.xml	[SPARK-19550][BUILD][FOLLOW-UP] Remove MaxPermSize for sql module	2018-01-15 07:49:34 -06:00

DB Tsai 0f0d1865f5 [SPARK-24402][SQL] Optimize In expression when only one element in the collection or collection is empty

## What changes were proposed in this pull request?

Two new rules in the logical plan optimizers are added.

1. When there is only one element in the **`Collection`**, the
physical plan will be optimized to **`EqualTo`**, so predicate
pushdown can be used.

```scala
    profileDF.filter( $"profileID".isInCollection(Set(6))).explain(true)
    """
      |== Physical Plan ==
      |*(1) Project [profileID#0]
      |+- *(1) Filter (isnotnull(profileID#0) && (profileID#0 = 6))
      |   +- *(1) FileScan parquet [profileID#0] Batched: true, Format: Parquet,
      |     PartitionFilters: [],
      |     PushedFilters: [IsNotNull(profileID), EqualTo(profileID,6)],
      |     ReadSchema: struct<profileID:int>
    """.stripMargin
```

2. When the **`Collection`** is empty, and the input is nullable, the
logical plan will be simplified to

```scala
    profileDF.filter( $"profileID".isInCollection(Set())).explain(true)
    """
      |== Optimized Logical Plan ==
      |Filter if (isnull(profileID#0)) null else false
      |+- Relation[profileID#0] parquet
    """.stripMargin
```

TODO:

1. For multiple conditions with numbers less than certain thresholds,
we should still allow predicate pushdown.
2. Optimize the **`In`** using **`tableswitch`** or **`lookupswitch`**
when the numbers of the categories are low, and they are **`Int`**,
**`Long`**.
3. The default immutable hash trees set is slow for query, and we
should do benchmark for using different set implementation for faster
query.
4. **`filter(if (condition) null else false)`** can be optimized to false.

## How was this patch tested?

Couple new tests are added.

Author: DB Tsai <d_tsai@apple.com>

Closes #21442 from dbtsai/optimize-in.

2018-07-16 15:33:39 -07:00

src [SPARK-24402][SQL] Optimize In expression when only one element in the collection or collection is empty 2018-07-16 15:33:39 -07:00

pom.xml [SPARK-19550][BUILD][FOLLOW-UP] Remove MaxPermSize for sql module 2018-01-15 07:49:34 -06:00