spark-instrumented-optimizer/sql/core
WeichenXu 5631a96367 [SPARK-29048] Improve performance on Column.isInCollection() with a large size collection
### What changes were proposed in this pull request?
The `Column.isInCollection()` with a large size collection will generate an expression with large size children expressions. This make analyzer and optimizer take a long time to run.
In this PR, in `isInCollection()` function, directly generate `InSet` expression, avoid generating too many children expressions.

### Why are the changes needed?
`Column.isInCollection()` with a large size collection sometimes become a bottleneck when running sql.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Manually benchmark it in spark-shell:
```
def testExplainTime(collectionSize: Int) = {
        val df = spark.range(10).withColumn("id2", col("id") + 1)
        val list = Range(0, collectionSize).toList
        val startTime = System.currentTimeMillis()
        df.where(col("id").isInCollection(list)).where(col("id2").isInCollection(list)).explain()
        val elapsedTime = System.currentTimeMillis() - startTime
        println(s"cost time: ${elapsedTime}ms")
}
```
Then test on collection size 5, 10, 100, 1000, 10000, test result is:

collection size | explain time (before) | explain time (after)
------ | ------ | ------
5 | 26ms | 29ms
10 | 30ms | 48ms
100 | 104ms | 50ms
1000 | 1202ms | 58ms
10000 | 10012ms | 523ms

Closes #25754 from WeichenXu123/improve_in_collection.

Lead-authored-by: WeichenXu <weichen.xu@databricks.com>
Co-authored-by: Xiao Li <gatorsmile@gmail.com>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
2019-09-12 17:23:08 -07:00
..
benchmarks [SPARK-29065][SQL][TEST] Extend EXTRACT benchmark 2019-09-12 21:32:35 +09:00
src [SPARK-29048] Improve performance on Column.isInCollection() with a large size collection 2019-09-12 17:23:08 -07:00
v1.2.1/src [SPARK-28744][SQL][TEST] rename SharedSQLContext to SharedSparkSession 2019-08-19 19:01:56 +08:00
v2.3.5/src [SPARK-28744][SQL][TEST] rename SharedSQLContext to SharedSparkSession 2019-08-19 19:01:56 +08:00
pom.xml [SPARK-27521][SQL] Move data source v2 to catalyst module 2019-06-05 09:55:55 -07:00