spark-instrumented-optimizer/sql/core
Dongjoon Hyun 7f7eb3934e [SPARK-16360][SQL] Speed up SQL query performance by removing redundant executePlan call
## What changes were proposed in this pull request?

Currently, there are a few reports about Spark 2.0 query performance regression for large queries.

This PR speeds up SQL query processing performance by removing redundant **consecutive `executePlan`** call in `Dataset.ofRows` function and `Dataset` instantiation. Specifically, this PR aims to reduce the overhead of SQL query execution plan generation, not real query execution. So, we can not see the result in the Spark Web UI. Please use the following query script. The result is **25.78 sec** -> **12.36 sec** as expected.

**Sample Query**
```scala
val n = 4000
val values = (1 to n).map(_.toString).mkString(", ")
val columns = (1 to n).map("column" + _).mkString(", ")
val query =
  s"""
     |SELECT $columns
     |FROM VALUES ($values) T($columns)
     |WHERE 1=2 AND 1 IN ($columns)
     |GROUP BY $columns
     |ORDER BY $columns
     |""".stripMargin

def time[R](block: => R): R = {
  val t0 = System.nanoTime()
  val result = block
  println("Elapsed time: " + ((System.nanoTime - t0) / 1e9) + "s")
  result
}
```

**Before**
```scala
scala> time(sql(query))
Elapsed time: 30.138142577s  // First query has a little overhead of initialization.
res0: org.apache.spark.sql.DataFrame = [column1: int, column2: int ... 3998 more fields]
scala> time(sql(query))
Elapsed time: 25.787751452s  // Let's compare this one.
res1: org.apache.spark.sql.DataFrame = [column1: int, column2: int ... 3998 more fields]
```

**After**
```scala
scala> time(sql(query))
Elapsed time: 17.500279659s  // First query has a little overhead of initialization.
res0: org.apache.spark.sql.DataFrame = [column1: int, column2: int ... 3998 more fields]
scala> time(sql(query))
Elapsed time: 12.364812255s  // This shows the real difference. The speed up is about 2 times.
res1: org.apache.spark.sql.DataFrame = [column1: int, column2: int ... 3998 more fields]
```

## How was this patch tested?

Manual by the above script.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #14044 from dongjoon-hyun/SPARK-16360.
2016-07-05 16:19:22 +08:00
..
benchmarks [SPARK-15881] Update microbenchmark results for WideSchemaBenchmark 2016-06-11 15:26:08 -07:00
src [SPARK-16360][SQL] Speed up SQL query performance by removing redundant executePlan call 2016-07-05 16:19:22 +08:00
pom.xml [SPARK-15493][SQL] default QuoteEscapingEnabled flag to true when writing CSV 2016-05-25 12:40:16 -07:00