spark-instrumented-optimizer/sql
hyukjinkwon d314677cfd [SPARK-16461][SQL] Support partition batch pruning with <=> predicate in InMemoryTableScanExec
## What changes were proposed in this pull request?

It seems `EqualNullSafe` filter was missed for batch pruneing partitions in cached tables.

It seems supporting this improves the performance roughly 5 times faster.

Running the codes below:

```scala
test("Null-safe equal comparison") {
  val N = 20000000
  val df = spark.range(N).repartition(20)
  val benchmark = new Benchmark("Null-safe equal comparison", N)
  df.createOrReplaceTempView("t")
  spark.catalog.cacheTable("t")
  sql("select id from t where id <=> 1").collect()

  benchmark.addCase("Null-safe equal comparison", 10) { _ =>
    sql("select id from t where id <=> 1").collect()
  }
  benchmark.run()
}
```

produces the results below:

**Before:**

```
Running benchmark: Null-safe equal comparison
  Running case: Null-safe equal comparison
  Stopped after 10 iterations, 2098 ms

Java HotSpot(TM) 64-Bit Server VM 1.8.0_45-b14 on Mac OS X 10.11.5
Intel(R) Core(TM) i7-4850HQ CPU  2.30GHz

Null-safe equal comparison:              Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
Null-safe equal comparison                     204 /  210         98.1          10.2       1.0X
```

**After:**

```
Running benchmark: Null-safe equal comparison
  Running case: Null-safe equal comparison
  Stopped after 10 iterations, 478 ms

Java HotSpot(TM) 64-Bit Server VM 1.8.0_45-b14 on Mac OS X 10.11.5
Intel(R) Core(TM) i7-4850HQ CPU  2.30GHz

Null-safe equal comparison:              Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
Null-safe equal comparison                      42 /   48        474.1           2.1       1.0X
```

## How was this patch tested?

Unit tests in `PartitionBatchPruningSuite`.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #14117 from HyukjinKwon/SPARK-16461.
2016-09-01 15:32:07 -07:00
..
catalyst [SPARK-16732][SQL] Remove unused codes in subexpressionEliminationForWholeStageCodegen 2016-09-01 14:13:38 -07:00
core [SPARK-16461][SQL] Support partition batch pruning with <=> predicate in InMemoryTableScanExec 2016-09-01 15:32:07 -07:00
hive [SPARK-16926] [SQL] Remove partition columns from partition metadata. 2016-09-01 14:13:17 -07:00
hive-thriftserver [SPARK-17190][SQL] Removal of HiveSharedState 2016-08-25 12:50:03 +08:00
README.md [SPARK-16557][SQL] Remove stale doc in sql/README.md 2016-07-14 19:24:42 -07:00

Spark SQL

This module provides support for executing relational queries expressed in either SQL or the DataFrame/Dataset API.

Spark SQL is broken up into four subprojects:

  • Catalyst (sql/catalyst) - An implementation-agnostic framework for manipulating trees of relational operators and expressions.
  • Execution (sql/core) - A query planner / execution engine for translating Catalyst's logical query plans into Spark RDDs. This component also includes a new public interface, SQLContext, that allows users to execute SQL or LINQ statements against existing RDDs and Parquet files.
  • Hive Support (sql/hive) - Includes an extension of SQLContext called HiveContext that allows users to write queries using a subset of HiveQL and access data from a Hive Metastore using Hive SerDes. There are also wrappers that allows users to run queries that include Hive UDFs, UDAFs, and UDTFs.
  • HiveServer and CLI support (sql/hive-thriftserver) - Includes support for the SQL CLI (bin/spark-sql) and a HiveServer2 (for JDBC/ODBC) compatible server.