History

Anton Okolnychyi bc9f9b4d6e [SPARK-25860][SQL] Replace Literal(null, _) with FalseLiteral whenever possible ## What changes were proposed in this pull request? This PR proposes a new optimization rule that replaces `Literal(null, _)` with `FalseLiteral` in conditions in `Join` and `Filter`, predicates in `If`, conditions in `CaseWhen`. The idea is that some expressions evaluate to `false` if the underlying expression is `null` (as an example see `GeneratePredicate$create` or `doGenCode` and `eval` methods in `If` and `CaseWhen`). Therefore, we can replace `Literal(null, _)` with `FalseLiteral`, which can lead to more optimizations later on. Let’s consider a few examples. ``` val df = spark.range(1, 100).select($"id".as("l"), ($"id" > 50).as("b")) df.createOrReplaceTempView("t") df.createOrReplaceTempView("p") ``` Case 1 ``` spark.sql("SELECT * FROM t WHERE if(l > 10, false, NULL)").explain(true) // without the new rule … == Optimized Logical Plan == Project [id#0L AS l#2L, cast(id#0L as string) AS s#3] +- Filter if ((id#0L > 10)) false else null +- Range (1, 100, step=1, splits=Some(12)) == Physical Plan == (1) Project [id#0L AS l#2L, cast(id#0L as string) AS s#3] +- (1) Filter if ((id#0L > 10)) false else null +- (1) Range (1, 100, step=1, splits=12) // with the new rule … == Optimized Logical Plan == LocalRelation <empty>, [l#2L, s#3] == Physical Plan == LocalTableScan <empty>, [l#2L, s#3] ``` Case 2* ``` spark.sql("SELECT * FROM t WHERE CASE WHEN l < 10 THEN null WHEN l > 40 THEN false ELSE null END”).explain(true) // without the new rule ... == Optimized Logical Plan == Project [id#0L AS l#2L, cast(id#0L as string) AS s#3] +- Filter CASE WHEN (id#0L < 10) THEN null WHEN (id#0L > 40) THEN false ELSE null END +- Range (1, 100, step=1, splits=Some(12)) == Physical Plan == (1) Project [id#0L AS l#2L, cast(id#0L as string) AS s#3] +- (1) Filter CASE WHEN (id#0L < 10) THEN null WHEN (id#0L > 40) THEN false ELSE null END +- (1) Range (1, 100, step=1, splits=12) // with the new rule ... == Optimized Logical Plan == LocalRelation <empty>, [l#2L, s#3] == Physical Plan == LocalTableScan <empty>, [l#2L, s#3] ``` Case 3* ``` spark.sql("SELECT * FROM t JOIN p ON IF(t.l > p.l, null, false)").explain(true) // without the new rule ... == Optimized Logical Plan == Join Inner, if ((l#2L > l#37L)) null else false :- Project [id#0L AS l#2L, cast(id#0L as string) AS s#3] : +- Range (1, 100, step=1, splits=Some(12)) +- Project [id#0L AS l#37L, cast(id#0L as string) AS s#38] +- Range (1, 100, step=1, splits=Some(12)) == Physical Plan == BroadcastNestedLoopJoin BuildRight, Inner, if ((l#2L > l#37L)) null else false :- (1) Project [id#0L AS l#2L, cast(id#0L as string) AS s#3] : +- (1) Range (1, 100, step=1, splits=12) +- BroadcastExchange IdentityBroadcastMode +- (2) Project [id#0L AS l#37L, cast(id#0L as string) AS s#38] +- (2) Range (1, 100, step=1, splits=12) // with the new rule ... == Optimized Logical Plan == LocalRelation <empty>, [l#2L, s#3, l#37L, s#38] ``` ## How was this patch tested? This PR comes with a set of dedicated tests. Closes #22857 from aokolnychyi/spark-25860. Authored-by: Anton Okolnychyi <aokolnychyi@apple.com> Signed-off-by: DB Tsai <d_tsai@apple.com>		2018-10-31 18:35:33 +00:00
..
catalyst	[SPARK-25860][SQL] Replace Literal(null, _) with FalseLiteral whenever possible	2018-10-31 18:35:33 +00:00
core	[SPARK-25860][SQL] Replace Literal(null, _) with FalseLiteral whenever possible	2018-10-31 18:35:33 +00:00
hive	[SPARK-25893][SQL] Show a directional error message for unsupported Hive Metastore versions	2018-10-31 09:20:19 -07:00
hive-thriftserver	[SPARK-25735][CORE][MINOR] Improve start-thriftserver.sh: print clean usage and exit with code 1	2018-10-17 09:56:17 -05:00
create-docs.sh	[MINOR][DOCS] Minor doc fixes related with doc build and uses script dir in SQL doc gen script	2017-08-26 13:56:24 +09:00
gen-sql-markdown.py	[SPARK-21485][FOLLOWUP][SQL][DOCS] Describes examples and arguments separately, and note/since in SQL built-in function documentation	2017-08-05 10:10:56 -07:00
mkdocs.yml	[SPARK-21485][SQL][DOCS] Spark SQL documentation generation for built-in functions	2017-07-26 09:38:51 -07:00
README.md	[MINOR][DOC] Fix some typos and grammar issues	2018-04-06 13:37:08 +08:00

README.md

Spark SQL

This module provides support for executing relational queries expressed in either SQL or the DataFrame/Dataset API.

Spark SQL is broken up into four subprojects:

Catalyst (sql/catalyst) - An implementation-agnostic framework for manipulating trees of relational operators and expressions.
Execution (sql/core) - A query planner / execution engine for translating Catalyst's logical query plans into Spark RDDs. This component also includes a new public interface, SQLContext, that allows users to execute SQL or LINQ statements against existing RDDs and Parquet files.
Hive Support (sql/hive) - Includes an extension of SQLContext called HiveContext that allows users to write queries using a subset of HiveQL and access data from a Hive Metastore using Hive SerDes. There are also wrappers that allow users to run queries that include Hive UDFs, UDAFs, and UDTFs.
HiveServer and CLI support (sql/hive-thriftserver) - Includes support for the SQL CLI (bin/spark-sql) and a HiveServer2 (for JDBC/ODBC) compatible server.

Running sql/create-docs.sh generates SQL documentation for built-in functions under sql/site.