spark-instrumented-optimizer/sql
Bruce Robbins 497d17f38a [SPARK-39496][SQL] Handle null struct in `Inline.eval`
Change `Inline.eval` to return a row of null values rather than a null row in the case of a null input struct.

Consider the following query:
```
set spark.sql.codegen.wholeStage=false;
select inline(array(named_struct('a', 1, 'b', 2), null));
```
This query fails with a `NullPointerException`:
```
22/06/16 15:10:06 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.NullPointerException
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
	at org.apache.spark.sql.execution.GenerateExec.$anonfun$doExecute$11(GenerateExec.scala:122)
```
(In Spark 3.1.3, you don't need to set `spark.sql.codegen.wholeStage` to false to reproduce the error, since Spark 3.1.3 has no codegen path for `Inline`).

This query fails regardless of the setting of `spark.sql.codegen.wholeStage`:
```
val dfWide = (Seq((1))
  .toDF("col0")
  .selectExpr(Seq.tabulate(99)(x => s"$x as col${x + 1}"): _*))

val df = (dfWide
  .selectExpr("*", "array(named_struct('a', 1, 'b', 2), null) as struct_array"))

df.selectExpr("*", "inline(struct_array)").collect
```
It fails with
```
22/06/16 15:18:55 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)/ 1]
java.lang.NullPointerException
	at org.apache.spark.sql.catalyst.expressions.JoinedRow.isNullAt(JoinedRow.scala:80)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.writeFields_0_8$(Unknown Source)
```
When `Inline.eval` returns a null row in the collection, GenerateExec gets a NullPointerException either when joining the null row with required child output, or projecting the null row.

This PR avoids producing the null row and produces a row of null values instead:
```
spark-sql> set spark.sql.codegen.wholeStage=false;
spark.sql.codegen.wholeStage	false
Time taken: 3.095 seconds, Fetched 1 row(s)
spark-sql> select inline(array(named_struct('a', 1, 'b', 2), null));
1	2
NULL	NULL
Time taken: 1.214 seconds, Fetched 2 row(s)
spark-sql>
```

No.

New unit test.

Closes #36903 from bersprockets/inline_eval_null_struct_issue.

Authored-by: Bruce Robbins <bersprockets@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit c4d5390dd0)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2022-06-18 09:27:07 +09:00
..
catalyst [SPARK-39496][SQL] Handle null struct in `Inline.eval` 2022-06-18 09:27:07 +09:00
core [SPARK-39496][SQL] Handle null struct in `Inline.eval` 2022-06-18 09:27:07 +09:00
hive [SPARK-38786][SQL][TEST] Bug in StatisticsSuite 'change stats after add/drop partition command' 2022-05-08 17:33:57 -07:00
hive-thriftserver Revert "[SPARK-37090][BUILD][3.2] Upgrade libthrift to 0.16.0 to avoid security vulnerabilities" 2022-03-01 22:21:10 -08:00
README.md
create-docs.sh [SPARK-34010][SQL][DODCS] Use python3 instead of python in SQL documentation build 2021-01-05 19:48:10 +09:00
gen-sql-api-docs.py [SPARK-34747][SQL][DOCS] Add virtual operators to the built-in function document 2021-03-19 10:19:26 +09:00
gen-sql-config-docs.py [SPARK-36657][SQL] Update comment in 'gen-sql-config-docs.py' 2021-09-02 18:51:10 -07:00
gen-sql-functions-docs.py
mkdocs.yml

README.md

Spark SQL

This module provides support for executing relational queries expressed in either SQL or the DataFrame/Dataset API.

Spark SQL is broken up into four subprojects:

  • Catalyst (sql/catalyst) - An implementation-agnostic framework for manipulating trees of relational operators and expressions.
  • Execution (sql/core) - A query planner / execution engine for translating Catalyst's logical query plans into Spark RDDs. This component also includes a new public interface, SQLContext, that allows users to execute SQL or LINQ statements against existing RDDs and Parquet files.
  • Hive Support (sql/hive) - Includes extensions that allow users to write queries using a subset of HiveQL and access data from a Hive Metastore using Hive SerDes. There are also wrappers that allow users to run queries that include Hive UDFs, UDAFs, and UDTFs.
  • HiveServer and CLI support (sql/hive-thriftserver) - Includes support for the SQL CLI (bin/spark-sql) and a HiveServer2 (for JDBC/ODBC) compatible server.

Running ./sql/create-docs.sh generates SQL documentation for built-in functions under sql/site, and SQL configuration documentation that gets included as part of configuration.md in the main docs directory.