History

Takeshi Yamamuro b806fc4582 [SPARK-31854][SQL] Invoke in MapElementsExec should not propagate null ### What changes were proposed in this pull request? This PR intends to fix a bug of `Dataset.map` below when the whole-stage codegen enabled; ``` scala> val ds = Seq(1.asInstanceOf[Integer], null.asInstanceOf[Integer]).toDS() scala> sql("SET spark.sql.codegen.wholeStage=true") scala> ds.map(v=>(v,v)).explain == Physical Plan == (1) SerializeFromObject [assertnotnull(input[0, scala.Tuple2, true])._1.intValue AS _1#69, assertnotnull(input[0, scala.Tuple2, true])._2.intValue AS _2#70] +- (1) MapElements <function1>, obj#68: scala.Tuple2 +- (1) DeserializeToObject staticinvoke(class java.lang.Integer, ObjectType(class java.lang.Integer), valueOf, value#1, true, false), obj#67: java.lang.Integer +- LocalTableScan [value#1] // `AssertNotNull` in `SerializeFromObject` will fail; scala> ds.map(v => (v, v)).show() java.lang.NullPointerException: Null value appeared in non-nullable fails: top level Product input object If the schema is inferred from a Scala tuple/case class, or a Java bean, please try to use scala.Option[_] or other nullable types (e.g. java.lang.Integer instead of int/scala.Int). // When the whole-stage codegen disabled, the query works well; scala> sql("SET spark.sql.codegen.wholeStage=false") scala> ds.map(v=>(v,v)).show() +----+----+ \| _1\| _2\| +----+----+ \| 1\| 1\| \|null\|null\| +----+----+ ``` A root cause is that `Invoke` used in `MapElementsExec` propagates input null, and then [AssertNotNull](`1b780f364b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala (L253-L255)`) in `SerializeFromObject` fails because a top-level row becomes null. So, `MapElementsExec` should not return `null` but `(null, null)`. NOTE: the generated code of the query above in the current master; ``` / 033 / private void mapelements_doConsume_0(java.lang.Integer mapelements_expr_0_0, boolean mapelements_exprIsNull_0_0) throws java.io.IOException { / 034 / boolean mapelements_isNull_1 = true; / 035 / scala.Tuple2 mapelements_value_1 = null; / 036 / if (!false) { / 037 / mapelements_resultIsNull_0 = false; / 038 / / 039 / if (!mapelements_resultIsNull_0) { / 040 / mapelements_resultIsNull_0 = mapelements_exprIsNull_0_0; / 041 / mapelements_mutableStateArray_0[0] = mapelements_expr_0_0; / 042 / } / 043 / / 044 / mapelements_isNull_1 = mapelements_resultIsNull_0; / 045 / if (!mapelements_isNull_1) { / 046 / Object mapelements_funcResult_0 = null; / 047 / mapelements_funcResult_0 = ((scala.Function1) references[1] / literal /).apply(mapelements_mutableStateArray_0[0]); / 048 / / 049 / if (mapelements_funcResult_0 != null) { / 050 / mapelements_value_1 = (scala.Tuple2) mapelements_funcResult_0; / 051 / } else { / 052 / mapelements_isNull_1 = true; / 053 / } / 054 / / 055 / } / 056 / } / 057 / / 058 / serializefromobject_doConsume_0(mapelements_value_1, mapelements_isNull_1); / 059 / / 060 / } ``` The generated code w/ this fix; ``` / 032 / private void mapelements_doConsume_0(java.lang.Integer mapelements_expr_0_0, boolean mapelements_exprIsNull_0_0) throws java.io.IOException { / 033 / boolean mapelements_isNull_1 = true; / 034 / scala.Tuple2 mapelements_value_1 = null; / 035 / if (!false) { / 036 / mapelements_mutableStateArray_0[0] = mapelements_expr_0_0; / 037 / / 038 / mapelements_isNull_1 = false; / 039 / if (!mapelements_isNull_1) { / 040 / Object mapelements_funcResult_0 = null; / 041 / mapelements_funcResult_0 = ((scala.Function1) references[1] / literal /).apply(mapelements_mutableStateArray_0[0]); / 042 / / 043 / if (mapelements_funcResult_0 != null) { / 044 / mapelements_value_1 = (scala.Tuple2) mapelements_funcResult_0; / 045 / mapelements_isNull_1 = false; / 046 / } else { / 047 / mapelements_isNull_1 = true; / 048 / } / 049 / / 050 / } / 051 / } / 052 / / 053 / serializefromobject_doConsume_0(mapelements_value_1, mapelements_isNull_1); / 054 / / 055 */ } ``` ### Why are the changes needed? Bugfix. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added tests. Closes #28681 from maropu/SPARK-31854. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>		2020-06-01 04:50:00 +00:00
..
catalyst	[SPARK-31849][PYTHON][SQL] Make PySpark SQL exceptions more Pythonic	2020-06-01 09:45:21 +09:00
core	[SPARK-31854][SQL] Invoke in MapElementsExec should not propagate null	2020-06-01 04:50:00 +00:00
hive	[SPARK-31862][SQL] Remove exception wrapping in AQE	2020-05-29 04:23:38 +00:00
hive-thriftserver	[SPARK-31859][SPARK-31861][SPARK-31863] Fix Thriftserver session timezone issues	2020-05-30 06:14:32 +00:00
create-docs.sh	[SPARK-31550][SQL][DOCS] Set nondeterministic configurations with general meanings in sql configuration doc	2020-04-27 17:08:52 +09:00
gen-sql-api-docs.py	[SPARK-31474][SQL][FOLLOWUP] Replace _FUNC_ placeholder with functionname in the note field of expression info	2020-04-23 13:33:04 +09:00
gen-sql-config-docs.py	[SPARK-31550][SQL][DOCS] Set nondeterministic configurations with general meanings in sql configuration doc	2020-04-27 17:08:52 +09:00
gen-sql-functions-docs.py	[SPARK-31562][SQL] Update ExpressionDescription for substring, current_date, and current_timestamp	2020-04-26 11:46:52 -07:00
mkdocs.yml	[SPARK-30731] Update deprecated Mkdocs option	2020-02-19 17:28:58 +09:00
README.md	[SPARK-30510][SQL][DOCS] Publicly document Spark SQL configuration options	2020-02-09 19:20:47 +09:00

README.md

Spark SQL

This module provides support for executing relational queries expressed in either SQL or the DataFrame/Dataset API.

Spark SQL is broken up into four subprojects:

Catalyst (sql/catalyst) - An implementation-agnostic framework for manipulating trees of relational operators and expressions.
Execution (sql/core) - A query planner / execution engine for translating Catalyst's logical query plans into Spark RDDs. This component also includes a new public interface, SQLContext, that allows users to execute SQL or LINQ statements against existing RDDs and Parquet files.
Hive Support (sql/hive) - Includes extensions that allow users to write queries using a subset of HiveQL and access data from a Hive Metastore using Hive SerDes. There are also wrappers that allow users to run queries that include Hive UDFs, UDAFs, and UDTFs.
HiveServer and CLI support (sql/hive-thriftserver) - Includes support for the SQL CLI (bin/spark-sql) and a HiveServer2 (for JDBC/ODBC) compatible server.

Running ./sql/create-docs.sh generates SQL documentation for built-in functions under sql/site, and SQL configuration documentation that gets included as part of configuration.md in the main docs directory.