spark-instrumented-optimizer/sql
Cheng Su 3a361cd837 [SPARK-34253][SQL] Object hash aggregate should not fallback if no more input rows
### What changes were proposed in this pull request?

Object hash aggregate will fallback to sort-based aggregation based on number of keys seen so far [0]. The default config threshold is 128 (spark.sql.objectHashAggregate.sortBased.fallbackThreshold in [1]). There's an edge case we can do better, where we do not fallback if there's no more input rows. Suppose the task only has 128 group-by keys in hash ma, we don't need to fallback in this case, and we can save the extra sort. This is an rare edge case in production, but it can happen.

[0]: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/ObjectAggregationIterator.scala#L161

[1]: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L1615

### Why are the changes needed?

To avoid unnecessary sort in query. Save resource.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Add a unit test to verify task fallback or not is challenging. Given the change is pretty minor, besides relying on existing test in `ObjectHashAggregateSuite.scala`, I manually ran the followed query, and verified in debug mode that the code path for fallback was not executed. And verified the code path for fallback was executed without this change.

```
withSQLConf(
  SQLConf.USE_OBJECT_HASH_AGG.key -> "true",
  SQLConf.OBJECT_AGG_SORT_BASED_FALLBACK_THRESHOLD.key -> "1") {
  Seq.fill(1)(Tuple1(Array.empty[Int]))
    .toDF("c0")
    .groupBy(lit(1))
    .agg(typed_count($"c0"), max($"c0")).collect()
}
```

Closes #31353 from c21/object-hash-fallback.

Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-01-28 15:18:54 +00:00
..
catalyst [SPARK-33542][SQL][FOLLOWUP] Group exception messages in catalyst/catalog 2021-01-28 05:15:57 +00:00
core [SPARK-34253][SQL] Object hash aggregate should not fallback if no more input rows 2021-01-28 15:18:54 +00:00
hive [SPARK-34262][SQL] Refresh cached data of v1 table in ALTER TABLE .. SET LOCATION 2021-01-28 15:05:22 +09:00
hive-thriftserver [SPARK-34052][SQL] store SQL text for a temp view created using "CACHE TABLE .. AS SELECT" 2021-01-20 02:09:39 +00:00
create-docs.sh [SPARK-34010][SQL][DODCS] Use python3 instead of python in SQL documentation build 2021-01-05 19:48:10 +09:00
gen-sql-api-docs.py [SPARK-34022][DOCS][FOLLOW-UP] Fix typo in SQL built-in function docs 2021-01-06 09:28:22 -08:00
gen-sql-config-docs.py [SPARK-31550][SQL][DOCS] Set nondeterministic configurations with general meanings in sql configuration doc 2020-04-27 17:08:52 +09:00
gen-sql-functions-docs.py [SPARK-31562][SQL] Update ExpressionDescription for substring, current_date, and current_timestamp 2020-04-26 11:46:52 -07:00
mkdocs.yml [SPARK-30731] Update deprecated Mkdocs option 2020-02-19 17:28:58 +09:00
README.md [SPARK-30510][SQL][DOCS] Publicly document Spark SQL configuration options 2020-02-09 19:20:47 +09:00

Spark SQL

This module provides support for executing relational queries expressed in either SQL or the DataFrame/Dataset API.

Spark SQL is broken up into four subprojects:

  • Catalyst (sql/catalyst) - An implementation-agnostic framework for manipulating trees of relational operators and expressions.
  • Execution (sql/core) - A query planner / execution engine for translating Catalyst's logical query plans into Spark RDDs. This component also includes a new public interface, SQLContext, that allows users to execute SQL or LINQ statements against existing RDDs and Parquet files.
  • Hive Support (sql/hive) - Includes extensions that allow users to write queries using a subset of HiveQL and access data from a Hive Metastore using Hive SerDes. There are also wrappers that allow users to run queries that include Hive UDFs, UDAFs, and UDTFs.
  • HiveServer and CLI support (sql/hive-thriftserver) - Includes support for the SQL CLI (bin/spark-sql) and a HiveServer2 (for JDBC/ODBC) compatible server.

Running ./sql/create-docs.sh generates SQL documentation for built-in functions under sql/site, and SQL configuration documentation that gets included as part of configuration.md in the main docs directory.