spark-instrumented-optimizer/sql
Sean Owen b69c26833c [SPARK-35848][MLLIB] Optimize some treeAggregates in MLlib by delaying allocations
### What changes were proposed in this pull request?

Optimize some treeAggregates in MLlib by delaying allocating (thus not sending around) large arrays of zeroes
This uses the same idea as in https://github.com/apache/spark/pull/23600/files

### Why are the changes needed?

Allocating huge arrays of zeroes takes additional memory and network I/O which is unnecessary in some cases. It can cause operations to run out of memory that might otherwise succeed. Specifically, this should prevent the 'zero' value from having to be (pointlessly) checked for serializability, which can fail when passing through the default JavaSerializer; it would also prevent allocating and sending large 'zero' values for an empty partition in the aggregate.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #33443 from srowen/SPARK-35848.

Authored-by: Sean Owen <srowen@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2021-07-22 13:59:09 -05:00
..
catalyst [SPARK-35815][SQL] Allow delayThreshold for watermark to be represented as ANSI interval literals 2021-07-22 17:36:22 +03:00
core [SPARK-35848][MLLIB] Optimize some treeAggregates in MLlib by delaying allocations 2021-07-22 13:59:09 -05:00
hive [SPARK-28266][SQL] convertToLogicalRelation should not interpret path property when reading Hive tables 2021-07-21 22:40:39 +08:00
hive-thriftserver [SPARK-36179][SQL] Support TimestampNTZType in SparkGetColumnsOperation 2021-07-20 09:48:58 +09:00
create-docs.sh [SPARK-34010][SQL][DODCS] Use python3 instead of python in SQL documentation build 2021-01-05 19:48:10 +09:00
gen-sql-api-docs.py [SPARK-34747][SQL][DOCS] Add virtual operators to the built-in function document 2021-03-19 10:19:26 +09:00
gen-sql-config-docs.py [SPARK-32194][PYTHON] Use proper exception classes instead of plain Exception 2021-05-26 11:54:40 +09:00
gen-sql-functions-docs.py
mkdocs.yml
README.md

Spark SQL

This module provides support for executing relational queries expressed in either SQL or the DataFrame/Dataset API.

Spark SQL is broken up into four subprojects:

  • Catalyst (sql/catalyst) - An implementation-agnostic framework for manipulating trees of relational operators and expressions.
  • Execution (sql/core) - A query planner / execution engine for translating Catalyst's logical query plans into Spark RDDs. This component also includes a new public interface, SQLContext, that allows users to execute SQL or LINQ statements against existing RDDs and Parquet files.
  • Hive Support (sql/hive) - Includes extensions that allow users to write queries using a subset of HiveQL and access data from a Hive Metastore using Hive SerDes. There are also wrappers that allow users to run queries that include Hive UDFs, UDAFs, and UDTFs.
  • HiveServer and CLI support (sql/hive-thriftserver) - Includes support for the SQL CLI (bin/spark-sql) and a HiveServer2 (for JDBC/ODBC) compatible server.

Running ./sql/create-docs.sh generates SQL documentation for built-in functions under sql/site, and SQL configuration documentation that gets included as part of configuration.md in the main docs directory.