spark-instrumented-optimizer/sql
Max Gekk adba2ec8f2 [SPARK-34099][SQL] Keep dependents of v2 tables cached during cache refreshing in DS v2 commands
### What changes were proposed in this pull request?

This PR changes cache refreshing of v2 tables in v2 commands. In particular, v2 table dependents are not removed from the cache after this PR. Comparing to current implementation, we just clear cached data of all dependents and keep them in the cache. So, the next actions will fill in the cached data of the original v2 table and its dependents. In more details:

1. Add new method `recacheTable()` to `DataSourceV2Strategy` and pass it the exec node where need to recache table. New method uses `recacheByPlan` to refresh data cache of v2 tables, and keeps table dependents still cached **while clearing their caches**.
2. Simplify `invalidateCache` (and rename it `invalidateTableCache`) by retargeting it for only table cache invalidation.
3. Modify a test for `REFRESH TABLE` and check that v2 table dependent is still cached after refreshing the base table.

### Why are the changes needed?
1. This should improve user experience with table/view caching. For example, let's imagine that an user has cached v2 table and cached view based on the table. And the user passed the table to external library which drops/renames/adds partitions in the v2 table. Unfortunately, the user gets the view uncached after that even he/she hasn't uncached the view explicitly.
2. Improve code maintenance.
3. Reduce the number of calls to the Cache Manager when need to recache a table. Before the changes, `invalidateCache()` invokes the Cache Manager 3 times: `lookupCachedData()`, `uncacheQuery()` and `cacheQuery()`.
4. Also this should speed up table recaching.

### Does this PR introduce _any_ user-facing change?
From the view of the correctness of query results, there are no behavior changes but the changes might influence on consuming memory and query execution time.

### How was this patch tested?
By running the existing test suites for v2 the add/drop/rename partition commands.

Closes #31172 from MaxGekk/dsv2-recache-table.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-01-15 03:32:49 +00:00
..
catalyst [SPARK-33989][SQL] Strip auto-generated cast when using Cast.sql 2021-01-14 15:27:14 +00:00
core [SPARK-34099][SQL] Keep dependents of v2 tables cached during cache refreshing in DS v2 commands 2021-01-15 03:32:49 +00:00
hive [SPARK-34068][CORE][SQL][MLLIB][GRAPHX] Remove redundant collection conversion 2021-01-13 18:07:02 -06:00
hive-thriftserver [SPARK-34068][CORE][SQL][MLLIB][GRAPHX] Remove redundant collection conversion 2021-01-13 18:07:02 -06:00
create-docs.sh [SPARK-34010][SQL][DODCS] Use python3 instead of python in SQL documentation build 2021-01-05 19:48:10 +09:00
gen-sql-api-docs.py [SPARK-34022][DOCS][FOLLOW-UP] Fix typo in SQL built-in function docs 2021-01-06 09:28:22 -08:00
gen-sql-config-docs.py [SPARK-31550][SQL][DOCS] Set nondeterministic configurations with general meanings in sql configuration doc 2020-04-27 17:08:52 +09:00
gen-sql-functions-docs.py [SPARK-31562][SQL] Update ExpressionDescription for substring, current_date, and current_timestamp 2020-04-26 11:46:52 -07:00
mkdocs.yml [SPARK-30731] Update deprecated Mkdocs option 2020-02-19 17:28:58 +09:00
README.md [SPARK-30510][SQL][DOCS] Publicly document Spark SQL configuration options 2020-02-09 19:20:47 +09:00

Spark SQL

This module provides support for executing relational queries expressed in either SQL or the DataFrame/Dataset API.

Spark SQL is broken up into four subprojects:

  • Catalyst (sql/catalyst) - An implementation-agnostic framework for manipulating trees of relational operators and expressions.
  • Execution (sql/core) - A query planner / execution engine for translating Catalyst's logical query plans into Spark RDDs. This component also includes a new public interface, SQLContext, that allows users to execute SQL or LINQ statements against existing RDDs and Parquet files.
  • Hive Support (sql/hive) - Includes extensions that allow users to write queries using a subset of HiveQL and access data from a Hive Metastore using Hive SerDes. There are also wrappers that allow users to run queries that include Hive UDFs, UDAFs, and UDTFs.
  • HiveServer and CLI support (sql/hive-thriftserver) - Includes support for the SQL CLI (bin/spark-sql) and a HiveServer2 (for JDBC/ODBC) compatible server.

Running ./sql/create-docs.sh generates SQL documentation for built-in functions under sql/site, and SQL configuration documentation that gets included as part of configuration.md in the main docs directory.