f49bf1a072
### What changes were proposed in this pull request? This PR adds support for lateral subqueries. A lateral subquery is a subquery preceded by the `LATERAL` keyword in the FROM clause of a query that can reference columns in the preceding FROM items. For example: ```sql SELECT * FROM t1, LATERAL (SELECT * FROM t2 WHERE t1.a = t2.c) ``` A new subquery expression`LateralSubquery` is used to represent a lateral subquery. It is similar to `ScalarSubquery` but can return multiple rows and columns. A new logical unary node `LateralJoin` is used to represent a lateral join. Here is the analyzed plan for the above query: ```scala Project [a, b, c, d] +- LateralJoin lateral-subquery [a], Inner : +- Project [c, d] : +- Filter (outer(a) = c) : +- Relation [c, d] +- Relation [a, b] ``` Similar to a correlated subquery, a lateral subquery can be viewed as a dependent (nested loop) join where the evaluation of the right subtree depends on the current value of the left subtree. The same technique to decorrelate a subquery is used to decorrelate a lateral join: ```scala Project [a, b, c, d] +- LateralJoin lateral-subquery [a && a = c], Inner // pull up correlated predicates as join conditions : +- Project [c, d] : +- Relation [c, d] +- Relation [a, b] ``` Then the lateral join can be rewritten into a normal join: ```scala Join Inner (a = c) :- Relation [a, b] +- Relation [c, d] ``` #### Follow-ups: 1. Similar to rewriting correlated scalar subqueries, rewriting lateral joins is also subject to the COUNT bug (See SPARK-15370 for more details). This is **not** handled in the current PR as it requires a sizeable amount of refactoring. It will be addressed in a subsequent PR (SPARK-35551). 2. Currently Spark does use outer query references to resolve star expressions in subqueries. This is not lateral subquery specific and can be handled in a separate PR (SPARK-35618) ### Why are the changes needed? To support an ANSI SQL feature. ### Does this PR introduce _any_ user-facing change? Yes. It allows users to use lateral subqueries in the FROM clause of a query. ### How was this patch tested? - Parser test: `PlanParserSuite.scala` - Analyzer test: `ResolveSubquerySuite.scala` - Optimizer test: `PullupCorrelatedPredicatesSuite.scala` - SQL test: `join-lateral.sql`, `postgreSQL/join.sql` Closes #32303 from allisonwang-db/spark-34382-lateral. Lead-authored-by: allisonwang-db <66282705+allisonwang-db@users.noreply.github.com> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> |
||
---|---|---|
.. | ||
benchmarks | ||
src | ||
pom.xml |