|
|
|
@ -5,12 +5,13 @@
|
|
|
|
|
We begin by formalizing pattern match semantics, as typically used in the implementation of compilers to realize the pattern part of production rules.
|
|
|
|
|
Match patterns in most functional languages (e.g., Scala, OcaML) can be expressed through these semantics, while compilers implemented in imperative languages (e.g., Orca) typically invent analogous constructs.
|
|
|
|
|
|
|
|
|
|
For simplicity of presentation, we assume that constants are drawn from a domain $\constantdom$ that includes primitive values ($\prim$) and domain of abstract syntax tree nodes $\nodelabel(\ldots)$:
|
|
|
|
|
For simplicity of presentation, we assume that constants are drawn from a domain $\constantdom$ that includes primitive values ($\prim$) and abstract syntax tree nodes $\nodelabel(\ldots)$:
|
|
|
|
|
$$\constantdom : \nodelabel(\constantdom, \ldots \constantdom) \oroption \prim$$
|
|
|
|
|
We assume that the primitive value domain includes at least boolean primitives ($\prim$) true ($\top$) and false ($\bot$).
|
|
|
|
|
We refer to $\nodelabel$ as the AST node type.
|
|
|
|
|
We assume that the primitive value domain includes at least boolean values true ($\top$) and false ($\bot$).
|
|
|
|
|
We refer to $\nodelabel$ as the AST node type, or label.
|
|
|
|
|
For simplicity, we abstract primitive-valued expressions $\expression \in \expressiondom$ as functions ($\expressiondom : \scopedom \rightarrow \constantdom$) that map from a scope to a constant.
|
|
|
|
|
We model scopes $\scope in \scopedom$ as maps of variable bindings ($\scopedom: \vardom \rightarrow \constantdom$), where $\vardom$ is the set of all variable names.
|
|
|
|
|
We write $\varsOf{\expression}$ to mean all scope variables referenced by $\expression$.
|
|
|
|
|
We model scopes $\scope \in \scopedom$ as maps of variable bindings ($\scopedom: \vardom \rightarrow \constantdom$), where $\vardom$ is the set of all variable names.
|
|
|
|
|
Unbound variables in a scope return a undefined, \texttt{null} value.
|
|
|
|
|
|
|
|
|
|
\begin{figure}
|
|
|
|
@ -33,12 +34,14 @@ Unbound variables in a scope return a undefined, \texttt{null} value.
|
|
|
|
|
|
|
|
|
|
\begin{example}
|
|
|
|
|
\Cref{fig:exampleAST} illustrates a simple abstract syntax tree with three node types (i.e., $\nodelabel \in \inset{\textbf{Filter}, \textbf{Project}, \textbf{Table}}$).
|
|
|
|
|
The first child of each node type is a primitive valued constant\footnote{Note that we treat collection types like lists as primitive types}, while the second child of the Filter and Project nodes are both ASTs.
|
|
|
|
|
The first child of each node type is a primitive valued constant\footnote{Note that we treat collection types like lists as primitive types}, while both of the Filter and Project nodes' second child is an AST node.
|
|
|
|
|
\end{example}
|
|
|
|
|
|
|
|
|
|
We define a language of match patterns as follows.
|
|
|
|
|
The core of the language is a pattern that matches AST nodes and checks its children against sub-patterns, and a pattern that matches anything.
|
|
|
|
|
The pattern matching language also includes basic boolean operators, as well as a set of rules that exist to manipulate or use the scope, matching based on the result of expression evaluation, matching an element of the scope against another pattern, or binding the result of expression evaluation or a sub-pattern into the scope.
|
|
|
|
|
An AST node pattern $\ell(\matcher, \ldots, \matcher)$ applies a set of patterns to its children.
|
|
|
|
|
The wildcard pattern $\matchany$ matches anything.
|
|
|
|
|
The assignment operator $\var \leftarrow \cdot$ assigns variables.
|
|
|
|
|
An arbitrary boolean-valued expression may be evaluated against bound variables as a pattern, and the operator $\passToMatcher$ applies a pattern to a bound variable. The remaining patterns are simple boolean operations.
|
|
|
|
|
\begin{align*}
|
|
|
|
|
\matcherdom :=& \nodelabel(\matcherdom, \ldots, \matcherdom)
|
|
|
|
|
\oroption \matchany
|
|
|
|
@ -46,17 +49,17 @@ The pattern matching language also includes basic boolean operators, as well as
|
|
|
|
|
\oroption \matcherdom \wedge \matcherdom
|
|
|
|
|
\oroption \matcherdom \vee \matcherdom\\
|
|
|
|
|
& \oroption \expressiondom
|
|
|
|
|
\oroption \vardom @ \matcherdom
|
|
|
|
|
\oroption \vardom \passToMatcher \matcherdom
|
|
|
|
|
\oroption \vardom \gets \expressiondom
|
|
|
|
|
\oroption \vardom \gets \matcherdom
|
|
|
|
|
\end{align*}
|
|
|
|
|
|
|
|
|
|
Where it is clear to do so, for a variable $\var \in \vardom$, we overload $\var$ to mean the matcher $\var \gets \matchany$.
|
|
|
|
|
Where it is clear to do so, we use variables $\var \in \vardom$ to mean a variable binding match pattern (i.e., the pattern $\var \gets \matchany$).
|
|
|
|
|
|
|
|
|
|
\begin{example}
|
|
|
|
|
\label{ex:pattern}
|
|
|
|
|
With $\inset{\texttt{cond}, \texttt{tgt}, \texttt{child}} \subset \vardom$, the match pattern for the select pushdown optimization from the introduction is:
|
|
|
|
|
$\textbf{Filter}( \texttt{cond} , \textbf{Project}( \texttt{tgt}, \texttt{child} ))$. Note the similarity to the match pattern in the introduction.
|
|
|
|
|
With $\texttt{cond}, \texttt{tgt}, \texttt{child} \in \vardom$, the match pattern for the select pushdown optimization from the introduction is:
|
|
|
|
|
$\textbf{Filter}( \texttt{cond} , \textbf{Project}( \texttt{tgt}, \texttt{child} ))$. Note the similarity to the scala match pattern syntax in the introduction.
|
|
|
|
|
\end{example}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
@ -74,10 +77,10 @@ We mark scope updates by $\scope[\var \backslash \constant]$ to mean $\scope$ wi
|
|
|
|
|
|
|
|
|
|
\begin{example}
|
|
|
|
|
The pattern from our running example can be equivalently stated as:
|
|
|
|
|
$$\textbf{Filter}( \texttt{cond}, \texttt{p} ) \wedge \texttt{p} \passToMatcher \textbf{Project}( \texttt{tgt}, \texttt{child} )$$
|
|
|
|
|
$$\textbf{Filter}( \texttt{cond}, \texttt{p} ) \wedge \left( \texttt{p} \passToMatcher \textbf{Project}( \texttt{tgt}, \texttt{child} ) \right)$$
|
|
|
|
|
If the \textbf{Filter} node is not matched, the conjunction shortcuts.
|
|
|
|
|
If it is matched, the right half of the conjunction is evaluated with \texttt{cond} and \texttt{p} bound (in the scope) to the \textbf{Filter} node's children.
|
|
|
|
|
The $@$ operator evaluates the \textbf{Project} matcher on the constant bound to the variable \texttt{p} (i.e., $\scope(\texttt{p})$), and the rest proceeds as before.
|
|
|
|
|
The $\passToMatcher$ operator evaluates the \textbf{Project} matcher on the constant bound to the variable \texttt{p} (i.e., $\scope(\texttt{p})$), and the rest proceeds as before.
|
|
|
|
|
\end{example}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
@ -109,7 +112,7 @@ We mark scope updates by $\scope[\var \backslash \constant]$ to mean $\scope$ wi
|
|
|
|
|
\scope & \textbf{if } \expression(\scope) = \top\\
|
|
|
|
|
\nullresult & \textbf{otherwise}
|
|
|
|
|
\end{cases}\\
|
|
|
|
|
\evalmatcher{\var @ \matcher}(\constant)(\scope)
|
|
|
|
|
\evalmatcher{\var \passToMatcher \matcher}(\constant)(\scope)
|
|
|
|
|
& = \evalmatcher{\matcher}(\scope(\var))(\scope)\\
|
|
|
|
|
\evalmatcher{\var \leftarrow \expression}(\constant)(\scope)
|
|
|
|
|
& = \scope[\var \backslash \expression(\scope)]\\
|
|
|
|
@ -117,19 +120,19 @@ We mark scope updates by $\scope[\var \backslash \constant]$ to mean $\scope$ wi
|
|
|
|
|
& = \evalmatcher{\matcher}(\constant)(\scope[\var \backslash \constant])
|
|
|
|
|
\end{align*}
|
|
|
|
|
\vspace*{-3mm}
|
|
|
|
|
\trimmedcaption{Operational semantics for match patterns.}
|
|
|
|
|
\trimmedcaption{Semantics for match patterns.}
|
|
|
|
|
\label{fig:evalMatcherSemantics}
|
|
|
|
|
\end{figure}
|
|
|
|
|
|
|
|
|
|
\begin{example}
|
|
|
|
|
\label{ex:sideEffects}
|
|
|
|
|
When \texttt{tgt} of a \textbf{Project} operator has side-effects or other non-deterministic behavior, systems like Spark will not apply the selection push-down optimization.
|
|
|
|
|
Let $\textsc{det}$ be an externally provided function that determines whether a target is deterministic.
|
|
|
|
|
When a \textbf{Project} operator has side-effects or other non-deterministic behavior, systems like Spark will not apply the selection push-down optimization.
|
|
|
|
|
Let $\textsc{det}$ be an externally provided function that determines whether a \textbf{Project} is deterministic.
|
|
|
|
|
Recall that we do not explicitly define expression evaluation semantics, so let $\textsc{det}(\texttt{tgt})$ be an expression that applies $\textsc{det}$ to the variable \texttt{tgt} in the scope.
|
|
|
|
|
We can then write a `safe' version of the selection push-down pattern as;
|
|
|
|
|
$$\textbf{Filter}(\texttt{cond}, \textbf{Project}(\texttt{tgt}, \texttt{child})) \wedge \textsc{det}(\texttt{tgt})$$
|
|
|
|
|
If the left-half of the conjunction succeeds, the variables \texttt{cond}, \texttt{tgt}, and \texttt{child} will be bound in the scope.
|
|
|
|
|
The right-half succeeds if \textsc{det} returns true on the value bound to \texttt{tgt}.
|
|
|
|
|
The right-half succeeds if the expression $\textsc{det}(\texttt{tgt})$ evaluates to true on the resulting scope.
|
|
|
|
|
\label{eg:MatchPattern}
|
|
|
|
|
\end{example}
|
|
|
|
|
|
|
|
|
@ -139,10 +142,10 @@ We mark scope updates by $\scope[\var \backslash \constant]$ to mean $\scope$ wi
|
|
|
|
|
\subsection{Application of Match Patterns}
|
|
|
|
|
|
|
|
|
|
In general, we are interested in match patterns applied to entire ASTs.
|
|
|
|
|
Let $\db \in \constantdom$ denote an abstract syntax tree instance\footnote{
|
|
|
|
|
We note that while $\db$ may also be a constant, this case is uninteresting.
|
|
|
|
|
Let $\db \in \constantdom$ be an abstract syntax tree instance\footnote{
|
|
|
|
|
While $\db$ may also be a constant, this case is not usually interesting.
|
|
|
|
|
}.
|
|
|
|
|
Let the subtrees of $\db$ be defined as:
|
|
|
|
|
We define the subtrees of $\db$ as:
|
|
|
|
|
$$\subtreesOf{\db} = \inset{\db} \cup \begin{cases}
|
|
|
|
|
\bigcup_{i} \subtreesOf{\constant_i} & \textbf{if } \db = \nodelabel(\constant_1, \ldots, \constant_n) \\
|
|
|
|
|
\emptyset & \textbf{otherwise}
|
|
|
|
@ -150,7 +153,7 @@ $$\subtreesOf{\db} = \inset{\db} \cup \begin{cases}
|
|
|
|
|
\begin{example}
|
|
|
|
|
\label{ex:subtrees}
|
|
|
|
|
The subtrees of the AST in \Cref{fig:exampleAST} are $\textbf{Filter}(\ldots)$, $\texttt{'X>3'}$, $\textbf{Project}(\ldots)$, $\texttt{['X', 'Y']}$, $\textbf{Table}(\ldots)$, and $\texttt{'R'}$.
|
|
|
|
|
While this list includes everything, including primitive values, moving forward we will consider only AST nodes like $\textbf{Filter}(\ldots)$ as subtrees.
|
|
|
|
|
Note that this list includes everything, including primitive values.
|
|
|
|
|
\end{example}
|
|
|
|
|
|
|
|
|
|
For a matcher $\matcher \in \matcherdom$, we define $\query{\matcher}(\db)$ as a search over every subtree of $\db$;
|
|
|
|
@ -216,7 +219,7 @@ $$\query{\matcher}(\db) = \comprehension{
|
|
|
|
|
|
|
|
|
|
\subsection{\systemlang}
|
|
|
|
|
|
|
|
|
|
We next demonstrate that any match pattern in the language defined above may be rewritten into a simplified ``flat'' form based on relational algebra that we call \systemlang.
|
|
|
|
|
We next demonstrate that any match pattern in the language defined above may be rewritten into a simplified ``flat'' form, based on relational algebra, that we call \systemlang.
|
|
|
|
|
Specifically, we extend positive relational algebra with a set of task-specific relational \textbf{match atoms} $\atom \in \atomdom$ as follows:
|
|
|
|
|
$$\atomdom :=
|
|
|
|
|
\inbrackets{\vardom = \nodelabel(\vardom, \ldots, \vardom)}
|
|
|
|
@ -226,10 +229,10 @@ $$\atomdom :=
|
|
|
|
|
\oroption \bot
|
|
|
|
|
\oroption \db(\vardom)$$
|
|
|
|
|
|
|
|
|
|
Formal semantics for match atoms are defined in \Cref{fig:atomSemantics}.
|
|
|
|
|
Semantics for match atoms are defined in \Cref{fig:atomSemantics}.
|
|
|
|
|
We note that several atom types have infinite cardinalities; we return to this point shortly.
|
|
|
|
|
To summarize,
|
|
|
|
|
(i)~a \textbf{Match Atom} ($\inbrackets{\var = \nodelabel(\var_1, \ldots, \var_n)}$) defines a (infinite) relation, with schema $\inset{\var, \var_1, \ldots, \var_n}$, consisting of every possible AST node, with the node in attribute $\var$, and the remaining attributes assigned to the node's fields.
|
|
|
|
|
(i)~a \textbf{Match Atom} ($\inbrackets{\var = \nodelabel(\var_1, \ldots, \var_n)}$) defines a (infinite) relation, with schema $\inset{\var, \var_1, \ldots, \var_n}$, consisting of every possible AST node, with the node in attribute $\var$, and the node's fields as the remaining attributes.
|
|
|
|
|
(ii)~a \textbf{Binding Atom} ($\inbrackets{\var = \expression}$) defines a (infinite) relation, with schema $\inset{\var} \cup \varsOf{\expression}$, of possible assignments to $\varsOf{\expression}$ and the result of evaluating $\expression$ on them;
|
|
|
|
|
(iii)~a \textbf{Test Atom} ($\inbrackets{\expression}$) defines the (infinite) relation, with schema $\varsOf{\expression}$, of all assignments to $\varsOf{\expression}$ on which $\expression$ evaluates to true;
|
|
|
|
|
(iv)~a \textbf{Universal Atom} ($\top$) defines the relation, with a nullary schema, consisting of a single tuple;
|
|
|
|
@ -258,7 +261,7 @@ We write $\schemaOf{\atom}$ for the schema of $\atom$.
|
|
|
|
|
\rewritematcher{\var}{\expression}
|
|
|
|
|
& = \inbrackets{\expression}\\
|
|
|
|
|
%
|
|
|
|
|
\rewritematcher{\var}{\var' @ \matcher}
|
|
|
|
|
\rewritematcher{\var}{\var' \passToMatcher \matcher}
|
|
|
|
|
& = \rewritematcher{\var'}{\matcher}\\
|
|
|
|
|
%
|
|
|
|
|
\rewritematcher{\var}{\var' \leftarrow \expression}
|
|
|
|
@ -267,28 +270,45 @@ We write $\schemaOf{\atom}$ for the schema of $\atom$.
|
|
|
|
|
\rewritematcher{\var}{\var' \leftarrow \matcher}
|
|
|
|
|
& = \rewritematcher{\var}{\matcher} \bowtie \inbrackets{\var' = \var}
|
|
|
|
|
\end{align*}
|
|
|
|
|
\trimmedcaption{Reducing match patterns to \systemlang; Each $\genvar$ denotes a freshly allocated variable name}
|
|
|
|
|
\trimmedcaption{Rewriting match patterns to \systemlang; Each $\genvar$ is a freshly allocated variable name.}
|
|
|
|
|
\label{fig:reductionToFOL}
|
|
|
|
|
\end{figure}
|
|
|
|
|
|
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
|
|
|
|
|
|
|
|
\paragraph{Query Safety}
|
|
|
|
|
|
|
|
|
|
While some match atoms are infinite-cardinality relations, we want queries to produce only finite outputs.
|
|
|
|
|
This concept is typically captured by the \emph{safety} property: a relation or query is safe (even if one of its component relations is unsafe) if it returns a finite set of results.
|
|
|
|
|
If a relation is finite, we know that its attributes have a finite domain and call relation safe.
|
|
|
|
|
If an infinite relation is joined with a finite relation, the result is safe if the keys of the infinite relation participate in the join.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
|
|
|
|
|
|
|
|
\paragraph{Rewriting Match Patterns}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
We next address the problem of rewriting `match' style pattern queries into \systemlang.
|
|
|
|
|
\Cref{fig:reductionToFOL} defines the rewrite rule $\rewritematcher{\var}{\matcher}$ which translates a match pattern $\matcher \in \matcherdom$ into a join over match atoms.
|
|
|
|
|
With $\var$ as a unique variable name, $\query{\matcher}(\db) \equiv \db(\var) \bowtie \rewritematcher{\var}{\matcher}$.
|
|
|
|
|
We specifically target joins to benefit from the commutativity and associativity of the join operator when picking an execution strategy in \Cref{fig:executionPlan}.
|
|
|
|
|
With $\var$ as a unique variable name, $\query{\matcher}(\db) \equiv \db(\var) \bowtie \rewritematcher{\var}{\matcher}$.
|
|
|
|
|
The result is guaranteed to be safe.
|
|
|
|
|
|
|
|
|
|
\begin{example}
|
|
|
|
|
\label{ex:rewrite}
|
|
|
|
|
Continuing our running example, and recalling that a bare variable $\var$ is shorthand for $\var \leftarrow \matchany$:
|
|
|
|
|
Continuing our running example, and recalling that a bare variable $\var$ is shorthand for $\var \leftarrow \matchany$, we expand $\query{m}(\db) =$
|
|
|
|
|
{\footnotesize\begin{align*}
|
|
|
|
|
\query{m}(\db) & = \db(r) \bowtie \rewritematcher{r}{\textbf{Filter}(\texttt{cond}, \textbf{Project}(\texttt{tgt}, \texttt{child}))}\\
|
|
|
|
|
& = \db(r) \bowtie \inbrackets{r = \textbf{Filter}(a, b)} \bowtie \rewritematcher{a}{\texttt{cond} \leftarrow \matchany} \bowtie \rewritematcher{b}{\textbf{Project}(\texttt{tgt}, \texttt{child})}\\
|
|
|
|
|
& = \db(r) \bowtie \inbrackets{r = \textbf{Filter}(a, b)} \bowtie \top \bowtie \inbrackets{\texttt{cond} = a} \bowtie \rewritematcher{b}{\textbf{Project}(\texttt{tgt}, \texttt{child})}\\
|
|
|
|
|
&\; \db(r) \bowtie \rewritematcher{r}{\textbf{Filter}(\texttt{cond}, \textbf{Project}(\texttt{tgt}, \texttt{child}))}\\
|
|
|
|
|
= &\; \db(r) \bowtie \inbrackets{r = \textbf{Filter}(a, b)} \bowtie \rewritematcher{a}{\texttt{cond} \leftarrow \matchany} \bowtie \rewritematcher{b}{\textbf{Project}(\texttt{tgt}, \texttt{child})}\\
|
|
|
|
|
= &\; \db(r) \bowtie \inbrackets{r = \textbf{Filter}(a, b)} \bowtie \top \bowtie \inbrackets{\texttt{cond} = a} \bowtie \rewritematcher{b}{\textbf{Project}(\texttt{tgt}, \texttt{child})}\\
|
|
|
|
|
\end{align*}}\\[-8mm]
|
|
|
|
|
\textbf{Project} is expanded similarly to \textbf{Filter}.
|
|
|
|
|
The atom $\inbrackets{r = \textbf{Filter}(a, b)}$ has schema $\inset{r, a, b}$ and is defined for every triple where $r = \textbf{Filter}(a, b)$.
|
|
|
|
|
Because $r$ is a key for this relation, observe that the query $\db(r) \bowtie \inbrackets{r = \textbf{Filter}(a, b)}$ computes the (finite) set of subtrees of $\db$ that are \textbf{Filter}-typed AST nodes (with attributes $r$, $a$, $b$ taking the values of the node and its two children, respectively).
|
|
|
|
|
\end{example}
|
|
|
|
|
|
|
|
|
|
Note the following relational algebra equivalences
|
|
|
|
|
\noindent Note the following relational algebra equivalences
|
|
|
|
|
$$
|
|
|
|
|
Q \bowtie \top \equiv Q
|
|
|
|
|
\hspace{10mm} %%%%%%%%%%%%%%%%%%%%
|
|
|
|
@ -296,17 +316,5 @@ Q \bowtie \bot \equiv \bot
|
|
|
|
|
\hspace{10mm} %%%%%%%%%%%%%%%%%%%%
|
|
|
|
|
\db(\var)\bowtie\db(\var) \equiv \db(\var)
|
|
|
|
|
$$
|
|
|
|
|
The first two equivalences follow from the relations $\top$ and $\bot$ being the identity and annihilator values for $\bowtie$ respectively~\cite{DBLP:conf/pods/GreenKT07}.
|
|
|
|
|
The first two equivalences follow from the relations $\top$ and $\bot$ being the identity and annihilator values for $\bowtie$ respectively.
|
|
|
|
|
The third follows from the idempotency of natural join on keyed relations.
|
|
|
|
|
|
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
|
|
|
|
|
|
|
|
\paragraph{Query Safety}
|
|
|
|
|
|
|
|
|
|
As we note above, match atoms are infinite-cardinality relations, but naturally, we want queries to produce only finite outputs.
|
|
|
|
|
This concept is typically captured by the notion of \emph{safety}: a query is safe if it is guaranteed to return a finite set of results.
|
|
|
|
|
This property is derived iteratively: If a relation is finite, we know that its attributes have a finite domain and call the attributes safe.
|
|
|
|
|
If all of the key attributes of a relation (even an infinite one) are safe, then only a finite number of records in the relation can possibly participate in a join and we can call the relation and all of its attributes safe.
|
|
|
|
|
A query is safe when all of its relations are safe.
|
|
|
|
|
The rewrite $\rewritematcher{\var}{\matcher}$ guarantees safety if: (i) $\var$ is safe, and (ii) any attributes referenced by expressions in $\matcher$ are safe.
|
|
|
|
|
|
|
|
|
|