stuff

2023-07-17 23:19:46 -04:00 · 2023-07-17 23:19:46 -04:00 · 577c6cdaa2
parent 5a23778161
commit 577c6cdaa2
3 changed files with 62 additions and 82 deletions
--- a/sections/introduction.tex
+++ b/sections/introduction.tex
@ -44,14 +44,12 @@ Match patterns are ubiquitous in functional languages due to their utility for p
 For example, the push-down rule above could be expressed in Scala using the \texttt{match} operator:

 \begin{lstlisting}
-  plan match {
-    case Filter(condition, Project(targetList, child)) =>
-      Project(targetList, Filter(condition, child))
-  }
+  plan match { case Filter(cond, Project(tgt, child)) =>
+                 Project(tgt, Filter(cond, child))          }
 \end{lstlisting}

 The pattern checks to see if \lstinline{plan} represents a \lstinline{Filter} ($\sigma$) node. 
-If so, it binds its first field to \lstinline{condition}, and checks to see if its second field is a \lstinline{Project} ($\pi$) node.  
+If so, it binds its first field to \lstinline{cond}, and checks to see if its second field is a \lstinline{Project} ($\pi$) node.  
 If so, it binds two child variables and runs the code to the right of \lstinline{=>} to build a replacement tree.

 A key design methodology for optimizers of relational query languages (and other paradigms) is to execute a large set of production rules simultaneously, iteratively inferring more optimal join plans until a fixed-point is reached (or time dictates we must give up). The key insights driving \systemlang are that: 
--- a/sections/query_evaluation.tex
+++ b/sections/query_evaluation.tex
@ -6,10 +6,9 @@ A typical optimizer is typically defined as a collection of rules of the form $\
  A typical optimizer has many `batches' of such rules, applied in sequence.
  As the generalization to multiple batches is straightforward, we assume only one batch.
 }.  
-To perform one rewrite step, the optimizer searches for the first subtree with a successful match (where $L_i(Q)$ is the limit operator):
-$$L_1\left(\bigcup_i \query{\matcher_i}(\db)\right)$$
+To perform one rewrite step, the optimizer searches for the first subtree with a successful match from $\bigcup_i \query{\matcher_i}(\db)$.
 Given a subtree $\constant$ matched by $\matcher_i$ and the resulting scope $\scope$, the step results in an updated AST $\db' = \db[\constant \backslash \expression_i(\scope)]$ (where $\db[\constant \backslash \constant']$ indicates $\db$ with subtree $\constant$ replaced by subtree $\constant'$).
-\DB{$\db[\constant \backslash \expression_i(\scope)]$ this supports match pattern that have an experssion evaluation right? Not all match patterns.}
+% \DB{$\db[\constant \backslash \expression_i(\scope)]$ this supports match pattern that have an experssion evaluation right? Not all match patterns.}
 The optimizer repeats this process until $\db$ converges to a fixed point, or a timeout or threshold number of steps is reached.  

 In this section, we focus on optimizing the search for matched subtrees, and return to the rewrite step in \Cref{sec:updates}. 
@ -34,20 +33,20 @@ Q \bowtie [\var = \nodelabel(\var_1, \ldots, \var_n)]
 We introduce the new operator $\expandop$ in the third equivalence shortly.
 The first two equivalences follow from the fact that $\varsOf{\expression}$ is a key for both $\inbrackets{e}$ and $\inbrackets{\var = \expression}$; the join is a foreign-key join.
 In the former case, tuples that fail the predicate have no join partner and are filtered out, like a selection.
-In the latter, the join is guaranteed to find a foreign key, and the resulting schema is extended by $\var$, like a projection.
-The conditions on the schema of $Q$ are enforce safety properties, as outlined above.
+In the latter, the join is guaranteed to find a foreign key, and the resulting schema is extended by $\var$, like projection.
+The constraints on $\schemaOf{Q}$ enforce query safety.

 Match atoms similarly act as a foreign key join, but simultaneously filter (like select) and extend the schema (like project): (i) only tuples where $\var$ is a $\nodelabel$-typed AST-node find join partners, but (ii) the resulting schema is extended by $\inset{\var_1, \ldots, \var_n}$.  
-We capture this behavior in a new operator named expand (denoted $\expandop$);
-The expand operator is reminiscent of the Unnest operator in nested relational algebra~\cite{DBLP:conf/pods/JaeschkeS82}; albeit constrained to emit at most one tuple per input.
+We capture this behavior in a new operator named \textbf{expand} (denoted $\expandop$);
+The expand operator is similar to the Unnest operator in nested relational algebra~\cite{DBLP:conf/pods/JaeschkeS82}; but never emits more than one tuple per input.

-\OK{Worth it to talk about merging variables? i.e.,:
-$$\inbrackets{\var = \var'} \bowtie Q \equiv \pi_{\schemaOf{Q}, \var \leftarrow \var'}(Q[\var \backslash \var'])$$
-}
+% \OK{Worth it to talk about merging variables? i.e.,:
+% $$\inbrackets{\var = \var'} \bowtie Q \equiv \pi_{\schemaOf{Q}, \var \leftarrow \var'}(Q[\var \backslash \var'])$$
+% }

 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-\subsection{Evaluation Plans}
-\DB{Should we call this section differently? Evaluation Workflow or Rewrite Workflow? In my head its conflicting with the physical plan thats evaluated. Ignore if its just me.}
+\subsection{Rewrite Execution Plan}
+% \DB{Should we call this section differently? Evaluation Workflow or Rewrite Workflow? In my head its conflicting with the physical plan thats evaluated. Ignore if its just me.}
 Starting with a join over atomic relations allows us to explore the space of evaluation plans by leveraging the associativity and commutativity of join.
 \Cref{alg:makePlan} outlines a simple, greedy strategy for eliminating joins.  
 Specifically, given a query of the form: $$Q = \atom_1 \bowtie \ldots \bowtie \atom_n = \rewritematcher{\var}{\matcher}$$
@ -67,15 +66,12 @@ In summary, the key challenge of selecting an execution plan is (greedily) selec
    \Require $A = \inset{\atom_1, \ldots, \atom_n} \subseteq \atomdom$: A set of target atoms
    \Require $Q$: A query
    \Ensure $Q': $ A query equivalent to $Q \bowtie \atom_1 \bowtie \ldots \bowtie \atom_n$
-    \If{$|A| = 0$}
-      \State \Return Q
-    \Else
-      \State $\texttt{candidates} = \textsc{EnumerateCandidates}(Q, A)$
-      \State $\atom = \textsc{PickCandidate}(C)$
-      \State $A' = A - \inset{\atom}$; 
-        \hspace{5mm} $Q' = \textsc{Rewrite}(Q \bowtie \atom)$
-      \State \Return \textsc{MakePlan}$(A', Q')$
-    \EndIf
+    \State \textbf{if} {$|A| = 0$} \textbf{then} \Return Q
+    \State $\texttt{candidates} = \textsc{EnumerateCandidates}(Q, A)$
+    \State $\atom = \textsc{PickCandidate}(C)$
+    \State $A' = A - \inset{\atom}$; 
+      \hspace{5mm} $Q' = \textsc{Rewrite}(Q \bowtie \atom)$
+    \State \Return \textsc{MakePlan}$(A', Q')$
  \end{algorithmic}
 \end{algorithm}

@ -101,23 +97,20 @@ However, the same rewritten subquery is shared by all target atom sets that shar
    \Require $Q$: A query
    \Require $\vec A$: A vector of target atom sets $A_i$ as in \Cref{alg:makePlan}.
    \Ensure $\vec Q'$: A vector of queries, where $Q_i'$ is as in \Cref{alg:makePlan}.
-    \If{$|\vec A| = 0$}
-      \State \Return $[]$
-    \Else
-      \For{$A_i \in \vec A$}
-        \State $C_i = \textsc{EnumerateCandidates}(Q, A_i)$
-      \EndFor
-      \State $\atom = \textsc{PickCandidate}(\vec C)$
-      \State $Q' = \textsc{Rewrite}(Q \bowtie \atom)$
-      \State \Return $
-              \textsc{MakeSharedPlan}(Q', \inbrackets{\;
-                A_i \;\left|\; A_i \in \vec A \wedge \atom \in A_i\right.
-              })$ \\
-            \hspace{15mm} $\cup \;
-              \textsc{MakeSharedPlan}(Q, \inbrackets{\;
-                A_i \;\left|\; A_i \in \vec A \wedge \atom \not \in A_i\right.
-              })$
-    \EndIf
+    \State \textbf{if } {$|\vec A| = 0$} \textbf{then} \Return $[]$
+    \For{$A_i \in \vec A$}
+      \State $C_i = \textsc{EnumerateCandidates}(Q, A_i)$
+    \EndFor
+    \State $\atom = \textsc{PickCandidate}(\vec C)$
+    \State $Q' = \textsc{Rewrite}(Q \bowtie \atom)$
+    \State \Return $
+            \textsc{MakeSharedPlan}(Q', \inbrackets{\;
+              A_i \;\left|\; A_i \in \vec A \wedge \atom \in A_i\right.
+            })$ \\
+          \hspace{15mm} $\cup \;
+            \textsc{MakeSharedPlan}(Q, \inbrackets{\;
+              A_i \;\left|\; A_i \in \vec A \wedge \atom \not \in A_i\right.
+            })$
  \end{algorithmic}
 \end{algorithm}

@ -190,7 +183,7 @@ Having established a language for querying the subtrees of $\db$, we next turn t
 The optimizer interacts with $\db$ through three access patterns:
 (i) Full subtree scan ($\db(\var)$),
 (ii) Node expansion ($\expandop_{\var \rightarrow \ell(\var_1, \ldots, \var_n)}$), and
-(iii) Subtree replacement $\db' = \db[\constant \backslash \constant]$
+(iii) Subtree replacement $\db' = \db[\constant \backslash \constant]$.

 In \cite{balakrishnan:2021:sigmod:treetoaster}, we contrasted two different storage layouts for ASTs.
 In one, each node of a subtree was stored as a tuple in a relation, with one relation per AST node type (i.e., $\nodelabel$).
@ -203,13 +196,4 @@ AST nodes are stored as a tuple of their type and the fields, with fields contai

 Enumerating the subtrees is trivial on this structure, requiring only a series of pointer traversals.
 Node expansion is likewise viable on this naive representation, as variables holding AST nodes necessarily store references to the node and its fields on the heap.
-\DB{Should we add something about how these traversal actually look for joins between diffrent relations and by maintaining pointers we just materialize and maintain the join result which gives us performance boost and is not too much to store?}
-
-Subtree replacement (i.e., updates) poses two nuanced problems. 
-First, a reference to the AST node on the heap is insufficient to replace the node; references to the node from elsewhere in the tree must be redirected.
-We maintain a table of parent/child relationships indexed on the child, allowing us to quickly identify all references to a node.
-
-Second, in a typical compiler, ASTs are immutable structures.  
-While this makes them easier to reason about, it also precludes updating a single node without updating each of its ancestors.
-Per our prior work on Fluid Data Structures~\cite{balakrishnan:2019:dbpl:fluid}, our prototype implementation uses an indirection layer.
-Prior to optimization, the Logical Plan structure is rewritten so that each LogicalPlan node is indirectly referenced by a unary wrapper node that allows its child to be updated.
+Subtree replacement requires only maintaining a lookup table of parent nodes, and in the case of Spark, an indirection layer~\cite{balakrishnan:2019:dbpl:fluid} to work around immutability constraints.
--- a/sections/specification.tex
+++ b/sections/specification.tex
@ -55,8 +55,8 @@ Where it is clear to do so, for a variable $\var \in \vardom$, we overload $\var

 \begin{example}
  \label{ex:pattern}
-  With $\inset{\texttt{cond}, \texttt{target}, \texttt{child}} \subset \vardom$, the match pattern for the select pushdown optimization from the introduction is:
-  $\textbf{Filter}( \texttt{cond} , \textbf{Project}( \texttt{target}, \texttt{child} ))$.  Note the similarity to the match pattern in the introduction.
+  With $\inset{\texttt{cond}, \texttt{tgt}, \texttt{child}} \subset \vardom$, the match pattern for the select pushdown optimization from the introduction is:
+  $\textbf{Filter}( \texttt{cond} , \textbf{Project}( \texttt{tgt}, \texttt{child} ))$.  Note the similarity to the match pattern in the introduction.
 \end{example}


@ -74,7 +74,7 @@ We mark scope updates by $\scope[\var \backslash \constant]$ to mean $\scope$ wi

 \begin{example}
  The pattern from our running example can be equivalently stated as:
-  $$\textbf{Filter}( \texttt{cond}, \texttt{p} ) \wedge \texttt{p} \passToMatcher \textbf{Project}( \texttt{target}, \texttt{child} )$$
+  $$\textbf{Filter}( \texttt{cond}, \texttt{p} ) \wedge \texttt{p} \passToMatcher \textbf{Project}( \texttt{tgt}, \texttt{child} )$$
  If the \textbf{Filter} node is not matched, the conjunction shortcuts.  
  If it is matched, the right half of the conjunction is evaluated with \texttt{cond} and \texttt{p} bound (in the scope) to the \textbf{Filter} node's children.
  The $@$ operator evaluates the \textbf{Project} matcher on the constant bound to the variable \texttt{p} (i.e., $\scope(\texttt{p})$), and the rest proceeds as before.
@ -160,8 +160,8 @@ $$\query{\matcher}(\db) = \comprehension{
 }$$

 \begin{example}
-  The running example pattern of \Cref{ex:pattern} applied to the subtrees of \Cref{ex:subtrees} produces only a single match with scope:
-  $$\inset{\;\texttt{cond} \mapsto \texttt{'X>3'},\;\; \texttt{target} \mapsto \texttt{['X','Y']},\;\; \texttt{child} \mapsto \textbf{Table}(\ldots)\;}$$
+  Let $\matcher$ be the running example pattern of \Cref{ex:pattern} and $\db$ be the query of \Cref{fig:exampleAST}.  The query $\query{m}(D)$ produces only a single result, the scope:
+  $$\inset{\;\texttt{cond} \mapsto \texttt{'X>3'},\;\; \texttt{tgt} \mapsto \texttt{['X','Y']},\;\; \texttt{child} \mapsto \textbf{Table}(\ldots)\;}$$
 \end{example}

 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@ -262,7 +262,7 @@ We write $\schemaOf{\atom}$ for the schema of $\atom$.
      & =  \inbrackets{\var' = \expression}\\
    %
    \rewritematcher{\var}{\var' \leftarrow \matcher}
-      & =  \rewritematcher{\var'}{\matcher} \bowtie \inbrackets{\var = \var'}
+      & =  \rewritematcher{\var}{\matcher} \bowtie \inbrackets{\var' = \var}
  \end{align*}
  \caption{Reducing match patterns to \systemlang; Each $\genvar$ denotes a freshly allocated variable name}
  \label{fig:reductionToFOL}
@ -272,18 +272,18 @@ We next address the problem of rewriting `match' style pattern queries into \sys
 \Cref{fig:reductionToFOL} defines the rewrite rule $\rewritematcher{\var}{\matcher}$ which translates a match pattern $\matcher \in \matcherdom$ into a join over match atoms.  
 With $\var$ as a unique variable name, $\query{\matcher}(\db) \equiv \db(\var) \bowtie \rewritematcher{\var}{\matcher}$.

-
-
-shows how to rewrite match patterns into equivalent \systemlang queries.
-
-
-
-Concretely, we 
-
-Any matcher $\matcher \in \matcherdom$ can be rewritten into an equivalent relational algebra query using the
-
-Each result tuple of the query $\db(\var) \bowtie \rewritematcher{\var}{\matcher}$ defines one matched scope of $\query{\matcher}(\db)$, with the attribute $\var$ assigned to the matched subtree. 
-
+\begin{example}
+  \label{ex:rewrite}
+  Continuing our running example, and recalling that a bare variable $\var$ is shorthand for $\var \leftarrow \matchany$:
+  {\footnotesize\begin{align*}
+    \query{m}(\db) & = \db(r) \bowtie \rewritematcher{r}{\textbf{Filter}(\texttt{cond}, \textbf{Project}(\texttt{tgt}, \texttt{child}))}\\
+                 & = \db(r) \bowtie \inbrackets{r = \textbf{Filter}(a, b)} \bowtie \rewritematcher{a}{\texttt{cond} \leftarrow \matchany} \bowtie \rewritematcher{b}{\textbf{Project}(\texttt{tgt}, \texttt{child})}\\
+                 & = \db(r) \bowtie \inbrackets{r = \textbf{Filter}(a, b)} \bowtie \top \bowtie \inbrackets{\texttt{cond} = a} \bowtie \rewritematcher{b}{\textbf{Project}(\texttt{tgt}, \texttt{child})}\\
+  \end{align*}}\\[-8mm]
+  \textbf{Project} is expanded similarly to \textbf{Filter}.  
+  The atom $\inbrackets{r = \textbf{Filter}(a, b)}$ has schema $\inset{r, a, b}$ and is defined for every triple where $r = \textbf{Filter}(a, b)$.  
+  Because $r$ is a key for this relation, observe that the query $\db(r) \bowtie \inbrackets{r = \textbf{Filter}(a, b)}$ computes the (finite) set of subtrees of $\db$ that are \textbf{Filter}-typed AST nodes (with attributes $r$, $a$, $b$ taking the values of the node and its two children, respectively).
+\end{example}

 Note the following relational algebra equivalences
 $$ 
@ -294,18 +294,16 @@ Q \bowtie \bot \equiv \bot
 \db(\var)\bowtie\db(\var) \equiv \db(\var)
 $$
 The first two equivalences follow from the relations $\top$ and $\bot$ being the identity and annihilator values for $\bowtie$ respectively~\cite{DBLP:conf/pods/GreenKT07}.
-The third follows from the idempotency of natural join on relations without duplicates.
+The third follows from the idempotency of natural join on keyed relations.

 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

-\subsection{Query Safety}
+\paragraph{Query Safety}

-Although we model Expression Bindings and Expression Tests as infinite relations, these relations may have infinite cardinality. \DB{should this be finite cardinality?}Put another way, all variables referenced by an expression must be assigned values before the expression can be evaluated.
-We formalize this by defining a safety property:
-
-\OK{
-... standard bla bla here.  the only 'safe' expressions are $\top$, $\bot$, and $\db(\var)$.  Everything else needs some symbol bound for safety.  The entire query needs to be safe, so the rewritten query needs a $\db(\var)$ in it somewhere to be safe...
-
-... also maybe note that we can inject default bindings (e.g., $\inbrackets{v \leftarrow \texttt{null}}$) to `balance' variables bound on one side of a disjunction, but not the other.
-}
+As we note above, match atoms are infinite-cardinality relations, but naturally, we want queries to produce only finite outputs. 
+This concept is typically captured by the notion of \emph{safety}: a query is safe if it is guaranteed to return a finite set of results.
+This property is derived iteratively: If a relation is finite, we know that its attributes have a finite domain and call the attributes safe.
+If all of the key attributes of a relation (even an infinite one) are safe, then only a finite number of records in the relation can possibly participate in a join and we can call the relation and all of its attributes safe.
+A query is safe when all of its relations are safe.
+The rewrite $\rewritematcher{\var}{\matcher}$ guarantees safety if: (i) $\var$ is safe, and (ii) any attributes referenced by expressions in $\matcher$ are safe.