conclusions, more examples, and trimming cites.

main
Oliver Kennedy 2023-07-18 00:30:19 -04:00
parent 6a14b08b81
commit 68cee7c1ff
Signed by: okennedy
GPG Key ID: 3E5F9B3ABD3FDB60
5 changed files with 85 additions and 51 deletions

View File

@ -44,17 +44,13 @@ author = {Maranget, Luc},
title = {Compiling Pattern Matching to Good Decision Trees},
year = {2008},
isbn = {9781605580623},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/1411304.1411311},
publisher = {ACM},
doi = {10.1145/1411304.1411311},
abstract = {We address the issue of compiling ML pattern matching to compact and efficient decisions trees. Traditionally, compilation to decision trees is optimized by (1) implementing decision trees as dags with maximal sharing; (2) guiding a simple compiler with heuristics. We first design new heuristics that are inspired by necessity, a concept from lazy pattern matching that we rephrase in terms of decision tree semantics. Thereby, we simplify previous semantic frameworks and demonstrate a straightforward connection between necessity and decision tree runtime efficiency. We complete our study by experiments, showing that optimizing compilation to decision trees is competitive with the optimizing match compiler of Le Fessant and Maranget (2001).},
booktitle = {Proceedings of the 2008 ACM SIGPLAN Workshop on ML},
booktitle = {SIGPLAN-ML},
pages = {3546},
numpages = {12},
keywords = {heuristics, match compilers, decision trees},
location = {Victoria, BC, Canada},
series = {ML '08}
}
@inproceedings{Keep:2013,
@ -115,7 +111,7 @@ series = {ICFP '13}
Jeff Kramer and
Jeff Magee},
title = {Scalable, adaptive load sharing for distributed systems},
journal = {{IEEE} Parallel Distributed Technol. Syst. Appl.},
journal = {{IEEE} PDTSA.},
volume = {1},
number = {3},
pages = {62--70},

View File

@ -1,13 +1,17 @@
%!TEX root=../main.tex
\section{Conclusions and Future Work}
\label{sec:conclusions}
\begin{itemize}
\item Recursion / Properties of subtrees
\item Generalization to DAGs (dataflow/control flow compilers)
\item bringing joins back into query planning (e.g., materialize TWO subtrees and join the results together instead of post-processing one)
\item negations (e.g., makeSharedplan with three branches: one where the atom is mutually exclusive, one where the atom is not exclusive... )
\item ???
\end{itemize}
In this paper, we introduced \systemlang, a language for building declarative compilers, and a work-sharing optimization that it enables.
As members of the DB and PL communities, we believe that \systemlang opens up an entirely new avenue of research for both groups, whether in terms of new opportunities for large scale program analysis or optimization, or via new and unique data management patterns and system requirements.
% \begin{itemize}
% \item Recursion / Properties of subtrees
% \item Generalization to DAGs (dataflow/control flow compilers)
% \item bringing joins back into query planning (e.g., materialize TWO subtrees and join the results together instead of post-processing one)
% \item negations (e.g., makeSharedplan with three branches: one where the atom is mutually exclusive, one where the atom is not exclusive... )
% \item ???
% \end{itemize}
% Ideas
% - SIGMOD '21
% - BDD optimization

View File

@ -39,7 +39,7 @@ viewed as a tree transformation.
\paragraph{Match Patterns}
Production rules are often implemented via `match patterns' in
functional languages, or via analogous constructs in imperative languages~\cite{DBLP:conf/sigmod/SolimanAREGSCGRPWNKB14}.
functional languages, or via analogous constructs in imperative languages.
Match patterns are ubiquitous in functional languages due to their utility for programming with algebraic data~\cite{Luc:2008}.
For example, the push-down rule above could be expressed in Scala using the \texttt{match} operator:

View File

@ -26,7 +26,7 @@ Q \bowtie \inbrackets{\var = \expression}
&& \textbf{and } \var \not\in \schemaOf{Q}\\
%%%%%%%%%%%%%%%%%%%%
Q \bowtie [\var = \nodelabel(\var_1, \ldots, \var_n)]
& \equiv \expandop_{\var \leftarrow \nodelabel(\var_1, \ldots, \var_n)}(Q)
& \equiv \expandop_{\var \rightarrow \nodelabel(\var_1, \ldots, \var_n)}(Q)
& \textbf{if } \inset{\var, \var_1, \ldots \var_n} \hspace{7mm}\\
&&\cap\; \schemaOf{Q} = \inset{\var}
\end{align*}
@ -40,17 +40,43 @@ Match atoms similarly act as a foreign key join, but simultaneously filter (like
We capture this behavior in a new operator named \textbf{expand} (denoted $\expandop$);
The expand operator is similar to the Unnest operator in nested relational algebra~\cite{DBLP:conf/pods/JaeschkeS82}; but never emits more than one tuple per input.
% \OK{Worth it to talk about merging variables? i.e.,:
% $$\inbrackets{\var = \var'} \bowtie Q \equiv \pi_{\schemaOf{Q}, \var \leftarrow \var'}(Q[\var \backslash \var'])$$
% }
\begin{figure}
\begin{tikzpicture}[node distance=3mm]
\node (source) {$\db(r)$};
\node (filter) [above=of source] {$\expandop_{r\rightarrow{\textbf{Filter}(\texttt{cond}, b)}}$};
\node (project) [above left=of filter] {$\expandop_{b\rightarrow{\textbf{Project}(\texttt{tgt}, \texttt{child})}}$};
\node (check1) [above=of project] {$\sigma_{\textsc{det}(\texttt{tgt})}$};
\node (check2) [above=of filter] {$\sigma_{\texttt{cond} = true}$};
\node (join) [above right=of filter] {$\expandop_{b\rightarrow \textbf{Join}(c, d)}$};
\node (exp) [above=of join] {$\pi_{*, v \gets \textsc{refs}(\texttt{cond})}$};
\node (leftj) [above left=of exp] {$\sigma_{\textsc{outputs}(c) \subseteq v}$};
\node (rightj) [above=of exp] {$\sigma_{\textsc{outputs}(d) \subseteq v}$};
\draw[very thick,draw=red!50!black] (source) -> (filter);
\draw[very thick,draw=red!50!black] (filter) -> (project);
\draw[very thick,draw=red!50!black] (project) -> (check1);
\draw (filter) -> (check2);
\draw (filter) -> (join);
\draw (join) -> (exp);
\draw (exp) -> (leftj);
\draw (exp) -> (rightj);
\end{tikzpicture}
\caption{Example rewrite execution plan. Thicker, red lines are for the running example.}
\label{fig:executionPlan}
\end{figure}
\begin{example}
Continuing \Cref{ex:rewrite}, $\db(r) \bowtie \inbrackets{r = \textbf{Filter}(a, b)}$ meets the constraints of the first join elimination rewrite and can be rewritten to
$\expandop_{r\rightarrow\textbf{Filter}(a, b)}(\db(r))$.
As before, one result is produced for every \textbf{Filter}-typed AST subtree of $\db$.
\end{example}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Rewrite Execution Plan}
\subsection{Selecting an Execution Plan}
% \DB{Should we call this section differently? Evaluation Workflow or Rewrite Workflow? In my head its conflicting with the physical plan thats evaluated. Ignore if its just me.}
Starting with a join over atomic relations allows us to explore the space of evaluation plans by leveraging the associativity and commutativity of join.
\Cref{alg:makePlan} outlines a simple, greedy strategy for eliminating joins.
Specifically, given a query of the form: $$Q = \atom_1 \bowtie \ldots \bowtie \atom_n = \rewritematcher{\var}{\matcher}$$
The return value of $\textsc{MakePlan}(\inset{\atom_1, \ldots, \atom_n}, \db(\var))$ is equivalent to $\query{\matcher}(\db)$\footnote{
Specifically, given a query of the form: $\atom_1 \bowtie \ldots \bowtie \atom_n = \rewritematcher{\var}{\matcher}$, the return value of $\textsc{MakePlan}(\inset{\atom_1, \ldots, \atom_n}, \db(\var))$ is equivalent to $\query{\matcher}(\db)$\footnote{
Matchers with disjunctions rewrite to queries with unions; A more robust optimization strategy is likely possible, but for the purposes of this paper we rely on distributivity to produce a set of union-free queries, each individually passed to \textsc{MakePlan}.
}
$\textsc{MakePlan}$ proceeds in three steps:
@ -68,10 +94,8 @@ In summary, the key challenge of selecting an execution plan is (greedily) selec
\Ensure $Q': $ A query equivalent to $Q \bowtie \atom_1 \bowtie \ldots \bowtie \atom_n$
\State \textbf{if} {$|A| = 0$} \textbf{then} \Return Q
\State $\texttt{candidates} = \textsc{EnumerateCandidates}(Q, A)$
\State $\atom = \textsc{PickCandidate}(C)$
\State $A' = A - \inset{\atom}$;
\hspace{5mm} $Q' = \textsc{Rewrite}(Q \bowtie \atom)$
\State \Return \textsc{MakePlan}$(A', Q')$
\State $\atom = \textsc{PickCandidate}(\texttt{candidates})$
\State \Return \textsc{MakePlan}$(\;A - \inset{\atom},\; \textsc{Rewrite}(Q \bowtie \atom)\;)$
\end{algorithmic}
\end{algorithm}
@ -79,16 +103,20 @@ The optimal strategy for \textsc{PickCandidate} mirrors the selection predicate
We want to prioritize atoms that have a low selectivity and a low cost.
We leave a more thorough cost-based optimizer to future work, and for now adopt the following heuristic order over atoms:
(i) match atoms (which are inexpensive), (ii) test atoms (which have nonzero selectivity), (iii) binding atoms (which are neither).
We note that universal and empty atoms may be optimized away, and AST atoms only appear at the query root.
% We note that universal and empty atoms may be optimized away, and AST atoms only appear at the query root.
\begin{example}
Consider the augmented version of our example pattern from \Cref{ex:sideEffects}.
Schema constraints enforce a specific execution plan --- \textsc{EnumerateCandidates} only returns a single candidate for each call.
The output of \textsc{MakePlan} on this pattern is illustrated in \Cref{fig:executionPlan} (bold, red lines).
\end{example}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Merging Plans}
An optimizer is simultaneously interested in matching multiple patterns, not just one.
We find an appropriate optimization opportunity in stream processing systems (e.g., Aurora/Borealis\cite{DBLP:conf/cidr/AbadiABCCHLMRRTXZ05,DBLP:books/sp/16/CetintemelAABBCHMMRRSTXZ16}), where multiple simultaneous streams are rewritten to share overlapping computations~\cite{DBLP:journals/ieeecc/KremienKM93}.
\Cref{alg:makeSharedPlan} generalizes \Cref{alg:makePlan} to detect and leverage such opportunities.
As before, the algorithm greedily selects one atom at a time and rewrites the plan around it.
However, the same rewritten subquery is shared by all target atom sets that share the selected atom.
An optimizer is simultaneously interested in matching multiple patterns.
We find an appropriate optimization opportunity in stream processing systems (e.g., Aurora/Borealis\cite{DBLP:conf/cidr/AbadiABCCHLMRRTXZ05}), where multiple simultaneous streams are rewritten to share overlapping computations~\cite{DBLP:journals/ieeecc/KremienKM93}.
\Cref{alg:makeSharedPlan} generalizes \Cref{alg:makePlan} to detect and leverage such opportunities, rewriting multiple atom sets in parallel.
\begin{algorithm}
\caption{$\textsc{MakeSharedPlan}(Q, \vec A)$}
@ -114,18 +142,24 @@ However, the same rewritten subquery is shared by all target atom sets that shar
\end{algorithmic}
\end{algorithm}
As before, the key challenge is the implementation of $\textsc{PickCandidate}$, which \Cref{alg:makeSharedPlan} generalizes to take a vector of candidate sets.
Similarly, we leave a full implementation leveraging cost-based optimization to future work and adopt a simple heuristic:
Candidate atoms are bucketed into three priority groups, as before, with each group taking precedence over subsequent ones.
Within a group, the atom that appears in the most candidate sets is selected.
As before, the key challenge is abstracted by $\textsc{PickCandidate}$, which now selects from a vector of candidate sets.
Our preliminary implementation buckets candidate atoms into the same three priority groups, and selects an the atom that appears the most frequently in the highest priority bucket.
\begin{example}
Consider the following additional match patterns
{\footnotesize \begin{align*}
&\textbf{Filter}(cond, child) \wedge (cond = true)\\
&\textbf{Filter}(cond, Join(lhs, rhs)) \wedge (v \leftarrow \textsc{refs}(\texttt{cond})) \wedge (\textsc{outputs}(lhs) \subseteq v)\\
&\textbf{Filter}(cond, Join(lhs, rhs)) \wedge (v \leftarrow \textsc{refs}(\texttt{cond})) \wedge (\textsc{outputs}(rhs) \subseteq v)
\end{align*}}
These test, respectively, for tautological filters and joins that admit filter push-down.
The rewrite into \systemlang, and subsequent variable-elimination optimizations are left as an exercise for the reader, but \Cref{fig:executionPlan} shows the final work-shared plan.
\end{example}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Limit-1 Execution}
Naively evaluating the query entails computing the union of all match queries in full.
However, in its simplest form, an optimizer is only interested in one rewrite opportunity at a time --- more than this risks rewriting the same part of the tree in different ways.
We consider how this limitation may be averted later in \Cref{TODO}, but for now are interested in producing only a single match.
Naively evaluating the query entails computing the union of all match queries in full, while we only want one tuple at a time.
Volcano style query execution is limited in its ability to share work from common subplans without expensive materialization of intermediate results.
Meanwhile, `push'-based streaming style query execution requires asynchronicity.
Here, we develop a non-asynchronous push-based evaluation strategy that can efficiently shortcut execution once a single result is produced.
@ -139,14 +173,13 @@ We write $\parentsOf{Q}$ to denote the set of subplans in the DAG with $Q$ as a
\begin{algorithmic}
\Require $Q$: A query subplan
\Require $t$: A tuple
\Ensure A matched tuple or $\nullresult$ \Comment{Match found, terminate}
\Ensure A matched tuple or $\nullresult$
\State \textbf{if} $\parentsOf{Q} = \emptyset$ \textbf{then return} $t$
\Comment{Match found, terminate}
\For{$Q' \in \parentsOf{Q}$}
\State $t' \gets \textsc{PushPlan}(Q, t)$
\If{$t' \neq \nullresult$}
\State \Return $t'$
\EndIf
\EndFor
\State \textbf{if} {$t' \neq \nullresult$} \textbf{then} \Return $t'$
\EndFor
\State \Return $\nullresult$
\end{algorithmic}
\end{algorithm}

View File

@ -14,7 +14,7 @@ We model scopes $\scope in \scopedom$ as maps of variable bindings ($\scopedom:
Unbound variables in a scope return a undefined, \texttt{null} value.
\begin{figure}
\begin{tikzpicture}[node distance=5mm]
\begin{tikzpicture}[node distance=3mm]
\node (filter) [plannode] {Filter};
\node (cond) [field,below left=of filter] {'$X > 3$'};
\node (project) [plannode,below right=of filter] {Project};
@ -121,13 +121,14 @@ We mark scope updates by $\scope[\var \backslash \constant]$ to mean $\scope$ wi
\end{figure}
\begin{example}
When \texttt{target} of a \textbf{Project} operator has side-effects or other non-deterministic behavior, systems like Spark will not apply the selection push-down optimization.
\label{ex:sideEffects}
When \texttt{tgt} of a \textbf{Project} operator has side-effects or other non-deterministic behavior, systems like Spark will not apply the selection push-down optimization.
Let $\textsc{det}$ be an externally provided function that determines whether a target is deterministic.
Recall that we do not explicitly define expression evaluation semantics, so let $\textsc{det}(\texttt{target})$ be an expression that applies $\textsc{det}$ to the variable \texttt{target} in the scope.
Recall that we do not explicitly define expression evaluation semantics, so let $\textsc{det}(\texttt{tgt})$ be an expression that applies $\textsc{det}$ to the variable \texttt{tgt} in the scope.
We can then write a `safe' version of the selection push-down pattern as;
$$\textbf{Filter}(\texttt{cond}, \textbf{Project}(\texttt{target}, \texttt{child})) \wedge \textsc{det}(\texttt{target})$$
If the left-half of the conjunction succeeds, the variables \texttt{cond}, \texttt{target}, and \texttt{child} will be bound in the scope.
The right-half succeeds if \textsc{det} returns true on the value bound to \texttt{target}.
$$\textbf{Filter}(\texttt{cond}, \textbf{Project}(\texttt{tgt}, \texttt{child})) \wedge \textsc{det}(\texttt{tgt})$$
If the left-half of the conjunction succeeds, the variables \texttt{cond}, \texttt{tgt}, and \texttt{child} will be bound in the scope.
The right-half succeeds if \textsc{det} returns true on the value bound to \texttt{tgt}.
\label{eg:MatchPattern}
\end{example}