101 lines
6.5 KiB
TeX
101 lines
6.5 KiB
TeX
%!TEX root=../main.tex
|
|
|
|
\begin{figure}
|
|
\includegraphics[width=0.9\columnwidth]{data/fall_2022_spark3.2/plotting/output/stackedGraph-2023-07-17__17_09_17.pdf}
|
|
\caption{Work Breakdown for the Spark 3.2 Optimizer}
|
|
\label{fig:sparkBreakdown}
|
|
\end{figure}
|
|
|
|
\section{Introduction}
|
|
\label{sec:introduction}
|
|
|
|
Compilers and program analyses grapple with exceedingly large state spaces, a struggle which has historically necessitated that high-level analysis tasks be broken into small, managable chunks
|
|
(\emph{e}.\emph{g}., separate compilation and context-insensitive program analyses).
|
|
Unfortunately, such a granular analysis methodology can often prohibit the idiomatic expression of useful analyses, and the loss of precision may preclude useful optimizations.
|
|
|
|
We propose a new relational-algebra-based language (called \systemlang) that
|
|
reframes common compilation and analysis tasks as database
|
|
operations. \systemlang unifies existing database optimizations (e.g.,
|
|
work sharing from streaming systems) with existing compiler tricks
|
|
(e.g., Tree Toasting~\cite{balakrishnan:2021:sigmod:treetoaster}),
|
|
laying the groundwork for a creating truly scalable, ``declarative'' compiler
|
|
by leveraging the wide array of scalable data processing techniques developed by the database community.
|
|
|
|
\paragraph{Production Rules}
|
|
Compiler transformations and optimizations are often expressed in
|
|
terms of production rules: if a pattern matching a certain form is
|
|
found, it is transformed according to the production rule. For
|
|
example, the classic selection push-down rule common in query plan optimization
|
|
may be expressed as:
|
|
|
|
$$\production{\sigma_\theta(\pi_{A}(R))}{\pi_{A}(\sigma_\theta(R))}$$
|
|
|
|
In other words, any relational algebra expression
|
|
$\sigma_{\theta}(\pi_{A}(R))$ may be safely replaced by an equivalent
|
|
expression of the form $\pi_{A}(\sigma_{\theta}(R))$. Expressions are
|
|
often realized as trees, and thus this style of optimization may be
|
|
viewed as a tree transformation.
|
|
|
|
\paragraph{Match Patterns}
|
|
|
|
Production rules are often implemented via `match patterns' in
|
|
functional languages, or via analogous constructs in imperative languages~\cite{DBLP:conf/sigmod/SolimanAREGSCGRPWNKB14}.
|
|
Match patterns are ubiquitous in functional languages due to their utility for programming with algebraic data~\cite{Luc:2008}.
|
|
For example, the push-down rule above could be expressed in Scala using the \texttt{match} operator:
|
|
|
|
\begin{lstlisting}
|
|
plan match { case Filter(cond, Project(tgt, child)) =>
|
|
Project(tgt, Filter(cond, child)) }
|
|
\end{lstlisting}
|
|
|
|
The pattern checks to see if \lstinline{plan} represents a \lstinline{Filter} ($\sigma$) node.
|
|
If so, it binds its first field to \lstinline{cond}, and checks to see if its second field is a \lstinline{Project} ($\pi$) node.
|
|
If so, it binds two child variables and runs the code to the right of \lstinline{=>} to build a replacement tree.
|
|
|
|
A key design methodology for optimizers of relational query languages (and other paradigms) is to execute a large set of production rules simultaneously, iteratively inferring more optimal join plans until a fixed-point is reached (or time dictates we must give up). The key insights driving \systemlang are that:
|
|
(i) these match patterns are effectively queries over the relational algebra tree, and
|
|
(ii) reframing them as such allows us to employ classical database optimizations to create more scalable compilers.
|
|
|
|
\paragraph{Tree Toasting}
|
|
This paper builds on prior work of ours: In \cite{balakrishnan:2021:sigmod:treetoaster}, we developed a compilation technique called Tree Toasting.
|
|
A toasted optimizer pre-computes the set of subtrees that match a rewrite rule's predicate.
|
|
As the optimizer rewrites segments of the tree, these pre-computed sets (i.e., `materialized views') are maintained.
|
|
|
|
\paragraph{Work Sharing}
|
|
In this paper, we focus on an orthogonal optimization strategy: Work Sharing~\cite{DBLP:journals/ieeecc/KremienKM93} from stream processing.
|
|
When two queries with a common sub-plan are registered with stream processor, the common sub-plan is only executed once.
|
|
Similarly, we explore a work sharing optimization where pattern matching predicates common to multiple rules are merged.
|
|
|
|
\paragraph{\systemlang}
|
|
Although we focus on one specific optimization in this paper, we emphasize that compiling pattern matching down to a query language opens up a range of further optimization opportunities, including
|
|
(i) cost-based optimization of evaluation strategies,
|
|
(ii) parallelization for exploration of large search optimization spaces, and
|
|
(iii) differential dataflow~\cite{DBLP:conf/cidr/McSherryMII13} for incremental `live' compilation.
|
|
|
|
\paragraph{Case Study: Apache Spark}
|
|
We explore this \systemlang in the context of Apache Spark's Catalyst query optimizer.
|
|
\Cref{fig:sparkBreakdown} breaks down how Catalyst spends its time while optimizing the 22 queries of the TPC-H workload\footnote{
|
|
A similar figure appears in \cite{balakrishnan:2021:sigmod:treetoaster}. \Cref{fig:sparkBreakdown} has been updated to account for improvements to the optimizer in Spark version 3.2
|
|
}.
|
|
At least a quarter of its time is spent iterating over trees (`Search'), and a further quarter is spent on bookkeeping (`Fixpoint Loop').
|
|
Both of these are both strong candidates for database-style optimizations.
|
|
|
|
For this paper, we translated a significant fragment of the Catalyst optimizer --- \textcolor{red}{[TODO]} rules in total --- into ASTral-compatible match syntax\footnote{
|
|
\url{
|
|
https://git.odin.cse.buffalo.edu/Astral/astral-compiler/src/branch/main/astral/catalyst/src/com/astraldb/catalyst/Catalyst.scala
|
|
}
|
|
}.
|
|
We use this fragment to evaluate our optimizations on a variety of queries, including (i) The 22 queries of the TPC-H benchmark, and
|
|
(ii) several large queries generated by deployments of Vizier~\cite{brachmann:2020:cidr:your}, a computational notebook based on Spark.
|
|
|
|
|
|
\subsection{Contributions}
|
|
|
|
In this paper, we make the following contributions:
|
|
(i) We introduce \systemlang, a declarative language for building compilers in \Cref{sec:datamodel};
|
|
(ii) We show how match patterns, the a common format for implementing rewrite rules, can be compiled to \systemlang;
|
|
(iii) We develop an runtime for \systemlang based on work sharing in stream processing systems in \Cref{sec:queryEvaluation};
|
|
(iv) We adapt tree toasting to \systemlang in \Cref{sec:treetoasting};
|
|
(v) We evaluate \systemlang by re-implementing a fragment of Spark's Catalyst Optimizer in \Cref{sec:experiments}; and
|
|
(vi) We explore potential further ways to leverage the declarative nature of \systemlang in \Cref{sec:conclusions}.
|