Still working on Sec 5

master
Atri Rudra 2020-12-17 00:02:07 -05:00
parent e13373cf21
commit f63cf9c2e5
3 changed files with 112 additions and 6 deletions

View File

@ -481,14 +481,15 @@ The runtime of the algorithm is dominated by~\Cref{alg:mon-sam-onepass} (which b
The evaluation of $\abs{\etree}(1,\ldots, 1)$ can be defined recursively, as follows (where $\etree_\lchild$ and $\etree_\rchild$ are the `left' and `right' children of $\etree$ if they exist):
\begin{align*}
\begin{align}
\label{eq:T-all-ones}
\abs{\etree}(1,\ldots, 1) = \begin{cases}
\abs{\etree_\lchild}(1,\ldots, 1) \cdot \abs{\etree_\rchild}(1,\ldots, 1) &\textbf{if }\etree.\type = \times\\
\abs{\etree_\lchild}(1,\ldots, 1) + \abs{\etree_\rchild}(1,\ldots, 1) &\textbf{if }\etree.\type = + \\
|\etree.\val| &\textbf{if }\etree.\type = \tnum\\
1 &\textbf{if }\etree.\type = \var.
\end{cases}
\end{align*}
\end{align}
%\begin{align*}
%&\eval{\etree ~|~ \etree.\type = +}_{\abs{\etree}} =&& \eval{\etree_\lchild}_{\abs{\etree}} + \eval{\etree_\rchild}_{\abs{\etree}}\\
@ -499,11 +500,12 @@ The evaluation of $\abs{\etree}(1,\ldots, 1)$ can be defined recursively, as fol
%In the same fashion the weighted distribution can be described as above with the following modification for the case when $\etree.\type = +$:
It turns out that for proof of~\Cref{lem:sample}, we need to argue that when $\etree.\type = +$, we indeed have
\begin{align*}
\begin{align}
\label{eq:T-weights}
%&\abs{\etree_\lchild}(1,\ldots, 1) + \abs{\etree_\rchild}(1,\ldots, 1); &\textbf{if }\etree.\type = + \\
\etree_\lchild.\vari{weight} &\gets \frac{\abs{\etree_\lchild}(1,\ldots, 1)}{\abs{\etree_\lchild}(1,\ldots, 1) + \abs{\etree_\rchild}(1,\ldots, 1)};\\
\etree_\rchild.\vari{weight} &\gets \frac{\abs{\etree_\rchild}(1,\ldots, 1)}{\abs{\etree_\lchild}(1,\ldots, 1)+ \abs{\etree_\rchild}(1,\ldots, 1)}
\end{align*}
\end{align}
%\begin{align*}
%&\eval{\etree~|~\etree.\type = +}_{\wght} =&&\eval{\etree_\lchild}_{\abs{\etree}} + \eval{\etree_\rchild}_{\abs{\etree}}; \etree_\lchild.\wght = \frac{\eval{\etree_\lchild}_{\abs{\etree}}}{\eval{\etree_\lchild}_{\abs{\etree}} + \eval{\etree_\rchild}_{\abs{\etree}}}; \etree_\rchild.\wght = \frac{\eval{\etree_\rchild}_{\abs{\etree}}}{\eval{\etree_\lchild}_{\abs{\etree}} + \eval{\etree_\rchild}_{\abs{\etree}}}

View File

@ -19,3 +19,87 @@
biburl = {https://dblp.org/rec/conf/icalp/KopelowitzW20.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
@book{arith-complexity,
author = {Peter B{\"{u}}rgisser and
Michael Clausen and
Mohammad Amin Shokrollahi},
title = {Algebraic complexity theory},
series = {Grundlehren der mathematischen Wissenschaften},
volume = {315},
publisher = {Springer},
year = {1997},
isbn = {3-540-60582-7},
timestamp = {Thu, 31 Jan 2013 18:02:56 +0100},
biburl = {https://dblp.org/rec/books/daglib/0090316.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
@article{NPRR,
author = {Hung Q. Ngo and
Ely Porat and
Christopher R{\'{e}} and
Atri Rudra},
title = {Worst-case Optimal Join Algorithms},
journal = {J. {ACM}},
volume = {65},
number = {3},
pages = {16:1--16:40},
year = {2018},
url = {https://doi.org/10.1145/3180143},
doi = {10.1145/3180143},
timestamp = {Wed, 21 Nov 2018 12:44:29 +0100},
biburl = {https://dblp.org/rec/journals/jacm/NgoPRR18.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
@article{skew,
author = {Hung Q. Ngo and
Christopher R{\'{e}} and
Atri Rudra},
title = {Skew strikes back: new developments in the theory of join algorithms},
journal = {{SIGMOD} Rec.},
volume = {42},
number = {4},
pages = {5--16},
year = {2013},
url = {https://doi.org/10.1145/2590989.2590991},
doi = {10.1145/2590989.2590991},
timestamp = {Fri, 06 Mar 2020 21:55:55 +0100},
biburl = {https://dblp.org/rec/journals/sigmod/NgoRR13.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
@article{factorized-db,
author = {Dan Olteanu and
Maximilian Schleich},
title = {Factorized Databases},
journal = {{SIGMOD} Rec.},
volume = {45},
number = {2},
pages = {5--16},
year = {2016},
url = {https://doi.org/10.1145/3003665.3003667},
doi = {10.1145/3003665.3003667},
timestamp = {Fri, 06 Mar 2020 21:56:19 +0100},
biburl = {https://dblp.org/rec/journals/sigmod/OlteanuS16.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
@inproceedings{ngo-survey,
author = {Hung Q. Ngo},
editor = {Jan Van den Bussche and
Marcelo Arenas},
title = {Worst-Case Optimal Join Algorithms: Techniques, Results, and Open
Problems},
booktitle = {Proceedings of the 37th {ACM} {SIGMOD-SIGACT-SIGAI} Symposium on Principles
of Database Systems, Houston, TX, USA, June 10-15, 2018},
pages = {111--124},
publisher = {{ACM}},
year = {2018},
url = {https://doi.org/10.1145/3196959.3196990},
doi = {10.1145/3196959.3196990},
timestamp = {Wed, 21 Nov 2018 12:44:18 +0100},
biburl = {https://dblp.org/rec/conf/pods/000118.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}

View File

@ -5,6 +5,24 @@ In this section, we consider couple of generalizations/corollaries of our resul
\subsection{Lineage circuits}
\label{sec:circuits}
In~\Cref{sec:semnx-as-repr}, we switched to thinking of our query results as polynomials and indeed pretty much of the rest of the paper has focussed on thinking of our input as a polynomial. In particular, starting with~\Cref{sec:expression-trees} with considered these polynomials to be represented an expression tree. However, these do not capture many of the compressed polynomial representations that we can get from query processing algorithms on bags including the recent work on worst-case optimal join algorithms~\cite{ngo-survey,skew}, factorized databases~\cite{factorized-db} and FAQ~\cite{DBLP:conf/pods/KhamisNR16}. Intuitively the main reason is that an expression tree does not allow for `storing' any intermediate results, which is crucial for these algorithms (and other query processing results as well).
In this section, we represent query polynomials via {\em arithmetic circits}~\cite{arith-complexity}, which are a standard way to represent polynomials over fields (and is standard in thhe field of algebraic complexity), though in our case we use them for polynomials over $\mathbb N$ in the obvious way. We present a formal treatment of {\em lineage circuit} but we present a quick overview here. A lineage circuit is represented by DAG, where each source node corresponds to either one of the input variables or a constant and the sinks correspond to the output. Every other node has at most two incoming edges (and is labeled as either an addition or a multiplication node) but there is no limit on the outdegree of such nodes. We note that is we restricted thhe outdegree to be one, then we get back expression trees.
In~\Cref{sec:results-circuits} we argue why our results from earlier sections also hold of lineage circuits (which we formally define in~\Cref{sec:circuits-formal}) and then argue why lineage circuits so indeed capture the notion of runtime of some well-known query processing algorithms in~\Cref{sec:circuit-runtime} (and we formaly define our cost model to capture the runtime of algorithms in~\Cref{sec:cost-model}).
\subsubsection{Extending our results to lineage circuits}
\label{sec:results-circuits}
We first note that since expression trees are a special case of lineage circuits, all of our hardness results in~\Cref{sec:hard} are still valid for lineage circuits.
For the approximation algorithm in~\Cref{sec:algo} we note that $\approxq$ (\Cref{alg:mon-sam}) works for lineage circuits as long as $\onepass$ and $\sampmon$ have the same guarantees (\Cref{lem:one-pass} and~\Cref{lem:onepass} respectively) hold for lineage circuits as well. It turns out that both $\onepass$ and $\sampmon$ work for lineage circuits as well simply because the only property these use for expression trees is that each node has two children and this is still valid of lineage trees (where for each non-source node the children correspond to the two nodes that have incoming edges to the given node). Put another way, our argument never used the fact that in an expression tree, each node has at most one parent.
More specifically consider $\onepass$. The algorithm (as well as its analysis) basically uses the fact that one can compute the corresponding polynomial at all $1$s input with a simple recursive formula (\cref{eq:T-all-ones}) and that we can compute a probability distribution based on these weights (as in~\cref{eq:T-weights}). It can be verified that all the arguments go through if we replace $\etree_\lchild$ and $\etree_\lchild$ for expression tree $\etree$ withh the two incoming nodes of the sink for the given lineage circuit.
\subsubsection{The cost model}
\label{sec:cost-model}
Thus far, our analysis of the runtime of $\onepass$ has been in terms of the size of the compressed lineage polynomial.
We now show that this models the behavior of a deterministic database by proving that for any boolean conjunctive query, we can construct a compressed lineage polynomial with the same complexity as it would take to evaluate the query on a deterministic \emph{bag-relational} database.
We adopt a minimalistic model of query evaluation focusing on the size of intermediate materialized states.
@ -25,7 +43,8 @@ The runtime $\qruntime{Q}$ of any query $Q$ is at least $|Q|$
\end{proposition}
\subsection{Circuit Lineage}
\subsubsection{Lineage circuit for query plans}
\label{sec:circuits-formal}
We represent lineage polynomials with arithmetic circuits over $\mathbb N$ with $+$, $\times$.
A circuit for relation $R$ is an acyclic graph $\tuple{V_R, E_R, \phi_R, \ell_R}$ with vertices $V_R$ and directed edges $E_R \subset V_R^2$.
A sink function $\phi_R : R \rightarrow V$ maps the tuples of the relation to vertices in the graph.
@ -94,7 +113,8 @@ Let $\ell_Q(v_t) = \times$, and let $\phi_Q(t) = v_t$
As in projection, newly created vertices will have an in-degree of $n$, and a fan-in tree is required.
There are $|{Q_1} \bowtie \ldots \bowtie {Q_n}|$ such vertices, so the corrected circuit has $|V_{Q_1}|+\ldots+|V_{Q_n}|+(n-1)|{Q_1} \bowtie \ldots \bowtie {Q_n}|$ vertices.
\subsection{Runtime vs Lineage}
\subsubsection{Circuit size vs. runtime}
\label{sec:circuit-runtime}
\begin{lemma}
\label{lem:circuits-model-runtime}