appendix D

This commit is contained in:
Boris Glavic 2020-12-20 17:50:56 -06:00
parent 5526ba5334
commit cc7c5fdb8a

View file

@ -1,7 +1,8 @@
%!TEX root=./main.tex
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Missing details from Section~\ref{sec:background}}\label{sec:proofs-background}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Supplementary Material for~\Cref{prop:expection-of-polynom}}\label{subsec:supp-mat-background}
To justify the use of $\semNX$-databases, we need to show that we can encode any $\semN$-PDB in this way and that the query semantics over this representation coincides with query semantics over $\semN$-PDB. For that it will be opportune to define representation systems for $\semN$-PDBs.\BG{cite}
@ -46,10 +47,10 @@ Importantly, as the following proposition shows, any finite $\semN$-PDB can be e
$\semNX$-PDBs are a complete representation system for $\semN$-PDBs that is closed under $\raPlus$ queries.
\end{Proposition}
\subsection{Proof of~\Cref{prop:semnx-pdbs-are-a-}}
\subsection{Proof of~\Cref{prop:semnx-pdbs-are-a-}}
To prove that $\semNX$-PDBs are complete consider the following construction that for any $\semN$-PDB $\pdb = (\idb, \pd)$ produces an $\semNX$-PDB $\pxdb = (\db, \pd')$ such that $\rmod(\pxdb) = \pdb$. Let $\idb = \{D_1, \ldots, D_{\abs{\idb}}\}$ and let $max(D_i)$ denote $max_{\tup} D_i(\tup)$. For each world $D_i$ we create a corresponding variable $X_i$.
%variables $X_{i1}$, \ldots, $X_{im}$ where $m = max(D_i)$.
%variables $X_{i1}$, \ldots, $X_{im}$ where $m = max(D_i)$.
In $\db$ we assign each tuple $\tup$ the polynomial:
%
\[
@ -79,7 +80,7 @@ Since $\semNX$-PDBs $\pxdb$ are a complete representation system for $\semN$-PDB
We need to prove for $\semN$-PDB $\pdb = (\idb,\pd)$ and $\semNX$-PDB $\pxdb = (\db',\pd')$ where $\rmod(\pxdb) = \pdb$ that $\expct_{\db \sim \pd}[\query(\db)(t)] = \expct_{\vct{W} \sim \pd'}\pbox{\polyForTuple(\vct{W})}$
By expanding $\polyForTuple$ and the expectation we have:
\begin{align*}
\expct_{\vct{W} \sim \pd'}\pbox{\polyForTuple(\vct{W})}
\expct_{\vct{W} \sim \pd'}\pbox{\polyForTuple(\vct{W})}
& = \sum_{\vct{w} \in \{0,1\}^n}\probOf'(\vct{w}) \cdot Q(\pxdb)(t)(\vct{w})\\
\intertext{From $\rmod(\pxdb) = \pdb$, we have that the range of $\assign_{\vct{w}(\pxdb)}$ is $\idb$, so}
& = \sum_{\db \in \idb}\;\;\sum_{\vct{w} \in \{0,1\}^n : \assign_{\vct{w}}(\pxdb) = \db}\probOf'(\vct{w}) \cdot Q(\pxdb)(t)(\vct{w})\\
@ -91,7 +92,7 @@ By expanding $\polyForTuple$ and the expectation we have:
\subsection{Supplementary Material for~\Cref{subsec:tidbs-and-bidbs}}\label{subsec:supp-mat-ti-bi-def}
Two important subclasses of $\semNX$-PDBs that are of interest to us are the bag versions of tuple-independent databases (\tis) and block-independent databases (\bis). Under set semantics, a \ti is a deterministic database $\db$ where each tuple $\tup$ is assigned a probability $\prob_\tup$. The set of possible worlds represented by a \ti $\db$ is all subsets of $\db$. The probability of each world is the product of the probabilities of all tuples that exist with one minus the probability of all tuples of $\db$ that are not part of this world, i.e., tuples are treated as independent random events. In a \bi, we also assign each tuple a probability, but additionally partition $\db$ into blocks. The possible worlds of a \bi $\db$ are all subsets of $\db$ that contain at most one tuple from each block. Note then that the tuples sharing the same block are disjoint, and the sum of the probabilitites of all the tuples in the same block $\block$ is $1$. The probability of such a world is the product of the probabilities of all tuples present in the world. %and one minus the sum of the probabilities of all tuples from blocks for which no tuple is present in the world.
Two important subclasses of $\semNX$-PDBs that are of interest to us are the bag versions of tuple-independent databases (\tis) and block-independent databases (\bis). Under set semantics, a \ti is a deterministic database $\db$ where each tuple $\tup$ is assigned a probability $\prob_\tup$. The set of possible worlds represented by a \ti $\db$ is all subsets of $\db$. The probability of each world is the product of the probabilities of all tuples that exist with one minus the probability of all tuples of $\db$ that are not part of this world, i.e., tuples are treated as independent random events. In a \bi, we also assign each tuple a probability, but additionally partition $\db$ into blocks. The possible worlds of a \bi $\db$ are all subsets of $\db$ that contain at most one tuple from each block. Note then that the tuples sharing the same block are disjoint, and the sum of the probabilitites of all the tuples in the same block $\block$ is $1$. The probability of such a world is the product of the probabilities of all tuples present in the world. %and one minus the sum of the probabilities of all tuples from blocks for which no tuple is present in the world.
For bag \tis and \bis, we define the probability of a tuple to be the probability that the tuple exists with multiplicity at least $1$.
\AH{This part \emph{below} needs more work if we include it.}
@ -125,7 +126,7 @@ Follows by the construction of $\rpoly$ in \cref{def:reduced-bi-poly}. \qed
\noindent Note the following fact:
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Proposition}\label{proposition:q-qtilde} For any \bi-lineage polynomial $\poly(X_1, \ldots, X_\numvar)$ and all $\vct{w} \in \eta$, it holds that
\begin{Proposition}\label{proposition:q-qtilde} For any \bi-lineage polynomial $\poly(X_1, \ldots, X_\numvar)$ and all $\vct{w} \in \eta$, it holds that
$% \[
\poly(\vct{w}) = \rpoly(\vct{w}).
$% \]
@ -196,7 +197,7 @@ For the set of edges in $\graph{\ell}$ we write $E_\ell$. For any graph $\graph
Given an arbitrary subgraph $S\graph{1}$ of $\graph{1}$, let $\eset{1}$ denote the set of edges in $S\graph{1}$. Define then $\eset{\ell}$ for $\ell > 1$ as the set of edges in the generated subgraph $S\graph{\ell}$ (i.e. when we apply~\Cref{def:Gk} to $\graph{1}=S\graph{1}$.
\end{Definition}
For example, consider $S\graph{1}$ with edges $\eset{1} = \{e_1\}$. Then the edges of $S\graph{2}$, s defined as $\eset{2} = \{(e_1, 0), (e_1, 1)\}$.
For example, consider $S\graph{1}$ with edges $\eset{1} = \{e_1\}$. Then the edges of $S\graph{2}$, s defined as $\eset{2} = \{(e_1, 0), (e_1, 1)\}$.
\begin{Definition}\label{def:ed-sub}
Let $\binom{S}{t}$ denote the set of subsets in $S$ with exactly $t$ edges. In a similar manner, $\binom{S}{\leq t}$ is used to mean the subsets of $S$ with $t$ or fewer edges.
\end{Definition}
@ -220,7 +221,7 @@ $f_\ell$ is a function.
\end{Lemma}
\subsubsection{Proof of~\cref{lem:fk-func}}\label{subsubsec:proof-fk}%[Proof of Lemma \ref{lem:fk-func}]
Note that $f_\ell$ is properly defined. For any $S \in \binom{E_\ell}{3}$, $|f(S)| \leq 3$, since it has to be the case that any subset of $3$ edges in $E_\ell$ will map to at most three edges in $E_1$. All mappings are in the required range. Then, since for any $b \in \{0,\ldots, \ell-1\}$ the map $(e, b) \mapsto e$ is a function, which %` mapping for which $(e, b)$ maps to no other edge than $e$, and this
Note that $f_\ell$ is properly defined. For any $S \in \binom{E_\ell}{3}$, $|f(S)| \leq 3$, since it has to be the case that any subset of $3$ edges in $E_\ell$ will map to at most three edges in $E_1$. All mappings are in the required range. Then, since for any $b \in \{0,\ldots, \ell-1\}$ the map $(e, b) \mapsto e$ is a function, which %` mapping for which $(e, b)$ maps to no other edge than $e$, and this
implies that $f_\ell$ is a function.
We are now ready to prove the structural lemmas. Note that $f_\ell$ maps subsets of three edges in $\graph{\ell}$ to a subset of at most three edges in $E_1$. To prove the structural lemmas, we will use the map $f_\ell^{-1}$. In particular, to count the number of occurrences of $\tri,\threepath,\threedis$ in $\graph{\ell}$ we count for each $S\in\binom{E_1}{\le 3}$, how many of $\tri/\threepath/\threedis$ subgraphs appear in $f_\ell^{-1}(S)$.
@ -228,7 +229,7 @@ We are now ready to prove the structural lemmas. Note that $f_\ell$ maps subsets
\subsubsection{Proof of Lemma \ref{lem:3m-G2}}
For each subset $\eset{1}\in \binom{E_1}{\le 3}$, we count the number of $3$-matchings in the $3$-edge subgraphs of $\graph{2}$ in $f_2^{-1}(\eset{1})$. We first consider the case of $\eset{1} \in \binom{E_1}{3}$, where $\eset{1}$ is composed of the edges $e_1, e_2, e_3$ and $f_2^{-1}(\eset{1})$ is the set of all $3$-edge subsets of the set
For each subset $\eset{1}\in \binom{E_1}{\le 3}$, we count the number of $3$-matchings in the $3$-edge subgraphs of $\graph{2}$ in $f_2^{-1}(\eset{1})$. We first consider the case of $\eset{1} \in \binom{E_1}{3}$, where $\eset{1}$ is composed of the edges $e_1, e_2, e_3$ and $f_2^{-1}(\eset{1})$ is the set of all $3$-edge subsets of the set
\begin{equation*}
\{(e_1, 0), (e_1, 1), (e_2, 0), (e_2, 1), (e_3, 0), (e_3, 1)\}.
\end{equation*}
@ -242,7 +243,7 @@ When $\eset{1} \equiv \threedis$, that edges in $\eset{2}$ are {\em not} disjoi
\begin{itemize}
\item Disjoint Two-Path ($\twopathdis$)
\end{itemize}
For $\eset{1} \equiv \twopathdis$ edges $e_2, e_3$ form a $2$-path with $e_1$ being disjoint. This means that $(e_2, 0), (e_2, 1), (e_3, 0), (e_3, 1)$ form a $4$-path while $(e_1, 0), (e_1, 1)$ is its own disjoint $2$-path. We can only pick either $(e_1, 0)$ or $(e_1, 1)$ for $f_2^{-1}(\eset{1})$, and then we need to pick a $2$-matching from $e_2$ and $e_3$. Note that the four path allows there to be 3 possible 2 matchings, specifically,
For $\eset{1} \equiv \twopathdis$ edges $e_2, e_3$ form a $2$-path with $e_1$ being disjoint. This means that $(e_2, 0), (e_2, 1), (e_3, 0), (e_3, 1)$ form a $4$-path while $(e_1, 0), (e_1, 1)$ is its own disjoint $2$-path. We can only pick either $(e_1, 0)$ or $(e_1, 1)$ for $f_2^{-1}(\eset{1})$, and then we need to pick a $2$-matching from $e_2$ and $e_3$. Note that the four path allows there to be 3 possible 2 matchings, specifically,
\begin{equation*}
\pbrace{(e_2, 0), (e_3, 0)}, \pbrace{(e_2, 0), (e_3, 1)}, \pbrace{(e_2, 1), (e_3, 1)}.
\end{equation*}
@ -251,12 +252,12 @@ Since these two selections can be made independently, there are $2 \cdot 3 = 6$
\begin{itemize}
\item $3$-star ($\oneint$)
\end{itemize}
When $\eset{1} \equiv \oneint$, the inner edges $(e_i, 1)$ of $\eset{2}$ are all connected, and the outer edges $(e_i, 0)$ are all disjoint. Note that for a valid 3 matching it must be the case that at most one inner edge can be part of the set of disjoint edges. When exactly one inner edge is chosen, there are 3 such possibilities. The remaining possible 3-matching occurs when all 3 outer edges are chosen. Thus, there are $3 + 1 = 4$ many 3-matchings in $f_2^{-1}(\eset{1})$.
When $\eset{1} \equiv \oneint$, the inner edges $(e_i, 1)$ of $\eset{2}$ are all connected, and the outer edges $(e_i, 0)$ are all disjoint. Note that for a valid 3 matching it must be the case that at most one inner edge can be part of the set of disjoint edges. When exactly one inner edge is chosen, there are 3 such possibilities. The remaining possible 3-matching occurs when all 3 outer edges are chosen. Thus, there are $3 + 1 = 4$ many 3-matchings in $f_2^{-1}(\eset{1})$.
\begin{itemize}
\item $3$-path ($\threepath$)
\end{itemize}
When $\eset{1} \equiv\threepath$ it is the case that all edges beginning with $e_1$ and ending with $e_3$ are successively connected. This means that the edges of $\eset{2}$ form a $6$-path in the edges of $f_2^{-1}(\eset{1})$, where all edges from $(e_1, 0),\ldots,(e_3, 1)$ are successively connected. For a $3$-matching to exist in $f_2^{-1}(\eset{1})$, we cannot pick both $(e_i,0)$ and $(e_i,1)$. % there must be at least one edge separating edges picked from a sequence.
When $\eset{1} \equiv\threepath$ it is the case that all edges beginning with $e_1$ and ending with $e_3$ are successively connected. This means that the edges of $\eset{2}$ form a $6$-path in the edges of $f_2^{-1}(\eset{1})$, where all edges from $(e_1, 0),\ldots,(e_3, 1)$ are successively connected. For a $3$-matching to exist in $f_2^{-1}(\eset{1})$, we cannot pick both $(e_i,0)$ and $(e_i,1)$. % there must be at least one edge separating edges picked from a sequence.
There are four such possibilities: $\pbrace{(e_1, 0), (e_2, 0), (e_3, 0)}$, $\pbrace{(e_1, 0), (e_2, 0), (e_3, 1)}$, $\pbrace{(e_1, 0), (e_2, 1), (e_3, 1)},$ $\pbrace{(e_1, 1), (e_2, 1), (e_3, 1)}$ . Thus, there are four possible 3-matchings in $f_2^{-1}(\eset{1})$.
\begin{itemize}
@ -294,7 +295,7 @@ When $\eset{1} \equiv \twopath$ and now we have all edges in $\eset{3}$ form a $
\end{itemize}
For $\eset{1} \equiv \twodis$, all edges of $\eset{3}$ are predicated on the fact that $(e_i, b)$ is disjoint with $(e_j, b)$ for $i \neq j\in \{1,2\}$ and $b \in \{0, 1, 2\}$. Pick an aribitrary $e_i$ and note, that $(e_i, 0), (e_i, 2)$ is a $2$-matching, which can combine with any of the $3$ edges in $(e_j, 0), (e_j, 1), (e_j, 2)$ again for $i \neq j$. Since the selections are independent, it follows that there exist $2 \cdot 3 = 6$ many $3$-matchings in $f_3^{-1}(\eset{1})$.
Now, we consider the 3-edge subgraphs of $\graph{1}$, starting with $\eset{1} = \tri$.
Now, we consider the 3-edge subgraphs of $\graph{1}$, starting with $\eset{1} = \tri$.
\begin{itemize}
\item Triangle ($\tri$)
\end{itemize}
@ -328,7 +329,7 @@ for a total of 18 many 3-matchings in $f_3^{-1}(\eset{1})$.
\begin{itemize}
\item $3$-path ($\threepath$)
\end{itemize}
When $\eset{1} \equiv \threepath$ and all edges in $\eset{3}$ are successively connected to form a $9$-path. Since $(e_1, 0)$ is disjoint to $(e_3, 2)$, both of these edges can exist in a $3$-matching. This relaxation yields 3 other 3-matchings that couldn't be counted in the case of the $\eset{1} = \tri$, namely
When $\eset{1} \equiv \threepath$ and all edges in $\eset{3}$ are successively connected to form a $9$-path. Since $(e_1, 0)$ is disjoint to $(e_3, 2)$, both of these edges can exist in a $3$-matching. This relaxation yields 3 other 3-matchings that couldn't be counted in the case of the $\eset{1} = \tri$, namely
\begin{equation*}
\pbrace{(e_1, 0), (e_2, 0), (e_3, 2)},\pbrace{(e_1, 0), (e_2, 1), (e_3, 2)}, \pbrace{(e_1, 0), (e_2, 2), (e_3, 2)}.
\end{equation*}
@ -337,7 +338,7 @@ There are therefore $18 + 3 = 21$ $3$-matchings in $f_3^{-1}(\eset{1})$.
\begin{itemize}
\item Disjoint Two-Path ($\twopathdis$)
\end{itemize}
Assume $\eset{1} = \twopathdis$, then the edges of $\eset{3}$ have successive connectivity from $(e_1, 0)$ through $(e_1, 2)$, and successive connectivity from $(e_2, 0)$ through $(e_3, 2)$. It is the case that the edges in $\eset{3}$ form a 6-path with a disjoint 3-path. There exist $8$ distinct two matchings (with at least one $(e_2,\cdot)$ and at least one $(e_3,\cdot)$ edge) in the $6$-path $(e_2, 0),\ldots, (e_3, 2)$ of the form
Assume $\eset{1} = \twopathdis$, then the edges of $\eset{3}$ have successive connectivity from $(e_1, 0)$ through $(e_1, 2)$, and successive connectivity from $(e_2, 0)$ through $(e_3, 2)$. It is the case that the edges in $\eset{3}$ form a 6-path with a disjoint 3-path. There exist $8$ distinct two matchings (with at least one $(e_2,\cdot)$ and at least one $(e_3,\cdot)$ edge) in the $6$-path $(e_2, 0),\ldots, (e_3, 2)$ of the form
\begin{equation*}
\pbrace{(e_2, 0), (e_3, 0)},\ldots, \pbrace{(e_2, 1), (e_3, 2)}, \pbrace{(e_2, 2), (e_3, 1)}, \pbrace{(e_2, 2), (e_3, 2)}.
\end{equation*}
@ -363,7 +364,7 @@ All of the observations above focused only on the shape of $\eset{1}$, and since
\subsubsection{Proof of~\cref{lem:3p-G3}}
The argument follows along the same lines as in the proof of \cref{lem:3p-G2}. Given $\mathcal{P} \in f_3^{-1}\inparen{\eset{1}}$, it \textit{must} be that every edge in $f_3(\mathcal{P})$ has at least one edge in $\mathcal{P}$ mapped to it (and $\mathcal{P}$ is connected). Notice again that this cannot be the case for any $\eset{1} \in \binom{E_1}{3}$, nor is it the case when $\eset{1} = \twodis$. This leaves us with two patterns, $\eset{1} = \twopath$ and $\eset{1} = \ed$. For the former, it is the case that we have two $3$-paths across $e_1$ and $e_2$, $\pbrace{(e_1, 1), (e_1, 2), (e_2, 0)}$ and $\pbrace{(e_1, 2), (e_2, 0), (e_2, 1)}$. For the latter pattern $\ed$, it it trivial to see that an edge in $\graph{1}$ becomes a $3$-path in $\graph{3}$, and this proves the identity.
The argument follows along the same lines as in the proof of \cref{lem:3p-G2}. Given $\mathcal{P} \in f_3^{-1}\inparen{\eset{1}}$, it \textit{must} be that every edge in $f_3(\mathcal{P})$ has at least one edge in $\mathcal{P}$ mapped to it (and $\mathcal{P}$ is connected). Notice again that this cannot be the case for any $\eset{1} \in \binom{E_1}{3}$, nor is it the case when $\eset{1} = \twodis$. This leaves us with two patterns, $\eset{1} = \twopath$ and $\eset{1} = \ed$. For the former, it is the case that we have two $3$-paths across $e_1$ and $e_2$, $\pbrace{(e_1, 1), (e_1, 2), (e_2, 0)}$ and $\pbrace{(e_1, 2), (e_2, 0), (e_2, 1)}$. For the latter pattern $\ed$, it it trivial to see that an edge in $\graph{1}$ becomes a $3$-path in $\graph{3}$, and this proves the identity.
All of the observations above focused only on the shape of $\eset{1}$, and since we see that for fixed $\eset{1}$, we have a fixed number of $3$-paths, this implies the identity.
@ -396,7 +397,7 @@ We now return to the proof of~\Cref{lem:mon-samp}:
\subsection{Proof of Theorem \ref{lem:mon-samp}}\label{app:subsec-th-mon-samp}
Consider now the random variables $\randvar_1,\dots,\randvar_\numvar$, where each $\randvar_i$ is the value of $\vari{Y}_{\vari{i}}$ after~\Cref{alg:mon-sam-product} is executed. In particular, note that we have
\[Y_i= \onesymbol\inparen{\monom\mod{\mathcal{B}}\not\equiv 0}\cdot \prod_{X_i\in \var\inparen{v}} p_i,\]
where the indicator variable handles the check in~\Cref{alg:check-duplicate-block}
where the indicator variable handles the check in~\Cref{alg:check-duplicate-block}
Then for random variable $\randvar_i$, it is the case that
\begin{align*}
\expct\pbox{\randvar_i} &= \sum\limits_{(\monom, \coef) \in \expandtree{\etree} }\frac{\onesymbol\inparen{\monom\mod{\mathcal{B}}\not\equiv 0}\cdot c\cdot\prod_{X_i\in \var\inparen{v}} p_i }{\abs{\etree}(1,\dots,1)} \\
@ -406,7 +407,7 @@ where in the first equality we use the fact that $\vari{sgn}_{\vari{i}}\cdot \ab
Let $\empmean = \frac{1}{\samplesize}\sum_{i = 1}^{\samplesize}\randvar_i$. It is also true that
\[\expct\pbox{\empmean} %\expct\pbox{ \frac{1}{\samplesize}\sum_{i = 1}^{\samplesize}\randvar_i}
\[\expct\pbox{\empmean} %\expct\pbox{ \frac{1}{\samplesize}\sum_{i = 1}^{\samplesize}\randvar_i}
= \frac{1}{\samplesize}\sum_{i = 1}^{\samplesize}\expct\pbox{\randvar_i}
= \frac{\rpoly(\prob_1,\ldots, \prob_\numvar)}{\abs{\etree}(1,\ldots, 1)}.\]
@ -415,7 +416,7 @@ Hoeffding's inequality states that if we know that each $\randvar_i$ (which are
\probOf\left(\left|\empmean - \expct\pbox{\empmean}\right| \geq \error\right) \leq 2\exp{\left(-\frac{2\samplesize^2\error^2}{\sum_{i = 1}^{\samplesize}(b_i -a_i)^2}\right)}.
\end{equation*}
Line ~\ref{alg:mon-sam-sample} shows that $\vari{sgn}_\vari{i}$ has a value in $\{-1, 1\}$ that is multiplied with $O(k)$ $\prob_i\in [0, 1]$, the range for each $\randvar_i$ is $[-1, 1]$.
Line ~\ref{alg:mon-sam-sample} shows that $\vari{sgn}_\vari{i}$ has a value in $\{-1, 1\}$ that is multiplied with $O(k)$ $\prob_i\in [0, 1]$, the range for each $\randvar_i$ is $[-1, 1]$.
Using Hoeffding's inequality, we then get:
\begin{equation*}
\probOf\pbox{~\left| \empmean - \expct\pbox{\empmean} ~\right| \geq \error} \leq 2\exp{\left(-\frac{2\samplesize^2\error^2}{2^2 \samplesize}\right)} = 2\exp{\left(-\frac{\samplesize\error^2}{2 }\right)}\leq \conf,
@ -428,10 +429,10 @@ This concludes the proof for the first claim of theorem ~\ref{lem:mon-samp}.
The runtime of the algorithm is dominated by~\Cref{alg:mon-sam-onepass} (which by~\Cref{lem:one-pass} takes time $O(size(\etree))$) and the $\samplesize$ iterations of the loop in~\Cref{alg:sampling-loop}. Each iteration's run time is dominated by the call to~\Cref{alg:mon-sam-sample} (which by~\Cref{lem:sample} takes $O(\log{k} \cdot k \cdot depth(\etree))$) and~\Cref{alg:check-duplicate-block}, which by the subsequent argument takes $O(k\log{k})$ time. We sort the $O(k)$ variables by their block IDs and then check if there is a duplicate block ID or not. Adding up all the times discussed here gives us the desired overall runtime.
\subsection{Proof of~\Cref{cor:approx-algo-const-p}}
The result follows by first noting that by definition of $\gamma$, we have
The result follows by first noting that by definition of $\gamma$, we have
%\AH{Just wondering why you use $\geq$ as opposed to $=$?}
%\AR{Ah, right-- fixed}
\[\rpoly(1,\dots,1)= (1-\gamma)\cdot \abs{\etree}(1,\dots,1).\]
\[\rpoly(1,\dots,1)= (1-\gamma)\cdot \abs{\etree}(1,\dots,1).\]
Further, since each $\prob_i\ge \prob_0$ and $\poly(\vct{X})$ (and hence $\rpoly(\vct{X})$) has degree at most $k$, we have that
\[ \rpoly(1,\dots,1) \ge \prob_0^k\cdot \rpoly(1,\dots,1).\]
The above two inequalities implies $\rpoly(1,\dots,1) \ge \prob_0^k\cdot (1-\gamma)\cdot \abs{\etree}(1,\dots,1)$.
@ -536,9 +537,9 @@ level 2/.style={sibling distance=0.7cm},
%\node[above right=0.7cm of TR, highlight_color, inner sep=0pt, font=\bfseries] (tr-comment) {$\etree_\rchild$};
% \draw[<-|, highlight_color] (TR) -- (tr-comment);
\end{tikzpicture}
\caption{Weights computed by $\onepass$ in ~\cref{example:one-pass}.}
\caption{Weights computed by $\onepass$ in ~\cref{example:one-pass}.}
\label{fig:expr-tree-T-wght}
\end{figure}
@ -566,7 +567,7 @@ For the base case, let the depth $d$ of $\etree$ be $0$. We have that the root
For the inductive hypothesis, assume that for $d \leq k$ for some $k \geq 0$, that it is indeed the case that $\sampmon$ returns a monomial.
For the inductive step, let us take a tree $\etree$ with $d = k + 1$. Note that each child has depth $d \leq k$, and by inductive hypothesis both of them return a valid monomial. Then the root can be either a $+$ or $\times$ node. For the case of a $+$ root node, line ~\ref{alg:sample-plus-bsamp} of $\sampmon$ will choose one of the children of the root. Since by inductive hypothesis it is the case that a monomial is being returned from either child, and only one of these monomials is selected, we have for the case of $+$ root node that a valid monomial is returned by $\sampmon$. When the root is a $\times$ node, lines ~\ref{alg:sample-times-union} and ~\ref{alg:sample-times-product} multiply the monomials returned by the two children of the root, and it is trivial to see that %by definition ~\ref{def:monomial}
For the inductive step, let us take a tree $\etree$ with $d = k + 1$. Note that each child has depth $d \leq k$, and by inductive hypothesis both of them return a valid monomial. Then the root can be either a $+$ or $\times$ node. For the case of a $+$ root node, line ~\ref{alg:sample-plus-bsamp} of $\sampmon$ will choose one of the children of the root. Since by inductive hypothesis it is the case that a monomial is being returned from either child, and only one of these monomials is selected, we have for the case of $+$ root node that a valid monomial is returned by $\sampmon$. When the root is a $\times$ node, lines ~\ref{alg:sample-times-union} and ~\ref{alg:sample-times-product} multiply the monomials returned by the two children of the root, and it is trivial to see that %by definition ~\ref{def:monomial}
the product of two monomials is also a monomial, which means that $\sampmon$ returns a valid monomial for the $\times$ root node, thus concluding the fact that $\sampmon$ indeed returns a monomial.
We will next prove by induction on the depth $d$ of $\etree$ that the $(\monom,\coef)$ returned by $\sampmon$ has a probability %`that is in accordance with the monomial sampled,
@ -608,7 +609,7 @@ It is easy to check that except for~\Cref{alg:sample-times-union}, all other lin
More specifically consider $\onepass$. The algorithm (as well as its analysis) basically uses the fact that one can compute the corresponding polynomial at all $1$s input with a simple recursive formula (\cref{eq:T-all-ones}), and that we can compute a probability distribution based on these weights (as in~\cref{eq:T-weights}). It can be verified that all the arguments go through if we replace $\etree_\lchild$ and $\etree_\rchild$ for expression tree $\etree$ with the two incoming nodes of the sink for the given lineage circuit. Another way to look at this is we could `unroll' the recursion in $\onepass$ and think of the algorithm as doing the evaluation at each node bottom up from leaves to the root in the expression tree. For lineage circuits, we start from the source nodes and do the computation in the topological order till we reach the sink(s).
The argument for $\sampmon$ is similar. Since we argued that $\onepass$ works as intended for lineage circuits since~\Cref{alg:one-pass} only recurses on children of the current node in the expression tree and we can generalize it to lineage circuits by recursing to the two children of the current node in the lineage circuit. Alternatively, as we have already used in the proof of~\Cref{lem:sample}, we can think of the sampling algorithm sampling a sub-graph of the expression tree. For lineage circuits, we can think of $\sampmon$ as sampling the same sub-graph. Alternatively, one can implicitly expand the circuit lineage into a (larger but) equivalent expression tree. Since $\sampmon$ only explores one sub-graph during its run we can think of its run on a lineage circuit as being done on the implicit equivalent expression tree\footnote{
The argument for $\sampmon$ is similar. Since we argued that $\onepass$ works as intended for lineage circuits since~\Cref{alg:one-pass} only recurses on children of the current node in the expression tree and we can generalize it to lineage circuits by recursing to the two children of the current node in the lineage circuit. Alternatively, as we have already used in the proof of~\Cref{lem:sample}, we can think of the sampling algorithm as sampling a sub-graph of the expression tree. For lineage circuits, we can think of $\sampmon$ as sampling the same sub-graph. Alternatively, one can implicitly expand the circuit lineage into a (larger but) equivalent expression tree. Since $\sampmon$ only explores one sub-graph during its run we can think of its run on a lineage circuit as being done on the implicit equivalent expression tree\footnote{
Recall that $\sampmon$ scales only in the depth of the expression and its polynomial degree ($k$). There exist polynomials that can be encoded in size $\Omega(\log k)$, but we follow convention in assuming that the circuit size is asymptotically larger than $k$ and thus treat the degree (i.e., join width) as a constant.
}. Hence, all of the results on $\sampmon$ on expression trees carry over to lineage circuits.
@ -616,122 +617,130 @@ Thus, we have argued that~\Cref{lem:approx-alg} also holds if we use a lineage c
\subsection{Representing Polynomials with Lineage Circuits}\label{app:subsec-rep-poly-lin-circ}
\newcommand{\getpoly}[1]{\textbf{lin}\inparen{#1}}
Each vertex $v \in V_Q$ in the arithmetic circuit for $\tuple{V_Q, E_Q, \phi_Q, \ell_Q}$ encodes a polynomial, realized as
Each vertex $v \in V_{Q,\pxdb}$ in the arithmetic circuit for
$$\getpoly{v} = \begin{cases}
\sum_{v' : (v',v) \in E_Q} \getpoly{v'} & \textbf{if } \ell(v) = +\\
\prod_{v' : (v',v) \in E_Q} \getpoly{v'} & \textbf{if } \ell(v) = \times\\
\[\tuple{V_{Q,\pxdb}, E_{Q,\pxdb}, \phi_{Q,\pxdb}, \ell_{Q,\pxdb}}\]
encodes a polynomial, realized as
\[\getpoly{v} = \begin{cases}
\sum_{v' : (v',v) \in E_{Q,\pxdb}} \getpoly{v'} & \textbf{if } \ell(v) = +\\
\prod_{v' : (v',v) \in E_{Q,\pxdb}} \getpoly{v'} & \textbf{if } \ell(v) = \times\\
\ell(v) & \textbf{otherwise}
\end{cases}$$
\end{cases}\]
We define the circuit for a select-union-project-join $Q$ recursively by cases as follows. In each case, let $\tuple{V_{Q_i}, E_{Q_i}, \phi_{Q_i}, \ell_{Q_i}}$ denote the circuit for subquery $Q_i$.
We define the circuit for a select-union-project-join $Q$ recursively by cases as follows. In each case, let $\tuple{V_{Q_i,\pxdb}, E_{Q_i,\pxdb}, \phi_{Q_{i},\pxdb}, \ell_{Q_i,\pxdb}}$ denote the circuit for subquery $Q_i$.
\caseheading{Base Relation}
Let $Q$ be a base relation $R$. We define one node for each tuple. Formally, let $V_Q = \comprehension{v_t}{t\in R}$, let $\phi_Q(t) = v_t$, let $\ell_Q(v_t) = R(t)$, and let $E_Q = \emptyset$.
Let $Q$ be a base relation $R$. We define one node for each tuple. Formally, let $V_{Q,\pxdb} = \comprehension{v_t}{t\in R}$, let $\phi_{Q,\pxdb}(t) = v_t$, let $\ell_{Q,\pxdb}(v_t) = R(t)$, and let $E_{Q,\pxdb} = \emptyset$.
This circuit has $|R|$ vertices.
\caseheading{Selection}
Let $Q = \sigma_\theta \inparen{Q_1}$.
We re-use the circuit for $Q_1$. %, but define a new distinguished node $v_0$ with label $0$ and make it the sink node for all tuples that fail the selection predicate.
Formally, let $V_Q = V_{Q_1}$, let $\ell_Q(v_0) = 0$, and let $\ell_Q(v) = \ell_{Q_1}(v)$ for any $v \in V_{Q_1}$. Let $E_Q = E_{Q_1}$, and define
$$\phi_Q(t) =
\phi_{Q_1}(t) \text{ for } t \text{ s.t.}\; \theta(t).$$
Dead sinks are iteratively removed, and so
We re-use the circuit for $Q_1$. %, but define a new distinguished node $v_0$ with label $0$ and make it the sink node for all tuples that fail the selection predicate.
Formally, let $V_{Q,\pxdb} = V_{Q_1,\pxdb}$, let $\ell_{Q,\pxdb}(v_0) = 0$, and let $\ell_{Q,\pxdb}(v) = \ell_{Q_1,\pxdb}(v)$ for any $v \in V_{Q_1,\pxdb}$. Let $E_{Q,\pxdb} = E_{Q_1,\pxdb}$, and define
$$\phi_{Q,\pxdb}(t) =
\phi_{Q_{1}, \pxdb}(t) \text{ for } t \text{ s.t.}\; \theta(t).$$
Dead sinks are iteratively removed, and so
%\AH{While not explicit, I assume a reviewer would know that the notation above discards tuples/vertices not satisfying the selection predicate.}
%v_0 & \textbf{otherwise}
%\end{cases}$$
this circuit has at most $|V_{Q_1}|$ vertices.
this circuit has at most $|V_{Q_1,\pxdb}|$ vertices.
\caseheading{Projection}
Let $Q = \pi_{\vct A} {Q_1}$.
We extend the circuit for ${Q_1}$ with a new set of sum vertices (i.e., vertices with label $+$) for each tuple in $Q$, and connect them to the corresponding sink nodes of the circuit for ${Q_1}$.
Naively, let $V_Q = V_{Q_1} \cup \comprehension{v_t}{t \in \pi_{\vct A} {Q_1}}$, let $\phi_Q(t) = v_t$, and let $\ell_Q(v_t) = +$. Finally let
$$E_Q = E_{Q_1} \cup \comprehension{(\phi_{Q_1}(t'), v_t)}{t = \pi_{\vct A} t', t' \in {Q_1}, t \in \pi_{\vct A} {Q_1}}$$
Naively, let $V_{Q,\pxdb} = V_{Q_1,\pxdb} \cup \comprehension{v_t}{t \in \pi_{\vct A} {Q_1}}$, let $\phi_{Q,\pxdb}(t) = v_t$, and let $\ell_{Q,\pxdb}(v_t) = +$. Finally let
$$E_{Q,\pxdb} = E_{Q_1,\pxdb} \cup \comprehension{(\phi_{Q_{1}, \pxdb}(t'), v_t)}{t = \pi_{\vct A} t', t' \in {Q_1}, t \in \pi_{\vct A} {Q_1}}$$
This formulation will produce vertices with an in-degree greater than two, a problem that we correct by replacing every vertex with an in-degree over two by an equivalent fan-in tree. The resulting structure has at most $|{Q_1}|-1$ new vertices.
% \AH{Is the rightmost operator \emph{supposed} to be a $-$? In the beginning we add $|\pi_{\vct A}{Q_1}|$ vertices.}
The corrected circuit thus has at most $|V_{Q_1}|+|{Q_1}|$ vertices.
The corrected circuit thus has at most $|V_{Q_1,\pxdb}|+|{Q_1}|$ vertices.
\caseheading{Union}
Let $Q = {Q_1} \cup {Q_2}$.
We merge graphs and produce a sum vertex for all tuples in both sides of the union.
Formally, let $V_Q = V_{Q_1} \cup V_{Q_2} \cup \comprehension{v_t}{t \in {Q_1} \cap {Q_2}}$, let $\ell_Q(v_t) = +$, and let
$$E_Q = E_{Q_1} \cup E_{Q_2} \cup \comprehension{(\phi_{Q_1}(t), v_t), (\phi_{Q_2}(t), v_t)}{t \in {Q_1} \cap {Q_2}}$$
$$\phi_Q(t) = \begin{cases}
Formally, let $V_{Q,\pxdb} = V_{Q_1,\pxdb} \cup V_{Q_2,\pxdb} \cup \comprehension{v_t}{t \in {Q_1} \cap {Q_2}}$, let $\ell_{Q,\pxdb}(v_t) = +$, and let
\[E_{Q,\pxdb} = E_{Q_1,\pxdb} \cup E_{Q_2,\pxdb} \cup \comprehension{(\phi_{Q_{1}, \pxdb}(t), v_t), (\phi_{Q_{2}, \pxdb}(t), v_t)}{t \in {Q_1} \cap {Q_2}}\]
\[
\phi_{Q,\pxdb}(t) = \begin{cases}
v_t & \textbf{if } t \in {Q_1} \cap {Q_1}\\
\phi_{Q_1}(t) & \textbf{if } t \not \in {Q_2}\\
\phi_{Q_2}(t) & \textbf{if } t \not \in {Q_1}\\
\end{cases}$$
This circuit has $|V_{Q_1}|+|V_{Q_2}|+|{Q_1} \cap {Q_2}|$ vertices.
\phi_{Q_{1}, \pxdb}(t) & \textbf{if } t \not \in {Q_2}\\
\phi_{Q_{2}, \pxdb}(t) & \textbf{if } t \not \in {Q_1}\\
\end{cases}\]
This circuit has $|V_{Q_1,\pxdb}|+|V_{Q_2,\pxdb}|+|{Q_1} \cap {Q_2}|$ vertices.
\caseheading{$k$-ary Join}
Let $Q = {Q_1} \bowtie \ldots \bowtie {Q_k}$.
We merge graphs and produce a multiplication vertex for all tuples resulting from the join
Naively, let $V_Q = V_{Q_1} \cup \ldots \cup V_{Q_k} \cup \comprehension{v_t}{t \in {Q_1} \bowtie \ldots \bowtie {Q_k}}$, let
Naively, let $V_{Q,\pxdb} = V_{Q_1,\pxdb} \cup \ldots \cup V_{Q_k,\pxdb} \cup \comprehension{v_t}{t \in {Q_1} \bowtie \ldots \bowtie {Q_k}}$, let
{\small
\begin{multline*}
E_Q = E_{Q_1} \cup \ldots \cup E_{Q_k} \cup
E_{Q,\pxdb} = E_{Q_1,\pxdb} \cup \ldots \cup E_{Q_k,\pxdb} \cup
\left\{\;
(\phi_{Q_1}(\pi_{\sch({Q_1})}t), v_t), \right.\\
\ldots, (\phi_{Q_k}(\pi_{\sch({Q_k})}t), v_t)
(\phi_{Q_{1}, \pxdb}(\pi_{\sch({Q_1})}t), v_t), \right.\\
\ldots, (\phi_{Q_k,\pxdb}(\pi_{\sch({Q_k})}t), v_t)
\;\left|\;t \in {Q_1} \bowtie \ldots \bowtie {Q_k}\;\right\}
\end{multline*}
}
Let $\ell_Q(v_t) = \times$, and let $\phi_Q(t) = v_t$
As in projection, newly created vertices will have an in-degree of $k$, and a fan-in tree is required.
There are $|{Q_1} \bowtie \ldots \bowtie {Q_k}|$ such vertices, so the corrected circuit has $|V_{Q_1}|+\ldots+|V_{Q_k}|+(k-1)|{Q_1} \bowtie \ldots \bowtie {Q_k}|$ vertices.
Let $\ell_{Q,\pxdb}(v_t) = \times$, and let $\phi_{Q,\pxdb}(t) = v_t$
As in projection, newly created vertices will have an in-degree of $k$, and a fan-in tree is required.
There are $|{Q_1} \bowtie \ldots \bowtie {Q_k}|$ such vertices, so the corrected circuit has $|V_{Q_1,\pxdb}|+\ldots+|V_{Q_k,\pxdb}|+(k-1)|{Q_1} \bowtie \ldots \bowtie {Q_k}|$ vertices.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Proof for~\Cref{lem:circuits-model-runtime}}\label{app:subsec-lem-lin-vs-qplan}
Proof by induction. The base case is a base relation: $Q = R$ and is trivially true since $|V_R| = |R|$.
For the inductive step, we assume that we have circuits for subplans $Q_1, \ldots, Q_n$ such that $|V_{Q_i}| \leq (k_i-1)\qruntime{Q_i}$ where $k_i$ is the degree of $Q_i$.
Proof by induction. The base case is a base relation: $Q = R$ and is trivially true since $|V_{R,\pxdb}| = |R|$.
For the inductive step, we assume that we have circuits for subplans $Q_1, \ldots, Q_n$ such that $|V_{Q_i,\pxdb}| \leq (k_i-1)\qruntime{Q_i,\pxdb}$ where $k_i$ is the degree of $Q_i$.
\caseheading{Selection}
Assume that $Q = \sigma_\theta(Q_1)$.
In the circuit for $Q$, $|V_Q| = |V_{Q_1}|$ vertices, so from the inductive assumption and $\qruntime{Q} = \qruntime{Q_1}$ by definition, we have $|V_Q| \leq (k-1) \qruntime{Q} $.
In the circuit for $Q$, $|V_{Q,\pxdb}| = |V_{Q_1,\pxdb}|$ vertices, so from the inductive assumption and $\qruntime{Q,\pxdb} = \qruntime{Q_1,\pxdb}$ by definition, we have $|V_{Q,\pxdb}| \leq (k-1) \qruntime{Q,\pxdb} $.
% \AH{Technically, $\kElem$ is the degree of $\poly_1$, but I guess this is a moot point since one can argue that $\kElem$ is also the degree of $\poly$.}
% OK: Correct
\caseheading{Projection}
Assume that $Q = \pi_{\vct A}(Q_1)$.
The circuit for $Q$ has at most $|V_{Q_1}|+|{Q_1}|$ vertices.
The circuit for $Q$ has at most $|V_{Q_1,\pxdb}|+|{Q_1}|$ vertices.
% \AH{The combination of terms above doesn't follow the details for projection above.}
\begin{align*}
|V_{Q}| & \leq |V_{Q_1}| + |Q_1|\\
%\intertext{By \Cref{prop:queries-need-to-output-tuples} $\qruntime{Q_1} \geq |Q_1|$}
%& \leq |V_{Q_1}| + 2 \qruntime{Q_1}\\
|V_{Q,\pxdb}| & \leq |V_{Q_1,\pxdb}| + |Q_1|\\
%\intertext{By \Cref{prop:queries-need-to-output-tuples} $\qruntime{Q_1,\pxdb} \geq |Q_1|$}
%& \leq |V_{Q_1,\pxdb}| + 2 \qruntime{Q_1,\pxdb}\\
\intertext{(From the inductive assumption)}
& \leq (k-1)\qruntime{Q_1} + \abs{Q_1}\\
\intertext{(By definition of $\qruntime{Q}$)}
& \le (k-1)\qruntime{Q}.
& \leq (k-1)\qruntime{Q_1,\pxdb} + \abs{Q_1}\\
\intertext{(By definition of $\qruntime{Q,\pxdb}$)}
& \le (k-1)\qruntime{Q,\pxdb}.
\end{align*}
\caseheading{Union}
Assume that $Q = Q_1 \cup Q_2$.
The circuit for $Q$ has $|V_{Q_1}|+|V_{Q_2}|+|{Q_1} \cap {Q_2}|$ vertices.
The circuit for $Q$ has $|V_{Q_1,\pxdb}|+|V_{Q_2,\pxdb}|+|{Q_1} \cap {Q_2}|$ vertices.
\begin{align*}
|V_{Q}| & \leq |V_{Q_1}|+|V_{Q_2}|+|{Q_1}|+|{Q_2}|\\
%\intertext{By \Cref{prop:queries-need-to-output-tuples} $\qruntime{Q_1} \geq |Q_1|$}
%& \leq |V_{Q_1}|+|V_{Q_2}|+\qruntime{Q_1}+\qruntime{Q_2}|\\
|V_{Q,\pxdb}| & \leq |V_{Q_1,\pxdb}|+|V_{Q_2,\pxdb}|+|{Q_1}|+|{Q_2}|\\
%\intertext{By \Cref{prop:queries-need-to-output-tuples} $\qruntime{Q_1,\pxdb} \geq |Q_1|$}
%& \leq |V_{Q_1,\pxdb}|+|V_{Q_2,\pxdb}|+\qruntime{Q_1,\pxdb}+\qruntime{Q_2,\pxdb}|\\
\intertext{(From the inductive assumption)}
& \leq (k-1)(\qruntime{Q_1} + \qruntime{Q_2}) + (b_1 + b_2)
\intertext{(By definition of $\qruntime{Q}$)}
& \leq (k-1)(\qruntime{Q}).
& \leq (k-1)(\qruntime{Q_1,\pxdb} + \qruntime{Q_2,\pxdb}) + (b_1 + b_2)
\intertext{(By definition of $\qruntime{Q,\pxdb}$)}
& \leq (k-1)(\qruntime{Q,\pxdb}).
\end{align*}
\caseheading{$k$-ary Join}
Assume that $Q = Q_1 \bowtie \ldots \bowtie Q_k$.
The circuit for $Q$ has $|V_{Q_1}|+\ldots+|V_{Q_k}|+(k-1)|{Q_1} \bowtie \ldots \bowtie {Q_k}|$ vertices.
The circuit for $Q$ has $|V_{Q_1,\pxdb}|+\ldots+|V_{Q_k,\pxdb}|+(k-1)|{Q_1} \bowtie \ldots \bowtie {Q_k}|$ vertices.
\begin{align*}
|V_{Q}| & = |V_{Q_1}|+\ldots+|V_{Q_k}|+(k-1)|{Q_1} \bowtie \ldots \bowtie {Q_k}|\\
|V_{Q,\pxdb}| & = |V_{Q_1,\pxdb}|+\ldots+|V_{Q_k,\pxdb}|+(k-1)|{Q_1} \bowtie \ldots \bowtie {Q_k}|\\
\intertext{From the inductive assumption and noting $\forall i: k_i \leq k-1$}
& \leq (k-1)\qruntime{Q_1}+\ldots+(k-1)\qruntime{Q_k}+\\
& \leq (k-1)\qruntime{Q_1,\pxdb}+\ldots+(k-1)\qruntime{Q_k,\pxdb}+\\
&\;\;\; (k-1)|{Q_1} \bowtie \ldots \bowtie {Q_k}|\\
& \leq (k-1)(\qruntime{Q_1}+\ldots+\qruntime{Q_k}+\\
& \leq (k-1)(\qruntime{Q_1,\pxdb}+\ldots+\qruntime{Q_k,\pxdb}+\\
&\;\;\;|{Q_1} \bowtie \ldots \bowtie {Q_k}|)\\
\intertext{(By definition of $\qruntime{Q}$)}
& = (k-1)\qruntime{Q}.
\intertext{(By definition of $\qruntime{Q,\pxdb}$)}
& = (k-1)\qruntime{Q,\pxdb}.
\end{align*}
The property holds for all recursive queries, and the proof holds.
%%% Local Variables:
%%% mode: latex
%%% TeX-master: "main"
%%% End: