Finished subsection 3.1 and started subsection 3.2 of Intro.

Aaron Huber 2022-01-27 10:58:33 -05:00
parent f5567e5e94
commit 2dfd9c7452
3 changed files with 56 additions and 109 deletions

View File

@ -143,21 +143,20 @@ Finally, note that there are exactly three cases where the expectation of a mono
\subsection{Proof for Lemma~\ref{lem:exp-poly-rpoly}}\label{subsec:proof-exp-poly-rpoly}
Let $\poly$ be a polynomial of $\abs{\tupset}$ variables with highest degree $= B$, defined as follows: %, in which every possible monomial permutation appears,
Let $\poly$ be a polynomial of $\numvar$ variables with highest degree $= B$, defined as follows: %, in which every possible monomial permutation appears,
\[\poly(X_1,\ldots, X_\numvar) = \sum_{\vct{d} \in \{0,\ldots, B\}^\numvar}c_{\vct{d}}\cdot \prod_{\substack{i = 1\\s.t. d_i \geq 1}}^\numvar X_i^{d_i}.\]
%Let the boolean function $\isInd{\cdot}$ take $\vct{d}$ as input and return true if there does not exist any dependent variables in $\vct{d}$, i.e., $\not\exists ~\block, i\neq j\suchthat d_{\block, i}, d_{\block, j} \geq 1$.\footnote{This \abbrBIDB notation is used and discussed in \cref{subsec:tidbs-and-bidbs}}.
In expectation we have
Let the boolean function $\isInd{\cdot}$ take $\vct{d}$ as input and return true if there does not exist any dependent variables in $\vct{d}$, i.e., $\not\exists ~\block, i\neq j\suchthat d_{\block, i}, d_{\block, j} \geq 1$.\footnote{This \abbrBIDB notation is used and discussed in \cref{subsec:tidbs-and-bidbs}}.
Then in expectation we have
\expct_{\vct{\randWorld}\sim\bpd}\pbox{\poly(\vct{\randWorld})} &= \expct_{\vct{\randWorld\sim\bpd}}\pbox{\sum_{\substack{\vct{d} \in \{0,\ldots,B\}^\numvar\\}}c_{\vct{d}}\cdot \prod_{\substack{i = 1\\s.t. d_i \geq 1}}^\numvar \randWorld_i^{d_i}} \label{p1-s1a}\\
&= \sum_{\substack{\vct{d} \in \{0,\ldots,B\}^\numvar}}c_{\vct{d}}\cdot \expct_{\vct{\randWorld\sim\bpd}}\pbox{\prod_{\substack{i = 1\\s.t. d_i \geq 1}}^\numvar \randWorld_i^{d_i}}\label{p1-s1c}\\
&= \sum_{\substack{\vct{d} \in \{0,\ldots,B\}^\numvar\\}}c_{\vct{d}}\cdot \prod_{\substack{i = 1\\s.t. d_i \geq 1}}^\numvar \expct_{\vct{\randWorld}\sim\bpd}\pbox{\randWorld_i^{d_i}}\label{p1-s2}\\
&= \sum_{\substack{\vct{d} \in \{0,\ldots,B\}^\numvar\\\wedge~\isInd{\vct{d}}}}c_{\vct{d}}\cdot \prod_{\substack{i = 1\\s.t. d_i \geq 1}}^\numvar \expct_{\vct{\randWorld}\sim\bpd}\pbox{\randWorld_i}\label{p1-s3}\\
&= \sum_{\substack{\vct{d} \in \{0,\ldots,B\}^\numvar\\}}c_{\vct{d}}\cdot \prod_{\substack{i = 1\\s.t. d_i \geq 1}}^\numvar \prob_i\label{p1-s4}\\
\expct_{\vct{\randWorld}}\pbox{\poly(\vct{\randWorld})} &= \expct_{\vct{\randWorld}}\pbox{\sum_{\substack{\vct{d} \in \{0,\ldots,B\}^\numvar\\\wedge~\isInd{\vct{d}}}}c_{\vct{d}}\cdot \prod_{\substack{i = 1\\s.t. d_i \geq 1}}^\numvar \randWorld_i^{d_i} + \sum_{\substack{\vct{d} \in \{0,\ldots, B\}^\numvar\\\wedge ~\neg\isInd{\vct{d}}}} c_{\vct{d}}\cdot\prod_{\substack{i = 1\\s.t. d_i \geq 1}}^\numvar\randWorld_i^{d_i}}\label{p1-s1a}\\
&= \sum_{\substack{\vct{d} \in \{0,\ldots,B\}^\numvar\\\wedge~\isInd{\vct{d}}}}c_{\vct{d}}\cdot \expct_{\vct{\randWorld}}\pbox{\prod_{\substack{i = 1\\s.t. d_i \geq 1}}^\numvar \randWorld_i^{d_i}} + \sum_{\substack{\vct{d} \in \{0,\ldots, B\}^\numvar\\\wedge ~\neg\isInd{\vct{d}}}} c_{\vct{d}}\cdot\expct_{\vct{\randWorld}}\pbox{\prod_{\substack{i = 1\\s.t. d_i \geq 1}}^\numvar\randWorld_i^{d_i}}\label{p1-s1b}\\
&= \sum_{\substack{\vct{d} \in \{0,\ldots,B\}^\numvar\\~\wedge\isInd{\vct{d}}}}c_{\vct{d}}\cdot \expct_{\vct{\randWorld}}\pbox{\prod_{\substack{i = 1\\s.t. d_i \geq 1}}^\numvar \randWorld_i^{d_i}}\label{p1-s1c}\\
&= \sum_{\substack{\vct{d} \in \{0,\ldots,B\}^\numvar\\\wedge~\isInd{\vct{d}}}}c_{\vct{d}}\cdot \prod_{\substack{i = 1\\s.t. d_i \geq 1}}^\numvar \expct_{\vct{\randWorld}}\pbox{\randWorld_i^{d_i}}\label{p1-s2}\\
&= \sum_{\substack{\vct{d} \in \{0,\ldots,B\}^\numvar\\\wedge~\isInd{\vct{d}}}}c_{\vct{d}}\cdot \prod_{\substack{i = 1\\s.t. d_i \geq 1}}^\numvar \expct_{\vct{\randWorld}}\pbox{\randWorld_i}\label{p1-s3}\\
&= \sum_{\substack{\vct{d} \in \{0,\ldots,B\}^\numvar\\\wedge~\isInd{\vct{d}}}}c_{\vct{d}}\cdot \prod_{\substack{i = 1\\s.t. d_i \geq 1}}^\numvar \prob_i\label{p1-s4}\\
&= \rpoly(\prob_1,\ldots, \prob_\numvar).\label{p1-s5}
\Cref{p1-s1a} is the result of substituting in the definition of $\poly$ given above. Then we arrive at \cref{p1-s1c} by linearity of expectation.
%Next, is the result of the independence constraint of \abbrBIDB\xplural, specifically that any monomial composed of dependent variables, i.e., variables from the same block $\block$, has a probability of $0$.
\Cref{p1-s2} is obtained by the fact that all variables in each surviving monomial are independent, which allows for the expectation to be pushed through the product. In \cref{p1-s3}, since $\randWorld_i \in \{0, 1\}$ it is the case that for any exponent $e \geq 1$, $\randWorld_i^e = \randWorld_i$. Next, in \cref{p1-s4} the expectation of a $1$-\abbrBIDB tuple is indeed its probability, since the only worlds considered are all those (and only those) which contribute to its marginal probability.
\Cref{p1-s1a} is the result of substituting in the definition of $\poly$ given above. Then we arrive at \cref{p1-s1b} by linearity of expectation. Next, \cref{p1-s1c} is the result of the independence constraint of \abbrBIDB\xplural, specifically that any monomial composed of dependent variables, i.e., variables from the same block $\block$, has a probability of $0$. \Cref{p1-s2} is obtained by the fact that all variables in each monomial are independent, which allows for the expectation to be pushed through the product. In \cref{p1-s3}, since $\randWorld_i \in \{0, 1\}$ it is the case that for any exponent $e \geq 1$, $\randWorld_i^e = \randWorld_i$. Next, in \cref{p1-s4} the expectation of a tuple is indeed its probability.
Finally, it can be verified that \Cref{p1-s5} follows since \cref{p1-s4} satisfies the construction of $\rpoly(\prob_1,\ldots, \prob_\numvar)$ in \Cref{def:reduced-bi-poly}.

View File

@ -121,7 +121,7 @@ Unfortunately, we prove that this is not the case. To analyze this question we
Define $\gentupset$ to be the set of tuples appearing across all the possible worlds of a $\abbrCTIDB$, formally $\gentupset = \inset{\tup_i ~|~ \forall \worldvec \in \worlds,~\forall i \in \abs{\tupset}:~\worldvec\pbox{i} > 0}$. When a specific $\pdb = \inparen{\worlds, \bpd}$ is being referred to, we will use $\tupset$ to denote the set of tuples.
Let $\qruntime{\query, \dbbase}$ be the optimal runtime (with some caveats; discussed in~\cref{sec:gen}) of query $\query$ on a comparable deterministic database $\dbbase$ defined next.
Let $\qruntime{\query, \gentupset}$ be the optimal runtime (with some caveats; discussed in~\cref{sec:gen}) of query $\query$ on a comparable deterministic database $\gentupset$ defined next.
%We make this runtime concrete later on.
%We denote by $\dbbase$ the base \abbrCTIDB table containing all possible tuples, formally as,
@ -135,22 +135,22 @@ Let $\qruntime{\query, \dbbase}$ be the optimal runtime (with some caveats; disc
Lower bound on $\timeOf{}^*(\query,\pdb)$ & Num. $\pd$s & Hardness Assumption\\
$\Omega\inparen{\inparen{\qruntime{\query, \dbbase}}^{1+\eps_0}}$ for {\em some} $\eps_0>0$ & Single & Triangle Detection hypothesis\\
$\Omega\inparen{\inparen{\qruntime{\query, \gentupset}}^{1+\eps_0}}$ for {\em some} $\eps_0>0$ & Single & Triangle Detection hypothesis\\
$\omega\inparen{\inparen{\qruntime{\query, \dbbase}}^{C_0}}$ for {\em all} $C_0>0$ & Multiple &$\sharpwzero\ne\sharpwone$\\
$\omega\inparen{\inparen{\qruntime{\query, \gentupset}}^{C_0}}$ for {\em all} $C_0>0$ & Multiple &$\sharpwzero\ne\sharpwone$\\
$\Omega\inparen{\inparen{\qruntime{\query, \dbbase}}^{c_0\cdot k}}$ for {\em some} $c_0>0$ & Multiple & \Cref{conj:known-algo-kmatch}\\ %Multiple & Current $k$-matching algorithms\\
$\Omega\inparen{\inparen{\qruntime{\query, \gentupset}}^{c_0\cdot k}}$ for {\em some} $c_0>0$ & Multiple & \Cref{conj:known-algo-kmatch}\\ %Multiple & Current $k$-matching algorithms\\
\caption{Our lower bounds for a specific hard query $Q$ parameterized by $k$. The $\pdb$ is over the same (family of) $\dbbase$ and those with `Multiple' in the second column need the algorithm to be able to handle multiple $\pd$ (for a given $\dbbase$). The last column states the hardness assumptions that imply the lower bounds in the first column ($\eps_o,C_0,c_0$ are constants that are independent of $k$).}
\caption{Our lower bounds for a specific hard query $Q$ parameterized by $k$. The $\pdb$ is over the same (family of) $\gentupset$ and those with `Multiple' in the second column need the algorithm to be able to handle multiple $\pd$ (for a given $\gentupset$). The last column states the hardness assumptions that imply the lower bounds in the first column ($\eps_o,C_0,c_0$ are constants that are independent of $k$).}
\mypar{Our lower bound results} In table~\ref{tab:lbs} we show that depending on what hardness result/conjecture we assume, we get various emphatic versions of {\em no} as an answer to our question. To make some sense of the other lower bounds in Table~\ref{tab:lbs}, we note that it is not too hard to show that $\timeOf{}^*(Q,\pdb) \le O\inparen{\inparen{\qruntime{Q, \dbbase}}^k}$, where $k$ is the largest degree of the query $\query$ (i.e., join width) over all result tuples $\tup$ (and the parameter that defines our family of hard queries).
\mypar{Our lower bound results} In table~\ref{tab:lbs} we show that depending on what hardness result/conjecture we assume, we get various emphatic versions of {\em no} as an answer to our question. To make some sense of the other lower bounds in Table~\ref{tab:lbs}, we note that it is not too hard to show that $\timeOf{}^*(Q,\pdb) \le O\inparen{\inparen{\qruntime{Q, \gentupset}}^k}$, where $k$ is the largest degree of the query $\query$ (i.e., join width) over all result tuples $\tup$ (and the parameter that defines our family of hard queries).
What our lower bound in the third row says is that one cannot get more than a polynomial improvement over essentially the trivial algorithm for~\cref{prob:expect-mult}.
However, this result assumes a hardness conjecture that is not as well studied as those in the first two rows of the table (see \Cref{sec:hard} for more discussion on the hardness assumptions). Further, we note that existing results already imply the claimed lower bounds if we were to replace the $\qruntime{\query, \dbbase}$ by just $\abs{\dbbase}$ (indeed these results follow from known lower bound for deterministic query processing). Our contribution is to then identify a family of hard queries where deterministic query processing is `easy' but computing the expected multiplicities is hard.
However, this result assumes a hardness conjecture that is not as well studied as those in the first two rows of the table (see \Cref{sec:hard} for more discussion on the hardness assumptions). Further, we note that existing results already imply the claimed lower bounds if we were to replace the $\qruntime{\query, \gentupset}$ by just $\abs{\gentupset}$ (indeed these results follow from known lower bound for deterministic query processing). Our contribution is to then identify a family of hard queries where deterministic query processing is `easy' but computing the expected multiplicities is hard.
\mypar{Our upper bound results} We introduce an $(1\pm \epsilon)$-approximation algorithm that computes ~\cref{prob:expect-mult} in $O_\epsilon\inparen{\qruntime{\query, \dbbase}}$.
\mypar{Our upper bound results} We introduce an $(1\pm \epsilon)$-approximation algorithm that computes ~\cref{prob:expect-mult} in $O_\epsilon\inparen{\qruntime{\query, \tupset}}$.
% In particular, we show the following upper bound results.
%(i) We show that e.g. for a circuit representation of the lineage polynomial (more on this later), when the circuit is a tree and there is a single
% result tuple, we also have the same runtime (we can also handle the case of multiple result tuples\footnote{We can approximate the expected result tuple multiplicities (for all result tuples {\em simultanesouly}) with only $O(\log{Z})=O_k(\log{n})$ overhead (where $Z$ is the number of result tuples) over the runtime of a broad class of query processing algorithms (see \Cref{app:sec-cicuits}).}).
@ -200,7 +200,6 @@ is equivalent to computing \Cref{prob:bag-pdb-poly-expected} (see \Cref{prop:exp
\subsection{Our Machinery}
\mypar{Lower Bound Proof Techniques}
All of our results rely on working with a {\em reduced} form of the lineage polynomial $\poly$. In fact, it turns out that for the $1$-\abbrCTIDB case, computing the expected multiplicity (over bag query semantics) is {\em exactly} the same as evaluating this reduced polynomial over the probabilities that define the \abbrTIDB. This is also true when the query input(s) is a block independent disjoint probabilistice database (with tuple multiplicity of at most $1$), which we refer to as a $1$-\abbrBIDB. For our results to be applicable to \abbrCTIDB\xplural, we introduce the following reduction.
@ -210,7 +209,6 @@ Any \abbrCTIDB $\pdb$, can be reduced to an equivalent $1$-\abbrBIDB $\pdb'$ in
Consider the $\boldsymbol{Route}$ relation of~\cref{fig:two-step} and query $\query = \project_{\text{City}_1}\inparen{\boldsymbol{Route}}$. The output relation $\query$ is $\inset{\intup{Chicago, X}, \intup{Chicago, Y}}$ and can be represented as a \abbrCTIDB $\query' = \inset{\intup{Chicago, X', 2}}$, where the following probabilities are true: $\probOf\pbox{X' = 0} = \probOf\pbox{\neg X \wedge \neg Y}$, $\probOf\pbox{X' = 1} = \probOf\pbox{\inparen{X \vee Y}\wedge\inparen{\neg X \vee \neg Y}}$, and $\probOf\pbox{X' = 2} = \probOf\pbox{X\wedge Y}$. $\query'$ can then be reduced to a $1$-\abbrBIDB by creating a block of the following disjoint tuples: $\query'' = \inset{\intup{\text{Chicago}, X'_0}, \intup{\text{Chicago}, X'_1}, \intup{\text{Chicago}, X'_2}}$ such that $\probOf\pbox{X'_i = 1} = \probOf\pbox{X' = i}$.
\AH{Left to do for this subsubsection: read through the rest and update def and lemma; start and complete second subsubsection \textbf{Upper Bound Proof and Algorithm Techniques}.}
Next, we motivate this reduced polynomial.
Consider the query $\query$ defined as follows over the bag relations of \Cref{fig:two-step}:
@ -289,21 +287,47 @@ we get that $\poly^2\inparen{\vct{\prob}}$ is in the range $[\inparen{p_0}^3\cdo
To get an $(1\pm \epsilon)$-multiplicative approximation we uniformly sample monomials from the \abbrSMB representation of $\poly$ and `adjust' their contribution to $\widetilde{\poly}\left(\cdot\right)$.
\mypar{Upper Bound Techniques}
Our negative results (\cref{tab:lbs}) indicate that \abbrCTIDB{}s can not achieve comparable performance to deterministic databases for exact results (under complexity assumptions). In fact, under plausible hardness conjectures, one cannot (drastically) improve upon the trivial algorithm to exactly compute the expected multiplicities for \abbrCTIDB\xplural. A natural followup is whether we can do better if we are willing to settle for an approximation to the expected multiplities.
In the remainder of this work, we demonstrate that a $(1\pm\epsilon)$ (multiplicative) approximation with competitive performance is achievable.
We adopt the two-step intensional model of query evaluation used in set-\abbrPDB\xplural, as illustrated in \Cref{fig:two-step}:
(i) \termStepOne (\abbrStepOne): Given input $\tupset$ and $\query$, output every tuple $\tup$ that possibly satisfies $\query$, annotated with its lineage polynomial ($\poly(\vct{X})=\apolyqdt\inparen{\vct{X}}$);
(ii) \termStepTwo (\abbrStepTwo): Given $\poly(\vct{X})$ for each tuple, compute $\expct\pbox{\poly(\vct{\randWorld})}$.
Let $\timeOf{\abbrStepOne}(Q,\dbbase,\circuit)$ denote the runtime of \abbrStepOne when it outputs $\circuit$ (which is a representation of $\poly$ as an arithmetic circuit --- more on this representation shortly).
Denote by $\timeOf{\abbrStepTwo}(\circuit)$ (recall $\circuit$ is the output of \abbrStepOne) the runtime of \abbrStepTwo, allowing us to formally define our objective:
Our next question is whether or not there exists a $\inparen{1\pm\epsilon}$-approximation algorithm that is linear to the deterministic query? If so, we have shown that approximation of bag \abbrPDB\xplural is comparable to deterministic query processing.
Here is where we introduce the specifics for producing our results and algorithm.
\item 2-step model
\item lower/upper bounds \emph{in terms of polynomials}.
\item intensional semantics/2-step model
\item circuit representation
\item $\rpoly\inparen{\cdot}$
\item $1$-\abbrBIDB reduction
\begin{Problem}[Bag-\abbrCTIDB linear time approximation]\label{prob:big-o-joint-steps}
Given \abbrCTIDB $\pdb$, $\raPlus$ query $\query$,
is there a $(1\pm\epsilon)$-approximation of $\expct_{\vct{D}\sim\pd}\pbox{\query\inparen{\vct{\db}}\inparen{\tup}}$ for all result tuples $\tup$ where
$\exists \circuit : \timeOf{\abbrStepOne}(Q,\tupset, \circuit) + \timeOf{\abbrStepTwo}(\circuit) \le O_\epsilon(\qruntime{Q, \tupset})$?
We show in \Cref{sec:circuit-depth} an $O(\qruntime{Q, \tupset})$ algorithm for constructing the lineage polynomial for all result tuples of an $\raPlus$ query $\query$ (or more more precisely, a single circuit $\circuit$ with one sink per tuple representing the tuple's lineage).
A key insight of this paper is that the representation of $\circuit$ matters.
For example, if we insist that $\circuit$ represent the lineage polynomial in the standard monomial basis (henceforth, \abbrSMB)\footnote{
This is the representation, typically used in set-\abbrPDB\xplural, where the polynomial is reresented as sum of `pure' products. See \Cref{def:smb} for a formal definition.
}, the answer to the above question in general is no, since then we will need $\abs{\circuit}\ge \Omega\inparen{\inparen{\qruntime{Q, \tupset}}^k}$,
and hence, just $\timeOf{\abbrStepOne}(Q,\tupset,\circuit)$ will be too large.
However, systems can directly emit compact, factorized representations of $\poly(\vct{X})$ (e.g., as a consequence of the standard projection push-down optimization~\cite{DBLP:books/daglib/0020812}).
For example, in~\Cref{fig:two-step}, $B(Y+Z)$ is a factorized representation of the SMB-form $BY+BZ$.
Accordingly, this work uses (arithmetic) circuits\footnote{
An arithmetic circuit is a DAG with variable and/or numeric source nodes and internal, each nodes representing either an addition or multiplication operator.
as the representation system of $\poly(\vct{X})$.
Given that there exists a representation $\circuit^*$ such that $\timeOf{\abbrStepOne}(\query,\tupset,\circuit^*)\le O(\qruntime{\query, \tupset})$, we can now focus on the complexity of \abbrStepTwo.
We can represent the factorized lineage polynomial by its correspoding arithmetic circuit $\circuit$ (whose size we denote by $|\circuit|$).
As we also show in \Cref{sec:circuit-runtime}, this size is also bounded by $\qruntime{Q, \tupset}$ (i.e., $|\circuit^*| \le O(\qruntime{Q, \tupset})$).
Thus, \AHchange{the question of approximation} %\Cref{prob:big-o-joint-steps}
can be reframed as:
\begin{Problem}[\Cref{prob:big-o-joint-steps} reframed]\label{prob:intro-stmt}
Given one circuit $\circuit$ that encodes $\apolyqdt$ for all result tuples $\tup$ (one sink per $\tup$) for \abbrBPDB $\pdb$ and $\raPlus$ query $\query$, does there exist an algorithm that computes a $(1\pm\epsilon)$-approximation of $\expct_{\vct{\db}\sim\pd}\pbox{\query\inparen{\vct{\db}}\inparen{\tup}}$ (for all result tuples $\tup$) in $\bigO{|\circuit|}$ time?
@ -421,82 +445,6 @@ We stress that this question is very well motivated, even for \abbrTIDBs: An ans
\mypar{Approximating the expected multiplicities}
%\AR{Have done my pass till here}
Our negative results indicate that \abbrBPDB{}s can not achieve comparable performance to deterministic databases for exact results (under complexity assumptions). In fact, under plausible hardness conjectures, one cannot (drastically) improve upon the trivial algorithm to exactly compute the expected multiplicities for \abbrTIDB. A natural followup is whether we can do better if we are willing to settle for an approximation to the expected multiplities.
In the remainder of this work, we demonstrate that a $(1\pm\epsilon)$ (multiplicative) approximation with competitive performance is achievable.
\AR{In \Cref{fig:two-step}, perhaps in caption give f/w pointer to definition of $Q$?}
We adopt the two-step intensional model of query evaluation used in set-\abbrPDB\xplural, as illustrated in \Cref{fig:two-step}:
(i) \termStepOne (\abbrStepOne): Given input $\dbbase$ and $\query$, output every tuple $\tup$ that possibly satisfies $\query$, annotated with its lineage polynomial ($\poly(\vct{X})=\apolyqdt\inparen{\vct{X}}$);
(ii) \termStepTwo (\abbrStepTwo): Given $\poly(\vct{X})$ for each tuple, compute $\expct\pbox{\poly(\vct{\randWorld})}$.
Let $\timeOf{\abbrStepOne}(Q,\dbbase,\circuit)$ denote the runtime of \abbrStepOne when it outputs $\circuit$ (which is a representation of $\poly$ as an arithmetic circuit --- more on this representation shortly).
Denote by $\timeOf{\abbrStepTwo}(\circuit)$ (recall $\circuit$ is the output of \abbrStepOne) the runtime of \abbrStepTwo, allowing us to formally define our objective:\AH{What if there are more efficient representations than circuits?}
Our next question is whether or not there exists a $\inparen{1\pm\epsilon}$-approximation algorithm that is linear to the deterministic query? If so, we have shown that approximation of bag \abbrPDB\xplural is comparable to deterministic query processing.
Given \abbrBPDB $\pdb$, $\raPlus$ query $\query$,
is there a $(1\pm\epsilon)$-approximation of $\expct_{\db\sim\pd}\pbox{\query\inparen{\db}\inparen{\tup}}$ for all result tuples $\tup$ where
$\exists \circuit : \timeOf{\abbrStepOne}(Q,\dbbase,\circuit) + \timeOf{\abbrStepTwo}(\circuit) \le O_\epsilon(\qruntime{Q, \dbbase})$?
%Note that if the answer to the above problem is yes, then we have shown that the answer to \Cref{prob:informal} is yes (when we are interested in approximating the expected multiplicities).
We show in \Cref{sec:circuit-depth} %{sec:gen}\AR{Refs needs to be updated}
%\OK{confirm this ref}
%Atri: fixed the ref
an $O(\qruntime{Q, \dbbase})$ algorithm for constructing the lineage polynomial for all result tuples of an $\raPlus$ query $\query$ (or more more precisely, a single circuit $\circuit$ with one sink per tuple representing the tuple's lineage).
% , and by extension the first step is in \sharpwonehard\AH{\sharpwonehard is not defined.}.
A key insight of this paper is that the representation of $\circuit$ matters.
For example, if we insist that $\circuit$ represent the lineage polynomial in the standard monomial basis (henceforth, \abbrSMB)\footnote{
This is the representation, typically used in set-\abbrPDB\xplural, where the polynomial is reresented as sum of `pure' products. See \Cref{def:smb} for a formal definition.
}, the answer to the above question in general is no, since then we will need $\abs{\circuit}\ge \Omega\inparen{\inparen{\qruntime{Q, \dbbase}}^k}$,
%\BG{should be $|\idb |$?},
%Atri: No, this is fine
and hence, just $\timeOf{\abbrStepOne}(Q,\dbbase,\circuit)$ will be too large.
However, systems can directly emit compact, factorized representations of $\poly(\vct{X})$ (e.g., as a consequence of the standard projection push-down optimization~\cite{DBLP:books/daglib/0020812}).
For example, in~\Cref{fig:two-step}, $B(Y+Z)$ is a factorized representation of the SMB-form $BY+BZ$.
Accordingly, this work uses (arithmetic) circuits\footnote{
An arithmetic circuit is a DAG with variable and/or numeric source nodes and internal, each nodes representing either an addition or multiplication operator.
as the representation system of $\poly(\vct{X})$.
% When $\poly(\vct{X})$ is in standard monomial basis (\abbrSMB)\footnote{A polynomial is in \abbrSMB when it is a sum of products of variables (a variable can occur more than once), where each product of variables is unique.}, by linearity of expectation and independence of \abbrTIDB, it follows that $\timeOf{\abbrStepTwo}(Q,\pdb)$ is $O(|\poly_\tup(\vct{X})|)$ and thus also $\bigO{\timeOf{\abbrStepOne}(Q,\pdb)}$.
% \AH{Is this obvious enough for the typical reviewer to realize?}
% Recall that $\prob_i$ denotes the probability of tuple $\tup_i$ (i.e. $\probOf\pbox{W_i = 1}$) for $i \in [\numvar]$. Consider another special case when for all $i$ in $[\numvar]$, $\prob_i = 1$.
% % Replaced the stuff below with something more auccint
% %For output tuple $\tup'$ of $\query\inparen{\pdb}$, computing $\expct\pbox{\poly_{\tup'}\inparen{\vct{\randWorld}}}$ is linear in
% %$\abs{\poly_\tup}$
% %the size of the arithemetic circuit
% %, since we can essentially push expectation through multiplication of variables dependent on one another.\footnote{For example in this special case, computing $\expct\pbox{(X_iX_j + X_\ell X_k)^2}$ does not require product expansion, since we have that $p_i^h x_i^h = p_i \cdot 1^{h-1}x_i^h$.}
% In this case, we have for any output tuple $\tup$, $\expct\pbox{\poly(\vct{W})}=\Phi(1,\dots,1)$.
% Thus, we have another case where $\timeOf{\abbrStepTwo}(Q,\pdb)$ is $\bigO{\timeOf{\abbrStepOne}(Q,\pdb)}$ and we again achieve deterministic query runtime for $\query\inparen{\pdb}$ (up to a constant factor). These observations introduce our first formalization of~\Cref{prob:informal}:
Given that there exists a representation $\circuit^*$ such that $\timeOf{\abbrStepOne}(\query,\dbbase,\circuit^*)\le O(\qruntime{\query, \dbbase})$, we can now focus on the complexity of \abbrStepTwo.
We can represent the factorized lineage polynomial by its correspoding arithmetic circuit $\circuit$ (whose size we denote by $|\circuit|$).
%\BG{This sentence didn't parse for me. What do we mean by representing a polynomial by a size?}
%Atri: fixed
As we also show in \Cref{sec:circuit-runtime}, this size is also bounded by $\qruntime{Q, \dbbase}$ (i.e., $|\circuit^*| \le O(\qruntime{Q, \dbbase})$).
Thus, \AHchange{the question of approximation} %\Cref{prob:big-o-joint-steps}
can be reframed as:
%Atri: Replaced the text below by the above. I know I had talked about $|\circuit|^k$ but I think the stuff below breaks the flow a bit
%Re-stating our earlier observation, given a circuit \circuit, if \circuit is in \abbrSMB (i.e. every sink to source path has a prefix of addition nodes and the rest of the internal nodes are multiplication nodes), then we have that $\timeOf{\abbrStepTwo}(Q,\pdb)$ is indeed $\bigO{\timeOf{\abbrStepOne}(Q,\pdb)}$. We note that \abbrSMB representations are produced by queries with a projection operation on top of a join operation.
% the form $\project, \project\inparen{\join},$ etc.
% Suppose, on the contrary, that \circuit is not in \abbrSMB and rather in some factorized form. Then to naively compute \abbrStepTwo, one needs to convert \circuit into \circuit' such that \circuit' is in \abbrSMB, and then compute $\expct\pbox{\poly_\tup\inparen{\vct{\randWorld}}}$, which takes $\bigO{|\circuit|^k}$ time for the case that $k$ is the degree of the polynimial $\Phi_\tup(\vct{X})$. Since $|\circuit'|$ lies between $\bigO{|\circuit|}$ and $\bigO{|\circuit|^k}$, it behooves us to determine which of these extremes is true for the general \circuit. This leads us to the main problem statement of our paper:
Given one circuit $\circuit$ that encodes $\apolyqdt$ for all result tuples $\tup$ (one sink per $\tup$) for \abbrBPDB $\pdb$ and $\raPlus$ query $\query$, does there exist an algorithm that computes a $(1\pm\epsilon)$-approximation of $\expct_{\db\sim\pd}\pbox{\query\inparen{\db}\inparen{\tup}}$ (for all result tuples $\tup$) in $\bigO{|\circuit|}$ time?
%\OK{This doesn't parse. What is $\bigO{\abbrStepOne}$? Should this be $\bigO{\poly}$?}
%Contributions, Overview, Paper Organization

View File

@ -126,7 +126,7 @@
\node[below=0.2cm of rrect]{{\LARGE $\expct\pbox{\poly(\vct{X})}$}};
\caption{Intensional Query ($\query$) Evaluation Model.}
\caption{Intensional Query Evaluation Model ($\query = \project_{\text{City}}\inparen{Route\join_{\text{City}_1 = City}OnTime}$).}