Reclaimed a few lines in S.1.

Aaron Huber 2022-06-07 11:45:50 -04:00
parent c82fbbdcca
commit d8b870b595
1 changed files with 14 additions and 13 deletions

View File

@ -4,7 +4,7 @@
This work explores the problem of computing the expectation of the multiplicity of a tuple in the result of a query over a \abbrCTIDB (tuple independent database), a type of probabilistic database with bag semantics where the multiplicity of a tuple is a random variable with range $[0,\bound]\stackrel{\text{def}}{=}\{0,1,\dots,\bound\}$ for some fixed constant $\bound$, and multiplicities assigned to any two tuples are independent of each other.
Formally, a \abbrCTIDB,
$\pdb = \inparen{\worlds, \bpd}$ is defined over a set of tuples $\tupset$ and a probability distribution $\bpd$ over all possible worlds generated by assigning each tuple $\tup \in \tupset$ a multiplicity in the range $[0,\bound]$.
$\pdb = \inparen{\worlds, \bpd}$ is defined over a \dbbaseName $\tupset$ and a probability distribution $\bpd$ over all possible worlds generated by assigning each tuple $\tup \in \tupset$ a multiplicity in the range $[0,\bound]$.
Any such world can be encoded as a vector (of length $\numvar=\abs{\tupset}$) from $\worlds$, such that the multiplicity of each $\tup \in \tupset$ is stored at a distinct index.
A given world $\worldvec \in\worlds$ can be interpreted as follows: for each $\tup \in \tupset$, $\worldvec_{\tup}$ is the multiplicity of $\tup$ in $\worldvec$.
We note that encoding a possible world as a vector, while non-standard, is equivalent to encoding it as a bag of tuples (\Cref{prop:expection-of-polynom} in \Cref{subsec:expectation-of-polynom-proof}).
@ -113,7 +113,7 @@ Those with `Multiple' in the second column need the algorithm to be able to hand
\mypar{Our lower bound results}
Let $\qruntime{\query,\gentupset,\bound}$ (see~\Cref{sec:gen} for further details) denote the runtime for query $\query$ over a \dbbaseName $\gentupset$ where the maximum multiplicity of any tuple is less than or equal to $\bound$. % This paper considers $\raPlus$ queries, for which order of operations is \emph{explicit}, as opposed to other query languages, e.g. Datalog, UCQ. Thus, since order of operations affects runtime, we denote the optimized $\raPlus$ query picked by an arbitrary production system as $\optquery{\query} \approx \min_{\query'\in\raPlus, \query'\equiv\query}\qruntime{\query', \gentupset, \bound}$. Then $\qruntime{\optquery{\query}, \gentupset,\bound}$ is the runtime for the optimized query.\footnote{The upper bounds on runtime that we derive apply pointwise to any $\query \in\raPlus$, allowing us to abstract away the specific heuristics for choosing an optimized query (i.e., Any deterministic query optimization heuristic is equally useful for \abbrCTIDB queries).}\BG{Rewrite: since an optimized Q is also a Q this also applies in the case where there is a query optimizer the rewrites Q}
Our question is whether or not it is always true that for every $\query$, $\timeOf{}^*\inparen{\query, \pdb, \bound}\leq \bigO{\qruntime{\optquery{\query}, \tupset, \bound}}$. We remark that the issue of query optimization is orthogonal to this question (recall that an $\raPlus$ query explicitly encodes order of operations) since we want to answer the above question for all $\query$. \emph{Specifically, if there is an equivalent query $\query'$ that is more efficient to evaluate, we allow both deterministic and probabilistic query processing access to $\query'$}.
Our question is whether or not it is always true that for every $\query$, $\timeOf{}^*\inparen{\query, \pdb, \bound}\leq \bigO{\qruntime{\optquery{\query}, \tupset, \bound}}$. We remark that the issue of query optimization is orthogonal to this question (recall that an $\raPlus$ query explicitly encodes order of operations) since we want to answer the above question for all $\query$. \emph{Specifically, if there is an equivalent and more efficient query $\query'$, we allow both deterministic and probabilistic query processing access to $\query'$}.
Unfortunately the the answer to the above question is no--
\Cref{tab:lbs} shows our results.
@ -171,7 +171,7 @@ $$
To compute $\expct\pbox{\poly_1^2}$ we can use linearity of expectation and push the expectation through each summand. To keep things simple, let us focus on the monomial $\poly_1^{\inparen{ABU}^2} = A^2U^2B^2$ as the procedure is the same for all other monomials of $\poly_1^2$. Let $\randWorld_U$ be the random variable corresponding to a lineage variable $U$. Because the distinct variables in the product are independent, we can push expectation through them yielding $\expct\pbox{\randWorld_A^2\randWorld_U^2\randWorld_B^2}=\expct\pbox{\randWorld_A^2}\expct\pbox{\randWorld_U^2}\expct\pbox{\randWorld_B^2}$. Since $\randWorld_A, \randWorld_B\in \inset{0, 1}$ we can simplify to $\expct\pbox{\randWorld_A}\expct\pbox{\randWorld_U^2}\expct\pbox{\randWorld_B}$ by the fact that for any $W\in \inset{0, 1}$, $W^2 = W$. Observe that if $W_U\in\inset{0, 1}$, then we further would have $\expct\pbox{\randWorld_A}\expct\pbox{\randWorld_U}\expct\pbox{\randWorld_B} = \prob_A\cdot\prob_X\cdot\prob_B$ (denoting $\probOf\pbox{\randWorld_A = 1} = \prob_A$) $= \rpoly_1^{\inparen{ABX}^2}\inparen{\prob_A, \prob_U, \prob_B}$ (see $ii)$ of~\Cref{def:reduced-poly}). However, in this example, we get stuck with $\expct\pbox{\randWorld_U^2}$, since $\randWorld_U\in\inset{0, 1, 2}$ and for $\randWorld_U \gets 2$, $\randWorld_U^2 \neq \randWorld_U$.
The simple insight to get around this issue to note that the random variables $\randWorld_U$ and $\randWorld_{U_1}+2\randWorld_{U_2}$ have exactly the same distribution, where $\randWorld_{U_1},\randWorld_{U_2}\in\inset{0,1}$ and $\probOf\pbox{\randWorld_{U_j} = 1} = \probOf\pbox{\randWorld_{U} = j}$. Thus, the idea is to replace the variable $U$ by $U_1+2U_2$ (where $U_j$ corresponds to the event that $U$ has multiplicity $j$) to obtain the following polynomial:
The simple insight to get around this issue to note that the random variables $\randWorld_U$ and $\randWorld_{U_1}+2\randWorld_{U_2}$ have exactly the same distribution, where $\randWorld_{U_1},\randWorld_{U_2}\in\inset{0,1}$ and $\probOf\pbox{\randWorld_{U_j} = 1} = \probOf\pbox{\randWorld_{U} = j}$. Thus, the idea is to replace the variable $U$ by $U_1+2U_2$ (where $U_j$ corresponds to the event that $U$ has multiplicity $j$) yielding% to obtain the following polynomial:
%Denote the variables of $\poly$ to be $\vars{\poly}.$ In the \abbrCTIDB setting, $\poly\inparen{\vct{X}}$ has an equivalent reformulation $\inparen{\refpoly{}\inparen{\vct{X_R}}}$ that is of use to us, where $\abs{\vct{X_R}} = \bound\cdot\abs{\vct{X}}$ . Given $X_\tup \in\vars{\poly}$ and integer valuation $X_\tup \in\inset{0,\ldots, c}$. We can replace $X_\tup$ by $\sum_{j\in\pbox{\bound}}jX_{\tup, j}$ where the variables $\inparen{X_{\tup, j}}_{j\in\pbox{\bound}}$ are disjoint with integer assignments $X_{\tup, j}\in\inset{0, 1}$. Then for any $\worldvec\in\worlds$ and corresponding reformulated world $\worldvec_{\vct{R}}\in\inset{0, 1}^{\tupset\bound}$, we set $\worldvec_{\vct{R}_{\tup, j}} = 1$ for $\worldvec_\tup = j$, while $\worldvec_{\vct{R}_{\tup, j'}} = 0$ for all $j'\neq j\in\pbox{\bound}$. By construction then $\poly\inparen{\vct{X}}\equiv\refpoly{}\inparen{\vct{X_R}}$ $\inparen{\vct{X_R} = \vars{\refpoly{}}}$ since for any integer valuation $X_\tup\in\pbox{\bound}$, $X_{\tup, j}\in\inset{0, 1}$ we have the equality $X_\tup = j = \sum_{j\in\pbox{\bound}}jX_{t, j}$.
@ -207,16 +207,16 @@ removing all monomials containing the term $X_{\tup, j}X_{\tup, j'}$ for $\tup\i
%&= ABX_1 + AB\inparen{2}^2X_2+ BYE + BZC + 2AX_1BYE+ 2A\inparen{2}^2X_2BYE\\
%&\qquad + 2AX_1BZC + 2A\inparen{2}^2X_2BZC + 2BYEZC.
As we have essentially argued earlier, for our specific example the expectation that we want is $\rpoly_1^2(\probOf\inparen{A=1}, \ldots,
As we have essentially argued earlier, the expecation for our specific example is $\rpoly_1^2(\probOf\inparen{A=1}, \ldots,
%$\allowbreak$\probOf\inparen{B=1}, \probOf\inparen{C=1}$,\allowbreak $\probOf\inparen{E=1},$\allowbreak $\probOf\inparen{U_1=1}, \probOf\inparen{U_2=1}, \probOf\inparen{Y=1},
\Cref{lem:tidb-reduce-poly} generalizes the equivalence to {\em all} $\raPlus$ queries on \abbrCTIDB\xplural (proof in \Cref{subsec:proof-exp-poly-rpoly}):
This equivalence generalizes to {\em all} $\raPlus$ queries on \abbrCTIDB\xplural (proof in \Cref{subsec:proof-exp-poly-rpoly}):
For any \abbrCTIDB $\pdb$, $\raPlus$ query $\query$, and lineage polynomial
, it holds that $
\expct_{\vct{W} \sim \pdassign}\pbox{\poly\inparen{\vct{W}}} = \rpoly\inparen{\probAllTup}
$, where $\probAllTup = \inparen{\prob_{\tup,j}}_{\tup\in\tupset,j\in\pbox{\bound}}.$
$, where $\probAllTup = \inparen{\prob_{\tup,j}}_{\tup\in\tupset,j\in\pbox{\bound}}.$
@ -254,12 +254,13 @@ However, systems can directly emit compact, factorized representations of $\poly
Accordingly, this work uses (arithmetic) circuits\footnote{
An arithmetic circuit is a DAG with variable/numeric source gates and multiplication/addition internal/sink gates.
as the representation system of $\poly(\vct{X})$, and we show in \Cref{sec:circuit-depth} an $\bigO{\qruntime{\optquery{\query}, \tupset, \bound}}$ algorithm for constructing the lineage polynomial for all result tuples of an $\raPlus$ query $\query$ (or more more precisely, a single circuit $\circuit$ with one sink per tuple representing the tuple's lineage).
as the representation system of $\poly(\vct{X})$, and we show in \Cref{sec:circuit-depth} an $\bigO{\qruntime{\optquery{\query}, \tupset, \bound}}$ algorithm for constructing the lineage polynomial for all result tuples of an $\raPlus$ query $\query$ (or more precisely, a circuit $\circuit$ with $\numvar$ sinks, one per output tuple).% representing the tuple's lineage).
Given that a representation $\circuit^*$ exists where $\timeOf{\abbrStepOne}(\query,\tupset,\circuit^*)\le \bigO{\qruntime{\optquery{\query}, \tupset, \bound}}$, we can focus on the complexity of \abbrStepTwo.
As we also show in \Cref{sec:circuit-runtime}, the size is also bounded by $\qruntime{\optquery{\query}, \tupset, \bound}$ (i.e., $|\circuit^*| \le \bigO{\qruntime{\optquery{\query}, \tupset, \bound}}$).
Since a representation $\circuit^*$ exists where $\timeOf{\abbrStepOne}(\query,\tupset,\circuit^*)\le \bigO{\qruntime{\optquery{\query}, \tupset, \bound}}$ and
the size of $\circuit^*$ is bounded by $\qruntime{\optquery{\query}, \tupset, \bound}$ (i.e., $|\circuit^*| \le \bigO{\qruntime{\optquery{\query}, \tupset, \bound}}$) (see~\Cref{sec:circuit-runtime}), we can focus on the complexity of \abbrStepTwo.
%Thus, the question of approximation can be stated as the following stronger (since~\Cref{prob:big-o-joint-steps} has access to \emph{all} equivalent \circuit representing $\query\inparen{\vct{W}}\inparen{\tup}$), but sufficient condition:
Given such a $\circuit^*$, to solve \Cref{prob:big-o-joint-steps}, it is \emph{sufficient} to solve: % the following problem:
%Given such a $\circuit^*$,
To solve \Cref{prob:big-o-joint-steps}, it is \emph{sufficient} to solve: % the following problem:
Given one circuit $\circuit$ that encodes $\Phi\inparen{\vct{X}}$ for all result tuples $\tup$ (one sink per $\tup$) for \abbrCTIDB $\pdb$ and $\raPlus$ query $\query$, does there exist an algorithm that computes a $(1\pm\epsilon)$-approximation of $\expct_{\rvworld\sim\bpd}\pbox{\query\inparen{\rvworld}\inparen{\tup}}$ (for all result tuples $\tup$) in $\bigO{|\circuit|}$ time?
@ -275,13 +276,13 @@ $\refpoly{1, }^{\inparen{ABU}^2}\inparen{\vct{X}} = A^2\inparen{U_1^2 + 4U_1U_2
%&\qquad+ 2AX_2B^2YE + 2AX_1B^2ZC + 2AX_2B^2ZC + 2B^2YEZC\\
Recall that
%Recall that
$\rpoly_1^{\inparen{ABU}^2}\inparen{\vct{X}} = AU_1B+4AU_2B$,
$\rpoly_1^{\inparen{ABU}^2}\inparen{\vct{X}} = AU_1B+4AU_2B$
which implies:
\[ \refpoly{1, }^{\inparen{ABU}^2}\inparen{\probAllTup} -4\prob_A^2\prob_{U_1}\prob_{U_2}\prob_B^2=\prob_A^2\prob_{U_1}^2\prob_B^2 + 4\prob_A^2\prob_{U_2}^2\prob_B^2.\]
%Substituting $\vct{\prob}$ for $\vct{X}$,