Aaron Huber 2022-06-07 22:48:09 -04:00
commit c18cd5500a
3 changed files with 44 additions and 38 deletions

View File

@ -1,12 +1,12 @@
%root: main.tex
%!TEX root=./main.tex
In this work, we study the problem of computing a query result tuple's expected multiplicity for probabilistic databases under bag semantics (where each tuple is associated with a multiplicity) exactly and approximately.
We study the problem of computing a query result tuple's expected multiplicity for probabilistic databases under bag semantics (where each tuple is associated with a multiplicity) exactly and approximately.
Specifically, we are interested in the fine-grained complexity of this problem for \abbrCTIDB\xplural, i.e., probabilistic databases where tuples are independent probabilistic events and the multiplicity of each tuple is bound by a constant $\bound$.
% We consider bag-\abbrTIDB\xplural where we have a bound $\bound$ on the maximum multiplicity of each tuple and tuples are independent probabilistic events (we refer to such databases as \abbrCTIDB\xplural).
Unfortunately, our results imply that computing expected multiplicities for \abbrCTIDB\xplural based on the output of deterministic query evaluation algorithms introduces super-linear overhead (under certain parameterized complexity hardness conjectures).
Unfortunately, our results imply that computing expected multiplicities for \abbrCTIDB\xplural based on the output of deterministic query evaluation algorithms introduces super-linear overhead (under certain complexity hardness conjectures).
% We are specifically interested in the fine-grained complexity of computing expected multiplicities and how it compares to the complexity of deterministic query evaluation algorithms --- if these complexities are comparable, it opens the door to practical deployment of probabilistic databases.
% Unfortunately, our results imply that computing expected multiplicities for \abbrCTIDB\xplural based on the results produced by such query evaluation algorithms introduces super-linear overhead (under parameterized complexity hardness assumptions/conjectures).
Nonetheless, we develop a sampling algorithm that computes a $(1 \pm \epsilon)$-approximation of the expected multiplicity of an output tuple in time linear in the runtime of the corresponding deterministic query for any positive relational algebra ($\raPlus$) query over \abbrCTIDB\xplural and for a non-trivial subclass of block-independent databases (\abbrBIDB\xplural).
Next, we develop a sampling algorithm that computes a $(1 \pm \epsilon)$-approximation of the expected multiplicity of an output tuple in time linear in the runtime of the corresponding deterministic query for any positive relational algebra ($\raPlus$) query over \abbrCTIDB\xplural and for a non-trivial subclass of block-independent databases. % (\abbrBIDB\xplural).
% We proceed to study approximation of expected result tuple multiplicities for positive relational algebra queries ($\raPlus$) over \abbrCTIDB\xplural and for a non-trivial subclass of block-independent databases (\abbrBIDB\xplural).
% We develop a sampling algorithm that computes a $(1 \pm \epsilon)$-approximation of the expected multiplicity of an output tuple in time linear in the runtime of the corresponding deterministic query for any $\raPlus$ query.

View File

@ -2,17 +2,18 @@
%root: main.tex
\section{Introduction}\label{sec:intro}
This work explores the problem of computing the expectation of the multiplicity of a tuple in the result of a query over a \abbrCTIDB (tuple independent database), a type of probabilistic database with bag semantics where the multiplicity of a tuple is a random variable with range $[0,\bound]\stackrel{\text{def}}{=}\{0,1,\dots,\bound\}$ for some fixed constant $\bound$, and multiplicities assigned to any two tuples are independent of each other.
We explore the problem of computing the expectation of the multiplicity of a tuple in the result of a query over a \abbrCTIDB (tuple independent database), a type of probabilistic database with bag semantics where the multiplicity of a tuple is a random variable with range $[0,\bound]\stackrel{\text{def}}{=}\{0,1,\dots,\bound\}$ for some fixed constant $\bound$, and multiplicities assigned to any two tuples are independent of each other.
Formally, a \abbrCTIDB,
$\pdb = \inparen{\worlds, \bpd}$ is defined over a \dbbaseName $\tupset$ and a probability distribution $\bpd$ over all possible worlds generated by assigning each tuple $\tup \in \tupset$ a multiplicity in the range $[0,\bound]$.
$\pdb = \inparen{\worlds, \bpd}$ is defined over a \dbbaseName (i.e. a `base' set of tuples) $\tupset$ and a probability distribution $\bpd$ over all possible worlds generated by assigning each tuple $\tup \in \tupset$ a multiplicity in the range $[0,\bound]$.
Any such world can be encoded as a vector (of length $\numvar=\abs{\tupset}$) from $\worlds$, such that the multiplicity of each $\tup \in \tupset$ is stored at a distinct index.
A given world $\worldvec \in\worlds$ can be interpreted as follows: for each $\tup \in \tupset$, $\worldvec_{\tup}$ is the multiplicity of $\tup$ in $\worldvec$.
We note that encoding a possible world as a vector, while non-standard, is equivalent to encoding it as a bag of tuples (\Cref{prop:expection-of-polynom} in \Cref{subsec:expectation-of-polynom-proof}).
Given that tuple multiplicities are independent events, the probability distribution $\bpd$ can be expressed compactly by assigning each tuple a (disjoint) probability distribution over $[0,\bound]$. Let $\prob_{\tup,j}$ denote the probability that tuple $\tup$ is assigned multiplicity $j$. The probability of a world $\worldvec$ is then $\prod_{\tup \in \tupset} \prob_{\tup,j}$ for $j = \worldvec_{\tup}$.
We note that encoding a possible world as a vector, while non-standard, is equivalent to encoding it as a bag of tuples (\Cref{prop:expection-of-polynom}).
%in \Cref{subsec:expectation-of-polynom-proof}).
Given that tuple multiplicities are independent events, the probability distribution $\bpd$ can be expressed compactly by assigning each tuple a (disjoint) probability distribution over $[0,\bound]$. Let $\prob_{\tup,j}$ denote the probability that tuple $\tup$ is assigned multiplicity $j$. The probability of a world $\worldvec$ is then $\prod_{\tup \in \tupset} \prob_{\tup,j(t)}$ for $j(t) = \worldvec_{\tup}$.
%
% Allowing for $\leq \bound$ multiplicities across all tuples gives rise to having $\leq \inparen{\bound+1}^\numvar$ possible worlds instead of the usual $2^\numvar$ possible worlds of a $1$-\abbrTIDB, which (assuming set query semantics), is the same as the traditional set \abbrTIDB.
% In this work, since we are generally considering bag query input, we will only be considering bag query semantics.
Note that in this work, we consider queries with bag semantics over such bag probabilistic databases.
In this work, we consider queries with bag semantics over such bag probabilistic databases.
We denote by $\query\inparen{\worldvec}\inparen{\tup}$ the multiplicity of a result tuple $\tup$ in query $\query$ over possible world $\worldvec\in\worlds$.
%
We can formally state our problem of computing the expected multiplicity: % of a result tuple as:
@ -66,19 +67,18 @@ As also observed in \cite{https://doi.org/10.48550/arxiv.2201.11524,DBLP:journal
Set query evaluation semantics over $1$-\abbrTIDB\xplural have been studied extensively, and their data complexity has, in general been shown % by Dalvi and Suicu
to be \sharpphard\cite{10.1145/1265530.1265571}.
In an independent work, Grohe et. al.~\cite{https://doi.org/10.48550/arxiv.2201.11524} studied bag-\abbrTIDB\xplural with unbounded multiplicities, which requires %them to explicitly address the issue of
a succinct representation of probability distributions over infinitely many multiplicities.
They demonstrate a dichotomy for
a succinct representation of probability distributions over infinitely many multiplicities and proved a dichotomy for
%the problem of
computing the probability that an output tuple's multiplicity is bounded by $s$.
computing the probability of an output tuple's multiplicity being bounded by given $s$.
% investigates the query evaluation problem over bag-\abbrTIDB\xplural when computing the probability of an output tuple having at most a multiplicity of $k$, showing that a dichotomy exists for this problem.
% While the authors observe that computing the expectation of an output tuple multiplicity is in polynomial time, no further (fine-grained) analysis of the expected value is considered.
% Our work in contrast assumes a finite bound on the multiplicities where we simply list the finitely many probability values (and hence do not need consider a more succinct representation). Further, our work primarily looks into the fine-grained analysis of computing the expected multiplicity of an output tuple.
In contrast to~\cite{https://doi.org/10.48550/arxiv.2201.11524}, we consider \abbrCTIDB\xplural, i.e., the multiplicity of input tuples is bound by a constant $\bound$.
\mypar{Our Setup} In contrast to~\cite{https://doi.org/10.48550/arxiv.2201.11524}, we consider \abbrCTIDB\xplural, i.e., the multiplicity of input tuples is bound by a constant $\bound$.
For this setting, % (\abbrCTIDB\xplural, i.e., the multiplicity of input tuples is bound by a constant $\bound$), however,
there exists a trivial \ptime algorithm for computing the expectation of a result tuple's multiplicity~(\Cref{prob:expect-mult}) for any $\raPlus$ query due to linearity of expectation (see~\Cref{sec:intro-poly-equiv}).
Since we can solve~\Cref{prob:expect-mult} in \ptime, the %interesting
question that we explore is %the hardness of
Since~\Cref{prob:expect-mult} is in \ptime, the %interesting
we explore the question of %the hardness of
computing expectation using fine-grained analysis and parameterized complexity, where we are interested in the exponent of polynomial runtime.\footnote{While %the authors of
\cite{https://doi.org/10.48550/arxiv.2201.11524} also observe that computing the expectation of an output tuple multiplicity is in \ptime, they do not investigate the fine-grained complexity of this problem.}
@ -93,16 +93,16 @@ Specifically, in this work we ask if~\Cref{prob:expect-mult} can be solved in ti
\centering
\begin{tabular}{|p{0.43\textwidth}|p{0.12\textwidth}|p{0.35\textwidth}|}
\hline
\textbf{Lower bound on $\timeOf{}^*(\qhard,\pdb)$} & \textbf{Num.} $\bpd$s
\textbf{Lower bound on $\timeOf{}^*(\qhard^k,\pdb)$} & \textbf{Num.} $\bpd$s
& \textbf{Hardness Assumption}\\
\hline
$\Omega\inparen{\inparen{\qruntime{\optquery{\qhard}, \tupset, \bound}}^{1+\eps_0}}$ for {\em some} $\eps_0>0$ & Single & Triangle Detection hypothesis\\
$\omega\inparen{\inparen{\qruntime{\optquery{\qhard}, \tupset, \bound}}^{C_0}}$ for {\em all} $C_0>0$ & Multiple &$\sharpwzero\ne\sharpwone$\\
$\Omega\inparen{\inparen{\qruntime{\optquery{\qhard}, \tupset, \bound}}^{c_0\cdot k}}$ for {\em some} $c_0>0$ & Multiple & Exponential Time Hypothesis\\%\Cref{conj:known-algo-kmatch}\\
$\Omega\inparen{\inparen{\qruntime{\optquery{\qhard^k}, \tupset, \bound}}^{1+\eps_0}}$ for {\em some} $\eps_0>0$ & Single & Triangle Detection hypothesis\\
$\omega\inparen{\inparen{\qruntime{\optquery{\qhard^k}, \tupset, \bound}}^{C_0}}$ for {\em all} $C_0>0$ & Multiple &$\sharpwzero\ne\sharpwone$\\
$\Omega\inparen{\inparen{\qruntime{\optquery{\qhard^k}, \tupset, \bound}}^{c_0\cdot k}}$ for {\em some} $c_0>0$ & Multiple & Exponential Time Hypothesis (ETH)\\%\Cref{conj:known-algo-kmatch}\\
\hline
\end{tabular}
\savecaptionspace{
\caption{Our lower bounds for $\qhard$ parameterized by $k$ over $1$-TIDB $\pdb$. % = \inset{\worlds, \bpd}$.
\caption{Our lower bounds for $\qhard^k$ parameterized by $k$ over $1$-TIDB $\pdb$. % = \inset{\worlds, \bpd}$.
Those with `Multiple' in the second column need the algorithm to be able to handle multiple $\bpd$. See~\Cref{sec:hard} for further details.}%, i.e. probability distributions (for a given $\tupset$). The last column states the hardness assumptions that imply the lower bounds in the first column ($\eps_o,C_0,c_0$ are constants that are independent of $k$).}
\label{tab:lbs}
}{0cm}{-0.73cm}
@ -112,24 +112,27 @@ Those with `Multiple' in the second column need the algorithm to be able to hand
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\mypar{Our lower bound results}
%
Let $\qruntime{\query,\gentupset,\bound}$ (see~\Cref{sec:gen} for further details) denote the runtime for query $\query$ over a \dbbaseName $\gentupset$ where the maximum multiplicity of any tuple is less than or equal to $\bound$. % This paper considers $\raPlus$ queries, for which order of operations is \emph{explicit}, as opposed to other query languages, e.g. Datalog, UCQ. Thus, since order of operations affects runtime, we denote the optimized $\raPlus$ query picked by an arbitrary production system as $\optquery{\query} \approx \min_{\query'\in\raPlus, \query'\equiv\query}\qruntime{\query', \gentupset, \bound}$. Then $\qruntime{\optquery{\query}, \gentupset,\bound}$ is the runtime for the optimized query.\footnote{The upper bounds on runtime that we derive apply pointwise to any $\query \in\raPlus$, allowing us to abstract away the specific heuristics for choosing an optimized query (i.e., Any deterministic query optimization heuristic is equally useful for \abbrCTIDB queries).}\BG{Rewrite: since an optimized Q is also a Q this also applies in the case where there is a query optimizer the rewrites Q}
Let $\qruntime{\query,\gentupset,\bound}$ (see~\Cref{sec:gen} for further details) denote the runtime for query $\query$ over any deterministic database
%\dbbaseName
that maps each tuple in $\gentupset$ to a multiplicity in $[0,\bound]$.
%where the maximum multiplicity of any tuple is less than or equal to $\bound$. % This paper considers $\raPlus$ queries, for which order of operations is \emph{explicit}, as opposed to other query languages, e.g. Datalog, UCQ. Thus, since order of operations affects runtime, we denote the optimized $\raPlus$ query picked by an arbitrary production system as $\optquery{\query} \approx \min_{\query'\in\raPlus, \query'\equiv\query}\qruntime{\query', \gentupset, \bound}$. Then $\qruntime{\optquery{\query}, \gentupset,\bound}$ is the runtime for the optimized query.\footnote{The upper bounds on runtime that we derive apply pointwise to any $\query \in\raPlus$, allowing us to abstract away the specific heuristics for choosing an optimized query (i.e., Any deterministic query optimization heuristic is equally useful for \abbrCTIDB queries).}\BG{Rewrite: since an optimized Q is also a Q this also applies in the case where there is a query optimizer the rewrites Q}
Our question is whether or not it is always true that for every $\query$, $\timeOf{}^*\inparen{\query, \pdb, \bound}\leq \bigO{\qruntime{\optquery{\query}, \tupset, \bound}}$. We remark that the issue of query optimization is orthogonal to this question (recall that an $\raPlus$ query explicitly encodes order of operations) since we want to answer the above question for all $\query$. \emph{Specifically, if there is an equivalent and more efficient query $\query'$, we allow both deterministic and probabilistic query processing access to $\query'$}.
Unfortunately the the answer to the above question is no--
\Cref{tab:lbs} shows our results.
Specifically, depending on what hardness result/conjecture we assume, we get various weaker or stronger versions of {\em no} as an answer to our question. To make some sense of the other lower bounds in \Cref{tab:lbs}, we note that it is not too hard to show that $\timeOf{}^*(\query,\pdb, \bound) \le \bigO{\inparen{\qruntime{\optquery{\query}, \tupset, \bound}}^k}$, where $k$ is the join width of $\query$ (our notion of join width
Specifically, depending on what hardness result/conjecture we assume, we get various weaker or stronger versions of {\em no} as an answer to our question. To make some sense of the lower bounds in \Cref{tab:lbs}, we note that it is not too hard to show that $\timeOf{}^*(\query,\pdb, \bound) \le \bigO{\inparen{\qruntime{\optquery{\query}, \tupset, \bound}}^k}$, where $k$ is the join width of $\query$ (our notion of join width
%follows from~\Cref{def:degree-of-poly}
is essentially the degree of the corresponding polynomial)
is essentially the degree of the corresponding polynomial we introduce in \Cref{sec:intro-poly-equiv}).
%% OK: Fig 1 hasn't been introduced yet
% defined in~\Cref{fig:nxDBSemantics}.)
of the query $\query$ over all result tuples $\tup$ (and the parameter that defines our family of hard queries).
%of the query $\query$ over all result tuples $\tup$ (and the parameter that defines our family of hard queries).
%
What our lower bound in the third row says, is that one cannot get more than a polynomial improvement (for fixed $k$) over essentially the trivial algorithm for~\Cref{prob:expect-mult}, assuming the Exponential Time Hypothesis (ETH)~\cite{eth}.
What our lower bound in the third row says, is that for a specific family of hard queries, one cannot get more than a polynomial improvement (for fixed $k$) over essentially the trivial algorithm for~\Cref{prob:expect-mult}, assuming the Exponential Time Hypothesis (ETH)~\cite{eth}.
%However, this result assumes a hardness conjecture that is not as well studied as those in the first two rows of the table (see \Cref{sec:hard} for more discussion on the hardness assumptions).
Further, we note that existing results\footnote{This claim follows when we set $\query$ to the query that counts the number of $k$-cliques over database $\tupset$. Precisely the same bounds as in the three rows of~ \Cref{tab:lbs} (with $n$ replacing $\qruntime{\optquery{\query}, \tupset, \bound}$) follow from the same complexity assumptions we make: triangle detection hypothesis (by definition), $\sharpwzero\ne\sharpwone$~\cite{10.5555/645413.652181} and Strong ETH~\cite{CHEN20061346}. For the last result we can replace $k/\log{k}$ by just $k$.
We also note that existing results\footnote{This claim follows when we set $\query$ to the query that counts the number of $k$-cliques over database $\tupset$. Precisely the same bounds as in the three rows of~ \Cref{tab:lbs} (with $n$ replacing $\qruntime{\optquery{\query}, \tupset, \bound}$) follow from the same complexity assumptions we make: triangle detection hypothesis (by definition), $\sharpwzero\ne\sharpwone$~\cite{10.5555/645413.652181} and Strong ETH~\cite{CHEN20061346}. For the last result we can replace $k/\log{k}$ by just $k$.
%This claim follows from known results for the problem of evaluating a query $\query$ that counts the number of $k$-cliques over database $\tupset$. Specifically, a lower bound of the form $\Omega\inparen{n^{1+\eps_0}}$ for {\em some} $\eps_0>0$ follows from the triangle detection hypothesis (this like our result is for $k=3$). Second, a lower bound of $\omega\inparen{n^{C_0}}$ for {\em all} $C_0>0$ under the assumption $\sharpwzero\ne\sharpwone$~\cite{10.5555/645413.652181}. Finally, a lower bound of $\Omega\inparen{n^{c_0\cdot k}}$ for {\em some} $c_0>0$ was shown by~\cite{CHEN20061346} (under the strong exponential time hypothesis).
}
already imply the claimed lower bounds if we replace the $\qruntime{\optquery{\query}, \tupset, \bound}$ by just $\numvar = |\tupset|$.
imply the claimed lower bounds if we replace the $\qruntime{\optquery{\query}, \tupset, \bound}$ by just $\numvar = |\tupset|$.
%(indeed these results follow from known lower bounds for deterministic query processing).
Our contribution is to identify a family of hard queries where deterministic query processing is `easy' but computing the expected multiplicities is hard.
@ -148,15 +151,15 @@ A common encoding of probabilistic databases (e.g., in \cite{IL84a,4497507,DBLP:
Given an $\raPlus$ query $\query$, \abbrCTIDB $\pdb$ and result tuple $\tup$,
%compute the expected multiplicity of the polynomial $\poly$ (i.e.,
%for $\worldvec\in\worlds$,
compute $\expct_{\vct{W}\sim \pdassign}\pbox{\poly\inparen[\query,\tupset,\tup]{\worldvec}}$).
compute $\expct_{\vct{W}\sim \pdassign}\pbox{\apolyqdt\inparen{\worldvec}}$).
\end{Problem}
%We note that computing \Cref{prob:expect-mult} is equivalent (yields the same result as) to computing \Cref{prob:bag-pdb-poly-expected} (see \Cref{prop:expection-of-polynom}).
We drop $\query$, $\tupset$, and $\tup$ from $\apolyqdt$ when they are clear from the context or irrelevant to the discussion.
All of our results rely on working with a {\em reduced} form $\inparen{\rpoly}$ of the lineage polynomial $\poly$. As we show, for the $1$-\abbrTIDB case, computing the expected multiplicity (over bag query semantics) is {\em exactly} the same as evaluating $\rpoly$ over the probabilities that define the $1$-\abbrTIDB.
Further, only light extensions (see \Cref{def:reduced-poly-one-bidb}) are required to support block independent disjoint probabilistic databases~\cite{DBLP:conf/icde/OlteanuHK10} (bag query semantics with input tuple multiplicity at most $1$). %, for which the proof of~\Cref{lem:tidb-reduce-poly} (introduced shortly) holds .
All of our results rely on working with a {\em reduced} form $\rpoly$ of the lineage polynomial $\poly$. As we show, for the $1$-\abbrTIDB case, computing the expected multiplicity (over bag query semantics) is {\em exactly} the same as evaluating $\rpoly$ over the probabilities that define the $1$-\abbrTIDB.
Further, only light extensions (see \Cref{def:reduced-poly-one-bidb}) are required to support block independent disjoint probabilistic databases~\cite{DBLP:conf/icde/OlteanuHK10}. % (bag query semantics with input tuple multiplicity at most $1$). %, for which the proof of~\Cref{lem:tidb-reduce-poly} (introduced shortly) holds .
Next, we motivate this reduced polynomial $\rpoly$.
Next, we motivate the reduced polynomial $\rpoly$.
Consider the query $\query_1$ defined as follows over the bag relations of \Cref{fig:two-step}:
\begin{lstlisting}[frame=single,framerule=0pt]
@ -167,9 +170,9 @@ WHERE $t_1$.Point = r.Point$_1$ AND $t_2$.Point = r.Point$_2$
It can be verified that $\poly\inparen{A, B, C, E, U, Y, Z}$ for the sole result tuple of $\query_1$ is $AUB + BYE + BZC$. Now consider the product query $\query_1^2 = \query_1 \times \query_1$.
The lineage polynomial for $\query_1^2$ is $\poly_1^2\inparen{A, B, C, E, U, Y, Z}$
$$
=A^2U^2B^2 + B^2Y^2E^2 + B^2Z^2C^2 + 2AUB^2YE + 2AXB^2ZC + 2B^2YEZC.
=A^2U^2B^2 + B^2Y^2E^2 + B^2Z^2C^2 + 2AUB^2YE + 2AUB^2ZC + 2B^2YEZC.
$$
To compute $\expct\pbox{\poly_1^2}$ we can use linearity of expectation and push the expectation through each summand. To keep things simple, let us focus on the monomial $\poly_1^{\inparen{ABU}^2} = A^2U^2B^2$ as the procedure is the same for all other monomials of $\poly_1^2$. Let $\randWorld_U$ be the random variable corresponding to a lineage variable $U$. Because the distinct variables in the product are independent, we can push expectation through them yielding $\expct\pbox{\randWorld_A^2\randWorld_U^2\randWorld_B^2}=\expct\pbox{\randWorld_A^2}\expct\pbox{\randWorld_U^2}\expct\pbox{\randWorld_B^2}$. Since $\randWorld_A, \randWorld_B\in \inset{0, 1}$ we can simplify to $\expct\pbox{\randWorld_A}\expct\pbox{\randWorld_U^2}\expct\pbox{\randWorld_B}$ by the fact that for any $W\in \inset{0, 1}$, $W^2 = W$. Observe that if $W_U\in\inset{0, 1}$, then we further would have $\expct\pbox{\randWorld_A}\expct\pbox{\randWorld_U}\expct\pbox{\randWorld_B} = \prob_A\cdot\prob_X\cdot\prob_B$ (denoting $\probOf\pbox{\randWorld_A = 1} = \prob_A$) $= \rpoly_1^{\inparen{ABX}^2}\inparen{\prob_A, \prob_U, \prob_B}$ (see $ii)$ of~\Cref{def:reduced-poly}). However, in this example, we get stuck with $\expct\pbox{\randWorld_U^2}$, since $\randWorld_U\in\inset{0, 1, 2}$ and for $\randWorld_U \gets 2$, $\randWorld_U^2 \neq \randWorld_U$.
To compute $\expct\pbox{\poly_1^2}$ we can use linearity of expectation and push the expectation through each summand. To keep things simple, let us focus on the monomial $\monomial{1}(A,B,U) = A^2U^2B^2$ as the procedure is the same for all other monomials of $\poly_1^2$. Let $\randWorld_U$ be the random variable corresponding to a lineage variable $U$. Because the distinct variables in the product are independent, we can push expectation through them yielding $\expct\pbox{\randWorld_A^2\randWorld_U^2\randWorld_B^2}=\expct\pbox{\randWorld_A^2}\expct\pbox{\randWorld_U^2}\expct\pbox{\randWorld_B^2}$. Since $\randWorld_A, \randWorld_B\in \inset{0, 1}$ we can simplify to $\expct\pbox{\randWorld_A}\expct\pbox{\randWorld_U^2}\expct\pbox{\randWorld_B}$ by the fact that for any $W\in \inset{0, 1}$, $W^2 = W$. Observe that if $W_U\in\inset{0, 1}$, then we further would have $\expct\pbox{\randWorld_A}\expct\pbox{\randWorld_U}\expct\pbox{\randWorld_B} = \prob_A\cdot\prob_X\cdot\prob_B$ (denoting $\probOf\pbox{\randWorld_A = 1} = \prob_A$) $= \rmonomial{1}\inparen{\prob_A, \prob_U, \prob_B}$ (see $ii)$ of~\Cref{def:reduced-poly}). However, in this example, we get stuck with $\expct\pbox{\randWorld_U^2}$, since $\randWorld_U\in\inset{0, 1, 2}$ and for $\randWorld_U \gets 2$, $\randWorld_U^2 \neq \randWorld_U$.
The simple insight to get around this issue to note that the random variables $\randWorld_U$ and $\randWorld_{U_1}+2\randWorld_{U_2}$ have exactly the same distribution, where $\randWorld_{U_1},\randWorld_{U_2}\in\inset{0,1}$ and $\probOf\pbox{\randWorld_{U_j} = 1} = \probOf\pbox{\randWorld_{U} = j}$. Thus, the idea is to replace the variable $U$ by $U_1+2U_2$ (where $U_j$ corresponds to the event that $U$ has multiplicity $j$) yielding% to obtain the following polynomial:
%
@ -180,7 +183,7 @@ The simple insight to get around this issue to note that the random variables $\
%\begin{multline*}
%\refpoly{1, }^{\inparen{ABX}^2}\inparen{A, X, B} = \poly_1^{\inparen{AXB}^2}\inparen{\sum_{j_1\in\pbox{\bound}}j_1A_{j_1}, \sum_{j_2\in\pbox{\bound}}j_2X_{j_2}, \sum_{j_3\in\pbox{\bound}}j_3B_{j_3}} \\
%= \inparen{\sum_{j_1\in\pbox{\bound}}j_1A_{j_1}}^2\inparen{\sum_{j_2\in\pbox{\bound}}j_2X_{j_2}}^2\inparen{\sum_{j_3\in\pbox{\bound}}j_3B_{j_3}}^2.
\[\refpoly{1, }^{\inparen{ABU}^2}\inparen{A, U_1, U_2 B} = \poly_1^{\inparen{AUB}^2}\inparen{A,(U_1+2U_2),B}.\]
\[\monomial{1,R}\inparen{A, U_1, U_2, B} = \monomial{1}\inparen{A,(U_1+2U_2),B}.\]
%\end{multline*}
%}
%Since the set of multiplicities for tuple $\tup$ by nature are disjoint we can drop all cross terms and have $\refpoly{1, }^2 = \sum_{j_1, j_2, j_3 \in \pbox{\bound}}j_1^2A^2_{j_1}j_2^2X_{j_2}^2j_3^2B^2_{j_3}$. Since we now have that all $\randWorld_{X_j}\in\inset{0, 1}$, computing expectation yields $\expct\pbox{\refpoly{1, }^2}=\sum_{j_1,j_2,j_3\in\pbox{\bound}}j_1^2j_2^2j_3^2$ \allowbreak $\expct\pbox{\randWorld_{A_{j_1}}}\expct\pbox{\randWorld_{X_{j_2}}}\expct\pbox{\randWorld_{B_{j_3}}}$.
@ -188,7 +191,7 @@ The simple insight to get around this issue to note that the random variables $\
Given that $U$ can only have multiplicity of $1$ or $2$ but not both,
%we drop the monomials with the term $U_1U_2$ to get
%$\refpoly{1, }^{\inparen{ABU}^2}\inparen{A, U_1, U_2, B} = A^2U_1^2B^2+2^2\cdot A^2 U_2^2B^2.$
given world vectors $(\randWorld_A,\randWorld_{U_1},\randWorld_{U_2},\randWorld_A)$, we have $\expct\pbox{\randWorld_{U_1}\randWorld_{U_2}}=0$. Further, since the world vectors are Binary vectors, we have $\expct\pbox{\refpoly{1, }^2}=\expct\pbox{\randWorld_{A}}\expct\pbox{\randWorld_{U_1}}\expct\pbox{\randWorld_{B}}+$ \\ $4\expct\pbox{\randWorld_{A}}\expct\pbox{\randWorld_{U_2}}\expct\pbox{\randWorld_{B}}\stackrel{\text{def}}{=}\rpoly_1^2\inparen{p_A,\probOf\inparen{U=1},\probOf\inparen{U=2},p_B}$. We only did the argument for a single monomial but by linearity of expectation we can apply the same argument to all monomials in $\poly_1^2$. Generalizing this argument to general $\poly$ leads to consider its following `reduced' version:
given world vectors $(\randWorld_A,\randWorld_{U_1},\randWorld_{U_2},\randWorld_A)$, we have $\expct\pbox{\randWorld_{U_1}\randWorld_{U_2}}=0$. Further, since the world vectors are Binary vectors, we have $\expct\pbox{\monomial{1,R}}=\expct\pbox{\randWorld_{A}}\expct\pbox{\randWorld_{U_1}}\expct\pbox{\randWorld_{B}}+$ \\ $4\expct\pbox{\randWorld_{A}}\expct\pbox{\randWorld_{U_2}}\expct\pbox{\randWorld_{B}}\stackrel{\text{def}}{=}\rmonomial{1}\inparen{p_A,\probOf\inparen{U=1},\probOf\inparen{U=2},p_B}$. We only did the argument for a single monomial but by linearity of expectation we can apply the same argument to all monomials in $\poly_1^2$. Generalizing this argument to arbitrary $\poly$ leads to consider its following `reduced' version:
\begin{Definition}\label{def:reduced-poly}
For any polynomial $\poly\inparen{\inparen{X_\tup}_{\tup\in\tupset}}$ define the reformulated polynomial $\refpoly{}\inparen{\inparen{X_{\tup, j}}_{\tup\in\tupset, j\in\pbox{\bound}}}
@ -231,15 +234,16 @@ We do so by considering an arbitrary graph $G$ (analogous to relation $\boldsymb
\mypar{Upper Bound Techniques}
Our negative results (\Cref{tab:lbs}) indicate that \abbrCTIDB{}s (even for $\bound=1$) can not achieve comparable performance to deterministic databases for exact results (under complexity assumptions). In fact, under plausible hardness conjectures, one cannot (drastically) improve upon the trivial algorithm to exactly compute the expected multiplicities for $1$-\abbrTIDB\xplural. A natural followup is whether we can do better if we are willing to settle for an approximation to the expected multiplities.
Our negative results (\Cref{tab:lbs}) indicate that \abbrCTIDB{}s (even for $\bound=1$) cannot achieve comparable performance to deterministic databases for exact results (under complexity assumptions). In fact, under plausible hardness conjectures, one cannot (drastically) improve upon the trivial algorithm to exactly compute the expected multiplicities for $1$-\abbrTIDB\xplural. A natural followup is whether we can do better if we are willing to settle for an approximation to the expected multiplities.
\input{two-step-model}
We adopt the two-step intensional model of query evaluation used in set-\abbrPDB\xplural, as illustrated in \Cref{fig:two-step}:
(i) \termStepOne (\abbrStepOne): Given input $\tupset$ and $\query$, output every tuple $\tup$ that possibly satisfies $\query$, annotated with its lineage polynomial $\poly(\vct{X})%=\textcolor{red}{CHANGE}\apolyqdt\inparen{\vct{X}}$
$;
(ii) \termStepTwo (\abbrStepTwo): Given $\poly(\vct{X})$ for each tuple, compute $\expct_{\randWorld\sim\bpd}\pbox{\poly(\vct{\randWorld})}$.
(ii) \termStepTwo (\abbrStepTwo): Given $\poly(\vct{X})$ for each tuple, compute an $(1\pm \eps)$-approximation $\expct_{\randWorld\sim\bpd}\pbox{\poly(\vct{\randWorld})}$.
Let $\timeOf{\abbrStepOne}(\query,\tupset,\circuit)$ denote the runtime of \abbrStepOne when it outputs $\circuit$ (a representation of $\poly$ as an arithmetic circuit --- more on this representation in~\Cref{sec:expression-trees}).
Denote by $\timeOf{\abbrStepTwo}(\circuit, \epsilon)$ (recall $\circuit$ is the output of \abbrStepOne) the runtime of \abbrStepTwo, which we can leverage~\Cref{def:reduced-poly} and~\Cref{lem:tidb-reduce-poly} to address the next formal objective:
Denote by $\timeOf{\abbrStepTwo}(\circuit, \epsilon)$ (recall $\circuit$ is the output of \abbrStepOne) the runtime of \abbrStepTwo (when $\poly$ is input as $\circuit$)
%which we can leverage~\Cref{def:reduced-poly} and~\Cref{lem:tidb-reduce-poly} to address the next formal objective:
\begin{Problem}[\abbrCTIDB linear time approximation]\label{prob:big-o-joint-steps}
Given \abbrCTIDB $\pdb$, $\raPlus$ query $\query$,
@ -255,7 +259,7 @@ Accordingly, this work uses (arithmetic) circuits\footnote{
An arithmetic circuit is a DAG with variable/numeric source gates and multiplication/addition internal/sink gates.
}
as the representation system of $\poly(\vct{X})$, and we show in \Cref{sec:circuit-depth} an $\bigO{\qruntime{\optquery{\query}, \tupset, \bound}}$ algorithm for constructing the lineage polynomial for all result tuples of an $\raPlus$ query $\query$ (or more precisely, a circuit $\circuit$ with $\numvar$ sinks, one per output tuple).% representing the tuple's lineage).
%
Since a representation $\circuit^*$ exists where $\timeOf{\abbrStepOne}(\query,\tupset,\circuit^*)\le \bigO{\qruntime{\optquery{\query}, \tupset, \bound}}$ and
the size of $\circuit^*$ is bounded by $\qruntime{\optquery{\query}, \tupset, \bound}$ (i.e., $|\circuit^*| \le \bigO{\qruntime{\optquery{\query}, \tupset, \bound}}$) (see~\Cref{sec:circuit-runtime}), we can focus on the complexity of \abbrStepTwo.
%Thus, the question of approximation can be stated as the following stronger (since~\Cref{prob:big-o-joint-steps} has access to \emph{all} equivalent \circuit representing $\query\inparen{\vct{W}}\inparen{\tup}$), but sufficient condition:

View File

@ -255,6 +255,7 @@
%Polynomial
\newcommand{\hideg}{K}
\newcommand{\poly}{\Phi}
\newcommand{\monomial}[1]{M_{{#1}}}
\newcommand{\genpoly}{\phi}
\newcommand{\vars}[1]{\func{Vars}\inparen{#1}}
\newcommand{\polyOf}[1]{\poly[#1]}
@ -265,6 +266,7 @@
\newcommand{\atupvar}{\tupvar{\rel}{\tup}}
\newcommand{\polyX}{\poly\inparen{\vct{\pVar}}}%<---let's see if this proves handy
\newcommand{\rpoly}{\widetilde{\poly}}%r for reduced as in reduced 'Q'
\newcommand{\rmonomial}[1]{\widetilde{\monomial{#1}}}
\newcommand{\refpoly}[1]{\poly_{#1R}}
\newcommand{\rpolyX}{\rpoly\inparen{\pVar}}%<---if this isn't something we use much, we can get rid of it
\newcommand{\biDisProd}{\mathcal{B}}%bidb disjoint tuple products (def 2.5)