|
|
|
@ -4,8 +4,8 @@
|
|
|
|
|
|
|
|
|
|
\secrev{
|
|
|
|
|
This work explores the problem of computing the expectation of a tuple's multiplicity in an important special case of bag \abbrTIDB, which we call a \abbrCTIDB. A \abbrCTIDB,
|
|
|
|
|
$\pdb = \inparen{\worlds, \bpd}$ encodes a bag of uncertain tuples such that each tuple in $\pdb$ has a multiplicity of at most $\bound$. $\tupset$ is the set of tuples appearing across all possible worlds, and the set of all worlds is encoded in $\worlds$, which is the set of all vectors of length $\numvar=\abs{\tupset}$ such that each index corresponds to a distinct $\tup \in \tupset$ storing its multiplicity. $\bpd$ is a product distribution over the set of all worlds. A given world $\worldvec \in\worlds$ can be interpreted such that, for each $\tup \in \tupset$, $\worldvec\pbox{\tup}$ is the multiplicity of $\tup$ in $\worldvec$. The resulting product distribution can then be encoded as $\prob_{\tup} = \probOf\pbox{W\pbox{\tup} = j}$ (for $j \in\pbox{\bound}$), where each %distribution
|
|
|
|
|
$\tup$ is an independent random event. %for $\tup \in \tupset$.
|
|
|
|
|
$\pdb = \inparen{\worlds, \bpd}$ encodes a bag of uncertain tuples such that each tuple in $\pdb$ has a multiplicity of at most $\bound$. $\tupset$ is the set of tuples appearing across all possible worlds, and the set of all worlds is encoded in $\worlds$, which is the set of all vectors of length $\numvar=\abs{\tupset}$ such that each index corresponds to a distinct $\tup \in \tupset$ storing its multiplicity. $\bpd$ is a product distribution over the set of all worlds. A given world $\worldvec \in\worlds$ can be interpreted such that, for each $\tup \in \tupset$, $\worldvec\pbox{\tup}$ is the multiplicity of $\tup$ in $\worldvec$. The resulting product distribution can then be encoded as $\prob_{\tup} = \probOf\pbox{W\pbox{\tup} = j}$ (for $j \in\pbox{\bound}$), where each tuple multiplicity combination $\inparen{\inparen{\tup, \bound} \in \tupset\times\pbox{\bound}}$ %distribution
|
|
|
|
|
is an independent random event. %for $\tup \in \tupset$.
|
|
|
|
|
}
|
|
|
|
|
%\mypar{For a later section}
|
|
|
|
|
%\sout{
|
|
|
|
@ -18,7 +18,11 @@ In this work, since we are generally considering bag query input, we will only b
|
|
|
|
|
We can formally state our problem of computing the expected multiplicity of a result tuple as:
|
|
|
|
|
|
|
|
|
|
\begin{Problem}\label{prob:expect-mult}
|
|
|
|
|
Given a \abbrCTIDB $\pdb = \inparen{\worlds, \bpd}$, $\raPlus$ query $\query$, and result tuple $\tup$, compute the expected multiplicity of $\tup$: $\expct_{\rvworld\sim\bpd}\pbox{\query\inparen{\rvworld}\inparen{\tup}}$.
|
|
|
|
|
Given a \abbrCTIDB $\pdb = \inparen{\worlds, \bpd}$, $\raPlus$ query $\query$
|
|
|
|
|
\footnote{
|
|
|
|
|
A query $\query$ is an $\raPlus$ query if it is composed entirely of one or more of the positive relational operators $\inset{\select, \project, \join, \union}$.
|
|
|
|
|
}
|
|
|
|
|
, and result tuple $\tup$, compute the expected multiplicity of $\tup$: $\expct_{\rvworld\sim\bpd}\pbox{\query\inparen{\rvworld}\inparen{\tup}}$.
|
|
|
|
|
\end{Problem}
|
|
|
|
|
|
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
|
|
@ -99,13 +103,13 @@ Given a \abbrCTIDB $\pdb = \inparen{\worlds, \bpd}$, $\raPlus$ query $\query$, a
|
|
|
|
|
% \label{fig:ctidb-red}
|
|
|
|
|
%\end{figure}
|
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
|
|
|
It is natural to explore computing the expected multiplicity of result tuple as this is the analog for computing the marginal probability of a tuple in a set \abbrPDB.
|
|
|
|
|
It is natural to explore computing the expected multiplicity of a result tuple as this is the analog for computing the marginal probability of a tuple in a set \abbrPDB.
|
|
|
|
|
In this work we will assume that $c =\bigO{1}$ since this is what typically seen in practice.
|
|
|
|
|
%because of the cancellation effect of queries over a $1$-\abbrBIDB (introduced later), where, for the worst case, a self join query, we would have a factor of $\frac{1}{c^{n-1}}$ cancellations.
|
|
|
|
|
Allowing for unbounded $c$ is an interesting open problem.
|
|
|
|
|
|
|
|
|
|
\mypar{Hardness of Set Query Semantics and Bag Query Semantics}
|
|
|
|
|
Set query evaluation semantics over $1$-\abbrTIDB\xplural have been studied extensively, and the data complexity of the problem in general has been shown by Dalvi and Suicu to be \sharpphard\cite{10.1145/1265530.1265571}. For our setting, there exists a trivial polytime algorithm to compute~\Cref{prob:expect-mult} for any query over a \abbrCTIDB due to linearity of expection by siimply computing the expectation over a `sum-of-products' representation of the query operations of $\query\inparen{\pdb}\inparen{\tup}$. %We discuss polynomial representation and equivalence in the following subsection.
|
|
|
|
|
Set query evaluation semantics over $1$-\abbrTIDB\xplural have been studied extensively, and the data complexity of the problem in general has been shown by Dalvi and Suicu to be \sharpphard\cite{10.1145/1265530.1265571}. For our setting, there exists a trivial polytime algorithm to compute~\Cref{prob:expect-mult} for any $\raPlus$ query over a \abbrCTIDB due to linearity of expection by simply computing the expectation over a `sum-of-products' representation of the query operations of $\query\inparen{\pdb}\inparen{\tup}$. %We discuss polynomial representation and equivalence in the following subsection.
|
|
|
|
|
Since we can compute~\Cref{prob:expect-mult} in polynomial time, the interesting question that we explore deals with analyzing the hardness of computing expectation using fine-grained analysis and parameterized complexity, where we are interested in the exponent of polynomial runtime.
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
@ -122,7 +126,10 @@ Specifically, in this work we ask if~\Cref{prob:expect-mult} can be solved in ti
|
|
|
|
|
%Define $\gentupset$ to be the set of tuples appearing across all the possible worlds of a $\abbrCTIDB$, formally $\gentupset = \inset{\tup_i ~|~ \forall \worldvec \in \worlds,~\forall i \in \abs{\tupset}:~\worldvec\pbox{i} > 0}$. When a specific $\pdb = \inparen{\worlds, \bpd}$ is being referred to, we will use $\tupset$ to denote the set of tuples.
|
|
|
|
|
%\end{Definition}
|
|
|
|
|
|
|
|
|
|
Let $T_{det}\inparen{\query, \gentupset, \bound} = \query\inparen{\gentupset}$ for arbitrary query $\query$, deterministic database $\gentupset$, and multiplicity bound $c$. Let $\qruntime{\query, \gentupset, \bound} = \min_{\query':\query'\equiv\query}T_{det}\inparen{\query, \gentupset, \bound}$ be the optimal runtime (with some caveats; discussed in~\Cref{sec:gen}) of query $\query$ on deterministic database $\tupset$.
|
|
|
|
|
Let $\qruntime{\optquery{\query},\gentupset,\bound}$ denote the runtime for query $\optquery{\query}$, deterministic database $\gentupset$, and multiplicity bound $\bound$. Being we consider $\raPlus$ queries in which order of operators can impact runtime, we denote the optimal query as $\optquery{\query} = \min_{\query'\in\raPlus, \query'\equiv\query}\qruntime{\query', \gentupset, \bound}$.
|
|
|
|
|
%let $\qruntim{\optquery{\query}, \gentupset, \bound} = \min_{\query'\in\raPlus,~\query'\equiv\query}T_{det}\inparen{\query, \gentupset, \bound}$ be the runtime for the optimally structured equivalent $\raPlus$ query $\query'$ (with some caveats; discussed in~\Cref{sec:gen}). % of query $\query$ on deterministic database $\tupset$.
|
|
|
|
|
%{\newline\noindent\centerline{\Huge \textcolor{black}{Or instead$\ldots$}}}
|
|
|
|
|
%\newline\noindent Let $T_{det}\inparen{\query, \gentupset, \bound}$ denote the runtime for $\raPlus$ query $\query$, deterministic database $\gentupset$, and multiplicity bound $\bound$. Since this paper does not consider optimization schemes, we leave optimization to the reader and show that our results hold across all inputs.
|
|
|
|
|
|
|
|
|
|
%We make this runtime concrete later on.
|
|
|
|
|
%We denote by $\dbbase$ the base \abbrCTIDB table containing all possible tuples, formally as,
|
|
|
|
@ -135,33 +142,33 @@ Let $T_{det}\inparen{\query, \gentupset, \bound} = \query\inparen{\gentupset}$ f
|
|
|
|
|
\hline
|
|
|
|
|
Lower bound on $\timeOf{}^*(\query,\pdb)$ & Num. $\bpd$s & Hardness Assumption\\
|
|
|
|
|
\hline
|
|
|
|
|
$\Omega\inparen{\inparen{\qruntime{\query, \tupset, \bound}}^{1+\eps_0}}$ for {\em some} $\eps_0>0$ & Single & Triangle Detection hypothesis\\
|
|
|
|
|
$\Omega\inparen{\inparen{\qruntime{\optquery{\query}, \tupset, \bound}}^{1+\eps_0}}$ for {\em some} $\eps_0>0$ & Single & Triangle Detection hypothesis\\
|
|
|
|
|
%\hline
|
|
|
|
|
$\omega\inparen{\inparen{\qruntime{\query, \tupset, \bound}}^{C_0}}$ for {\em all} $C_0>0$ & Multiple &$\sharpwzero\ne\sharpwone$\\
|
|
|
|
|
$\omega\inparen{\inparen{\qruntime{\optquery{\query}, \tupset, \bound}}^{C_0}}$ for {\em all} $C_0>0$ & Multiple &$\sharpwzero\ne\sharpwone$\\
|
|
|
|
|
%\hline
|
|
|
|
|
$\Omega\inparen{\inparen{\qruntime{\query, \tupset, \bound}}^{c_0\cdot k}}$ for {\em some} $c_0>0$ & Multiple & \Cref{conj:known-algo-kmatch}\\ %Multiple & Current $k$-matching algorithms\\
|
|
|
|
|
$\Omega\inparen{\inparen{\qruntime{\optquery{\query}, \tupset, \bound}}^{c_0\cdot k}}$ for {\em some} $c_0>0$ & Multiple & \Cref{conj:known-algo-kmatch}\\ %Multiple & Current $k$-matching algorithms\\
|
|
|
|
|
\hline
|
|
|
|
|
\end{tabular}
|
|
|
|
|
\caption{Our lower bounds for a specific hard query $\query$ parameterized by $k$. For $\pdb = \inset{\worlds, \bpd}$ those with `Multiple' in the second column need the algorithm to be able to handle multiple $\bpd$ (for a given $\tupset$). The last column states the hardness assumptions that imply the lower bounds in the first column ($\eps_o,C_0,c_0$ are constants that are independent of $k$).}
|
|
|
|
|
\caption{Our lower bounds for a specific hard query $\query$ parameterized by $k$.For $\pdb = \inset{\worlds, \bpd}$ those with `Multiple' in the second column need the algorithm to be able to handle multiple $\bpd$, i.e. probability distributions (for a given $\tupset$). The last column states the hardness assumptions that imply the lower bounds in the first column ($\eps_o,C_0,c_0$ are constants that are independent of $k$).}
|
|
|
|
|
\label{tab:lbs}
|
|
|
|
|
\end{table}
|
|
|
|
|
\mypar{Our lower bound results}
|
|
|
|
|
Our question is whether or not it is always true that $\timeOf{}^*\inparen{\query, \pdb}\leq\qruntime{\query, \tupset, \bound}$. Unfortunately this is not the case.
|
|
|
|
|
Our question is whether or not it is always true that $\timeOf{}^*\inparen{\query, \pdb}\leq\qruntime{\optquery{\query}, \tupset, \bound}$. Unfortunately this is not the case.
|
|
|
|
|
~\Cref{tab:lbs} shows our results.%our lower bounds for computing~\Cref{prob:expect-mult} on \abbrCTIDB\xplural.
|
|
|
|
|
|
|
|
|
|
Specifically, depending on what hardness result/conjecture we assume, we get various emphatic versions of {\em no} as an answer to our question. To make some sense of the other lower bounds in Table~\ref{tab:lbs}, we note that it is not too hard to show that $\timeOf{}^*(Q,\pdb) \le O\inparen{\inparen{\qruntime{Q, \tupset, \bound}}^k}$, where $k$ is the join width (our notion of join width follows from~\Cref{def:degree-of-poly} and~\Cref{fig:nxDBSemantics}.) of the query $\query$ over all result tuples $\tup$ (and the parameter that defines our family of hard queries).
|
|
|
|
|
Specifically, depending on what hardness result/conjecture we assume, we get various emphatic versions of {\em no} as an answer to our question. To make some sense of the other lower bounds in Table~\ref{tab:lbs}, we note that it is not too hard to show that $\timeOf{}^*(Q,\pdb) \le \bigO{\inparen{\qruntime{\optquery{\query}, \tupset, \bound}}^k}$, where $k$ is the join width (our notion of join width follows from~\Cref{def:degree-of-poly} and~\Cref{fig:nxDBSemantics}.) of the query $\query$ over all result tuples $\tup$ (and the parameter that defines our family of hard queries).
|
|
|
|
|
|
|
|
|
|
What our lower bound in the third row says is that one cannot get more than a polynomial improvement over essentially the trivial algorithm for~\Cref{prob:expect-mult}.
|
|
|
|
|
However, this result assumes a hardness conjecture that is not as well studied as those in the first two rows of the table (see \Cref{sec:hard} for more discussion on the hardness assumptions). Further, we note that existing results already imply the claimed lower bounds if we were to replace the $\qruntime{\query, \tupset, \bound}$ by just $\numvar$ (indeed these results follow from known lower bound for deterministic query processing). Our contribution is to then identify a family of hard queries where deterministic query processing is `easy' but computing the expected multiplicities is hard.
|
|
|
|
|
However, this result assumes a hardness conjecture that is not as well studied as those in the first two rows of the table (see \Cref{sec:hard} for more discussion on the hardness assumptions). Further, we note that existing results already imply the claimed lower bounds if we were to replace the $\qruntime{\optquery{\query}, \tupset, \bound}$ by just $\numvar$ (indeed these results follow from known lower bound for deterministic query processing). Our contribution is to then identify a family of hard queries where deterministic query processing is `easy' but computing the expected multiplicities is hard.
|
|
|
|
|
|
|
|
|
|
\mypar{Our upper bound results} We introduce an $(1\pm \epsilon)$-approximation algorithm that computes ~\Cref{prob:expect-mult} in time $O_\epsilon\inparen{\qruntime{\query, \tupset, \bound}}$. This means, when we are okay with approximation, that we solve~\Cref{prob:expect-mult} in time linear in the size of the deterministic query %$\timeOf{Approx}^*\inparen{\query, \pdb}\leq\qruntime{\query,\tupset,\bound}$ (where $\timeOf{Approx}^*\inparen{\cdot}$ denotes runtime of approximation algorithm),
|
|
|
|
|
\mypar{Our upper bound results} We introduce an $(1\pm \epsilon)$-approximation algorithm that computes ~\Cref{prob:expect-mult} in time $O_\epsilon\inparen{\qruntime{\optquery{\query}, \tupset, \bound}}$. This means, when we are okay with approximation, that we solve~\Cref{prob:expect-mult} in time linear in the size of the deterministic query %$\timeOf{Approx}^*\inparen{\query, \pdb}\leq\qruntim{\optquery{\query},\tupset,\bound}$ (where $\timeOf{Approx}^*\inparen{\cdot}$ denotes runtime of approximation algorithm),
|
|
|
|
|
and bag \abbrPDB\xplural are deployable in practice.
|
|
|
|
|
% In particular, we show the following upper bound results.
|
|
|
|
|
%(i) We show that e.g. for a circuit representation of the lineage polynomial (more on this later), when the circuit is a tree and there is a single
|
|
|
|
|
% result tuple, we also have the same runtime (we can also handle the case of multiple result tuples\footnote{We can approximate the expected result tuple multiplicities (for all result tuples {\em simultanesouly}) with only $O(\log{Z})=O_k(\log{n})$ overhead (where $Z$ is the number of result tuples) over the runtime of a broad class of query processing algorithms (see \Cref{app:sec-cicuits}).}).
|
|
|
|
|
%Further, we show that for {\em any} $\raPlus$ query on a \abbrTIDB $(1$-$\abbrTIDB)$, we also obtain linear runtime for approximation.
|
|
|
|
|
% the approximation algorithm has runtime linear in the size of the compressed lineage encoding (
|
|
|
|
|
In contrast, known approximation techniques (\cite{DBLP:conf/icde/OlteanuHK10,DBLP:journals/jal/KarpLM89}) in set-\abbrPDB\xplural need time $\Omega(\qruntime{\query, \tupset, \bound}^{2k})$ %, where $\circuit$ is a representation of the query operations and input to produce $\tup$; more on this shortly.
|
|
|
|
|
In contrast, known approximation techniques (\cite{DBLP:conf/icde/OlteanuHK10,DBLP:journals/jal/KarpLM89}) in set-\abbrPDB\xplural need time $\Omega(\qruntime{\optquery{\query}, \tupset, \bound}^{2k})$ %, where $\circuit$ is a representation of the query operations and input to produce $\tup$; more on this shortly.
|
|
|
|
|
(see \Cref{sec:karp-luby}).
|
|
|
|
|
Further, our approximation algorithm works for a more general notion of bag \abbrPDB\xplural beyond \abbrCTIDB\xplural
|
|
|
|
|
%we generalize the \abbrPDB data model considered by the approximation algorithm to a class of bag-Block Independent Disjoint Databases
|
|
|
|
@ -203,12 +210,12 @@ multiplicity of the polynomial $\apolyqdt$ (i.e., $\expct_{\vct{W}\sim \pdassign
|
|
|
|
|
%where $\pdassign$ is the distribution induced by $\pd$ on the relevant assignments $\vct{W}$ to variables of $\apolyqdt$.
|
|
|
|
|
\end{Problem}
|
|
|
|
|
We note that computing \Cref{prob:expect-mult}
|
|
|
|
|
is equivalent to computing \Cref{prob:bag-pdb-poly-expected} (see \Cref{prop:expection-of-polynom}).
|
|
|
|
|
is equivalent (yields the same result as) to computing \Cref{prob:bag-pdb-poly-expected} (see \Cref{prop:expection-of-polynom}).
|
|
|
|
|
%In this work, we study the complexity of \Cref{prob:bag-pdb-poly-expected} for several models of probabilistic databases and various encodings of such polynomials.
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
\secrev{
|
|
|
|
|
All of our results rely on working with a {\em reduced} form of the lineage polynomial $\poly$. In fact, it turns out that for the $1$-\abbrTIDB case, computing the expected multiplicity (over bag query semantics) is {\em exactly} the same as evaluating this reduced polynomial over the probabilities that define the $1$-\abbrTIDB. This is also true when the query input(s) is a block independent disjoint probabilistice database (with tuple multiplicity of at most $1$), which we refer to as a $1$-\abbrBIDB.
|
|
|
|
|
All of our results rely on working with a {\em reduced} form $\inparen{\poly}$ of the lineage polynomial $\poly$. In fact, it turns out that for the $1$-\abbrTIDB case, computing the expected multiplicity (over bag query semantics) is {\em exactly} the same as evaluating this reduced polynomial over the probabilities that define the $1$-\abbrTIDB. This is also true when the query input(s) is a block independent disjoint probabilistice database (with tuple multiplicity of at most $1$), which we refer to as a $1$-\abbrBIDB.
|
|
|
|
|
% For our results to be applicable to \abbrCTIDB\xplural, we introduce the following reduction.
|
|
|
|
|
%\begin{Definition}
|
|
|
|
|
%Any \abbrCTIDB $\pdb$, can be reduced to an equivalent $1$-\abbrBIDB $\pdb'$ in the following manner. For each $\tup_i \in \tupset$, create a block of $\bound + 1$ disjoint \abbrBIDB tuples in $\pdb'$ such that each tuple in the newly formed block is mapped to its own boolean variable $X_{i, j}$ for $i \in \abs{D}$ and $j \in \pbox{c+1}$. Then, given $\worldvec \in \worlds$, the equivalent world in $\pdb'$ will set each variable $X_{i, j} = 1$ for each $\worldvec\pbox{i} = j$, while $\inparen{\text{for }\ell \neq j}$ all other $X_{i, \ell} \in \vct{X}$ of $\pdb'$ are set to $0$.
|
|
|
|
@ -229,15 +236,15 @@ The lineage polynomial for $Q_1^2$ is given by $\poly_1^2\inparen{A, B, C, E, X,
|
|
|
|
|
$$
|
|
|
|
|
=A^2X^2B^2 + B^2Y^2E^2 + B^2Z^2C^2 + 2AXB^2YE + 2AXB^2ZC + 2B^2YEZC.
|
|
|
|
|
$$
|
|
|
|
|
To compute $\expct\pbox{\poly_1^2}$ we can use linearity of expectation and push the expectation through each summand. To keep things simple, let us focus on the monomial $\poly_1^{\inparen{ABX}^2} = A^2X^2B^2$ as the procedure is the same for all other monomials of $\poly_1^2$. Let $\randWorld_X$ be the random variable corresponding to a lineage variable $X$. Because the distinct variables in the product are independent, we can push expectation through them yielding $\expct\pbox{\randWorld_A^2\randWorld_X^2\randWorld_B^2}=\expct\pbox{\randWorld_A^2}\expct\pbox{\randWorld_X^2}\expct\pbox{\randWorld_B^2}$. Since $\randWorld_A, \randWorld_B\in \inset{0, 1}$ we can further derive $\expct\pbox{\randWorld_A}\expct\pbox{\randWorld_X^2}\expct\pbox{\randWorld_B}$ by the fact that for any $W\in \inset{0, 1}$, $W^2 = W$. However, we get stuck with $\expct\pbox{\randWorld_X^2}$, since $\randWorld_X\in\inset{0, 1, 2}$ and for $\randWorld_X \gets 2$, $\randWorld_X^2 \neq \randWorld_X$.
|
|
|
|
|
To compute $\expct\pbox{\poly_1^2}$ we can use linearity of expectation and push the expectation through each summand. To keep things simple, let us focus on the monomial $\poly_1^{\inparen{ABX}^2} = A^2X^2B^2$ as the procedure is the same for all other monomials of $\poly_1^2$. Let $\randWorld_X$ be the random variable corresponding to a lineage variable $X$. Because the distinct variables in the product are independent, we can push expectation through them yielding $\expct\pbox{\randWorld_A^2\randWorld_X^2\randWorld_B^2}=\expct\pbox{\randWorld_A^2}\expct\pbox{\randWorld_X^2}\expct\pbox{\randWorld_B^2}$. Since $\randWorld_A, \randWorld_B\in \inset{0, 1}$ we can further derive $\expct\pbox{\randWorld_A}\expct\pbox{\randWorld_X^2}\expct\pbox{\randWorld_B}$ by the fact that for any $W\in \inset{0, 1}$, $W^2 = W$. Observe that if $X\in\inset{0, 1}$, then we further would have $\expct\pbox{\randWorld_A}\expct\pbox{\randWorld_X}\expct\pbox{\randWorld_B} = \prob_A\cdot\prob_X\cdot\prob_B$ (denoting $\probOf\pbox{\randWorld_A = 1} = \prob_A$) $= \rpoly_1^{\inparen{ABX}^2}\inparen{\prob_A, \prob_X, \prob_B}$ (see $ii)$ of~\Cref{def:reduced-poly}). However, in this example, we get stuck with $\expct\pbox{\randWorld_X^2}$, since $\randWorld_X\in\inset{0, 1, 2}$ and for $\randWorld_X \gets 2$, $\randWorld_X^2 \neq \randWorld_X$.
|
|
|
|
|
|
|
|
|
|
%the expectation is $\expct\pbox{A^2X^2B^2} = A\cdot\prob_A\cdot\inparen{\sum\limits_{i \in [2]}X_i\cdot \prob_{X, i}}\cdot B\prob_B$ for $X \in \inset{0, 1, 2}$.
|
|
|
|
|
|
|
|
|
|
Denote the variables of $\poly$ to be $\vars{\poly}.$ In the \abbrCTIDB setting, $\poly\inparen{\vct{X}}$ has an equivalent reformulation $\inparen{\refpoly{}}$ that is of use to us. Given $X_\tup \in\vars{\poly}$, by definition $X_\tup \in\inset{0,\ldots, c}$. We can replace $X_\tup$ by $\sum_{j\in\pbox{\bound}}X_{\tup, j}$ where each $X_{\tup, j}\in\inset{0, 1}$. Then for any $\worldvec\in\worlds$, we set $X_{\tup, j} = 1$ for $\worldvec_\tup = j$, while $X_{\tup, j'} = 0$ for all $j'\neq j\in\pbox{\bound}$. By construction then $\poly\inparen{\vct{X}}\equiv\refpoly{}\inparen{\vct{X}}$ since for any $X_\tup\in\vars{\poly}$ we have the equality $X_\tup = j = \sum_{j\in\pbox{\bound}}jX_j$.
|
|
|
|
|
Denote the variables of $\poly$ to be $\vars{\poly}.$ In the \abbrCTIDB setting, $\poly\inparen{\vct{X}}$ has an equivalent reformulation $\inparen{\refpoly{}}$ that is of use to us. Given $X_\tup \in\vars{\poly}$, by definition $X_\tup \in\inset{0,\ldots, c}$. We can replace $X_\tup$ by $\sum_{j\in\pbox{\bound}}X_{\tup, j}$ where each $X_{\tup, j}\in\inset{0, 1}$. Then for any $\worldvec\in\worlds$, we set $X_{\tup, j} = 1$ for $\worldvec_\tup = j$, while $X_{\tup, j'} = 0$ for all $j'\neq j\in\pbox{\bound}$. By construction then $\poly\inparen{\vct{X}}\equiv\refpoly{}\inparen{\vct{X_R}}$ $\inparen{\vct{X_R} = \vars{\refpoly{}}}$ since for any $X_\tup\in\vars{\poly}$ we have the equality $X_\tup = j = \sum_{j\in\pbox{\bound}}jX_j$.
|
|
|
|
|
|
|
|
|
|
Considering again our example,
|
|
|
|
|
\begin{multline*}
|
|
|
|
|
\refpoly{1, }^{\inparen{ABX}^2}\inparen{A, X, B} = \poly^{\inparen{AXB}^2}\inparen{\sum_{j_1\in\pbox{\bound}}j_1A_{j_1}, \sum_{j_2\in\pbox{\bound}}j_2X_{j_2}, \sum_{j_3\in\pbox{\bound}}j_3B_{j_3}} \\
|
|
|
|
|
\refpoly{1, }^{\inparen{ABX}^2}\inparen{A, X, B} = \poly_1^{\inparen{AXB}^2}\inparen{\sum_{j_1\in\pbox{\bound}}j_1A_{j_1}, \sum_{j_2\in\pbox{\bound}}j_2X_{j_2}, \sum_{j_3\in\pbox{\bound}}j_3B_{j_3}} \\
|
|
|
|
|
= \inparen{\sum_{j_1\in\pbox{\bound}}j_1A_{j_1}}^2\inparen{\sum_{j_2\in\pbox{\bound}}j_2X_{j_2}}^2\inparen{\sum_{j_3\in\pbox{\bound}}j_3B_{j_3}}^2.
|
|
|
|
|
\end{multline*}
|
|
|
|
|
Since the set of multiplicities for tuple $\tup$ by nature are disjoint we can drop all cross terms and have $\refpoly{1, }^2 = \sum_{j_1, j_2, j_3 \in \pbox{\bound}}j_1^2A^2_{j_1}j_2^2X_{j_2}^2j_3^2B^2_{j_3}$. Computing expectation we get $\expct\pbox{\refpoly{1, }^2}=\sum_{j_1,j_2,j_3\in\pbox{\bound}}j_1^2j_2^2j_3^2\expct\pbox{\randWorld_{A_{j_1}}}\expct\pbox{\randWorld_{X_{j_2}}}\expct\pbox{\randWorld_{B_{j_3}}}$, since we now have that all $\randWorld_{X_j}\in\inset{0, 1}$.
|
|
|
|
@ -306,7 +313,7 @@ $, where $\probAllTup = \inparen{\inparen{\prob_{\tup, j}}_{\tup\in\tupset, j\in
|
|
|
|
|
\secrev{
|
|
|
|
|
\subsection{Our Techniques}
|
|
|
|
|
\mypar{Lower Bound Proof Techniques}
|
|
|
|
|
Our main hardness result shows that computing~\Cref{prob:expect-mult} is $\sharpwonehard$ for $1$-\abbrTIDB. To prove this result we show that for the same $\query_1$ from the example above, for an arbitrary `product width' $k$, the query $Q^k$ is able to encode various hard graph-counting problems (assuming $\bigO{\numvar}$ tuples rather than the $O(1)$ tuples in \Cref{fig:two-step}).
|
|
|
|
|
Our main hardness result shows that computing~\Cref{prob:expect-mult} is $\sharpwonehard$ for $1$-\abbrTIDB. To prove this result we show that for the same $\query_1$ from the example above, for an arbitrary `product width' $k$, the query $Q^k$ is able to encode various hard graph-counting problems (assuming $\bigO{\numvar}$ tuples rather than the $\bigO{1}$ tuples in \Cref{fig:two-step}).
|
|
|
|
|
We do so by considering an arbitrary graph $G$ (analogous to relation $\boldsymbol{R}$ of $\query$) and analyzing how the coefficients in the (univariate) polynomial $\widetilde{\poly}\left(p,\dots,p\right)$ relate to counts of subgraphs in $G$ that are isomorphic to various graphs with $k$ edges. E.g., we exploit the fact that the leading coefficient in $\poly$ corresponding to $\query^k$ is proportional to the number of $k$-matchings in $G$, a known hard problem in parameterized/fine-grained complexity literature.
|
|
|
|
|
|
|
|
|
|
\mypar{Upper Bound Techniques}
|
|
|
|
@ -315,20 +322,20 @@ Our negative results (\Cref{tab:lbs}) indicate that \abbrCTIDB{}s (even for $\bo
|
|
|
|
|
\input{two-step-model}
|
|
|
|
|
We adopt the two-step intensional model of query evaluation used in set-\abbrPDB\xplural, as illustrated in \Cref{fig:two-step}:
|
|
|
|
|
(i) \termStepOne (\abbrStepOne): Given input $\tupset$ and $\query$, output every tuple $\tup$ that possibly satisfies $\query$, annotated with its lineage polynomial ($\poly(\vct{X})=\apolyqdt\inparen{\vct{X}}$);
|
|
|
|
|
(ii) \termStepTwo (\abbrStepTwo): Given $\poly(\vct{X})$ for each tuple, compute $\expct\pbox{\poly(\vct{\randWorld})}$.
|
|
|
|
|
(ii) \termStepTwo (\abbrStepTwo): Given $\poly(\vct{X})$ for each tuple, compute $\expct_{\randWorld\sim\bpd}\pbox{\poly(\vct{\randWorld})}$.
|
|
|
|
|
Let $\timeOf{\abbrStepOne}(Q,\tupset,\circuit)$ denote the runtime of \abbrStepOne when it outputs $\circuit$ (which is a representation of $\poly$ as an arithmetic circuit --- more on this representation shortly).
|
|
|
|
|
Denote by $\timeOf{\abbrStepTwo}(\circuit, \epsilon)$ (recall $\circuit$ is the output of \abbrStepOne) the runtime of \abbrStepTwo, which we can leverage~\Cref{def:reduced-poly} and~\Cref{lem:tidb-reduce-poly} to address the next formal objective: % to formally define our objective:
|
|
|
|
|
|
|
|
|
|
\begin{Problem}[\abbrCTIDB linear time approximation]\label{prob:big-o-joint-steps}
|
|
|
|
|
Given \abbrCTIDB $\pdb$, $\raPlus$ query $\query$,
|
|
|
|
|
is there a $(1\pm\epsilon)$-approximation of $\expct_{\rvworld\sim\bpd}\pbox{\query\inparen{\rvworld}\inparen{\tup}}$ for all result tuples $\tup$ where
|
|
|
|
|
$\exists \circuit : \timeOf{\abbrStepOne}(Q,\tupset, \circuit) + \timeOf{\abbrStepTwo}(\circuit, \epsilon) \le O_\epsilon(\qruntime{Q, \tupset, \bound})$?
|
|
|
|
|
$\exists \circuit : \timeOf{\abbrStepOne}(Q,\tupset, \circuit) + \timeOf{\abbrStepTwo}(\circuit, \epsilon) \le O_\epsilon(\qruntime{\optquery{\query}, \tupset, \bound})$?
|
|
|
|
|
\end{Problem}
|
|
|
|
|
|
|
|
|
|
We show in \Cref{sec:circuit-depth} an $O(\qruntime{Q, \tupset, \bound})$ algorithm for constructing the lineage polynomial for all result tuples of an $\raPlus$ query $\query$ (or more more precisely, a single circuit $\circuit$ with one sink per tuple representing the tuple's lineage).
|
|
|
|
|
We show in \Cref{sec:circuit-depth} an $\bigO{\qruntime{\optquery{\query}, \tupset, \bound}}$ algorithm for constructing the lineage polynomial for all result tuples of an $\raPlus$ query $\query$ (or more more precisely, a single circuit $\circuit$ with one sink per tuple representing the tuple's lineage).
|
|
|
|
|
A key insight of this paper is that the representation of $\circuit$ matters.
|
|
|
|
|
For example, if we insist that $\circuit$ represent the lineage polynomial in \abbrSMB, the answer to the above question in general is no, since then we will need $\abs{\circuit}\ge \Omega\inparen{\inparen{\qruntime{\query, \tupset, \bound}}^k}$,
|
|
|
|
|
and hence, just $\timeOf{\abbrStepOne}(Q,\tupset,\circuit)$ will be too large.
|
|
|
|
|
For example, if we insist that $\circuit$ represent the lineage polynomial in \abbrSMB, the answer to the above question in general is no, since then we will need $\abs{\circuit}\ge \Omega\inparen{\inparen{\qruntime{\optquery{\query}, \tupset, \bound}}^k}$,
|
|
|
|
|
and hence, just $\timeOf{\abbrStepOne}(\query,\tupset,\circuit)$ will be too large.
|
|
|
|
|
|
|
|
|
|
However, systems can directly emit compact, factorized representations of $\poly(\vct{X})$ (e.g., as a consequence of the standard projection push-down optimization~\cite{DBLP:books/daglib/0020812}).
|
|
|
|
|
For example, in~\Cref{fig:two-step}, $B(Y+Z)$ is a factorized representation of the SMB-form $BY+BZ$.
|
|
|
|
@ -337,9 +344,9 @@ Accordingly, this work uses (arithmetic) circuits\footnote{
|
|
|
|
|
}
|
|
|
|
|
as the representation system of $\poly(\vct{X})$.
|
|
|
|
|
|
|
|
|
|
Given that there exists a representation $\circuit^*$ such that $\timeOf{\abbrStepOne}(\query,\tupset,\circuit^*)\le O(\qruntime{\query, \tupset, \bound})$, we can now focus on the complexity of \abbrStepTwo.
|
|
|
|
|
Given that there exists a representation $\circuit^*$ such that $\timeOf{\abbrStepOne}(\query,\tupset,\circuit^*)\le \bigO{\qruntime{\optquery{\query}, \tupset, \bound}}$, we can now focus on the complexity of \abbrStepTwo.
|
|
|
|
|
We can represent the factorized lineage polynomial by its correspoding arithmetic circuit $\circuit$ (whose size we denote by $|\circuit|$).
|
|
|
|
|
As we also show in \Cref{sec:circuit-runtime}, this size is also bounded by $\qruntime{\query, \tupset, \bound}$ (i.e., $|\circuit^*| \le O(\qruntime{\query, \tupset, \bound})$).
|
|
|
|
|
As we also show in \Cref{sec:circuit-runtime}, this size is also bounded by $\qruntime{\optquery{\query}, \tupset, \bound}$ (i.e., $|\circuit^*| \le \bigO{\qruntime{\optquery{\query}, \tupset, \bound}}$).
|
|
|
|
|
Thus, the question of approximation %\Cref{prob:big-o-joint-steps}
|
|
|
|
|
can be stated as the following stronger (since~\Cref{prob:big-o-joint-steps} has access to \emph{all} equivalent \circuit representing $\query\inparen{\vct{W}}\inparen{\tup}$), but sufficient condition:
|
|
|
|
|
\begin{Problem}\label{prob:intro-stmt}
|
|
|
|
|