More changes @atri, @okennedy, @lordpretzel 021121 comments.

master
Aaron Huber 2022-02-14 12:28:41 -05:00
parent 47fbb0da36
commit 360920a8ec
5 changed files with 72 additions and 55 deletions

View File

@ -41,17 +41,17 @@ For these algorithms, $\jointime{R_1, \ldots, R_n}$ is linear in the {\em AGM bo
\noindent\resizebox{1\linewidth}{!}{
\begin{minipage}{1.0\linewidth}
\begin{align*}
\qruntimenoopt{R,\db} & = |\db.R| &
\qruntimenoopt{\sigma \query, \db} & = \qruntimenoopt{\query,\db} &
\qruntimenoopt{\pi \query, \db} & = \qruntimenoopt{\query,\db} + \abs{\query(\db)}
\qruntimenoopt{R,\db,\bound} & = |\db.R| &
\qruntimenoopt{\sigma \query, \db,\bound} & = \qruntimenoopt{\query,\db} &
\qruntimenoopt{\pi \query, \db,\bound} & = \qruntimenoopt{\query,\db,\bound} + \abs{\query(\db)}
\end{align*}\\[-15mm]
\begin{align*}
\qruntimenoopt{\query \cup \query', \db} & = \qruntimenoopt{\query, \db} +
\qruntimenoopt{\query', \db} +
\qruntimenoopt{\query \cup \query', \db,\bound} & = \qruntimenoopt{\query, \db,\bound} +
\qruntimenoopt{\query', \db,\bound} +
\abs{\query(D)}+\abs{\query'(D)} \\
\qruntimenoopt{\query_1 \bowtie \ldots \bowtie \query_m, \db}
& = \qruntimenoopt{\query_1, \db} + \ldots +
\qruntimenoopt{\query_m,\db} +
\qruntimenoopt{\query_1 \bowtie \ldots \bowtie \query_m, \db,\bound}
& = \qruntimenoopt{\query_1, \db,\bound} + \ldots +
\qruntimenoopt{\query_m,\db,\bound} +
\jointime{\query_1(\db), \ldots, \query_m(\db)}
\end{align*}
\end{minipage}
@ -63,7 +63,7 @@ We assume that full table scans are used for every base relation access. We can
%Observe that
% () .\footnote{This claim can be verified by e.g. simply looking at the {\em Generic-Join} algorithm in~\cite{skew} and {\em factorize} algorithm in~\cite{factorized-db}.} It can be verified that the above cost model on the corresponding $\raPlus$ join queries correctly captures the runtime of current best known .
Finally, \Cref{lem:circ-model-runtime} and \Cref{lem:tlc-is-the-same-as-det} show that for any $\raPlus$ query $\query$ and $\dbbase$, there exists a circuit $\circuit^*$ such that $\timeOf{\abbrStepOne}(Q,\dbbase,\circuit^*)$ and $|\circuit^*|$ are both $O(\qruntimenoopt{Q, \dbbase})$. Recall we assumed these two bounds when we moved from \Cref{prob:big-o-joint-steps} to \Cref{prob:intro-stmt}.
Finally, \Cref{lem:circ-model-runtime} and \Cref{lem:tlc-is-the-same-as-det} show that for any $\raPlus$ query $\query$ and $\tupset$, there exists a circuit $\circuit^*$ such that $\timeOf{\abbrStepOne}(Q,\tupset,\circuit^*)$ and $|\circuit^*|$ are both $O(\qruntimenoopt{Q, \tupset,\bound})$. Recall we assumed these two bounds when we moved from \Cref{prob:big-o-joint-steps} to \Cref{prob:intro-stmt}.
%
%We now make a simple observation on the above cost model:
%\begin{proposition}

View File

@ -4,8 +4,8 @@
\secrev{
This work explores the problem of computing the expectation of a tuple's multiplicity in an important special case of bag \abbrTIDB, which we call a \abbrCTIDB. A \abbrCTIDB,
$\pdb = \inparen{\worlds, \bpd}$ encodes a bag of uncertain tuples such that each tuple in $\pdb$ has a multiplicity of at most $\bound$. $\tupset$ is the set of tuples appearing across all possible worlds, and the set of all worlds is encoded in $\worlds$, which is the set of all vectors of length $\numvar=\abs{\tupset}$ such that each index corresponds to a distinct $\tup \in \tupset$ storing its multiplicity. $\bpd$ is a product distribution over the set of all worlds. A given world $\worldvec \in\worlds$ can be interpreted such that, for each $\tup \in \tupset$, $\worldvec\pbox{\tup}$ is the multiplicity of $\tup$ in $\worldvec$. The resulting product distribution can then be encoded as $\prob_{\tup} = \probOf\pbox{W\pbox{\tup} = j}$ (for $j \in\pbox{\bound}$), where each %distribution
$\tup$ is an independent random event. %for $\tup \in \tupset$.
$\pdb = \inparen{\worlds, \bpd}$ encodes a bag of uncertain tuples such that each tuple in $\pdb$ has a multiplicity of at most $\bound$. $\tupset$ is the set of tuples appearing across all possible worlds, and the set of all worlds is encoded in $\worlds$, which is the set of all vectors of length $\numvar=\abs{\tupset}$ such that each index corresponds to a distinct $\tup \in \tupset$ storing its multiplicity. $\bpd$ is a product distribution over the set of all worlds. A given world $\worldvec \in\worlds$ can be interpreted such that, for each $\tup \in \tupset$, $\worldvec\pbox{\tup}$ is the multiplicity of $\tup$ in $\worldvec$. The resulting product distribution can then be encoded as $\prob_{\tup} = \probOf\pbox{W\pbox{\tup} = j}$ (for $j \in\pbox{\bound}$), where each tuple multiplicity combination $\inparen{\inparen{\tup, \bound} \in \tupset\times\pbox{\bound}}$ %distribution
is an independent random event. %for $\tup \in \tupset$.
}
%\mypar{For a later section}
%\sout{
@ -18,7 +18,11 @@ In this work, since we are generally considering bag query input, we will only b
We can formally state our problem of computing the expected multiplicity of a result tuple as:
\begin{Problem}\label{prob:expect-mult}
Given a \abbrCTIDB $\pdb = \inparen{\worlds, \bpd}$, $\raPlus$ query $\query$, and result tuple $\tup$, compute the expected multiplicity of $\tup$: $\expct_{\rvworld\sim\bpd}\pbox{\query\inparen{\rvworld}\inparen{\tup}}$.
Given a \abbrCTIDB $\pdb = \inparen{\worlds, \bpd}$, $\raPlus$ query $\query$
\footnote{
A query $\query$ is an $\raPlus$ query if it is composed entirely of one or more of the positive relational operators $\inset{\select, \project, \join, \union}$.
}
, and result tuple $\tup$, compute the expected multiplicity of $\tup$: $\expct_{\rvworld\sim\bpd}\pbox{\query\inparen{\rvworld}\inparen{\tup}}$.
\end{Problem}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@ -99,13 +103,13 @@ Given a \abbrCTIDB $\pdb = \inparen{\worlds, \bpd}$, $\raPlus$ query $\query$, a
% \label{fig:ctidb-red}
%\end{figure}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
It is natural to explore computing the expected multiplicity of result tuple as this is the analog for computing the marginal probability of a tuple in a set \abbrPDB.
It is natural to explore computing the expected multiplicity of a result tuple as this is the analog for computing the marginal probability of a tuple in a set \abbrPDB.
In this work we will assume that $c =\bigO{1}$ since this is what typically seen in practice.
%because of the cancellation effect of queries over a $1$-\abbrBIDB (introduced later), where, for the worst case, a self join query, we would have a factor of $\frac{1}{c^{n-1}}$ cancellations.
Allowing for unbounded $c$ is an interesting open problem.
\mypar{Hardness of Set Query Semantics and Bag Query Semantics}
Set query evaluation semantics over $1$-\abbrTIDB\xplural have been studied extensively, and the data complexity of the problem in general has been shown by Dalvi and Suicu to be \sharpphard\cite{10.1145/1265530.1265571}. For our setting, there exists a trivial polytime algorithm to compute~\Cref{prob:expect-mult} for any query over a \abbrCTIDB due to linearity of expection by siimply computing the expectation over a `sum-of-products' representation of the query operations of $\query\inparen{\pdb}\inparen{\tup}$. %We discuss polynomial representation and equivalence in the following subsection.
Set query evaluation semantics over $1$-\abbrTIDB\xplural have been studied extensively, and the data complexity of the problem in general has been shown by Dalvi and Suicu to be \sharpphard\cite{10.1145/1265530.1265571}. For our setting, there exists a trivial polytime algorithm to compute~\Cref{prob:expect-mult} for any $\raPlus$ query over a \abbrCTIDB due to linearity of expection by simply computing the expectation over a `sum-of-products' representation of the query operations of $\query\inparen{\pdb}\inparen{\tup}$. %We discuss polynomial representation and equivalence in the following subsection.
Since we can compute~\Cref{prob:expect-mult} in polynomial time, the interesting question that we explore deals with analyzing the hardness of computing expectation using fine-grained analysis and parameterized complexity, where we are interested in the exponent of polynomial runtime.
}
@ -122,7 +126,10 @@ Specifically, in this work we ask if~\Cref{prob:expect-mult} can be solved in ti
%Define $\gentupset$ to be the set of tuples appearing across all the possible worlds of a $\abbrCTIDB$, formally $\gentupset = \inset{\tup_i ~|~ \forall \worldvec \in \worlds,~\forall i \in \abs{\tupset}:~\worldvec\pbox{i} > 0}$. When a specific $\pdb = \inparen{\worlds, \bpd}$ is being referred to, we will use $\tupset$ to denote the set of tuples.
%\end{Definition}
Let $T_{det}\inparen{\query, \gentupset, \bound} = \query\inparen{\gentupset}$ for arbitrary query $\query$, deterministic database $\gentupset$, and multiplicity bound $c$. Let $\qruntime{\query, \gentupset, \bound} = \min_{\query':\query'\equiv\query}T_{det}\inparen{\query, \gentupset, \bound}$ be the optimal runtime (with some caveats; discussed in~\Cref{sec:gen}) of query $\query$ on deterministic database $\tupset$.
Let $\qruntime{\optquery{\query},\gentupset,\bound}$ denote the runtime for query $\optquery{\query}$, deterministic database $\gentupset$, and multiplicity bound $\bound$. Being we consider $\raPlus$ queries in which order of operators can impact runtime, we denote the optimal query as $\optquery{\query} = \min_{\query'\in\raPlus, \query'\equiv\query}\qruntime{\query', \gentupset, \bound}$.
%let $\qruntim{\optquery{\query}, \gentupset, \bound} = \min_{\query'\in\raPlus,~\query'\equiv\query}T_{det}\inparen{\query, \gentupset, \bound}$ be the runtime for the optimally structured equivalent $\raPlus$ query $\query'$ (with some caveats; discussed in~\Cref{sec:gen}). % of query $\query$ on deterministic database $\tupset$.
%{\newline\noindent\centerline{\Huge \textcolor{black}{Or instead$\ldots$}}}
%\newline\noindent Let $T_{det}\inparen{\query, \gentupset, \bound}$ denote the runtime for $\raPlus$ query $\query$, deterministic database $\gentupset$, and multiplicity bound $\bound$. Since this paper does not consider optimization schemes, we leave optimization to the reader and show that our results hold across all inputs.
%We make this runtime concrete later on.
%We denote by $\dbbase$ the base \abbrCTIDB table containing all possible tuples, formally as,
@ -135,33 +142,33 @@ Let $T_{det}\inparen{\query, \gentupset, \bound} = \query\inparen{\gentupset}$ f
\hline
Lower bound on $\timeOf{}^*(\query,\pdb)$ & Num. $\bpd$s & Hardness Assumption\\
\hline
$\Omega\inparen{\inparen{\qruntime{\query, \tupset, \bound}}^{1+\eps_0}}$ for {\em some} $\eps_0>0$ & Single & Triangle Detection hypothesis\\
$\Omega\inparen{\inparen{\qruntime{\optquery{\query}, \tupset, \bound}}^{1+\eps_0}}$ for {\em some} $\eps_0>0$ & Single & Triangle Detection hypothesis\\
%\hline
$\omega\inparen{\inparen{\qruntime{\query, \tupset, \bound}}^{C_0}}$ for {\em all} $C_0>0$ & Multiple &$\sharpwzero\ne\sharpwone$\\
$\omega\inparen{\inparen{\qruntime{\optquery{\query}, \tupset, \bound}}^{C_0}}$ for {\em all} $C_0>0$ & Multiple &$\sharpwzero\ne\sharpwone$\\
%\hline
$\Omega\inparen{\inparen{\qruntime{\query, \tupset, \bound}}^{c_0\cdot k}}$ for {\em some} $c_0>0$ & Multiple & \Cref{conj:known-algo-kmatch}\\ %Multiple & Current $k$-matching algorithms\\
$\Omega\inparen{\inparen{\qruntime{\optquery{\query}, \tupset, \bound}}^{c_0\cdot k}}$ for {\em some} $c_0>0$ & Multiple & \Cref{conj:known-algo-kmatch}\\ %Multiple & Current $k$-matching algorithms\\
\hline
\end{tabular}
\caption{Our lower bounds for a specific hard query $\query$ parameterized by $k$. For $\pdb = \inset{\worlds, \bpd}$ those with `Multiple' in the second column need the algorithm to be able to handle multiple $\bpd$ (for a given $\tupset$). The last column states the hardness assumptions that imply the lower bounds in the first column ($\eps_o,C_0,c_0$ are constants that are independent of $k$).}
\caption{Our lower bounds for a specific hard query $\query$ parameterized by $k$.For $\pdb = \inset{\worlds, \bpd}$ those with `Multiple' in the second column need the algorithm to be able to handle multiple $\bpd$, i.e. probability distributions (for a given $\tupset$). The last column states the hardness assumptions that imply the lower bounds in the first column ($\eps_o,C_0,c_0$ are constants that are independent of $k$).}
\label{tab:lbs}
\end{table}
\mypar{Our lower bound results}
Our question is whether or not it is always true that $\timeOf{}^*\inparen{\query, \pdb}\leq\qruntime{\query, \tupset, \bound}$. Unfortunately this is not the case.
Our question is whether or not it is always true that $\timeOf{}^*\inparen{\query, \pdb}\leq\qruntime{\optquery{\query}, \tupset, \bound}$. Unfortunately this is not the case.
~\Cref{tab:lbs} shows our results.%our lower bounds for computing~\Cref{prob:expect-mult} on \abbrCTIDB\xplural.
Specifically, depending on what hardness result/conjecture we assume, we get various emphatic versions of {\em no} as an answer to our question. To make some sense of the other lower bounds in Table~\ref{tab:lbs}, we note that it is not too hard to show that $\timeOf{}^*(Q,\pdb) \le O\inparen{\inparen{\qruntime{Q, \tupset, \bound}}^k}$, where $k$ is the join width (our notion of join width follows from~\Cref{def:degree-of-poly} and~\Cref{fig:nxDBSemantics}.) of the query $\query$ over all result tuples $\tup$ (and the parameter that defines our family of hard queries).
Specifically, depending on what hardness result/conjecture we assume, we get various emphatic versions of {\em no} as an answer to our question. To make some sense of the other lower bounds in Table~\ref{tab:lbs}, we note that it is not too hard to show that $\timeOf{}^*(Q,\pdb) \le \bigO{\inparen{\qruntime{\optquery{\query}, \tupset, \bound}}^k}$, where $k$ is the join width (our notion of join width follows from~\Cref{def:degree-of-poly} and~\Cref{fig:nxDBSemantics}.) of the query $\query$ over all result tuples $\tup$ (and the parameter that defines our family of hard queries).
What our lower bound in the third row says is that one cannot get more than a polynomial improvement over essentially the trivial algorithm for~\Cref{prob:expect-mult}.
However, this result assumes a hardness conjecture that is not as well studied as those in the first two rows of the table (see \Cref{sec:hard} for more discussion on the hardness assumptions). Further, we note that existing results already imply the claimed lower bounds if we were to replace the $\qruntime{\query, \tupset, \bound}$ by just $\numvar$ (indeed these results follow from known lower bound for deterministic query processing). Our contribution is to then identify a family of hard queries where deterministic query processing is `easy' but computing the expected multiplicities is hard.
However, this result assumes a hardness conjecture that is not as well studied as those in the first two rows of the table (see \Cref{sec:hard} for more discussion on the hardness assumptions). Further, we note that existing results already imply the claimed lower bounds if we were to replace the $\qruntime{\optquery{\query}, \tupset, \bound}$ by just $\numvar$ (indeed these results follow from known lower bound for deterministic query processing). Our contribution is to then identify a family of hard queries where deterministic query processing is `easy' but computing the expected multiplicities is hard.
\mypar{Our upper bound results} We introduce an $(1\pm \epsilon)$-approximation algorithm that computes ~\Cref{prob:expect-mult} in time $O_\epsilon\inparen{\qruntime{\query, \tupset, \bound}}$. This means, when we are okay with approximation, that we solve~\Cref{prob:expect-mult} in time linear in the size of the deterministic query %$\timeOf{Approx}^*\inparen{\query, \pdb}\leq\qruntime{\query,\tupset,\bound}$ (where $\timeOf{Approx}^*\inparen{\cdot}$ denotes runtime of approximation algorithm),
\mypar{Our upper bound results} We introduce an $(1\pm \epsilon)$-approximation algorithm that computes ~\Cref{prob:expect-mult} in time $O_\epsilon\inparen{\qruntime{\optquery{\query}, \tupset, \bound}}$. This means, when we are okay with approximation, that we solve~\Cref{prob:expect-mult} in time linear in the size of the deterministic query %$\timeOf{Approx}^*\inparen{\query, \pdb}\leq\qruntim{\optquery{\query},\tupset,\bound}$ (where $\timeOf{Approx}^*\inparen{\cdot}$ denotes runtime of approximation algorithm),
and bag \abbrPDB\xplural are deployable in practice.
% In particular, we show the following upper bound results.
%(i) We show that e.g. for a circuit representation of the lineage polynomial (more on this later), when the circuit is a tree and there is a single
% result tuple, we also have the same runtime (we can also handle the case of multiple result tuples\footnote{We can approximate the expected result tuple multiplicities (for all result tuples {\em simultanesouly}) with only $O(\log{Z})=O_k(\log{n})$ overhead (where $Z$ is the number of result tuples) over the runtime of a broad class of query processing algorithms (see \Cref{app:sec-cicuits}).}).
%Further, we show that for {\em any} $\raPlus$ query on a \abbrTIDB $(1$-$\abbrTIDB)$, we also obtain linear runtime for approximation.
% the approximation algorithm has runtime linear in the size of the compressed lineage encoding (
In contrast, known approximation techniques (\cite{DBLP:conf/icde/OlteanuHK10,DBLP:journals/jal/KarpLM89}) in set-\abbrPDB\xplural need time $\Omega(\qruntime{\query, \tupset, \bound}^{2k})$ %, where $\circuit$ is a representation of the query operations and input to produce $\tup$; more on this shortly.
In contrast, known approximation techniques (\cite{DBLP:conf/icde/OlteanuHK10,DBLP:journals/jal/KarpLM89}) in set-\abbrPDB\xplural need time $\Omega(\qruntime{\optquery{\query}, \tupset, \bound}^{2k})$ %, where $\circuit$ is a representation of the query operations and input to produce $\tup$; more on this shortly.
(see \Cref{sec:karp-luby}).
Further, our approximation algorithm works for a more general notion of bag \abbrPDB\xplural beyond \abbrCTIDB\xplural
%we generalize the \abbrPDB data model considered by the approximation algorithm to a class of bag-Block Independent Disjoint Databases
@ -203,12 +210,12 @@ multiplicity of the polynomial $\apolyqdt$ (i.e., $\expct_{\vct{W}\sim \pdassign
%where $\pdassign$ is the distribution induced by $\pd$ on the relevant assignments $\vct{W}$ to variables of $\apolyqdt$.
\end{Problem}
We note that computing \Cref{prob:expect-mult}
is equivalent to computing \Cref{prob:bag-pdb-poly-expected} (see \Cref{prop:expection-of-polynom}).
is equivalent (yields the same result as) to computing \Cref{prob:bag-pdb-poly-expected} (see \Cref{prop:expection-of-polynom}).
%In this work, we study the complexity of \Cref{prob:bag-pdb-poly-expected} for several models of probabilistic databases and various encodings of such polynomials.
}
\secrev{
All of our results rely on working with a {\em reduced} form of the lineage polynomial $\poly$. In fact, it turns out that for the $1$-\abbrTIDB case, computing the expected multiplicity (over bag query semantics) is {\em exactly} the same as evaluating this reduced polynomial over the probabilities that define the $1$-\abbrTIDB. This is also true when the query input(s) is a block independent disjoint probabilistice database (with tuple multiplicity of at most $1$), which we refer to as a $1$-\abbrBIDB.
All of our results rely on working with a {\em reduced} form $\inparen{\poly}$ of the lineage polynomial $\poly$. In fact, it turns out that for the $1$-\abbrTIDB case, computing the expected multiplicity (over bag query semantics) is {\em exactly} the same as evaluating this reduced polynomial over the probabilities that define the $1$-\abbrTIDB. This is also true when the query input(s) is a block independent disjoint probabilistice database (with tuple multiplicity of at most $1$), which we refer to as a $1$-\abbrBIDB.
% For our results to be applicable to \abbrCTIDB\xplural, we introduce the following reduction.
%\begin{Definition}
%Any \abbrCTIDB $\pdb$, can be reduced to an equivalent $1$-\abbrBIDB $\pdb'$ in the following manner. For each $\tup_i \in \tupset$, create a block of $\bound + 1$ disjoint \abbrBIDB tuples in $\pdb'$ such that each tuple in the newly formed block is mapped to its own boolean variable $X_{i, j}$ for $i \in \abs{D}$ and $j \in \pbox{c+1}$. Then, given $\worldvec \in \worlds$, the equivalent world in $\pdb'$ will set each variable $X_{i, j} = 1$ for each $\worldvec\pbox{i} = j$, while $\inparen{\text{for }\ell \neq j}$ all other $X_{i, \ell} \in \vct{X}$ of $\pdb'$ are set to $0$.
@ -229,15 +236,15 @@ The lineage polynomial for $Q_1^2$ is given by $\poly_1^2\inparen{A, B, C, E, X,
$$
=A^2X^2B^2 + B^2Y^2E^2 + B^2Z^2C^2 + 2AXB^2YE + 2AXB^2ZC + 2B^2YEZC.
$$
To compute $\expct\pbox{\poly_1^2}$ we can use linearity of expectation and push the expectation through each summand. To keep things simple, let us focus on the monomial $\poly_1^{\inparen{ABX}^2} = A^2X^2B^2$ as the procedure is the same for all other monomials of $\poly_1^2$. Let $\randWorld_X$ be the random variable corresponding to a lineage variable $X$. Because the distinct variables in the product are independent, we can push expectation through them yielding $\expct\pbox{\randWorld_A^2\randWorld_X^2\randWorld_B^2}=\expct\pbox{\randWorld_A^2}\expct\pbox{\randWorld_X^2}\expct\pbox{\randWorld_B^2}$. Since $\randWorld_A, \randWorld_B\in \inset{0, 1}$ we can further derive $\expct\pbox{\randWorld_A}\expct\pbox{\randWorld_X^2}\expct\pbox{\randWorld_B}$ by the fact that for any $W\in \inset{0, 1}$, $W^2 = W$. However, we get stuck with $\expct\pbox{\randWorld_X^2}$, since $\randWorld_X\in\inset{0, 1, 2}$ and for $\randWorld_X \gets 2$, $\randWorld_X^2 \neq \randWorld_X$.
To compute $\expct\pbox{\poly_1^2}$ we can use linearity of expectation and push the expectation through each summand. To keep things simple, let us focus on the monomial $\poly_1^{\inparen{ABX}^2} = A^2X^2B^2$ as the procedure is the same for all other monomials of $\poly_1^2$. Let $\randWorld_X$ be the random variable corresponding to a lineage variable $X$. Because the distinct variables in the product are independent, we can push expectation through them yielding $\expct\pbox{\randWorld_A^2\randWorld_X^2\randWorld_B^2}=\expct\pbox{\randWorld_A^2}\expct\pbox{\randWorld_X^2}\expct\pbox{\randWorld_B^2}$. Since $\randWorld_A, \randWorld_B\in \inset{0, 1}$ we can further derive $\expct\pbox{\randWorld_A}\expct\pbox{\randWorld_X^2}\expct\pbox{\randWorld_B}$ by the fact that for any $W\in \inset{0, 1}$, $W^2 = W$. Observe that if $X\in\inset{0, 1}$, then we further would have $\expct\pbox{\randWorld_A}\expct\pbox{\randWorld_X}\expct\pbox{\randWorld_B} = \prob_A\cdot\prob_X\cdot\prob_B$ (denoting $\probOf\pbox{\randWorld_A = 1} = \prob_A$) $= \rpoly_1^{\inparen{ABX}^2}\inparen{\prob_A, \prob_X, \prob_B}$ (see $ii)$ of~\Cref{def:reduced-poly}). However, in this example, we get stuck with $\expct\pbox{\randWorld_X^2}$, since $\randWorld_X\in\inset{0, 1, 2}$ and for $\randWorld_X \gets 2$, $\randWorld_X^2 \neq \randWorld_X$.
%the expectation is $\expct\pbox{A^2X^2B^2} = A\cdot\prob_A\cdot\inparen{\sum\limits_{i \in [2]}X_i\cdot \prob_{X, i}}\cdot B\prob_B$ for $X \in \inset{0, 1, 2}$.
Denote the variables of $\poly$ to be $\vars{\poly}.$ In the \abbrCTIDB setting, $\poly\inparen{\vct{X}}$ has an equivalent reformulation $\inparen{\refpoly{}}$ that is of use to us. Given $X_\tup \in\vars{\poly}$, by definition $X_\tup \in\inset{0,\ldots, c}$. We can replace $X_\tup$ by $\sum_{j\in\pbox{\bound}}X_{\tup, j}$ where each $X_{\tup, j}\in\inset{0, 1}$. Then for any $\worldvec\in\worlds$, we set $X_{\tup, j} = 1$ for $\worldvec_\tup = j$, while $X_{\tup, j'} = 0$ for all $j'\neq j\in\pbox{\bound}$. By construction then $\poly\inparen{\vct{X}}\equiv\refpoly{}\inparen{\vct{X}}$ since for any $X_\tup\in\vars{\poly}$ we have the equality $X_\tup = j = \sum_{j\in\pbox{\bound}}jX_j$.
Denote the variables of $\poly$ to be $\vars{\poly}.$ In the \abbrCTIDB setting, $\poly\inparen{\vct{X}}$ has an equivalent reformulation $\inparen{\refpoly{}}$ that is of use to us. Given $X_\tup \in\vars{\poly}$, by definition $X_\tup \in\inset{0,\ldots, c}$. We can replace $X_\tup$ by $\sum_{j\in\pbox{\bound}}X_{\tup, j}$ where each $X_{\tup, j}\in\inset{0, 1}$. Then for any $\worldvec\in\worlds$, we set $X_{\tup, j} = 1$ for $\worldvec_\tup = j$, while $X_{\tup, j'} = 0$ for all $j'\neq j\in\pbox{\bound}$. By construction then $\poly\inparen{\vct{X}}\equiv\refpoly{}\inparen{\vct{X_R}}$ $\inparen{\vct{X_R} = \vars{\refpoly{}}}$ since for any $X_\tup\in\vars{\poly}$ we have the equality $X_\tup = j = \sum_{j\in\pbox{\bound}}jX_j$.
Considering again our example,
\begin{multline*}
\refpoly{1, }^{\inparen{ABX}^2}\inparen{A, X, B} = \poly^{\inparen{AXB}^2}\inparen{\sum_{j_1\in\pbox{\bound}}j_1A_{j_1}, \sum_{j_2\in\pbox{\bound}}j_2X_{j_2}, \sum_{j_3\in\pbox{\bound}}j_3B_{j_3}} \\
\refpoly{1, }^{\inparen{ABX}^2}\inparen{A, X, B} = \poly_1^{\inparen{AXB}^2}\inparen{\sum_{j_1\in\pbox{\bound}}j_1A_{j_1}, \sum_{j_2\in\pbox{\bound}}j_2X_{j_2}, \sum_{j_3\in\pbox{\bound}}j_3B_{j_3}} \\
= \inparen{\sum_{j_1\in\pbox{\bound}}j_1A_{j_1}}^2\inparen{\sum_{j_2\in\pbox{\bound}}j_2X_{j_2}}^2\inparen{\sum_{j_3\in\pbox{\bound}}j_3B_{j_3}}^2.
\end{multline*}
Since the set of multiplicities for tuple $\tup$ by nature are disjoint we can drop all cross terms and have $\refpoly{1, }^2 = \sum_{j_1, j_2, j_3 \in \pbox{\bound}}j_1^2A^2_{j_1}j_2^2X_{j_2}^2j_3^2B^2_{j_3}$. Computing expectation we get $\expct\pbox{\refpoly{1, }^2}=\sum_{j_1,j_2,j_3\in\pbox{\bound}}j_1^2j_2^2j_3^2\expct\pbox{\randWorld_{A_{j_1}}}\expct\pbox{\randWorld_{X_{j_2}}}\expct\pbox{\randWorld_{B_{j_3}}}$, since we now have that all $\randWorld_{X_j}\in\inset{0, 1}$.
@ -306,7 +313,7 @@ $, where $\probAllTup = \inparen{\inparen{\prob_{\tup, j}}_{\tup\in\tupset, j\in
\secrev{
\subsection{Our Techniques}
\mypar{Lower Bound Proof Techniques}
Our main hardness result shows that computing~\Cref{prob:expect-mult} is $\sharpwonehard$ for $1$-\abbrTIDB. To prove this result we show that for the same $\query_1$ from the example above, for an arbitrary `product width' $k$, the query $Q^k$ is able to encode various hard graph-counting problems (assuming $\bigO{\numvar}$ tuples rather than the $O(1)$ tuples in \Cref{fig:two-step}).
Our main hardness result shows that computing~\Cref{prob:expect-mult} is $\sharpwonehard$ for $1$-\abbrTIDB. To prove this result we show that for the same $\query_1$ from the example above, for an arbitrary `product width' $k$, the query $Q^k$ is able to encode various hard graph-counting problems (assuming $\bigO{\numvar}$ tuples rather than the $\bigO{1}$ tuples in \Cref{fig:two-step}).
We do so by considering an arbitrary graph $G$ (analogous to relation $\boldsymbol{R}$ of $\query$) and analyzing how the coefficients in the (univariate) polynomial $\widetilde{\poly}\left(p,\dots,p\right)$ relate to counts of subgraphs in $G$ that are isomorphic to various graphs with $k$ edges. E.g., we exploit the fact that the leading coefficient in $\poly$ corresponding to $\query^k$ is proportional to the number of $k$-matchings in $G$, a known hard problem in parameterized/fine-grained complexity literature.
\mypar{Upper Bound Techniques}
@ -315,20 +322,20 @@ Our negative results (\Cref{tab:lbs}) indicate that \abbrCTIDB{}s (even for $\bo
\input{two-step-model}
We adopt the two-step intensional model of query evaluation used in set-\abbrPDB\xplural, as illustrated in \Cref{fig:two-step}:
(i) \termStepOne (\abbrStepOne): Given input $\tupset$ and $\query$, output every tuple $\tup$ that possibly satisfies $\query$, annotated with its lineage polynomial ($\poly(\vct{X})=\apolyqdt\inparen{\vct{X}}$);
(ii) \termStepTwo (\abbrStepTwo): Given $\poly(\vct{X})$ for each tuple, compute $\expct\pbox{\poly(\vct{\randWorld})}$.
(ii) \termStepTwo (\abbrStepTwo): Given $\poly(\vct{X})$ for each tuple, compute $\expct_{\randWorld\sim\bpd}\pbox{\poly(\vct{\randWorld})}$.
Let $\timeOf{\abbrStepOne}(Q,\tupset,\circuit)$ denote the runtime of \abbrStepOne when it outputs $\circuit$ (which is a representation of $\poly$ as an arithmetic circuit --- more on this representation shortly).
Denote by $\timeOf{\abbrStepTwo}(\circuit, \epsilon)$ (recall $\circuit$ is the output of \abbrStepOne) the runtime of \abbrStepTwo, which we can leverage~\Cref{def:reduced-poly} and~\Cref{lem:tidb-reduce-poly} to address the next formal objective: % to formally define our objective:
\begin{Problem}[\abbrCTIDB linear time approximation]\label{prob:big-o-joint-steps}
Given \abbrCTIDB $\pdb$, $\raPlus$ query $\query$,
is there a $(1\pm\epsilon)$-approximation of $\expct_{\rvworld\sim\bpd}\pbox{\query\inparen{\rvworld}\inparen{\tup}}$ for all result tuples $\tup$ where
$\exists \circuit : \timeOf{\abbrStepOne}(Q,\tupset, \circuit) + \timeOf{\abbrStepTwo}(\circuit, \epsilon) \le O_\epsilon(\qruntime{Q, \tupset, \bound})$?
$\exists \circuit : \timeOf{\abbrStepOne}(Q,\tupset, \circuit) + \timeOf{\abbrStepTwo}(\circuit, \epsilon) \le O_\epsilon(\qruntime{\optquery{\query}, \tupset, \bound})$?
\end{Problem}
We show in \Cref{sec:circuit-depth} an $O(\qruntime{Q, \tupset, \bound})$ algorithm for constructing the lineage polynomial for all result tuples of an $\raPlus$ query $\query$ (or more more precisely, a single circuit $\circuit$ with one sink per tuple representing the tuple's lineage).
We show in \Cref{sec:circuit-depth} an $\bigO{\qruntime{\optquery{\query}, \tupset, \bound}}$ algorithm for constructing the lineage polynomial for all result tuples of an $\raPlus$ query $\query$ (or more more precisely, a single circuit $\circuit$ with one sink per tuple representing the tuple's lineage).
A key insight of this paper is that the representation of $\circuit$ matters.
For example, if we insist that $\circuit$ represent the lineage polynomial in \abbrSMB, the answer to the above question in general is no, since then we will need $\abs{\circuit}\ge \Omega\inparen{\inparen{\qruntime{\query, \tupset, \bound}}^k}$,
and hence, just $\timeOf{\abbrStepOne}(Q,\tupset,\circuit)$ will be too large.
For example, if we insist that $\circuit$ represent the lineage polynomial in \abbrSMB, the answer to the above question in general is no, since then we will need $\abs{\circuit}\ge \Omega\inparen{\inparen{\qruntime{\optquery{\query}, \tupset, \bound}}^k}$,
and hence, just $\timeOf{\abbrStepOne}(\query,\tupset,\circuit)$ will be too large.
However, systems can directly emit compact, factorized representations of $\poly(\vct{X})$ (e.g., as a consequence of the standard projection push-down optimization~\cite{DBLP:books/daglib/0020812}).
For example, in~\Cref{fig:two-step}, $B(Y+Z)$ is a factorized representation of the SMB-form $BY+BZ$.
@ -337,9 +344,9 @@ Accordingly, this work uses (arithmetic) circuits\footnote{
}
as the representation system of $\poly(\vct{X})$.
Given that there exists a representation $\circuit^*$ such that $\timeOf{\abbrStepOne}(\query,\tupset,\circuit^*)\le O(\qruntime{\query, \tupset, \bound})$, we can now focus on the complexity of \abbrStepTwo.
Given that there exists a representation $\circuit^*$ such that $\timeOf{\abbrStepOne}(\query,\tupset,\circuit^*)\le \bigO{\qruntime{\optquery{\query}, \tupset, \bound}}$, we can now focus on the complexity of \abbrStepTwo.
We can represent the factorized lineage polynomial by its correspoding arithmetic circuit $\circuit$ (whose size we denote by $|\circuit|$).
As we also show in \Cref{sec:circuit-runtime}, this size is also bounded by $\qruntime{\query, \tupset, \bound}$ (i.e., $|\circuit^*| \le O(\qruntime{\query, \tupset, \bound})$).
As we also show in \Cref{sec:circuit-runtime}, this size is also bounded by $\qruntime{\optquery{\query}, \tupset, \bound}$ (i.e., $|\circuit^*| \le \bigO{\qruntime{\optquery{\query}, \tupset, \bound}}$).
Thus, the question of approximation %\Cref{prob:big-o-joint-steps}
can be stated as the following stronger (since~\Cref{prob:big-o-joint-steps} has access to \emph{all} equivalent \circuit representing $\query\inparen{\vct{W}}\inparen{\tup}$), but sufficient condition:
\begin{Problem}\label{prob:intro-stmt}

View File

@ -141,7 +141,7 @@
\newcommand{\tupset}{D}
\newcommand{\gentupset}{\overline{D}}
\newcommand{\world}{\inset{0,\ldots, c}}
\newcommand{\worldvec}{\vct{M}}
\newcommand{\worldvec}{\vct{W}}
\newcommand{\worlds}{\world^\tupset}
\newcommand{\bpd}{\mathcal{P}}%bpd for bag probability distribution
%BIDB
@ -211,6 +211,7 @@
%Instance Variables
\newcommand{\prob}{p}
\newcommand{\wElem}{w} %an element of \vct{w}
\newcommand{\worldinst}{W}
%Polynomial Variables
\newcommand{\pVar}{X}%<----not used but recomment instituting this--pVar for polyVar
\newcommand{\kElem}{k}%the kth element<---where and how are we using this?
@ -333,8 +334,9 @@
\newcommand{\sharpwonehard}{\#{\sf W}[1]-hard\xspace}
\newcommand{\ptime}{{\sf PTIME}\xspace}
\newcommand{\timeOf}[1]{T_{#1}}
\newcommand{\qruntime}[1]{T^*_{det}(#1)}
\newcommand{\qruntimenoopt}[1]{T_{det}\inparen{#1}}
\newcommand{\qruntime}[1]{T_{det}\inparen{#1}}
\newcommand{\optquery}[1]{\func{OPT}\inparen{#1}}
\newcommand{\qruntimenoopt}[1]{T_{det}\inparen{#1}}%need to get rid of this--needs to be propagated
\newcommand{\jointime}[1]{T_{join}(#1)}
\newcommand{\kmatchtime}{T_{match}\inparen{k, G}}

View File

@ -3,7 +3,7 @@
\section{Hardness of Exact Computation}
\label{sec:hard}
In this section, we will prove the hardness results claimed in Table~\ref{tab:lbs} for a specific (family) of hard instance $(\query,\pdb)$ for \Cref{prob:bag-pdb-poly-expected} where $\pdb$ is a $1$-\abbrTIDB.
Note that this implies hardness for \abbrCTIDB\xplural $\inparen{\bound\geq1}$, \bis and general \abbrBPDB, showing \Cref{prob:bag-pdb-poly-expected} cannot be done in $\bigO{\qruntime{\query,\tupset}}$ runtime.
Note that this implies hardness for \abbrCTIDB\xplural $\inparen{\bound\geq1}$, \bis and general \abbrBPDB, showing \Cref{prob:bag-pdb-poly-expected} cannot be done in $\bigO{\qruntime{\optquery{\query},\tupset,\bound}}$ runtime.
%(and hence the equivalent \Cref{prob:bag-pdb-query-eval})
%in the negative.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@ -55,24 +55,24 @@ as $R_i$ for each $i \in [k]$. The query $\query^k$ then becomes
\begin{lstlisting}
SELECT COUNT(*) FROM $R_1$ JOIN $R_2$ JOIN$\cdots$JOIN $R_k$
\end{lstlisting}
\noindent Further, the \abbrCTIDB instance of~\Cref{fig:two-step} generalizes to one compatible to~\Cref{def:qk} as follows. Relation $T$ has $n$ tuples corresponding to each vertex for $i$ in $[n]$, each with probability $\prob_i$ and $R$ has tuples corresponding to the edges $\edgeSet$ (each with probability of $1$).\footnote{Technically, $\poly_{G}^\kElem(\vct{X})$ should have variables corresponding to tuples in $R$ as well, but since they always are present with probability $1$, we drop those. Our argument also works when all the tuples in $R$ also are present with probability $\prob$ but to simplify notation we assign probability $1$ to edges.}
\noindent Consider again the \abbrCTIDB instance $\pdb$ of~\Cref{fig:two-step} and, for our hard instance, let $\bound = 1$. $\pdb$ generalizes to one compatible to~\Cref{def:qk} as follows. Relation $T$ has $n$ tuples corresponding to each vertex for $i$ in $[n]$, each with probability $\prob_i$ and $R$ has tuples corresponding to the edges $\edgeSet$ (each with probability of $1$).\footnote{Technically, $\poly_{G}^\kElem(\vct{X})$ should have variables corresponding to tuples in $R$ as well, but since they always are present with probability $1$, we drop those. Our argument also works when all the tuples in $R$ also are present with probability $\prob$ but to simplify notation we assign probability $1$ to edges.}
In other words, for this instance $\tupset$ contains the set of $\numvar$ unary tuples in $T$ (which corresponds to $\vset$) and $\numedge$ binary tuples in $R$ (which corresponds to $\edgeSet$).
Note that this implies that $\poly_{G}^\kElem$ is indeed a \abbrCTIDB-lineage polynomial. % for a \abbrTIDB \abbrPDB.
\AH{Can the proofs generalize to $2$-\abbrTIDB, as the new updated~\Cref{fig:two-step} now is?}
Next, we note that the runtime for answering $\query^k$ on deterministic database $\tupset$, as defined above, is $\bigO{\numedge}$ (i.e. deterministic query processing is `easy' for this query):
\begin{Lemma}\label{lem:tdet-om}
Let $\query^k$ and $\tupset$ be as defined above. Then
% of \Cref{def:qk}, the runtime
$\qruntimenoopt{\query^k, \tupset}$ is $\bigO{\kElem\numedge}$.
\end{Lemma}
\AH{Should the above be $\qruntimenoopt{}$ or $\qruntime$?}
\AH{Should the above be $\qruntimenoopt{}$ or $\qruntime{}$?}
\subsection{Multiple Distinct $\prob$ Values}
\label{sec:multiple-p}
%Unless otherwise noted, all proofs for this section are in \Cref{app:single-mult-p}.
We are now ready to present our main hardness result.
%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\AH{Note that~\Cref{def:reduced-poly} has been changed, where we compute $\rpoly$ from $\refpoly{}$.}
\begin{Theorem}\label{thm:mult-p-hard-result}
Let $\prob_0,\ldots,\prob_{2k}$ be $2k + 1$ distinct values in $(0, 1]$. Then computing $\rpoly_G^\kElem(\prob_i,\dots,\prob_i)$ (over all $i\in [2k+1]$ for arbitrary $G=(\vset,\edgeSet)$
%and any $(2k+1)$ distinct values $\prob_i$ ($0\le i \le 2k$)

View File

@ -7,18 +7,19 @@
%We now introduce some terminology
%and develop a reduced form of lineage polynomials for a \abbrBIDB or \abbrTIDB.
%Note that
\secrev{A }
polynomial over $\vct{X}=(X_1,\dots,X_n)$ with individual degree $B <\infty$
\secrev{
A polynomial over a set of variables $\vct{S}$ with $\abs{S}=\numedge$ and individual degree $B <\infty$
is formally defined as (where $c_{\vct{d}}\in \semN$):
\begin{equation}
\label{eq:sop-form}
\poly\inparen{X_1,\dots,X_n}=\secrev{\sum_{\vct{d}\in\{0,\ldots,B\}^\tupset} c_{\vct{d}}\cdot \prod_{\tup\in\tupset} X_\tup^{d_\tup}.}
\poly\inparen{S_1,\dots,S_\numedge}=\sum_{\vct{d}\in\{0,\ldots,B\}^\tupset} c_{\vct{d}}\cdot \prod_{i\in\pbox{\numedge}}S_i^{d_i}.
\end{equation}
}
%where $c_{\vct{d}}\in \semN$.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Definition}[Standard Monomial Basis]\label{def:smb}
The term $\prod_{\tup\in\tupset} X_\tup^{d_\tup}$ in \Cref{eq:sop-form} is a {\em monomial}. A polynomial $\poly\inparen{\vct{X}}$ is in standard monomial basis (\abbrSMB) when we keep only the terms with $c_{\vct{d}}\ne 0$ from \Cref{eq:sop-form}.
\secrev{The term $\prod_{i\in\pbox{\numedge}} S_i^{d_i}$ }in \Cref{eq:sop-form} is a {\em monomial}. A polynomial $\poly\inparen{\vct{X}}$ is in standard monomial basis (\abbrSMB) when we keep only the terms with $c_{\vct{d}}\ne 0$ from \Cref{eq:sop-form}.
\end{Definition}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Unless othewise noted, we consider all polynomials to be in \abbrSMB representation.
@ -26,8 +27,9 @@ When it is unclear, we use $\smbOf{\poly}$ to denote the \abbrSMB form of a poly
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Definition}[Degree]\label{def:degree-of-poly}
The degree of polynomial $\poly(\vct{X})$ is the largest \secrev{$\norm{\vct{d}}_1$}% = \sum_{\tup\in\tupset} d_\tup$
such that $c_{(d_1,\dots,d_n)}\ne 0$. % maximum sum of exponents, over all monomials in $\smbOf{\poly(\vct{X})}$.
The degree of polynomial $\poly(\vct{X})$ is the largest \secrev{$\vct{d} = \sum_{i\in\pbox{\numedge}}d_i %= \norm{\vct{d}}_1
$}% = \sum_{\tup\in\tupset} d_\tup$
such that $c_{(d_1,\dots,d_n)}\ne 0$. % maximum sum of exponents, over all monomials in $\smbOf{\poly(\vct{X})}$.
\end{Definition}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
As an example, the degree of the polynomial $X^2+2XY^2+Y^2$ is $3$.
@ -41,13 +43,19 @@ or simply lineage polynomial), if there exists a $\raPlus$ query $\query$, \abbr
%Following the typical representation of bags in production databases, for query inputs, we will use \abbrBPDB\xplural with multiplicities $\{0, 1\}$ (see \Cref{sec:gener-results-beyond} for more on this choice).
\subsubsection{\abbrCTIDB\xplural and \abbrOneBIDB\xplural}
\subsection{\abbrCTIDB\xplural and \abbrOneBIDB\xplural}
\label{subsec:tidbs-and-bidbs}
An \textit{incomplete database} $\Omega$ is a set of deterministic databases $\omega$ called possible worlds.
%An \textit{incomplete database} $\Omega$ is a set of deterministic databases $\worldvec$ called possible worlds.
\noindent\secrev{
A \abbrCTIDB $\pdb$ is a pair $\inparen{\worlds, \bpd}$ such that $\worlds$ is an incomplete database whose set of possible worlds is the $c+1^\numvar$ tuple/multiplicity combinations across all $\tup\in\tupset$, where $\abs{\tupset} = \numvar$, $\tupset = \bigcup_{\worldvec\in\worlds,~\worldvec_{\tup}\geq 1}\tup$ is the set of possible tuples across possible worlds, and $\bpd$ is a probability distribution over $\worlds$.
A block independent database \abbrBIDB $\pdb'$ can viewed as a $1$-\abbrTIDB $\pdb$ with the added flexibility that each $\tup\in\tupset$ has multiple disjoint alternatives, i.e., all $\tup \in \tupset'$ are partitioned into $m$ independent blocks with the condition that tuples $\tup \in \block_i$ for $i \in \pbox{m}$ are disjoint events. We define next a specific construction of \abbrBIDB that is useful for out work.
\begin{Definition}[$1$-\abbrBIDB]\label{def:one-bidb}
Define a $1$-\abbrBIDB to be the pair $\pdb' = \inparen{\prod_{\tup\in\tupset'}\inset{0, \bound_\tup}, \bpd'},$ where $\tupset'$ is the set of possible tuples such that each $\tup \in \tupset'$ has a multiplicity domain of $\inset{0, \bound_\tup}$, with $\bound_\tup \in \mathbb{N}$. The term $\prod_{\tup\in\tupset'}$ is the direct product of all such multiplicity domain pairs. The tuples $\tup\in\tupset'$ are further partitioned into $m$ independent blocks $\block_i,~i\in\pbox{m}$ of disjoint tuples. $\bpd$ is the probability distribution across all worlds such that, given $\worldvec\in\prod_{\tup\in\tupset'}\inset{0,\bound_\tup},\tup,~\tup'\in\block_i~:~\probOf\pbox{\worldvec_\tup, \worldvec_\tup'>0} = 0$.
\end{Definition}
%A \abbrCTIDB $\pdb$ is a pair $\inparen{\worlds, \bpd}$ such that $\worlds$ is an incomplete database whose set of possible worlds is the $c+1^\numvar$ tuple/multiplicity combinations across all $\tup\in\tupset$, where $\abs{\tupset} = \numvar$, $\tupset = \bigcup_{\worldvec\in\worlds,~\worldvec_{\tup}\geq 1}\tup$ is the set of possible tuples across possible worlds, and $\bpd$ is a probability distribution over $\worlds$.
\begin{Definition}[$\bound$-Block Independent Disjoint Database ($\bound$-\abbrBIDB)]\label{def:bidb}
A $\bound$-block independent database ($\bound$-\abbrBIDB) $\pdb' = \inparen{\inset{0,\ldots,\bound}^{\tupset'}, \bpd'}$ is a probabilistic database such that the all worlds set is encoded as the set of vectors $\worldvec\in\inset{0,\ldots,\bound}^{\abs{\tupset'}}$ where $\worldvec_\tup\leq\bound$ is the multiplicity for tuple $\tup$. $\pdb'$ requires the set of all possible tuples $\tupset = \bigcup_{\worldvec\in\inset{0,\ldots, \bound}^{\tupset'},~\worldvec_\tup \geq 1}\tup$ to be partitioned into $m$ independent blocks $\block_i$ ($i\in\pbox{m}$) where all tuples $\tup_{i, j}\in \block_i$ are disjoint. $\bpd'$ is the probability distribution where, for all $\worldvec\in\inset{0,\ldots,\bound}^{\tupset'}$ such that $\worldvec_{\tup_{i, j}},\worldvec_{\tup_{i, j'}}\neq 0, j\neq j'$ for any $\block_i$, $\probOf\pbox{\worldvec} = 0$, where all other $\worldvec$ has $0<\probOf\pbox{\worldvec}\leq 1$.%bpd'$ set with the all worlds set $\worlds$ and probability distribution $\bpd'$ such that $\tupset' = \bigcup_{\worldvec\in\worlds, \worldvec_\tup \geq 1}\tup$ is the set of all possible tuples for which all $\tup\in\tupset'$ can be partitioned into $\numedge$ blocks $\block_i$ where the set of tuples $\tup_j \in \block_i$ are all disjoint, while blocks $\block_i$ are independent of one another. Each $\tup\in\tupset'$ has a multiplicity of at most $\bound$. $\bpd'$ is the distribution such that for any $\worldvec\in\worlds$ with $\worldvec_{\tup_{i, j}}\geq 1$ and $\worldvec_{\tup_{i, j'}}\geq 1$, $j\neq j'$ in any $\block_i$ more than one tuple present from the same block $\block_i$ has probability $\probOf\pbox{\worldvec} = 0$.
\end{Definition}
A block independent database (\abbrBIDB) is a related probabilistic data model $\pdb=\inparen{\Omega, \bpd}$ such that the base set of tuples $\tupset = \bigcup_{\omega\in\Omega,~\tup\in\omega}\tup$ is partitioned into a set of $\numvar$ independent blocks $\inset{\inparen{\block_\tup}_{\tup\in\pbox{\numvar}}}$ such that the set of tuples $\inset{\inparen{\tup_j}_{j\in\pbox{\abs{\block}}}}$ in block $\block_\tup$ are disjoint from one another. This construction produces the set of possible worlds $\Omega$ that consists of all unique combinations of tuples in $\tupset$ with the constraint that for any $\omega\in\Omega$, no two tuples $\tup_j, \tup_{j'}, j\neq j'$ from the same block $\block_\tup$ exist together. A $\bound$-\abbrBIDB has the further requirement that each block has a multiplicity of at most $c$. We present a reduction that is useful in producing our results:
\begin{Definition}[\abbrCTIDB reduction]\label{def:ctidb-reduct}