Finished incorporating @atri write up changes to S2, S4, S5

master
Aaron Huber 2021-09-15 12:18:44 -04:00
parent d3081e80d1
commit 17b1afb954
9 changed files with 41 additions and 40 deletions

View File

@ -184,6 +184,21 @@ The property holds for all recursive queries, and the proof holds.
With \cref{lem:circ-model-runtime} and our upper bound results on \approxq, we now have all the pieces to argue that using our approximation algorithm, the expected multiplicities of an $\raPlus$ query can be computed in essentially the same runtime as deterministic query processing for the same query, proving claim (iv) of the Introduction.
\section{Proof of \Cref{cor:cost-model}}
\begin{proof}
This follows from \Cref{lem:circuits-model-runtime} (\Cref{sec:circuit-runtime}) and \Cref{cor:approx-algo-const-p} (where the latter is used with $\delta$ being substituted\footnote{Recall that \Cref{cor:approx-algo-const-p} is stated for a single output tuple so to get the required guarantee for all (at most $n^k$) output tuples of $Q$ we get at most $\frac \delta{n^k}$ probability of failure for each output tuple and then just a union bound over all output tuples. } with $\frac \delta{n^k}$).
\qed
\end{proof}
\mypar{Higher Moments}
%\label{sec:momemts}
%
We make a simple observation to conclude the presentation of our results.
So far we have only focused on the expectation of $\poly$.
In addition, we could e.g. prove bounds of the probability of a tuple's multiplicity being at least $1$.
Progress can be made on this as follows:
For any positive integer $m$ we can compute the $m$-th moment of the multiplicities, allowing us to e.g. use the Chebyschev inequality or other high moment based probability bounds on the events we might be interested in.
We leave further investigations for future work.
%%% Local Variables:
%%% mode: latex

View File

@ -3,10 +3,8 @@
\section{$1 \pm \epsilon$ Approximation Algorithm}\label{sec:algo}
In \Cref{sec:hard}, we showed that the answer to $\Cref{prob:intro-stmt}$ is no.
%computing the expected multiplicity of a compressed lineage polynomial for \ti (even just based on project-join queries), and by extension \bi (or more general \abbrPDB models) %any $\semNX$-PDB)
%is unlikely to be possible in linear time (\Cref{thm:mult-p-hard-result}), even if all tuples have the same probability (\Cref{th:single-p-hard}).
With this result, we now design an approximation algorithm for our problem that runs in {\em linear time}.\footnote{For a very broad class of circuits: please see the discussion after \Cref{lem:val-ub} for more.}
In \Cref{sec:hard}, we showed that the answer to \Cref{prob:intro-stmt} is no.
With this result, we now design an approximation algorithm for our problem that runs in $\bigO{\abs{\circuit}}$.\footnote{For a very broad class of circuits: please see the discussion after \Cref{lem:val-ub} for more.}
The folowing approximation algorithm applies to \bi, though our bounds are more meaningful for a non-trivial subclass of \bis that contains both \tis, as well as the PDBench benchmark~\cite{pdbench}. As before, all proofs and pseudocode can be found in \Cref{sec:proofs-approx-alg}.
%it is then desirable to have an algorithm to approximate the multiplicity in linear time, which is what we describe next.
@ -42,7 +40,7 @@ For any circuit $\circuit$, the corresponding
{\em positive circuit}, denoted $\abs{\circuit}$, is obtained from $\circuit$ as follows. For each leaf node $\ell$ of $\circuit$ where $\ell.\type$ is $\tnum$, update $\ell.\vari{value}$ to $|\ell.\vari{value}|$.
\end{Definition}
We will overload notation and use $\abs{\circuit}\inparen{\vct{X}}$ to mean $\polyf\inparen{\abs{\circuit}}$.
Conveniently, $\abs{\circuit}\inparen{1,\ldots,1}$ gives us the number of terms represented in $\expansion{\circuit}$, i.e. $\sum\limits_{\inparen{\monom, \coef} \in \expansion{\circuit}}\abs{\coef}$.
Conveniently, $\abs{\circuit}\inparen{1,\ldots,1}$ gives us $\sum\limits_{\inparen{\monom, \coef} \in \expansion{\circuit}}\abs{\coef}$.
\begin{Definition}[\size($\cdot$), \depth$\inparen{\cdot}$]\label{def:size-depth}
The functions \size and \depth output the number of gates and levels respectively for input \circuit.
@ -80,7 +78,7 @@ In a RAM model of word size of $W$-bits, $\multc{M}{W}$ denotes the complexity o
\subsection{Our main result}
\AH{Verify that the proof for \Cref{lem:approx-alg} doesn't rely on properties of $\raPlus$ or \abbrBIDB.}
\begin{Theorem}\label{lem:approx-alg}
Let \circuit be an arbitrary arithmetic circuit from a \abbrBIDB %for a UCQ over \bi
Let \circuit be an arbitrary circuit from a \abbrBIDB %for a UCQ over \bi
and define $\poly(\vct{X})=\polyf(\circuit)$ and let $k=\degree(\circuit)$.
Then an estimate $\mathcal{E}$ of $\rpoly(\prob_1,\ldots, \prob_\numvar)$ can be computed in time
{\small
@ -95,7 +93,7 @@ such that
To get linear runtime results from \Cref{lem:approx-alg}, we will need to define another parameter modeling the (weighted) number of monomials in %$\poly\inparen{\vct{X}}$
$\expansion{\circuit}$
to be `canceled' monomials with dependent variables are removed (\Cref{def:reduced-bi-poly}). %def:hen it is modded with $\mathcal{B}$ (\Cref{def:mod-set-polys}).
that need to be `canceled' when monomials with dependent variables are removed (\Cref{def:reduced-bi-poly}). %def:hen it is modded with $\mathcal{B}$ (\Cref{def:mod-set-polys}).
Let $\isInd{\cdot}$ be a boolean function returning true if monomial $\encMon$ is composed of independent variables and false otherwise; further, let $\indicator{\theta}$ also be a boolean function returning true if $\theta$ evaluates to true.
\begin{Definition}[Parameter $\gamma$]\label{def:param-gamma}
Given a circuit $\circuit$ from a \abbrBIDB, define

View File

@ -1,6 +1,6 @@
%!TEX root=./main.tex
\section{More on Circuits and Moments}\label{sec:gen}
\subsection{More on Circuits and Moments}\label{sec:gen}
We formalize our claim from \Cref{sec:intro} that a linear approximation algorithm for our problem implies that PDB queries (under bag semantics) can be answered (approximately) in the same runtime as deterministic queries under reasonable assumptions.
Lastly, we generalize our result for expectation to other moments.
@ -18,7 +18,7 @@ Lastly, we generalize our result for expectation to other moments.
\mypar{The cost model}
%\label{sec:cost-model}
So far our analysis of $\approxq$ has been in terms of the size of the lineage circuits.
We now show that this model corresponds to the behavior of a deterministic database by proving that for any \raPlus query $\query$, we can construct a compressed circuit for $\poly$ and \bi $\pxdb$ of size and runtime linear in that of a general class of query processing algorithms for the same query $\query$ on $\pxdb$'s \dbbaseName $\dbbase$.
We now show that this model corresponds to the behavior of a deterministic database by proving that for any \raPlus query $\query$, we can construct a compressed circuit for $\poly$ and \bi $\pdb$ of size and runtime linear in that of a general class of query processing algorithms for the same query $\query$ on $\pdb$'s \dbbaseName $\dbbase$.
% Note that by definition, there exists a linear relationship between input sizes $|\pxdb|$ and $|\dbbase|$ (i.e., $\exists c, \db \in \pxdb$ s.t. $\abs{\pxdb} \leq c \cdot \abs{\db})$).
% \footnote{This is a reasonable assumption because each block of a \bi represents entities with uncertain attributes.
% In practice there is often a limited number of alternatives for each block (e.g., which of five conflicting data sources to trust). Note that all \tis trivially fulfill this condition (i.e., $c = 1$).}
@ -44,7 +44,7 @@ We adopt a minimalistic compute-bound model of query evaluation drawn from the w
Under this model a query $Q$ evaluated over database $\dbbase$ has runtime $O(\qruntime{Q,\dbbase})$.
We assume that full table scans are used for every base relation access. We can model index scans by treating an index scan query $\sigma_\theta(R)$ as a base relation.
It can be verified that worst-case optimal join algorithms~\cite{skew,ngo-survey}, as well as query evaluation via factorized databases~\cite{factorized-db}\AR{See my comment on element on whether we should include this ref or not.} (and work on FAQs~\cite{DBLP:conf/pods/KhamisNR16}) can be modeled as select-union-project-join queries (though the size of these queries is data dependent).\footnote{This claim can be verified by e.g. simply looking at the {\em Generic-Join} algorithm in~\cite{skew} and {\em factorize} algorithm in~\cite{factorized-db}.} It can be verified that the above cost model on the corresponding SPJU join queries correctly captures their runtime.
It can be verified that worst-case optimal join algorithms~\cite{skew,ngo-survey}, as well as query evaluation via factorized databases~\cite{factorized-db}\AR{See my comment on element on whether we should include this ref or not.} (and work on FAQs~\cite{DBLP:conf/pods/KhamisNR16}) can be modeled as select-union-project-join queries (though the size of these queries is data dependent).\footnote{This claim can be verified by e.g. simply looking at the {\em Generic-Join} algorithm in~\cite{skew} and {\em factorize} algorithm in~\cite{factorized-db}.} It can be verified that the above cost model on the corresponding $\raPlus$ join queries correctly captures their runtime.
%
%We now make a simple observation on the above cost model:
%\begin{proposition}
@ -56,29 +56,18 @@ It can be verified that worst-case optimal join algorithms~\cite{skew,ngo-survey
%
We are now ready to formally state our claim from \Cref{sec:intro}:
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Corollary}
Given an SPJU query $Q$ over a \ti $\pxdb$ with \dbbaseName $\dbbase$, we can compute a $(1\pm\eps)$-approximation of the expectation for each output tuple in $\query(\pxdb)$ with probability at least $1-\delta$ in time
\begin{Corollary}\label{cor:cost-model}
Given an $\raPlus$ query $\query$ over a \ti $\pdb$ with \dbbaseName $\dbbase$, we can compute a $(1\pm\eps)$-approximation of the expectation for each output tuple in $\query(\pdb)$ with probability at least $1-\delta$ in time
%
\[
O_k\left(\frac 1{\eps^2}\cdot\qruntime{Q,\dbbase}\cdot \log{\frac{1}{\conf}}\cdot \log(n)\right)
\]
\end{Corollary}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{proof}
This follows from \Cref{lem:circuits-model-runtime} (\Cref{sec:circuit-runtime}) and \Cref{cor:approx-algo-const-p} (where the latter is used with $\delta$ being substituted\footnote{Recall that \Cref{cor:approx-algo-const-p} is stated for a single output tuple so to get the required guarantee for all (at most $n^k$) output tuples of $Q$ we get at most $\frac \delta{n^k}$ probability of failure for each output tuple and then just a union bound over all output tuples. } with $\frac \delta{n^k}$).
\end{proof}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\mypar{Higher Moments}
%\label{sec:momemts}
%
We make a simple observation to conclude the presentation of our results.
So far we have only focused on the expectation of $\poly$.
In addition, we could e.g. prove bounds of the probability of a tuple's multiplicity being at least $1$.
Progress can be made on this as follows:
For any positive integer $m$ we can compute the $m$-th moment of the multiplicities, allowing us to e.g. use the Chebyschev inequality or other high moment based probability bounds on the events we might be interested in.
We leave further investigations for future work.
%%% Local Variables:
%%% mode: latex

View File

@ -11,7 +11,7 @@ in SOP form, the problem is \sharpwonehard for factorized polynomials (proven th
We prove that it is possible to approximate the expectation of a lineage polynomial in linear time
% When only considering polynomials for result tuples of
for UCQs over TIDBs and BIDBs (assuming that there are few cancellations).
Interesting directions for future work include development of a dichotomy for bag PDBs and approximations for more general data models. % beyond what we consider in this paper.
Interesting directions for future work include development of a dichotomy for bag \abbrPDB\xplural. While we handle higher moments in \Cref{app:sec-cicuits}, more general approximations are an interesting area for exploration, including those for more general data models. % beyond what we consider in this paper.
% Furthermore, it would be interesting to see whether our approximation algorithm can be extended to support queries with negations, perhaps using circuits with monus as a representation system.
% \BG{I am not sure what interesting future work is here. Some wild guesses, if anybody agrees I'll try to flesh them out:

View File

@ -132,7 +132,6 @@ sensitive=true
\input{mult_distinct_p}
\input{single_p}
\input{approx_alg}
\input{circuits-model-runtime}
\input{related-work}
\input{conclusions}

View File

@ -9,7 +9,6 @@ and develop a reduced form of lineage polynomials for a \abbrBIDB or \abbrTIDB.
Note that a polynomial over $\vct{X}=(X_1,\dots,X_n)$ with individual degree $B <\infty$
%\footnote{The standard definition of polynomials requires a finite number of terms.} and $c_\vct{i} \in \domN$
is formally defined as: %(with $c_\vct{i} \in \domN$):
\AH{Do we want to say that $\domain\inparen{c_\vct{i}} = \domR$ instead? I've only seen the $\semNX$ use case. Is there a legitimate use case for real valued coefficients?}
\begin{equation}
\label{eq:sop-form}
\poly\inparen{X_1,\dots,X_n}=\sum_{\vct{d}\in\{0,\ldots,B\}^n} c_{\vct{d}}\cdot \prod_{i=1}^n X_i^{d_i},
@ -32,8 +31,7 @@ The degree of polynomial $\poly(\vct{X})$ is the largest $\sum_{i=1}^n d_i$ such
As an example, the degree of the polynomial $X^2+2XY^2+Y^2$ is $3$.
Product terms in lineage arise only from join operations (\Cref{fig:nxDBSemantics}), so intuitively, the degree of a lineage polynomial is analogous to the largest number of joins to produce an output tuple.
%in any clause of the $\raPlus$ query that created it.
We call a polynomial $\poly\inparen{\vct{X}}$ a \emph{\bi-lineage polynomial} (resp., \emph{\ti-lineage polynomial}, or simply lineage polynomial), if there exists a $\raPlus$ query $\query$, \bi (\ti) $\pdb$, and tuple $\tup$ such that $\poly\inparen{\vct{X}} = \query(\pdb)(\tup)$.
We call a polynomial $\poly\inparen{\vct{X}}$ a \emph{\bi-lineage polynomial} (resp., \emph{\ti-lineage polynomial}, or simply lineage polynomial), if there exists a $\raPlus$ query $\query$, \bi (\ti) $\pdb$, and tuple $\tup$ such that $\poly\inparen{\vct{X}} = \apolyqdt\inparen{\vct{X}}.$
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%\begin{Definition}[Modding with a set]\label{def:mod-set}
%Let $S$ be a {\em set} of polynomials over $\vct{X}$. Then $\poly(\vct{X})\mod{S}$ is the polynomial obtained by taking the mod of $\poly(\vct{X})$ over {\em all} polynomials in $S$ (order does not matter).
@ -164,7 +162,7 @@ By \Cref{lem:exp-poly-rpoly} and linearity of expectation, the following corolla
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Corollary}\label{cor:expct-sop}
If $\poly$ is a \bi-lineage polynomial already in \abbrSMB, then the expectation of $\poly$, i.e., $\expct\pbox{\poly} = \rpoly\left(\prob_1,\ldots, \prob_\numvar\right)$ can be computed in $\bigO{\size\inparen{\poly}}$, where $\size\inparen{\poly}$ (\Cref{def:size}) is proportional to the total number of multiplication/addition operators in $\poly$.
If $\poly$ is a \bi-lineage polynomial already in \abbrSMB, then the expectation of $\poly$, i.e., $\expct\pbox{\poly} = \rpoly\left(\prob_1,\ldots, \prob_\numvar\right)$ can be computed in $\bigO{\size\inparen{\poly}}$, where $\size\inparen{\poly}$ (\Cref{def:size-depth}) is proportional to the total number of multiplication/addition operators in $\poly$.
\end{Corollary}

View File

@ -92,7 +92,7 @@ Note that each circuit \circuit encodes a tree, with edges pointing towards the
%\end{figure}
We next formally define the relationship of circuits with polynomials. While the definition assumes one sink for notational convenience, it easily generalizes to the multiple sinks case.
\begin{Definition}[$\polyf(\cdot)$]\label{def:poly-func}
Denote $\polyf(\circuit)$ to be the function from circuit $\circuit$ to its corresponding polynomial (in \abbrSMB).\footnote{Recall our assumption that unless otherwise mentioned, all polynomials are considered in $\abbrSMB$.} $\polyf(\cdot)$ is recursively defined on $\circuit$ as follows, with addition and multiplication following the standard interpretation for polynomials:
Denote $\polyf(\circuit)$ to be the function from circuit $\circuit$ to its corresponding polynomial (in \abbrSMB). $\polyf(\cdot)$ is recursively defined on $\circuit$ as follows, with addition and multiplication following the standard interpretation for polynomials:
\begin{equation*}
\polyf(\circuit) = \begin{cases}
\polyf(\circuit_\lchild) + \polyf(\circuit_\rchild) &\text{ if \circuit.\type } = \circplus\\
@ -115,7 +115,7 @@ The circuit of \Cref{fig:circuit} is an element of $\circuitset{2X^2+3XY-2Y^2}$.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\medskip
\noindent We are now ready to formally state our \textbf{main problem}.
\noindent We are now ready to formally state the final version of \Cref{prob:intro-stmt}.%our \textbf{main problem}.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Definition}[The Expected Result Multiplicity Problem]\label{def:the-expected-multipl}
Let $\pdb$ be an arbitrary \abbrBIDB-PDB and $\vct{X}$ be the set of variables annotating tuples in $\dbbase$. Fix a query $\query$ and an output tuple $\tup$.
@ -126,6 +126,8 @@ Let $\pdb$ be an arbitrary \abbrBIDB-PDB and $\vct{X}$ be the set of variables a
\textbf{Output}: $\expct_{\vct{W} \sim \pdassign}[\apolyqdt(\vct{W})]$
\end{center}
\end{Definition}
\input{circuits-model-runtime}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

View File

@ -12,7 +12,7 @@ A \textit{probabilistic database} $\pdb$ is a pair $(\idb, \pd)$ where $\idb$ is
For a probabilistic database $\pdb = (\idb, \pd)$, the result of a query is the pair $(\query(\idb), \pd')$ where $\pd'$ is a probability distribution over $\query(\idb)$ that assigns to each possible query result the sum of the probabilities of the worlds that produce this answer:
%
\[\forall \db' \in \query(\idb): \pd'(\db') = \sum_{\db \in \idb: \query(\db) = \db'} \pd(\db). \]
%\[\forall \db' \in \query(\idb): \pd'(\db') = \sum_{\db \in \idb: \query(\db) = \db'} \pd(\db). \]
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%NEEDS to be moved to the appendix.
@ -40,11 +40,11 @@ Recall \Cref{fig:nxDBSemantics} which depicts the semantics for constructing a l
\begin{Proposition}[Expectation of polynomials]\label{prop:expection-of-polynom}
Given a \abbrBPDB $\pdb = (\idb,\pd)$ and lineage polynomial $\apolyqdt$ for aribitrary output tuple $\tup$, %$\semNX$-\abbrPDB $\pxdb = (\idb_{\semNX}',\pd')$ where $\rmod(\pxdb) = \pdb$,
we have:
$ \expct_{\randDB \sim \pd}[\query(\randDB)(t)] = \expct_{\randWorld\sim \pdassign}\pbox{\apolyqdt(\randWorld)}. $
$ \expct_{\randDB \sim \pd}[\query(\randDB)(t)] = \expct_{\vct{\randWorld}\sim \pdassign}\pbox{\apolyqdt\inparen{\vct{\randWorld}}}. $
\end{Proposition}
\noindent A formal proof of \Cref{prop:expection-of-polynom} is given in \Cref{subsec:expectation-of-polynom-proof}.\footnote{Although \Cref{prop:expection-of-polynom} follows, e.g., as an obvious consequence of~\cite{IL84a}'s Theorem 7.1, we are unaware of any formal proof for bag-probabilistic databases.}
%This proposition shows that computing expected tuple multiplicities is equivalent to computing the expectation of a polynomial (for that tuple) from a probability distribution over all possible assignments of variables in the polynomial to $\{0,1\}$.
We focus on the problem of computing $\expct_\pdassign\pbox{\apolyqdt\inparen{\randWorld}}$a from now on, assume an implicit result tuple, and so drop the subscript from $\apolyqdt$ (i.e., $\poly$ will denote a polynomial).
We focus on the problem of computing $\expct_\pdassign\pbox{\apolyqdt\inparen{\vct{\randWorld}}}$ from now on, assume implicit $\query, \dbbase, \tup$, and so drop the subscript from $\apolyqdt$ (i.e., $\poly\inparen{\vct{X}}$ will denote a polynomial).
\subsubsection{\tis and \bis}
\label{subsec:tidbs-and-bidbs}
@ -52,12 +52,12 @@ In this paper, we focus on two popular forms of \abbrPDB\xplural: Block-Independ
%
A \bi $\pdb$ is a \abbrPDB with the constraint that
%(i) every tuple $\tup_i$ is annotated with a unique random variable $\randWorld_i \in \{0, 1\}$ and (ii) that
the tuples can be partitioned into a set of $\ell$ blocks such that tuples $\tup_{i, j}, \tup_{k, j'}$ from separate blocks $(i\neq k, j \in [\abs{i}], j' \in [\abs{k}])$ are independent of each other while tuples $\tup_{i, j}, \tup_{i, k}$ from the same block are disjoint events.\footnote{
Although only a single independent, $[\abs{\block_i}+1]$-valued variable is customarily used per block, we decompose it into $\abs{\block_i}$ correlated $\{0,1\}$-valued variables per block that can be used directly in polynomials (without an indicator function). For $t_{i, j} \in b_i$, the event $(\randWorld_{i,j} = 1)$ corresponds to the event $(\randWorld_i = j)$ in the customary annotation scheme.
the tuples in $\dbbase$ can be partitioned into a set of $\ell$ blocks such that tuples $\tup_{i, j}, \tup_{k, j'}$ from separate blocks $(i\neq k, j \in [\abs{i}], j' \in [\abs{k}])$ are independent of each other while tuples $\tup_{i, j}, \tup_{i, k}$ from the same block are disjoint events.\footnote{
Although only a single independent, $[\abs{\block_i}+1]$-valued variable is customarily used per block~\cite{DBLP:series/synthesis/2011Suciu}, we decompose it into $\abs{\block_i}$ correlated $\{0,1\}$-valued variables per block that can be used directly in polynomials (without an indicator function). For $t_{i, j} \in b_i$, the event $(\randWorld_{i,j} = 1)$ corresponds to the event $(\randWorld_i = j)$ in the customary annotation scheme.
}
Each tuple $\tup_{i, j}$ is annotated with a random variable $\randWorld_{i, j} \in \{0, 1\}$ denoting its presence in a possible world $\db$. The probability distribution $\pd$ over $\pdb$ is the one induced from individual tuple probabilities $\prob_{i, j}\in \vct{\prob}=\inparen{\prob_{1, 1},\ldots,\prob_{\abs{\block},\ldots,\abs{\block_{\abs{\block}}}}}$ and the conditions on the blocks. A \abbrTIDB is a \abbrBIDB with the added requirement that each block is size $1$.
Each tuple $\tup_{i, j}$ is annotated with a random variable $\randWorld_{i, j} \in \{0, 1\}$ denoting its presence in a possible world $\db$. The probability distribution $\pd$ over $\dbbase$ is the one induced from individual tuple probabilities $\prob_{i, j}\in \vct{\prob}=\inparen{\prob_{1, 1},\ldots,\prob_{\abs{\block},\ldots,\abs{\block_{\abs{\block}}}}}$ and the conditions on the blocks. A \abbrTIDB is a \abbrBIDB with the added requirement that each block is size $1$.
Instead of looking only at the possible worlds of $\pdb$, one can consider all worlds, including those that cannot exist due to disjointness. The all worlds set can be modeled by $\vct{\randWorld}\in \{0, 1\}^\numvar$,\footnote{Here and later on in the paper, especially in \Cref{sec:algo}, we will overload notation and rename the variables as $X_1,\dots,X_n$, where $n=\sum_{i=1}^\ell \abs{b_i}$.} such that $\randWorld_k \in \vct{\randWorld}$ represents the presence of $\tup_{i, j}$ (where $k = \sum_i \abs{b_i} + j$). We denote a probability distribution over all $\vct{\randWorld} \in \{0, 1\}^\numvar$ as $\pdassign$. When $\pdassign$ is the one induced from each $\prob_{i, j}$ while assigning $\probOf\pbox{\vct{\randWorld}} = 0$ for any $\vct{\randWorld}$ with $\randWorld_{i, j} = \randWorld_{i, k} = 1$, we end up with a bijective mapping from $\pd$ to $\pdassign$, such that each mapping is equivalent, implying the distributions are equivalent.
Instead of looking only at the possible worlds of $\pdb$, one can consider all worlds, including those that cannot exist due to disjointness. The all worlds set can be modeled by $\vct{\randWorld}\in \{0, 1\}^\numvar$,\footnote{Here and later on in the paper, especially in \Cref{sec:algo}, we will overload notation and rename the variables as $X_1,\dots,X_n$, where $n=\sum_{i=1}^\ell \abs{b_i}$.} such that $\randWorld_k \in \vct{\randWorld}$ represents the presence of $\tup_{i, j}$ (where $k = \sum_{\ell = 1}^{i - 1} \abs{b_\ell} + j$). We denote a probability distribution over all $\vct{\randWorld} \in \{0, 1\}^\numvar$ as $\pdassign$. When $\pdassign$ is the one induced from each $\prob_{i, j}$ while assigning $\probOf\pbox{\vct{\randWorld}} = 0$ for any $\vct{\randWorld}$ with $\randWorld_{i, j} = \randWorld_{i, k} = 1$ for any block $i$ and $j\neq k$, we end up with a bijective mapping from $\pd$ to $\pdassign$, such that each mapping is equivalent, implying the distributions are equivalent.
%that $\forall i \in \abs{\block}, \forall j\neq k \in [\block_i] \suchthat \db\inparen{\tup_{i, j}} = 0 \vee \db\inparen{\tup_{i, k} = 0}$.In other words, each random variable corresponds to the event of a single tuple's presence.
%A \emph{\ti} is a \bi where each block contains exactly one tuple.
\Cref{subsec:supp-mat-ti-bi-def} explains \abbrTIDB\xplural and \abbrBIDB\xplural in greater detail.

View File

@ -26,7 +26,7 @@ These include tuple-independent databases~\cite{VS17} (\tis), block-independent
Fink et al.~\cite{FH12} study aggregate queries over a probabilistic version of the extension of K-relations for aggregate queries proposed in~\cite{AD11d} (\emph{pvc-tables}). As an extension of K-relations, this approach supports bags. % Probabilities are computed using a decomposition approach~\cite{DBLP:conf/icde/OlteanuHK10}.
% over the symbolic expressions that are used as tuple annotations and values in pvc-tables.
% \cite{FH12} identifies a tractable class of queries involving aggregation.
In contrast, we study a less general data model ($\semNX$-PDBs)
In contrast, we study a less general data model
and query class, but provide a linear time approximation algorithm and provide new insights into the complexity of computing expectations while~\cite{FH12} computes probabilities for individual output annotations.
Several techniques for approximating tuple probabilities have been proposed in related work~\cite{FH13,heuvel-19-anappdsd,DBLP:conf/icde/OlteanuHK10,DS07}, relying on Monte Carlo sampling, e.g.,~\cite{DS07}, or a branch-and-bound paradigm~\cite{DBLP:conf/icde/OlteanuHK10}.