Finished incorporating @atri write up changes to S2, S4, S5
parent
d3081e80d1
commit
17b1afb954
15
appendix.tex
15
appendix.tex
|
@ -184,6 +184,21 @@ The property holds for all recursive queries, and the proof holds.
|
|||
|
||||
With \cref{lem:circ-model-runtime} and our upper bound results on \approxq, we now have all the pieces to argue that using our approximation algorithm, the expected multiplicities of an $\raPlus$ query can be computed in essentially the same runtime as deterministic query processing for the same query, proving claim (iv) of the Introduction.
|
||||
|
||||
\section{Proof of \Cref{cor:cost-model}}
|
||||
\begin{proof}
|
||||
This follows from \Cref{lem:circuits-model-runtime} (\Cref{sec:circuit-runtime}) and \Cref{cor:approx-algo-const-p} (where the latter is used with $\delta$ being substituted\footnote{Recall that \Cref{cor:approx-algo-const-p} is stated for a single output tuple so to get the required guarantee for all (at most $n^k$) output tuples of $Q$ we get at most $\frac \delta{n^k}$ probability of failure for each output tuple and then just a union bound over all output tuples. } with $\frac \delta{n^k}$).
|
||||
\qed
|
||||
\end{proof}
|
||||
|
||||
\mypar{Higher Moments}
|
||||
%\label{sec:momemts}
|
||||
%
|
||||
We make a simple observation to conclude the presentation of our results.
|
||||
So far we have only focused on the expectation of $\poly$.
|
||||
In addition, we could e.g. prove bounds of the probability of a tuple's multiplicity being at least $1$.
|
||||
Progress can be made on this as follows:
|
||||
For any positive integer $m$ we can compute the $m$-th moment of the multiplicities, allowing us to e.g. use the Chebyschev inequality or other high moment based probability bounds on the events we might be interested in.
|
||||
We leave further investigations for future work.
|
||||
|
||||
%%% Local Variables:
|
||||
%%% mode: latex
|
||||
|
|
|
@ -3,10 +3,8 @@
|
|||
|
||||
\section{$1 \pm \epsilon$ Approximation Algorithm}\label{sec:algo}
|
||||
|
||||
In \Cref{sec:hard}, we showed that the answer to $\Cref{prob:intro-stmt}$ is no.
|
||||
%computing the expected multiplicity of a compressed lineage polynomial for \ti (even just based on project-join queries), and by extension \bi (or more general \abbrPDB models) %any $\semNX$-PDB)
|
||||
%is unlikely to be possible in linear time (\Cref{thm:mult-p-hard-result}), even if all tuples have the same probability (\Cref{th:single-p-hard}).
|
||||
With this result, we now design an approximation algorithm for our problem that runs in {\em linear time}.\footnote{For a very broad class of circuits: please see the discussion after \Cref{lem:val-ub} for more.}
|
||||
In \Cref{sec:hard}, we showed that the answer to \Cref{prob:intro-stmt} is no.
|
||||
With this result, we now design an approximation algorithm for our problem that runs in $\bigO{\abs{\circuit}}$.\footnote{For a very broad class of circuits: please see the discussion after \Cref{lem:val-ub} for more.}
|
||||
The folowing approximation algorithm applies to \bi, though our bounds are more meaningful for a non-trivial subclass of \bis that contains both \tis, as well as the PDBench benchmark~\cite{pdbench}. As before, all proofs and pseudocode can be found in \Cref{sec:proofs-approx-alg}.
|
||||
%it is then desirable to have an algorithm to approximate the multiplicity in linear time, which is what we describe next.
|
||||
|
||||
|
@ -42,7 +40,7 @@ For any circuit $\circuit$, the corresponding
|
|||
{\em positive circuit}, denoted $\abs{\circuit}$, is obtained from $\circuit$ as follows. For each leaf node $\ell$ of $\circuit$ where $\ell.\type$ is $\tnum$, update $\ell.\vari{value}$ to $|\ell.\vari{value}|$.
|
||||
\end{Definition}
|
||||
We will overload notation and use $\abs{\circuit}\inparen{\vct{X}}$ to mean $\polyf\inparen{\abs{\circuit}}$.
|
||||
Conveniently, $\abs{\circuit}\inparen{1,\ldots,1}$ gives us the number of terms represented in $\expansion{\circuit}$, i.e. $\sum\limits_{\inparen{\monom, \coef} \in \expansion{\circuit}}\abs{\coef}$.
|
||||
Conveniently, $\abs{\circuit}\inparen{1,\ldots,1}$ gives us $\sum\limits_{\inparen{\monom, \coef} \in \expansion{\circuit}}\abs{\coef}$.
|
||||
|
||||
\begin{Definition}[\size($\cdot$), \depth$\inparen{\cdot}$]\label{def:size-depth}
|
||||
The functions \size and \depth output the number of gates and levels respectively for input \circuit.
|
||||
|
@ -80,7 +78,7 @@ In a RAM model of word size of $W$-bits, $\multc{M}{W}$ denotes the complexity o
|
|||
\subsection{Our main result}
|
||||
\AH{Verify that the proof for \Cref{lem:approx-alg} doesn't rely on properties of $\raPlus$ or \abbrBIDB.}
|
||||
\begin{Theorem}\label{lem:approx-alg}
|
||||
Let \circuit be an arbitrary arithmetic circuit from a \abbrBIDB %for a UCQ over \bi
|
||||
Let \circuit be an arbitrary circuit from a \abbrBIDB %for a UCQ over \bi
|
||||
and define $\poly(\vct{X})=\polyf(\circuit)$ and let $k=\degree(\circuit)$.
|
||||
Then an estimate $\mathcal{E}$ of $\rpoly(\prob_1,\ldots, \prob_\numvar)$ can be computed in time
|
||||
{\small
|
||||
|
@ -95,7 +93,7 @@ such that
|
|||
|
||||
To get linear runtime results from \Cref{lem:approx-alg}, we will need to define another parameter modeling the (weighted) number of monomials in %$\poly\inparen{\vct{X}}$
|
||||
$\expansion{\circuit}$
|
||||
to be `canceled' monomials with dependent variables are removed (\Cref{def:reduced-bi-poly}). %def:hen it is modded with $\mathcal{B}$ (\Cref{def:mod-set-polys}).
|
||||
that need to be `canceled' when monomials with dependent variables are removed (\Cref{def:reduced-bi-poly}). %def:hen it is modded with $\mathcal{B}$ (\Cref{def:mod-set-polys}).
|
||||
Let $\isInd{\cdot}$ be a boolean function returning true if monomial $\encMon$ is composed of independent variables and false otherwise; further, let $\indicator{\theta}$ also be a boolean function returning true if $\theta$ evaluates to true.
|
||||
\begin{Definition}[Parameter $\gamma$]\label{def:param-gamma}
|
||||
Given a circuit $\circuit$ from a \abbrBIDB, define
|
||||
|
|
|
@ -1,6 +1,6 @@
|
|||
%!TEX root=./main.tex
|
||||
|
||||
\section{More on Circuits and Moments}\label{sec:gen}
|
||||
\subsection{More on Circuits and Moments}\label{sec:gen}
|
||||
We formalize our claim from \Cref{sec:intro} that a linear approximation algorithm for our problem implies that PDB queries (under bag semantics) can be answered (approximately) in the same runtime as deterministic queries under reasonable assumptions.
|
||||
Lastly, we generalize our result for expectation to other moments.
|
||||
|
||||
|
@ -18,7 +18,7 @@ Lastly, we generalize our result for expectation to other moments.
|
|||
\mypar{The cost model}
|
||||
%\label{sec:cost-model}
|
||||
So far our analysis of $\approxq$ has been in terms of the size of the lineage circuits.
|
||||
We now show that this model corresponds to the behavior of a deterministic database by proving that for any \raPlus query $\query$, we can construct a compressed circuit for $\poly$ and \bi $\pxdb$ of size and runtime linear in that of a general class of query processing algorithms for the same query $\query$ on $\pxdb$'s \dbbaseName $\dbbase$.
|
||||
We now show that this model corresponds to the behavior of a deterministic database by proving that for any \raPlus query $\query$, we can construct a compressed circuit for $\poly$ and \bi $\pdb$ of size and runtime linear in that of a general class of query processing algorithms for the same query $\query$ on $\pdb$'s \dbbaseName $\dbbase$.
|
||||
% Note that by definition, there exists a linear relationship between input sizes $|\pxdb|$ and $|\dbbase|$ (i.e., $\exists c, \db \in \pxdb$ s.t. $\abs{\pxdb} \leq c \cdot \abs{\db})$).
|
||||
% \footnote{This is a reasonable assumption because each block of a \bi represents entities with uncertain attributes.
|
||||
% In practice there is often a limited number of alternatives for each block (e.g., which of five conflicting data sources to trust). Note that all \tis trivially fulfill this condition (i.e., $c = 1$).}
|
||||
|
@ -44,7 +44,7 @@ We adopt a minimalistic compute-bound model of query evaluation drawn from the w
|
|||
Under this model a query $Q$ evaluated over database $\dbbase$ has runtime $O(\qruntime{Q,\dbbase})$.
|
||||
We assume that full table scans are used for every base relation access. We can model index scans by treating an index scan query $\sigma_\theta(R)$ as a base relation.
|
||||
|
||||
It can be verified that worst-case optimal join algorithms~\cite{skew,ngo-survey}, as well as query evaluation via factorized databases~\cite{factorized-db}\AR{See my comment on element on whether we should include this ref or not.} (and work on FAQs~\cite{DBLP:conf/pods/KhamisNR16}) can be modeled as select-union-project-join queries (though the size of these queries is data dependent).\footnote{This claim can be verified by e.g. simply looking at the {\em Generic-Join} algorithm in~\cite{skew} and {\em factorize} algorithm in~\cite{factorized-db}.} It can be verified that the above cost model on the corresponding SPJU join queries correctly captures their runtime.
|
||||
It can be verified that worst-case optimal join algorithms~\cite{skew,ngo-survey}, as well as query evaluation via factorized databases~\cite{factorized-db}\AR{See my comment on element on whether we should include this ref or not.} (and work on FAQs~\cite{DBLP:conf/pods/KhamisNR16}) can be modeled as select-union-project-join queries (though the size of these queries is data dependent).\footnote{This claim can be verified by e.g. simply looking at the {\em Generic-Join} algorithm in~\cite{skew} and {\em factorize} algorithm in~\cite{factorized-db}.} It can be verified that the above cost model on the corresponding $\raPlus$ join queries correctly captures their runtime.
|
||||
%
|
||||
%We now make a simple observation on the above cost model:
|
||||
%\begin{proposition}
|
||||
|
@ -56,29 +56,18 @@ It can be verified that worst-case optimal join algorithms~\cite{skew,ngo-survey
|
|||
%
|
||||
We are now ready to formally state our claim from \Cref{sec:intro}:
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
\begin{Corollary}
|
||||
Given an SPJU query $Q$ over a \ti $\pxdb$ with \dbbaseName $\dbbase$, we can compute a $(1\pm\eps)$-approximation of the expectation for each output tuple in $\query(\pxdb)$ with probability at least $1-\delta$ in time
|
||||
\begin{Corollary}\label{cor:cost-model}
|
||||
Given an $\raPlus$ query $\query$ over a \ti $\pdb$ with \dbbaseName $\dbbase$, we can compute a $(1\pm\eps)$-approximation of the expectation for each output tuple in $\query(\pdb)$ with probability at least $1-\delta$ in time
|
||||
%
|
||||
\[
|
||||
O_k\left(\frac 1{\eps^2}\cdot\qruntime{Q,\dbbase}\cdot \log{\frac{1}{\conf}}\cdot \log(n)\right)
|
||||
\]
|
||||
\end{Corollary}
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
\begin{proof}
|
||||
This follows from \Cref{lem:circuits-model-runtime} (\Cref{sec:circuit-runtime}) and \Cref{cor:approx-algo-const-p} (where the latter is used with $\delta$ being substituted\footnote{Recall that \Cref{cor:approx-algo-const-p} is stated for a single output tuple so to get the required guarantee for all (at most $n^k$) output tuples of $Q$ we get at most $\frac \delta{n^k}$ probability of failure for each output tuple and then just a union bound over all output tuples. } with $\frac \delta{n^k}$).
|
||||
\end{proof}
|
||||
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
\mypar{Higher Moments}
|
||||
%\label{sec:momemts}
|
||||
%
|
||||
We make a simple observation to conclude the presentation of our results.
|
||||
So far we have only focused on the expectation of $\poly$.
|
||||
In addition, we could e.g. prove bounds of the probability of a tuple's multiplicity being at least $1$.
|
||||
Progress can be made on this as follows:
|
||||
For any positive integer $m$ we can compute the $m$-th moment of the multiplicities, allowing us to e.g. use the Chebyschev inequality or other high moment based probability bounds on the events we might be interested in.
|
||||
We leave further investigations for future work.
|
||||
|
||||
%%% Local Variables:
|
||||
%%% mode: latex
|
||||
|
|
|
@ -11,7 +11,7 @@ in SOP form, the problem is \sharpwonehard for factorized polynomials (proven th
|
|||
We prove that it is possible to approximate the expectation of a lineage polynomial in linear time
|
||||
% When only considering polynomials for result tuples of
|
||||
for UCQs over TIDBs and BIDBs (assuming that there are few cancellations).
|
||||
Interesting directions for future work include development of a dichotomy for bag PDBs and approximations for more general data models. % beyond what we consider in this paper.
|
||||
Interesting directions for future work include development of a dichotomy for bag \abbrPDB\xplural. While we handle higher moments in \Cref{app:sec-cicuits}, more general approximations are an interesting area for exploration, including those for more general data models. % beyond what we consider in this paper.
|
||||
% Furthermore, it would be interesting to see whether our approximation algorithm can be extended to support queries with negations, perhaps using circuits with monus as a representation system.
|
||||
|
||||
% \BG{I am not sure what interesting future work is here. Some wild guesses, if anybody agrees I'll try to flesh them out:
|
||||
|
|
1
main.tex
1
main.tex
|
@ -132,7 +132,6 @@ sensitive=true
|
|||
\input{mult_distinct_p}
|
||||
\input{single_p}
|
||||
\input{approx_alg}
|
||||
\input{circuits-model-runtime}
|
||||
\input{related-work}
|
||||
\input{conclusions}
|
||||
|
||||
|
|
|
@ -9,7 +9,6 @@ and develop a reduced form of lineage polynomials for a \abbrBIDB or \abbrTIDB.
|
|||
Note that a polynomial over $\vct{X}=(X_1,\dots,X_n)$ with individual degree $B <\infty$
|
||||
%\footnote{The standard definition of polynomials requires a finite number of terms.} and $c_\vct{i} \in \domN$
|
||||
is formally defined as: %(with $c_\vct{i} \in \domN$):
|
||||
\AH{Do we want to say that $\domain\inparen{c_\vct{i}} = \domR$ instead? I've only seen the $\semNX$ use case. Is there a legitimate use case for real valued coefficients?}
|
||||
\begin{equation}
|
||||
\label{eq:sop-form}
|
||||
\poly\inparen{X_1,\dots,X_n}=\sum_{\vct{d}\in\{0,\ldots,B\}^n} c_{\vct{d}}\cdot \prod_{i=1}^n X_i^{d_i},
|
||||
|
@ -32,8 +31,7 @@ The degree of polynomial $\poly(\vct{X})$ is the largest $\sum_{i=1}^n d_i$ such
|
|||
As an example, the degree of the polynomial $X^2+2XY^2+Y^2$ is $3$.
|
||||
Product terms in lineage arise only from join operations (\Cref{fig:nxDBSemantics}), so intuitively, the degree of a lineage polynomial is analogous to the largest number of joins to produce an output tuple.
|
||||
%in any clause of the $\raPlus$ query that created it.
|
||||
We call a polynomial $\poly\inparen{\vct{X}}$ a \emph{\bi-lineage polynomial} (resp., \emph{\ti-lineage polynomial}, or simply lineage polynomial), if there exists a $\raPlus$ query $\query$, \bi (\ti) $\pdb$, and tuple $\tup$ such that $\poly\inparen{\vct{X}} = \query(\pdb)(\tup)$.
|
||||
|
||||
We call a polynomial $\poly\inparen{\vct{X}}$ a \emph{\bi-lineage polynomial} (resp., \emph{\ti-lineage polynomial}, or simply lineage polynomial), if there exists a $\raPlus$ query $\query$, \bi (\ti) $\pdb$, and tuple $\tup$ such that $\poly\inparen{\vct{X}} = \apolyqdt\inparen{\vct{X}}.$
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
%\begin{Definition}[Modding with a set]\label{def:mod-set}
|
||||
%Let $S$ be a {\em set} of polynomials over $\vct{X}$. Then $\poly(\vct{X})\mod{S}$ is the polynomial obtained by taking the mod of $\poly(\vct{X})$ over {\em all} polynomials in $S$ (order does not matter).
|
||||
|
@ -164,7 +162,7 @@ By \Cref{lem:exp-poly-rpoly} and linearity of expectation, the following corolla
|
|||
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
\begin{Corollary}\label{cor:expct-sop}
|
||||
If $\poly$ is a \bi-lineage polynomial already in \abbrSMB, then the expectation of $\poly$, i.e., $\expct\pbox{\poly} = \rpoly\left(\prob_1,\ldots, \prob_\numvar\right)$ can be computed in $\bigO{\size\inparen{\poly}}$, where $\size\inparen{\poly}$ (\Cref{def:size}) is proportional to the total number of multiplication/addition operators in $\poly$.
|
||||
If $\poly$ is a \bi-lineage polynomial already in \abbrSMB, then the expectation of $\poly$, i.e., $\expct\pbox{\poly} = \rpoly\left(\prob_1,\ldots, \prob_\numvar\right)$ can be computed in $\bigO{\size\inparen{\poly}}$, where $\size\inparen{\poly}$ (\Cref{def:size-depth}) is proportional to the total number of multiplication/addition operators in $\poly$.
|
||||
\end{Corollary}
|
||||
|
||||
|
||||
|
|
|
@ -92,7 +92,7 @@ Note that each circuit \circuit encodes a tree, with edges pointing towards the
|
|||
%\end{figure}
|
||||
We next formally define the relationship of circuits with polynomials. While the definition assumes one sink for notational convenience, it easily generalizes to the multiple sinks case.
|
||||
\begin{Definition}[$\polyf(\cdot)$]\label{def:poly-func}
|
||||
Denote $\polyf(\circuit)$ to be the function from circuit $\circuit$ to its corresponding polynomial (in \abbrSMB).\footnote{Recall our assumption that unless otherwise mentioned, all polynomials are considered in $\abbrSMB$.} $\polyf(\cdot)$ is recursively defined on $\circuit$ as follows, with addition and multiplication following the standard interpretation for polynomials:
|
||||
Denote $\polyf(\circuit)$ to be the function from circuit $\circuit$ to its corresponding polynomial (in \abbrSMB). $\polyf(\cdot)$ is recursively defined on $\circuit$ as follows, with addition and multiplication following the standard interpretation for polynomials:
|
||||
\begin{equation*}
|
||||
\polyf(\circuit) = \begin{cases}
|
||||
\polyf(\circuit_\lchild) + \polyf(\circuit_\rchild) &\text{ if \circuit.\type } = \circplus\\
|
||||
|
@ -115,7 +115,7 @@ The circuit of \Cref{fig:circuit} is an element of $\circuitset{2X^2+3XY-2Y^2}$.
|
|||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
\medskip
|
||||
|
||||
\noindent We are now ready to formally state our \textbf{main problem}.
|
||||
\noindent We are now ready to formally state the final version of \Cref{prob:intro-stmt}.%our \textbf{main problem}.
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
\begin{Definition}[The Expected Result Multiplicity Problem]\label{def:the-expected-multipl}
|
||||
Let $\pdb$ be an arbitrary \abbrBIDB-PDB and $\vct{X}$ be the set of variables annotating tuples in $\dbbase$. Fix a query $\query$ and an output tuple $\tup$.
|
||||
|
@ -126,6 +126,8 @@ Let $\pdb$ be an arbitrary \abbrBIDB-PDB and $\vct{X}$ be the set of variables a
|
|||
\textbf{Output}: $\expct_{\vct{W} \sim \pdassign}[\apolyqdt(\vct{W})]$
|
||||
\end{center}
|
||||
\end{Definition}
|
||||
|
||||
\input{circuits-model-runtime}
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
|
||||
|
||||
|
|
|
@ -12,7 +12,7 @@ A \textit{probabilistic database} $\pdb$ is a pair $(\idb, \pd)$ where $\idb$ is
|
|||
|
||||
For a probabilistic database $\pdb = (\idb, \pd)$, the result of a query is the pair $(\query(\idb), \pd')$ where $\pd'$ is a probability distribution over $\query(\idb)$ that assigns to each possible query result the sum of the probabilities of the worlds that produce this answer:
|
||||
%
|
||||
\[\forall \db' \in \query(\idb): \pd'(\db') = \sum_{\db \in \idb: \query(\db) = \db'} \pd(\db). \]
|
||||
%\[\forall \db' \in \query(\idb): \pd'(\db') = \sum_{\db \in \idb: \query(\db) = \db'} \pd(\db). \]
|
||||
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
%NEEDS to be moved to the appendix.
|
||||
|
@ -40,11 +40,11 @@ Recall \Cref{fig:nxDBSemantics} which depicts the semantics for constructing a l
|
|||
\begin{Proposition}[Expectation of polynomials]\label{prop:expection-of-polynom}
|
||||
Given a \abbrBPDB $\pdb = (\idb,\pd)$ and lineage polynomial $\apolyqdt$ for aribitrary output tuple $\tup$, %$\semNX$-\abbrPDB $\pxdb = (\idb_{\semNX}',\pd')$ where $\rmod(\pxdb) = \pdb$,
|
||||
we have:
|
||||
$ \expct_{\randDB \sim \pd}[\query(\randDB)(t)] = \expct_{\randWorld\sim \pdassign}\pbox{\apolyqdt(\randWorld)}. $
|
||||
$ \expct_{\randDB \sim \pd}[\query(\randDB)(t)] = \expct_{\vct{\randWorld}\sim \pdassign}\pbox{\apolyqdt\inparen{\vct{\randWorld}}}. $
|
||||
\end{Proposition}
|
||||
\noindent A formal proof of \Cref{prop:expection-of-polynom} is given in \Cref{subsec:expectation-of-polynom-proof}.\footnote{Although \Cref{prop:expection-of-polynom} follows, e.g., as an obvious consequence of~\cite{IL84a}'s Theorem 7.1, we are unaware of any formal proof for bag-probabilistic databases.}
|
||||
%This proposition shows that computing expected tuple multiplicities is equivalent to computing the expectation of a polynomial (for that tuple) from a probability distribution over all possible assignments of variables in the polynomial to $\{0,1\}$.
|
||||
We focus on the problem of computing $\expct_\pdassign\pbox{\apolyqdt\inparen{\randWorld}}$a from now on, assume an implicit result tuple, and so drop the subscript from $\apolyqdt$ (i.e., $\poly$ will denote a polynomial).
|
||||
We focus on the problem of computing $\expct_\pdassign\pbox{\apolyqdt\inparen{\vct{\randWorld}}}$ from now on, assume implicit $\query, \dbbase, \tup$, and so drop the subscript from $\apolyqdt$ (i.e., $\poly\inparen{\vct{X}}$ will denote a polynomial).
|
||||
|
||||
\subsubsection{\tis and \bis}
|
||||
\label{subsec:tidbs-and-bidbs}
|
||||
|
@ -52,12 +52,12 @@ In this paper, we focus on two popular forms of \abbrPDB\xplural: Block-Independ
|
|||
%
|
||||
A \bi $\pdb$ is a \abbrPDB with the constraint that
|
||||
%(i) every tuple $\tup_i$ is annotated with a unique random variable $\randWorld_i \in \{0, 1\}$ and (ii) that
|
||||
the tuples can be partitioned into a set of $\ell$ blocks such that tuples $\tup_{i, j}, \tup_{k, j'}$ from separate blocks $(i\neq k, j \in [\abs{i}], j' \in [\abs{k}])$ are independent of each other while tuples $\tup_{i, j}, \tup_{i, k}$ from the same block are disjoint events.\footnote{
|
||||
Although only a single independent, $[\abs{\block_i}+1]$-valued variable is customarily used per block, we decompose it into $\abs{\block_i}$ correlated $\{0,1\}$-valued variables per block that can be used directly in polynomials (without an indicator function). For $t_{i, j} \in b_i$, the event $(\randWorld_{i,j} = 1)$ corresponds to the event $(\randWorld_i = j)$ in the customary annotation scheme.
|
||||
the tuples in $\dbbase$ can be partitioned into a set of $\ell$ blocks such that tuples $\tup_{i, j}, \tup_{k, j'}$ from separate blocks $(i\neq k, j \in [\abs{i}], j' \in [\abs{k}])$ are independent of each other while tuples $\tup_{i, j}, \tup_{i, k}$ from the same block are disjoint events.\footnote{
|
||||
Although only a single independent, $[\abs{\block_i}+1]$-valued variable is customarily used per block~\cite{DBLP:series/synthesis/2011Suciu}, we decompose it into $\abs{\block_i}$ correlated $\{0,1\}$-valued variables per block that can be used directly in polynomials (without an indicator function). For $t_{i, j} \in b_i$, the event $(\randWorld_{i,j} = 1)$ corresponds to the event $(\randWorld_i = j)$ in the customary annotation scheme.
|
||||
}
|
||||
Each tuple $\tup_{i, j}$ is annotated with a random variable $\randWorld_{i, j} \in \{0, 1\}$ denoting its presence in a possible world $\db$. The probability distribution $\pd$ over $\pdb$ is the one induced from individual tuple probabilities $\prob_{i, j}\in \vct{\prob}=\inparen{\prob_{1, 1},\ldots,\prob_{\abs{\block},\ldots,\abs{\block_{\abs{\block}}}}}$ and the conditions on the blocks. A \abbrTIDB is a \abbrBIDB with the added requirement that each block is size $1$.
|
||||
Each tuple $\tup_{i, j}$ is annotated with a random variable $\randWorld_{i, j} \in \{0, 1\}$ denoting its presence in a possible world $\db$. The probability distribution $\pd$ over $\dbbase$ is the one induced from individual tuple probabilities $\prob_{i, j}\in \vct{\prob}=\inparen{\prob_{1, 1},\ldots,\prob_{\abs{\block},\ldots,\abs{\block_{\abs{\block}}}}}$ and the conditions on the blocks. A \abbrTIDB is a \abbrBIDB with the added requirement that each block is size $1$.
|
||||
|
||||
Instead of looking only at the possible worlds of $\pdb$, one can consider all worlds, including those that cannot exist due to disjointness. The all worlds set can be modeled by $\vct{\randWorld}\in \{0, 1\}^\numvar$,\footnote{Here and later on in the paper, especially in \Cref{sec:algo}, we will overload notation and rename the variables as $X_1,\dots,X_n$, where $n=\sum_{i=1}^\ell \abs{b_i}$.} such that $\randWorld_k \in \vct{\randWorld}$ represents the presence of $\tup_{i, j}$ (where $k = \sum_i \abs{b_i} + j$). We denote a probability distribution over all $\vct{\randWorld} \in \{0, 1\}^\numvar$ as $\pdassign$. When $\pdassign$ is the one induced from each $\prob_{i, j}$ while assigning $\probOf\pbox{\vct{\randWorld}} = 0$ for any $\vct{\randWorld}$ with $\randWorld_{i, j} = \randWorld_{i, k} = 1$, we end up with a bijective mapping from $\pd$ to $\pdassign$, such that each mapping is equivalent, implying the distributions are equivalent.
|
||||
Instead of looking only at the possible worlds of $\pdb$, one can consider all worlds, including those that cannot exist due to disjointness. The all worlds set can be modeled by $\vct{\randWorld}\in \{0, 1\}^\numvar$,\footnote{Here and later on in the paper, especially in \Cref{sec:algo}, we will overload notation and rename the variables as $X_1,\dots,X_n$, where $n=\sum_{i=1}^\ell \abs{b_i}$.} such that $\randWorld_k \in \vct{\randWorld}$ represents the presence of $\tup_{i, j}$ (where $k = \sum_{\ell = 1}^{i - 1} \abs{b_\ell} + j$). We denote a probability distribution over all $\vct{\randWorld} \in \{0, 1\}^\numvar$ as $\pdassign$. When $\pdassign$ is the one induced from each $\prob_{i, j}$ while assigning $\probOf\pbox{\vct{\randWorld}} = 0$ for any $\vct{\randWorld}$ with $\randWorld_{i, j} = \randWorld_{i, k} = 1$ for any block $i$ and $j\neq k$, we end up with a bijective mapping from $\pd$ to $\pdassign$, such that each mapping is equivalent, implying the distributions are equivalent.
|
||||
%that $\forall i \in \abs{\block}, \forall j\neq k \in [\block_i] \suchthat \db\inparen{\tup_{i, j}} = 0 \vee \db\inparen{\tup_{i, k} = 0}$.In other words, each random variable corresponds to the event of a single tuple's presence.
|
||||
%A \emph{\ti} is a \bi where each block contains exactly one tuple.
|
||||
\Cref{subsec:supp-mat-ti-bi-def} explains \abbrTIDB\xplural and \abbrBIDB\xplural in greater detail.
|
||||
|
|
|
@ -26,7 +26,7 @@ These include tuple-independent databases~\cite{VS17} (\tis), block-independent
|
|||
Fink et al.~\cite{FH12} study aggregate queries over a probabilistic version of the extension of K-relations for aggregate queries proposed in~\cite{AD11d} (\emph{pvc-tables}). As an extension of K-relations, this approach supports bags. % Probabilities are computed using a decomposition approach~\cite{DBLP:conf/icde/OlteanuHK10}.
|
||||
% over the symbolic expressions that are used as tuple annotations and values in pvc-tables.
|
||||
% \cite{FH12} identifies a tractable class of queries involving aggregation.
|
||||
In contrast, we study a less general data model ($\semNX$-PDBs)
|
||||
In contrast, we study a less general data model
|
||||
and query class, but provide a linear time approximation algorithm and provide new insights into the complexity of computing expectations while~\cite{FH12} computes probabilities for individual output annotations.
|
||||
|
||||
Several techniques for approximating tuple probabilities have been proposed in related work~\cite{FH13,heuvel-19-anappdsd,DBLP:conf/icde/OlteanuHK10,DS07}, relying on Monte Carlo sampling, e.g.,~\cite{DS07}, or a branch-and-bound paradigm~\cite{DBLP:conf/icde/OlteanuHK10}.
|
||||
|
|
Loading…
Reference in New Issue