Done with pass on S2
parent
bd13ad5569
commit
985c12ecd9
|
@ -1,8 +1,8 @@
|
|||
%!TEX root=./main.tex
|
||||
|
||||
\subsection{Relationship to Deterministic Query Runtimes}\label{sec:gen}
|
||||
We formalize our claim from \Cref{sec:intro} that a linear approximation algorithm for our problem implies that PDB queries (under bag semantics) can be answered (approximately) in the same runtime as deterministic queries under reasonable assumptions.
|
||||
Lastly, we generalize our result for expectation to other moments.
|
||||
%We formalize our claim from \Cref{sec:intro} that a linear approximation algorithm for our problem implies that PDB queries (under bag semantics) can be answered (approximately) in the same runtime as deterministic queries under reasonable assumptions.
|
||||
%Lastly, we generalize our result for expectation to other moments.
|
||||
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
|
||||
|
@ -15,16 +15,16 @@ Lastly, we generalize our result for expectation to other moments.
|
|||
%
|
||||
%\label{sec:circuits}
|
||||
|
||||
\mypar{The cost model}
|
||||
%\mypar{The cost model}
|
||||
%\label{sec:cost-model}
|
||||
So far our analysis of \Cref{prob:intro-stmt} has been in terms of the size of the lineage circuits.
|
||||
We now show that this model corresponds to the behavior of a deterministic database by proving that for any \raPlus query $\query$, we can construct a compressed circuit for $\poly$ and \bi $\pdb$ of size and runtime linear in that of a general class of query processing algorithms for the same query $\query$ on $\pdb$'s \dbbaseName $\dbbase$.
|
||||
%So far our analysis of \Cref{prob:intro-stmt} has been in terms of the size of the lineage circuits.
|
||||
%We now show that this model corresponds to the behavior of a deterministic database by proving that for any \raPlus query $\query$, we can construct a compressed circuit for $\poly$ and \bi $\pdb$ of size and runtime linear in that of a general class of query processing algorithms for the same query $\query$ on $\pdb$'s \dbbaseName $\dbbase$.
|
||||
% Note that by definition, there exists a linear relationship between input sizes $|\pxdb|$ and $|\dbbase|$ (i.e., $\exists c, \db \in \pxdb$ s.t. $\abs{\pxdb} \leq c \cdot \abs{\db})$).
|
||||
% \footnote{This is a reasonable assumption because each block of a \bi represents entities with uncertain attributes.
|
||||
% In practice there is often a limited number of alternatives for each block (e.g., which of five conflicting data sources to trust). Note that all \tis trivially fulfill this condition (i.e., $c = 1$).}
|
||||
%That is for \bis that fulfill this restriction approximating the expectation of results of SPJU queries is only has a constant factor overhead over deterministic query processing (using one of the algorithms for which we prove the claim).
|
||||
% with the same complexity as it would take to evaluate the query on a deterministic \emph{bag} database of the same size as the input PDB.
|
||||
We adopt a minimalistic compute-bound model of query evaluation drawn from the worst-case optimal join literature~\cite{skew,ngo-survey}.
|
||||
We adopt a minimalistic compute-bound model of query evaluation drawn from the worst-case optimal join literature~\cite{skew,ngo-survey} to define $\qruntime{\cdot,\cdot}$.\AR{Recursive definition needs to change based on what Oliver needs. Also I think in the definition betlow would be better to replace all $\dbbase$ with $D$.}
|
||||
|
||||
%
|
||||
\noindent\resizebox{1\linewidth}{!}{
|
||||
|
@ -44,7 +44,11 @@ We adopt a minimalistic compute-bound model of query evaluation drawn from the w
|
|||
Under this model a query $Q$ evaluated over database $\dbbase$ has runtime $O(\qruntime{Q,\dbbase})$.
|
||||
We assume that full table scans are used for every base relation access. We can model index scans by treating an index scan query $\sigma_\theta(R)$ as a base relation.
|
||||
|
||||
It can be verified that worst-case optimal join algorithms~\cite{skew,ngo-survey}, as well as query evaluation via factorized databases~\cite{factorized-db}\AR{See my comment on element on whether we should include this ref or not.} (and work on FAQs~\cite{DBLP:conf/pods/KhamisNR16}) can be modeled as select-union-project-join queries (though the size of these queries is data dependent).\footnote{This claim can be verified by e.g. simply looking at the {\em Generic-Join} algorithm in~\cite{skew} and {\em factorize} algorithm in~\cite{factorized-db}.} It can be verified that the above cost model on the corresponding $\raPlus$ join queries correctly captures their runtime.
|
||||
It can be verified that worst-case optimal join algorithms~\cite{skew,ngo-survey}, as well as query evaluation via factorized databases~\cite{factorized-db}
|
||||
%\AR{See my comment on element on whether we should include this ref or not.}
|
||||
(and work on FAQs~\cite{DBLP:conf/pods/KhamisNR16}) can be modeled as $\raPlus$ queries (though the size of these queries is data dependent).\footnote{This claim can be verified by e.g. simply looking at the {\em Generic-Join} algorithm in~\cite{skew} and {\em factorize} algorithm in~\cite{factorized-db}.} It can be verified that the above cost model on the corresponding $\raPlus$ join queries correctly captures their runtime.
|
||||
|
||||
More specifically \Cref{lem:circ-model-runtime} and \Cref{to-be-decided} show that for any $\raPlus$ query $\query$ and $\dbbase$, there exists a circuit $\circuit$ such that $\timeOf{\abbrStepOne}(Q,\dbbase,\circuit)$ and $|\circuit$ are both $O(\qruntime{Q, \dbbase})$. Recall we assumed these two bounds when we moved from \Cref{prob:big-o-joint-steps} to \Cref{prob:intro-stmt}.
|
||||
%
|
||||
%We now make a simple observation on the above cost model:
|
||||
%\begin{proposition}
|
||||
|
@ -54,15 +58,16 @@ It can be verified that worst-case optimal join algorithms~\cite{skew,ngo-survey
|
|||
%
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
%
|
||||
We are now ready to formally state our claim with respect to \Cref{prob:intro-stmt}:
|
||||
%We are now ready to formally state our claim with respect to \Cref{prob:intro-stmt}:
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
\begin{Corollary}\label{cor:cost-model}
|
||||
Given an $\raPlus$ query $\query$ over a \ti $\pdb$ with \dbbaseName $\dbbase$, we can compute a $(1\pm\eps)$-approximation of the expectation for each output tuple in $\query(\pdb)$ with probability at least $1-\delta$ in time
|
||||
%\begin{Corollary}\label{cor:cost-model}
|
||||
% Given an $\raPlus$ query $\query$ over a \ti $\pdb$ with \dbbaseName $\dbbase$, we can compute a $(1\pm\eps)$-approximation of the expectation for each output tuple in $\query(\pdb)$ with probability at least $1-\delta$ in time
|
||||
%
|
||||
\[
|
||||
O_k\left(\frac 1{\eps^2}\cdot\qruntime{Q,\dbbase}\cdot \log{\frac{1}{\conf}}\cdot \log(n)\right)
|
||||
\]
|
||||
\end{Corollary}
|
||||
% \[
|
||||
% O_k\left(\frac 1{\eps^2}\cdot\qruntime{Q,\dbbase}\cdot \log{\frac{1}{\conf}}\cdot \log(n)\right)
|
||||
% \]
|
||||
%\end{Corollary}
|
||||
%Atri: The above is no longer needed
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
|
|
|
@ -6,16 +6,16 @@
|
|||
We now introduce some terminology
|
||||
and develop a reduced form of lineage polynomials for a \abbrBIDB or \abbrTIDB.
|
||||
Note that a polynomial over $\vct{X}=(X_1,\dots,X_n)$ with individual degree $B <\infty$
|
||||
is formally defined as:
|
||||
is formally defined as (where $c_{\vct{d}}\in \semN$):
|
||||
\begin{equation}
|
||||
\label{eq:sop-form}
|
||||
\poly\inparen{X_1,\dots,X_n}=\sum_{\vct{d}\in\{0,\ldots,B\}^n} c_{\vct{d}}\cdot \prod_{i=1}^n X_i^{d_i},
|
||||
\end{equation}
|
||||
where $c_{\vct{d}}\in \semN$.
|
||||
%where $c_{\vct{d}}\in \semN$.
|
||||
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
\begin{Definition}[Standard Monomial Basis]\label{def:smb}
|
||||
From above, the term $\prod_{i=1}^n X_i^{d_i}$ is a {\em monomial}. A polynomial $\poly\inparen{\vct{X}}$ is in standard monomial basis (\abbrSMB) when we keep only the terms with $c_{\vct{d}}\ne 0$ from \Cref{eq:sop-form}.
|
||||
The term $\prod_{i=1}^n X_i^{d_i}$ in \Cref{eq:sop-form} is a {\em monomial}. A polynomial $\poly\inparen{\vct{X}}$ is in standard monomial basis (\abbrSMB) when we keep only the terms with $c_{\vct{d}}\ne 0$ from \Cref{eq:sop-form}.
|
||||
\end{Definition}
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
Unless othewise noted, we consider all polynomials to be in \abbrSMB representation.
|
||||
|
@ -27,9 +27,9 @@ The degree of polynomial $\poly(\vct{X})$ is the largest $\sum_{i=1}^n d_i$ such
|
|||
\end{Definition}
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
As an example, the degree of the polynomial $X^2+2XY^2+Y^2$ is $3$.
|
||||
Product terms in lineage arise only from join operations (\Cref{fig:nxDBSemantics}), so intuitively, the degree of a lineage polynomial is analogous to the largest number of joins to produce an output tuple.
|
||||
Product terms in lineage arise only from join operations (\Cref{fig:nxDBSemantics}), so intuitively, the degree of a lineage polynomial is analogous to the largest number of joins needed to produce a result tuple.
|
||||
%in any clause of the $\raPlus$ query that created it.
|
||||
We call a polynomial $\poly\inparen{\vct{X}}$ a \emph{\bi-lineage polynomial} (resp., \emph{\ti-lineage polynomial}, or simply lineage polynomial), if there exists a $\raPlus$ query $\query$, \bi (\ti) $\pdb$, and tuple $\tup$ such that $\poly\inparen{\vct{X}} = \apolyqdt\inparen{\vct{X}}.$
|
||||
We call a polynomial $\poly\inparen{\vct{X}}$ a \emph{\bi-lineage polynomial} (resp., \emph{\ti-lineage polynomial}, or simply lineage polynomial), if there exists a $\raPlus$ query $\query$, \bi (\ti) $\pdb$, and result tuple $\tup$ such that $\poly\inparen{\vct{X}} = \apolyqdt\inparen{\vct{X}}.$
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
|
||||
\begin{Definition}[Reduced \bi Polynomials]\label{def:reduced-bi-poly}
|
||||
|
@ -56,14 +56,14 @@ Consider $\poly(X, Y) = (X + Y)(X + Y)$ where $X$ and $Y$ are from different blo
|
|||
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
\begin{Lemma}\label{lem:exp-poly-rpoly}
|
||||
Let $\pdb$ be a \abbrBIDB over $\numvar$ input tuples such that the probability distribution $\pdassign$ over $\vct{\randWorld}^\numvar$ (the all worlds set) is induced by the probability vector $\probAllTup = (\prob_1, \ldots, \prob_\numvar)$. As in \Cref{lem:tidb-reduce-poly} for \abbrTIDB, any \abbrBIDB-lineage polynomial $\poly(\vct{X})$ based on $\pdb$ and query $\query$ we have:
|
||||
Let $\pdb$ be a \abbrBIDB over $\numvar$ input tuples such that the probability distribution $\pdassign$ over $\{0,1\}^\numvar$ (the all worlds set) is induced by the probability vector $\probAllTup = (\prob_1, \ldots, \prob_\numvar)$. As in \Cref{lem:tidb-reduce-poly} for \abbrTIDB, any \abbrBIDB-lineage polynomial $\poly(\vct{X})$ based on $\pdb$ and query $\query$ we have:
|
||||
% The expectation over possible worlds in $\poly(\vct{X})$ is equal to $\rpoly(\prob_1,\ldots, \prob_\numvar)$.
|
||||
\begin{equation*}
|
||||
\expct_{\vct{W}\sim \pd}\pbox{\poly(\vct{W})} = \rpoly(\probAllTup).
|
||||
\end{equation*}
|
||||
\end{Lemma}
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
By \Cref{lem:exp-poly-rpoly} and linearity of expectation, the following corollary results.
|
||||
By \Cref{lem:exp-poly-rpoly} and linearity of expectation, we get the following corollary.
|
||||
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
\begin{Corollary}\label{cor:expct-sop}
|
||||
|
|
|
@ -87,7 +87,7 @@ The circuit of \Cref{fig:circuit} is an element of $\circuitset{2X^2+3XY-2Y^2}$.
|
|||
\noindent We are now ready to formally state the final version of \Cref{prob:intro-stmt}.%our \textbf{main problem}.
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
\begin{Definition}[The Expected Result Multiplicity Problem]\label{def:the-expected-multipl}
|
||||
Let $\pdb$ be an arbitrary \abbrBIDB-PDB and $\vct{X}$ be the set of variables annotating tuples in $\dbbase$. Fix a query $\query$ and an output tuple $\tup$.
|
||||
Let $\pdb$ be an arbitrary \abbrBIDB-PDB and $\vct{X}$ be the set of variables annotating tuples in $\dbbase$. Fix a query $\query$ and a result tuple $\tup$.
|
||||
The \expectProblem is defined as follows:\\[-7mm]
|
||||
\begin{center}
|
||||
\textbf{Input}: $\circuit \in \circuitset{\polyX}$ for $\polyX = \apolyqdt$
|
||||
|
|
|
@ -15,12 +15,12 @@ For a probabilistic database $\pdb = (\idb, \pd)$, the result of a query is th
|
|||
Recall \Cref{fig:nxDBSemantics} which depicts the semantics for constructing a lineage polynomial $\apolyqdt$ for any $\raPlus$ query. We now make a meaningful connection between possible world semantics and world assignments on the lineage polynomial.
|
||||
|
||||
\begin{Proposition}[Expectation of polynomials]\label{prop:expection-of-polynom}
|
||||
Given a \abbrBPDB $\pdb = (\idb,\pd)$, $\raPlus$ query $\query$, and lineage polynomial $\apolyqdt$ for aribitrary output tuple $\tup$, %$\semNX$-\abbrPDB $\pxdb = (\idb_{\semNX}',\pd')$ where $\rmod(\pxdb) = \pdb$,
|
||||
Given a \abbrBPDB $\pdb = (\idb,\pd)$, $\raPlus$ query $\query$, and lineage polynomial $\apolyqdt$ for aribitrary result tuple $\tup$, %$\semNX$-\abbrPDB $\pxdb = (\idb_{\semNX}',\pd')$ where $\rmod(\pxdb) = \pdb$,
|
||||
we have (denoting $\randDB$ as the random variable over $\idb$):
|
||||
$ \expct_{\randDB \sim \pd}[\query(\randDB)(t)] = \expct_{\vct{\randWorld}\sim \pdassign}\pbox{\apolyqdt\inparen{\vct{\randWorld}}}. $
|
||||
\end{Proposition}
|
||||
\noindent A formal proof of \Cref{prop:expection-of-polynom} is given in \Cref{subsec:expectation-of-polynom-proof}.\footnote{Although \Cref{prop:expection-of-polynom} follows, e.g., as an obvious consequence of~\cite{IL84a}'s Theorem 7.1, we are unaware of any formal proof for bag-probabilistic databases.}
|
||||
We focus on the problem of computing $\expct_\pdassign\pbox{\apolyqdt\inparen{\vct{\randWorld}}}$ from now on, assume implicit $\query, \dbbase, \tup$, and so drop the subscript from $\apolyqdt$ (i.e., $\poly\inparen{\vct{X}}$ will denote a polynomial).
|
||||
We focus on the problem of computing $\expct_\pdassign\pbox{\apolyqdt\inparen{\vct{\randWorld}}}$ from now on, assume implicit $\query, \dbbase, \tup$, and drop them from $\apolyqdt$ (i.e., $\poly\inparen{\vct{X}}$ will denote a polynomial).
|
||||
|
||||
\subsubsection{\tis and \bis}
|
||||
\label{subsec:tidbs-and-bidbs}
|
||||
|
@ -28,13 +28,13 @@ In this paper, we focus on two popular forms of \abbrPDB\xplural: Block-Independ
|
|||
%
|
||||
A \bi $\pdb$ is a \abbrPDB with the constraint that
|
||||
%(i) every tuple $\tup_i$ is annotated with a unique random variable $\randWorld_i \in \{0, 1\}$ and (ii) that
|
||||
the tuples in $\dbbase$ can be partitioned into a set of $\ell$ blocks such that tuples $\tup_{i, j}, \tup_{k, j'}$ from separate blocks $(i\neq k, j \in [\abs{i}], j' \in [\abs{k}])$ are independent of each other while tuples $\tup_{i, j}, \tup_{i, k}$ from the same block are disjoint events.\footnote{
|
||||
the tuples in $\dbbase$ can be partitioned into a set of $\ell$ blocks such that tuples $\tup_{i, j}, \tup_{k, j'}$ from separate blocks $(i\neq k)$ are independent of each other while tuples $\tup_{i, j}, \tup_{i, k}$ from the same block are disjoint events.\footnote{
|
||||
Although only a single independent, $[\abs{\block_i}+1]$-valued variable is customarily used per block~\cite{DBLP:series/synthesis/2011Suciu}, we decompose it into $\abs{\block_i}$ correlated $\{0,1\}$-valued variables per block that can be used directly in polynomials (without an indicator function). For $t_{i, j} \in b_i$, the event $(\randWorld_{i,j} = 1)$ corresponds to the event $(\randWorld_i = j)$ in the customary annotation scheme.
|
||||
}
|
||||
Each tuple $\tup_{i, j}$ is annotated with a random variable $\randWorld_{i, j} \in \{0, 1\}$ denoting its presence in a possible world $\db$. The probability distribution $\pd$ over $\dbbase$ is the one induced from individual tuple probabilities $\prob_{i, j}\in \vct{\prob}=\inparen{\prob_{1, 1},\ldots,\prob_{\abs{\block},\ldots,\abs{\block_{\abs{\block}}}}}$ and the conditions on the blocks. A \abbrTIDB is a \abbrBIDB where each block has size exactly $1$.
|
||||
|
||||
Instead of looking only at the possible worlds of $\pdb$, one can consider all worlds, including those that cannot exist due to disjointness. The all worlds set can be modeled by $\vct{\randWorld}\in \{0, 1\}^\numvar$,\footnote{Here and later on in the paper, especially in \Cref{sec:algo}, we will overload notation and rename the variables as $X_1,\dots,X_n$, where $n=\sum_{i=1}^\ell \abs{b_i}$.} such that $\randWorld_k \in \vct{\randWorld}$ represents the presence of $\tup_{i, j}$ (where $k = \sum_{\ell = 1}^{i - 1} \abs{b_\ell} + j$). We denote a probability distribution over all $\vct{\randWorld} \in \{0, 1\}^\numvar$ as $\pdassign$. When $\pdassign$ is the one induced from each $\prob_{i, j}$ while assigning $\probOf\pbox{\vct{\randWorld}} = 0$ for any $\vct{\randWorld}$ with $\randWorld_{i, j} = \randWorld_{i, k} = 1$ for any block $i$ and $j\neq k$, we end up with a bijective mapping from $\pd$ to $\pdassign$, such that each mapping is equivalent, implying the distributions are equivalent.
|
||||
\Cref{subsec:supp-mat-ti-bi-def} explains \abbrTIDB\xplural and \abbrBIDB\xplural in greater detail.
|
||||
Instead of looking only at the possible worlds of $\pdb$, one can consider all worlds, including those that cannot exist due to disjointness. Then all worlds set can be modeled by $\vct{\randWorld}\in \{0, 1\}^\numvar$,\footnote{Here and later, especially in \Cref{sec:algo}, we will rename the variables as $X_1,\dots,X_n$, where $n=\sum_{i=1}^\ell \abs{b_i}$.} such that $\randWorld_k \in \vct{\randWorld}$ represents the presence of $\tup_{i, j}$ (where $k = \sum_{\ell = 1}^{i - 1} \abs{b_\ell} + j$). We denote a probability distribution over all $\vct{\randWorld} \in \{0, 1\}^\numvar$ as $\pdassign$. When $\pdassign$ is the one induced from each $\prob_{i, j}$ while assigning $\probOf\pbox{\vct{\randWorld}} = 0$ for any $\vct{\randWorld}$ with $\randWorld_{i, j} = \randWorld_{i, k} = 1$ for any block $i$ and $j\neq k$, we end up with a bijective mapping from $\pd$ to $\pdassign$, such that each mapping is equivalent, implying the distributions are equivalent.
|
||||
\Cref{subsec:supp-mat-ti-bi-def} has more details. % explains \abbrTIDB\xplural and \abbrBIDB\xplural in greater detail.
|
||||
|
||||
|
||||
%%% Local Variables:
|
||||
|
|
Loading…
Reference in New Issue