Trims and flow tweaks

master
Oliver Kennedy 2021-09-12 23:44:44 -04:00
parent b697e7763f
commit 441eb67719
Signed by: okennedy
GPG Key ID: 3E5F9B3ABD3FDB60
5 changed files with 105 additions and 50 deletions

View File

@ -4,16 +4,13 @@
The problem of computing the marginal probability of a tuple in the result of a query over set-probabilistic databases (PDBs) can be reduced to calculating the probability of the \emph{lineage formula} of the result, a Boolean formula over random variables representing the existence of tuples in the database's possible worlds.
The analog for bag semantics is a natural number-valued polynomial over random variables that evaluates to the multiplicity of the tuple in each world.
In this work, we study the problem of calculating the expectation of such polynomials (a tuple's expected multiplicity) exactly and approximately.
For tuple-independent databases (TIDBs), the expected multiplicity of a query result tuple can trivially be computed in linear time in the size of the tuple's lineage, if this polynomial is encoded as a sum of products (the standard operating procedure for Set-PDBs).
However, using a reduction from the problem of counting $k$-matchings, we demonstrate that calculating the expectation is \sharpwonehard when the polynomial is compressed, for example through factorization.
Such factorized representations are
exploited by modern join algorithms (e.g., worst-case optimal joins), and
so our results imply that computing probabilities for Bag-PDB based on the results produced by such algorithms introduces super-linear overhead.
We are specifically interested in the fine-grained complexity of this problem relative to the complexity of deterministic query evaluation --- if these complexities are comparable, it opens the door to practical deployment of probabilistic databases.
Unfortunately, we show the reverse; our results imply that computing probabilities for Bag-PDB based on the results produced by such algorithms introduces super-linear overhead.
% Such factorized representations are necessary to realize the performance of modern join algorithms (e.g., worst-case optimal joins), and so our results imply that a Bag-PDB doing exact computations (via these factorized representations) can never be as fast as a classical (deterministic) database.
The problem stays hard even for polynomials generated by project-join queries if all input tuples have a fixed probability $\prob$ (s.t. $\prob \in (0,1)$).
The problem stays hard even if all input tuples have a fixed probability $\prob$ (s.t. $\prob \in (0,1)$).
We proceed to study polynomials of result tuples of positive relational algebra queries ($\raPlus$) over TIDBs and for a non-trivial subclass of block-independent databases (BIDBs).
We develop a sampling algorithm that computes a $1 \pm \epsilon$-approximation of the expectation of polynomial (arithmetic) circuits in linear time in the size of the circuit.
By removing Bag-PDB's reliance on the sum-of-products representation of polynomials, this result paves the way for future work on PDBs that are competitive with deterministic databases.
We develop a sampling algorithm that computes a $1 \pm \epsilon$-approximation of the expected multiplicity of an output tuple in linear time in the runtime of a comparable deterministic query.
% By removing Bag-PDB's reliance on the sum-of-products representation of polynomials, this result paves the way for future work on PDBs that are competitive with deterministic databases.
\end{abstract}
%%% Local Variables:

View File

@ -2,17 +2,20 @@
%root: main.tex
\section{Introduction}\label{sec:intro}
A probabilistic database (PDB) $\pdb$ is a pair $\inparen{\idb, \pd}$, where $\idb$ is a set of deterministic database instances called possible worlds and $\pd$ is a probability distribution over $\idb$.
A commonly studied problem in probabilistic databases is, given a query $\query$, PDB $\pdb$, and possible query result tuple $\tup$, to compute the tuple's \textit{marginal probability} of being in the query's result, i.e., computing the expectation of a Boolean random variable over $\pd$ that is $1$ for every $\db \in \idb$ for which $\tup \in \query(\db)$ and $0$ otherwise. In this work, we are interested in bag semantics where each tuple $\tup$ is associated with a multiplicity $\db(\tup)$ from $\semN$ in each possible world\footnote{We find it convenient to use the notation from~\cite{DBLP:conf/pods/GreenKT07} which models bag relations as functions that map tuples to their multiplicity.}.
A commonly studied problem in probabilistic databases is, given a query $\query$, PDB $\pdb$, and possible query result tuple $\tup$, to compute the tuple's \textit{marginal probability} of being in the query's result, i.e., computing the expectation of a Boolean random variable over $\pd$ that is $1$ for every $\db \in \idb$ for which $\tup \in \query(\db)$ and $0$ otherwise.
In this work, we are interested in bag semantics, where each tuple is associated with a multiplicity.
Following~\cite{DBLP:conf/pods/GreenKT07}, we model bag databases (resp., relations) as functions from each $\tup$ to the tuple's multiplicity $\db(\tup)$ from $\semN$ in a possible world $\db$.
We refer to such a probabilistic database as a bag-probabilistic database or \abbrBPDB for short.
The natural generalization of the problem of computing marginal probabilities of query result tuples to bag semantics is to compute the expectation of a random variable over $\pd$ that assigns value $\query(\db)(\tup)$ in world $\db$:
\AH{I think I understand what is being stated in this last sentence, but I wonder if phrasing the end something like, ``for world $\db \in \idb$ would be easier to digest for the average reviewer...maybe it was just me.}
The natural generalization of the problem of computing marginal probabilities of query result tuples to bag semantics is to compute the expectation of a random variable over $\pd$ that assigns value $\query(\db)(\tup) \in \semN$ in world $\db \in \idb$:
%OK: done
%\AH{I think I understand what is being stated in this last sentence, but I wonder if phrasing the end something like, ``for world $\db \in \idb$ would be easier to digest for the average reviewer...maybe it was just me.}
% In bag query semantics the random variable $\query\inparen{\pdb}\inparen{\tup}$ is the multiplicity of its corresponding output tuple $\tup$ (in a random database instance in $\idb$ chosen according to $\pd$).
%In addition to traditional deterministic query evaluation requirements (for a given query class), the query evaluation problem in bag-\abbrPDB semantics can be formally stated as:
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Problem}[Expected Multiplicity]\label{prob:bag-pdb-query-eval}
Given a positive relational algebra query\footnote{The class of $\raPlus$ queries consists of all queries that can be composed of the positive (monotonic) relational algebra operators: selection, projection, join, and union (SPJU).} ($\raPlus$) $\query$, \abbrBPDB $\pdb$, and output tuple $\tup$, compute the expected
Given an \raPlus query\footnote{The class of positive relational algebra (\raPlus) queries consists of all queries that can be composed of the positive (monotonic) relational algebra operators: selection, projection, join, and union (SPJU).} $\query$, \abbrBPDB $\pdb$, and output tuple $\tup$, compute the expected
multiplicity ($\expct_{\db\sim\pd}\pbox{\query\inparen{\db}\inparen{\tup}}$)
of tuple $\tup$.
\end{Problem}
@ -26,7 +29,7 @@ A common encoding of probabilistic databases (e.g., in \cite{IL84a,Imielinski198
%Each valuation of the random variables appearing in this formula corresponds to one possible world.
%Given a joint probability distribution over such assignments, the marginal probability of a query result tuple $\tup$ is the probability that the lineage formula of $\tup$ evaluates to true. Given a \abbrBPDB $\pdb$, we refer to the above encoding of $\pdb$ as \dbbaseName and denote it as $\dbbase$.
%
The bag semantics analog of a lineage formula is a provenance/lineage polynomial $\apolyqdt$~\cite{DBLP:conf/pods/GreenKT07}-- see~\Cref{fig:nxDBSemantics} for a definition-- a polynomial with integer coefficients and exponents over integer variables $\vct{X}$ encoding the multiplicity of input tuples.
The bag semantics analog is a provenance/lineage polynomial $\apolyqdt$~\cite{DBLP:conf/pods/GreenKT07} (see~\Cref{fig:nxDBSemantics} for a definition), a polynomial with integer coefficients and exponents over integer variables $\vct{X}$ encoding input tuple multiplicities.
\begin{figure}
\begin{align*}
\polyqdt{\project_A(\query)}{\dbbase}{\tup} =& \sum_{\tup': \project_A(\tup') = \tup} \polyqdt{\query}{\dbbase}{\tup'} &
@ -74,7 +77,7 @@ We drop $\query$, $\dbbase$, and $\tup$ from $\apolyqdt$ when they are clear fro
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Problem}[Expected Multiplicity of Lineage Polynomials]\label{prob:bag-pdb-poly-expected}
Given an $\raPlus$ query $\query$, \abbrBPDB $\pdb$, and output tuple $\tup$, compute the expected
multiplicity of $\apolyqdt$ ($\expct_{\vct{W}\sim \pdassign}\pd\pbox{\apolyqdt(\vct{W})}$),
multiplicity of the polynomial $\apolyqdt$ (i.e., $\expct_{\vct{W}\sim \pdassign}\pd\pbox{\apolyqdt(\vct{W})}$),
where $\pdassign$ is the distribution induced by $\pd$ on the relevant assignements to variables of $\apolyqdt$.
\end{Problem}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@ -101,7 +104,9 @@ We initially focus on tuple-independent probabilistic bag-databases\footnote{See
}
% OK: I tidied things up a touch.
%\BG{The footnote is still a bit hard to follow I think, but I do not have a great suggestion on how to improve it.}
We will denote $\dbbase=(t_1,\dots,t_\numvar)$. Each of the $2^n$ possible worlds in $\Omega$ can be encoded as a string in $\{0,1\}^\numvar$. In particular, any vector $\vct{W}=\inparen{W_1,\dots,W_n}\in \{0,1\}^\numvar$ represents a world $\db\in\idb$ in the natural way: i.e. $\tup_i\in\db$ in it iff $W_i=1$. Furthermore, $\pd$ is compactly described by a tuple $\vct{p}=\inparen{p_1,\dots,p_n}$, which induces the Bernoulli distribution over vectors $\vct{W}\in\{0,1\}^\numvar$ where each $i\in [n]$, $\probOf(W_i=1)=p_i$.
We will denote $\dbbase=(t_1,\dots,t_\numvar)$. Each of the $2^n$ possible worlds in $\Omega$ can be encoded as a string in $\{0,1\}^\numvar$. In particular, any vector $\vct{W}=\inparen{W_1,\dots,W_n}\in \{0,1\}^\numvar$ represents a world $\db\in\idb$ in the natural way: i.e. $\tup_i\in\db$ in it
iff $W_i=1$. Furthermore, $\pd$ is compactly described by a tuple $\vct{p}=\inparen{p_1,\dots,p_n}$, which induces the Bernoulli distribution over vectors $\vct{W}\in\{0,1\}^\numvar$ where each $i\in [n]$, $\probOf(W_i=1)=p_i$.
%Finally for each $\vct{W}\in\{0,1\}^\numvar$, we define $\pdb_{\vct{W}}$
%\AH{Where do we use this notation? If we use this somewhere, should we maybe use $\db_{\vct{\randWorld}}$ instead?}
% as the world represented by $\vct{W}$.
@ -129,13 +134,16 @@ However, in practice (and in theory), we care about the {\em fine-grained} compl
For \abbrBPDB $\pdb$ and query $Q$, let $\timeOf{}^*(Q,\pdb)$ denote the (optimal) runtime complexity of \Cref{prob:bag-pdb-query-eval} (over all result tuples $\tup$).\AR{Am changing these runtime definitions to include the runtime for all result tuples $\tup$.}
Denote by $\qruntime{Q, \db}$ the `runtime' of query $Q$ on deterministic database $\db$ under a cost model that is satisfied by a wide range of query processing algorithms including those based on the recent work on worst-case optimal join algorithms (we make this runtime concrete in \Cref{sec:circuit-runtime}\AR{We need to move the definition of $\qruntime{}$ to \Cref{sec:background} because among others we now need it in our lower bound arguments as well.}).
Denote by $\qruntime{Q, \db}$ the `runtime' of query $Q$ on deterministic database $\db$ under a cost model that is satisfied by a wide range of query processing algorithms, including those based on the recent work on worst-case optimal join algorithms (we make this runtime concrete in \Cref{sec:circuit-runtime}\AR{We need to move the definition of $\qruntime{}$ to \Cref{sec:background} because among others we now need it in our lower bound arguments as well.}).
%Denoting by $\dbbase = \bigcup_{\db \in \idb} \db$ the set of all possible tuples in \abbrPDB $\pdb = \inparen{\idb, \pd}$,
We finally have all the pieces to state a formal specification of our problem:
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Problem}\label{prob:informal}
Given an $\raPlus$ query $\query$ and \abbrTIDB\AR{Changed this to \abbrTIDB: we should motivate why we are restricting ourselves to this special case here.} \abbrBPDB $\pdb$, is it the case that $\timeOf{}^*(Q,\pdb) \le O(\qruntime{Q, \dbbase})$?
Given an $\raPlus$ query $\query$ and \abbrTIDB
% OK: added motivation
%\AR{Changed this to \abbrTIDB: we should motivate why we are restricting ourselves to this special case here.}
\abbrBPDB $\pdb$, is it the case that $\timeOf{}^*(Q,\pdb) \le O(\qruntime{Q, \dbbase})$?
\end{Problem}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% However the question remains: \emph{can bag-probabilistic databases be as fast as deterministic queries}.
@ -145,12 +153,13 @@ Given an $\raPlus$ query $\query$ and \abbrTIDB\AR{Changed this to \abbrTIDB: we
%The problem of deterministic query evaluation is known to be \sharpwonehard\footnote{A problem is in \sharpwone if the runtime of the most efficient known algorithm to solve it is lower bounded by some function $f$ of a parameter $k$, where the growth in runtime is polynomially dependent on $f(k)$, i.e. $\Omega\inparen{\numvar^{f(k)}}$.} in data complexity for general $\query$. For example, the counting $k$-cliques query problem (where the parameter $k$ is the size of the clique) is \sharpwonehard since (under standard complexity assumptions) it cannot run in time faster than $n^{f(k)}$ for some strictly increasing $f(k)$.
%In this paper, we begin to explore whether the problem of bag-probabilistic query evaluation (which we relate to deterministic query processing more precisely below) falls into this same complexity class.
We note that the above is a special case of \Cref{prob:bag-pdb-query-eval} since we are asking whether the query evaluation over \abbrBPDB is {\em linear} in the runtime of deterministic query processing time. We stress that this question is very well motivated. In particular,
we note that an answer in the affirmative for~\Cref{prob:informal} indicates that bag-probabilistic databases can be competitive with classical deterministic databases, opening the door for deployment in practice.
We note that the above is a special case of \Cref{prob:bag-pdb-query-eval} since we are asking whether the query evaluation over \abbrBPDB is {\em linear} in the runtime of deterministic query processing time.
We stress that this question is very well motivated, even for one of the simplest models of probabilistic databases (i.e., \abbrTIDBs).
In particular, we note that an answer in the affirmative for~\Cref{prob:informal} indicates that bag-probabilistic databases can be competitive with deterministic databases, opening the door for practical deployment.
\mypar{Our lower bound results} Unfortunately, we prove the negative. In fact in Table~\ref{tab:lbs}\AR{Cref was not formatting Table correct so added Table in explicitly.} we show that depending on what hardness result/conjecture we assume, we get various emphatic versions of {\em no} as an answer to \Cref{prob:informal}.
\begin{table}
\begin{tabular}{|p{0.4\textwidth}|p{0.15\textwidth}|p{0.45\textwidth}|}
\begin{tabular}{|p{0.4\textwidth}|p{0.15\textwidth}|p{0.35\textwidth}|}
\hline
Lower bound on $\timeOf{}^*(Q,\pdb)$ & Num. $\pd$s & Hardness Assumption\\
\hline
@ -158,14 +167,16 @@ $\Omega\inparen{\inparen{\qruntime{Q, \dbbase}}^{1+\eps_0}}$ for {\em some} $\ep
\hline
$\omega\inparen{\inparen{\qruntime{Q, \dbbase}}^{C_0}}$ for {\em all} $C_0>0$ & Multiple &$\sharpwzero\ne\sharpwone$\\
\hline
$\Omega\inparen{\inparen{\qruntime{Q, \dbbase}}^{c_0\cdot k}}$ for {\em some} $c_0>0$ & Multiple & Current algorithms for counting $k$-matchings\\
$\Omega\inparen{\inparen{\qruntime{Q, \dbbase}}^{c_0\cdot k}}$ for {\em some} $c_0>0$ & Multiple & Current $k$-matching algorithms\\
\hline
\end{tabular}
\caption{Our lower bounds for a specific hard query $Q$ parameterized by $k$. The $\pdb$ is over the same $\dbbase$ and those with `Multiple' in the second column need the algorithm to be able to handle multiple $\pd$. The last column states the hardness assumptions that imply the lower bounds in the first column (all of $\eps_o,C_0,c_0$ are all constants independent of $k$).}
\label{tab:lbs}
\end{table}
Note that the lower bound in the first row by itself is enough to refute \Cref{prob:informal}.
To make some sense of the other lower bounds in Table~\ref{tab:lbs}, we note that it is not too hard to show that $\timeOf{}^*(Q,\pdb) \le O\inparen{\inparen{\qruntime{Q, \dbbase}}^k})$, where $k$ is the largest degree of the polynomial $\apolyqdt$ over all result tuple $\tup$ (which is the parameter that defines our family of hard queries). What our lower bound in the third rows says that one cannot get more than a polynomial improvement over essentially the trivial algorithm for \Cref{prob:informal}.\footnote{We note that similar hardness results are known for determinsitic query processing is one is looking for lower bounds in terms of $\abs{\dbbase}$. Our lower bounds are in terms of $\timeOf{}^*(Q,\pdb)$, which in general can be super-linear in $\abs{\dbbase}$.} However, this result assumes a hardness conjecture that is not as well studied as those in the first two rows of the table (see \Cref{sec:hard} for more discussion on the hardness assumptions). Further, we note that existing resluts already imply the claimed lower bounds if we were to replace the $\timeOf{}^*(Q,\pdb)$ by just $\abs{\dbbase}$-- indeed these results follow from known lower bound for deterministic query processing-- our contribution to then identify a family of hard query where deterministic query procedding is `easy' but computing the expected multuplicities is hard. To put these hardness results in context, we will next take a short detour to review the existing hardness results for \abbrPDB\xplural under set semantics.
To make some sense of the other lower bounds in Table~\ref{tab:lbs}, we note that it is not too hard to show that $\timeOf{}^*(Q,\pdb) \le O\inparen{\inparen{\qruntime{Q, \dbbase}}^k})$, where $k$ is the largest degree of the polynomial $\apolyqdt$ over all result tuple $\tup$ (which is the parameter that defines our family of hard queries). What our lower bound in the third rows says that one cannot get more than a polynomial improvement over essentially the trivial algorithm for \Cref{prob:informal}.\footnote{
We note similar hardness results for determinsitic query processing that apply lower bounds in terms of $\abs{\dbbase}$. Our lower bounds are in terms of $\qruntime{Q,\dbbase}$, which in general can be super-linear in $\abs{\dbbase}$.
} However, this result assumes a hardness conjecture that is not as well studied as those in the first two rows of the table (see \Cref{sec:hard} for more discussion on the hardness assumptions). Further, we note that existing resluts already imply the claimed lower bounds if we were to replace the $\timeOf{}^*(Q,\pdb)$ by just $\abs{\dbbase}$-- indeed these results follow from known lower bound for deterministic query processing-- our contribution to then identify a family of hard query where deterministic query procedding is `easy' but computing the expected multuplicities is hard. To put these hardness results in context, we will next take a short detour to review the existing hardness results for \abbrPDB\xplural under set semantics.
% Atri: Converting sub-section to para since it saves space
@ -177,11 +188,15 @@ As mentioned before, under set semantics, $\apolyqdt\inparen{\vct{X}}$ is a prop
%Atri: If we get a reviewer who does not know what a propositional formula is then we are in trouble-- I did move some of the footnote text to the main part though
%\footnote{To be precise, $\poly_\tup\inparen{\vct{X}}$ is a propositional formula composed of boolean variables and the logical disjunction and conjunction connectives. Evaluating such a formula follows the standard semantics of the said operators on boolean variables ($\semB$-semiring semantics).}
% whose evaluation follows the standard Boolean semi-ring semantics (i.e. addition is logical OR and multiplication is logical AND), denoting the presence or absence of $\tup$.
and $\expct\pbox{\apolyqdt\inparen{\vct{\randWorld}}}$ is the marginal probability of $\tup$ appearing in the output. We note that the answer to \Cref{prob:informal} for set-sematics is also no. Dalvi and Suicu \cite{10.1145/1265530.1265571} showed that the complexity of the query evaluation problem over set-\abbrPDB\xplural is \sharpphard \AR{This is result for TIDBs for general set-PDBs?}
and $\expct\pbox{\apolyqdt\inparen{\vct{\randWorld}}}$ is the marginal probability of $\tup$ appearing in the output. We note that the answer to \Cref{prob:informal} for set-sematics is also no. Dalvi and Suicu \cite{10.1145/1265530.1265571} showed that the complexity of the query evaluation problem over set-\abbrTIDB\xplural is \sharpphard
%OK: The former ---v
%\AR{This is result for TIDBs for general set-PDBs?}
%Atri: Again if we have a reviewer who does not know what \sharpp is then we are in trouble
%\footnote{\sharpp is the counting version for problems residing in the NP complexity class.}
in general, and proved that a dichotomy exists for this problem for the class of union of conjunctive queries (with the same expressive power as $\raPlus$), where the runtime of $\query(\pdb)$ is either polynomial or \sharpphard in data complexity. %for any polynomial-time deterministic query.
Thus, for the hard queries, the answer to~\Cref{prob:informal} is {\em no} for set-PDBs (under the standard complexity assumption that $\sharpp\ne \polytime$). We note that the \sharpphard lower bound is much stronger than what one can hope for in \abbrBPDB since, as mentioned earlier, for a fixed query one can always solve \Cref{prob:bag-pdb-query-eval} in polynomial time.
in general.
%, and proved that a dichotomy exists for this problem for the class of union of conjunctive queries (with the same expressive power as $\raPlus$), where the runtime of $\query(\pdb)$ is either polynomial or \sharpphard in data complexity. %for any polynomial-time deterministic query.
%Thus, for the hard queries, the answer to~\Cref{prob:informal} is {\em no} for set-PDBs (under the standard complexity assumption that $\sharpp\ne \polytime$).
We note that the \sharpphard lower bound is much stronger than what one can hope for in \abbrBPDB since, as mentioned earlier, for a fixed query one can always solve \Cref{prob:bag-pdb-query-eval} in polynomial time.
%Concretely, easy queries in this dichotomy can be answered through so-called \emph{extensional} query evaluation, where probability computation is inlined into normal deterministic query processing.
%This is possible, because queries on the easy side of the dichotomy can always be rewritten into a form that guarantees that, for every relational operator in the query, the presence of every tuple in the operator's output is governed by either a conjunction or disjunction of \emph{independent} events.
@ -213,33 +228,39 @@ For queries on the hard side of the dichotomy, and the best known algorithmic ap
\mypar{Approximating the expected multiplicities}
\AR{Have done my pass till here}
Our negative results indicate that \abbrBPDB{}s can not achieve comparable performance to deterministic databases for exact results (under standard complexity results). In fact, under plausible hardness conjecture, one cannot improve upon the trivial algorithm to exactly compute the expected multiplicities for \abbrTIDB. A natural followup questions is whether we can do better if we are willing to settle for an approximation to the expeccted multiplities.
In the remainder of this work, we demonstrate that a $(1\pm\epsilon)$ (multiplicative) approximation with competitive performance is indeed achievable.
Our negative results indicate that \abbrBPDB{}s can not achieve comparable performance to deterministic databases for exact results (under standard complexity results). In fact, under plausible hardness conjectures, one cannot improve upon the trivial algorithm to exactly compute the expected multiplicities for \abbrTIDB. A natural followup is whether we can do better if we are willing to settle for an approximation to the expeccted multiplities.
In the remainder of this work, we demonstrate that a $(1\pm\epsilon)$ (multiplicative) approximation with competitive performance is achievable.
Like set-probabilistic databases, our approach adopts the intensional model of query evaluation, as illustrated in \Cref{fig:two-step}.
Given input $\dbbase$ and $\query$, the first step, which we will refer to as \termStepOne (\abbrStepOne), outputs every tuple $\tup$ that possibly satisfies $\query$, annotated with its lineage polynomial ($\poly(\vct{X})=\apolyqdt\inparen{\vct{X}}$).
The second step, \termStepTwo (\abbrStepTwo) consists of computing $\expct\pbox{\poly(\vct{\randWorld})}$ from the output of the first step.
For \abbrBPDB $\pdb$, query $\query$, let $\timeOf{\abbrStepOne}(Q,\dbbase,\circuit)$ denote the runtime of \abbrStepOne, when it outputs $\circuit$ (which is a representation of $\poly$-- more on this representation shortly).
For \abbrBPDB $\pdb$, query $\query$, let $\timeOf{\abbrStepOne}(Q,\dbbase,\circuit)$ denote the runtime of \abbrStepOne, when it outputs $\circuit$ (which is a representation of $\poly$ --- more on this representation shortly).
Let us denote by $\timeOf{\abbrStepTwo}(\circuit)$ (recall $\circuit$ is the output of \abbrStepOne) the runtime of \abbrStepTwo, allowing us to formally define our objective:
\input{two-step-model}
\begin{Problem}\label{prob:big-o-joint-steps}
Given \abbrBPDB $\pdb$, $\raPlus$ query $\query$ and output tuple $\tup$,
does there exist a $(1\pm\epsilon)$-approximation of $\expct_{\db\sim\pd}\pbox{\query\inparen{\db}\inparen{\tup}}$ (for all resuult tuples $\tup$) for some $\circuit$ such that
$\timeOf{\abbrStepOne}(Q,\dbbase,\circuit) + \timeOf{\abbrStepTwo}(\circuit) \le O(\qruntime{Q, \dbbase})$?
Given \abbrBPDB $\pdb$, $\raPlus$ query $\query$,
is there a $(1\pm\epsilon)$-approximation of $\expct_{\db\sim\pd}\pbox{\query\inparen{\db}\inparen{\tup}}$ for all result tuples $\tup$ where
$\exists \circuit : \timeOf{\abbrStepOne}(Q,\dbbase,\circuit) + \timeOf{\abbrStepTwo}(\circuit) \le O(\qruntime{Q, \dbbase})$?
\end{Problem}
Note that if the answer to the above problem is yes, then we have shown that the answer to \Cref{prob:informal} is yes (when we are interested in approximating the expected muktiplities).
Note that if the answer to the above problem is yes, then we have shown that the answer to \Cref{prob:informal} is yes (when we are interested in approximating the expected multiplities).
We show in \Cref{sec:gen}
%\OK{confirm this ref}
%Atri: fixed the ref
an $O(\qruntime{Q, \dbbase})$ algorithm for constructing the lineage polynomial of result tuples of an $\raPlus$ query $\query$ (or more preicsely its representation $\circuit$).
an $O(\qruntime{Q, \dbbase})$ algorithm for constructing the lineage polynomial for all result tuples of an $\raPlus$ query $\query$ (or more more precisely, a single $\circuit$ with one sink per tuple representing the lineage).
% , and by extension the first step is in \sharpwonehard\AH{\sharpwonehard is not defined.}.
A key insight of this paper is that the representation of $\circuit$ matters. For example if we insist that $\circuit$ represent the lineage polynomial in the standard monomial basis (henceforth, \abbrSMB)\footnote{This is the representation where the polynomial is reresented as sum of `pure' products-- see \Cref{def:smb} for a formal definition.}, the answer to the above question in general is no, since then we will need $\abs{\circuit}\ge \Omega\inparen{\inparen{\qruntime{Q, \dbbase}}^k}$, and hence, just $\timeOf{\abbrStepOne}(Q,\dbbase,\circuit)$ will be too large.
A key insight of this paper is that the representation of $\circuit$ matters.
For example, if we insist that $\circuit$ represent the lineage polynomial in the standard monomial basis (henceforth, \abbrSMB)\footnote{
This is the representation, typically used in set-\abbrPDB\xplural, where the polynomial is reresented as sum of `pure' products. See \Cref{def:smb} for a formal definition.
}, the answer to the above question in general is no, since then we will need $\abs{\circuit}\ge \Omega\inparen{\inparen{\qruntime{Q, \dbbase}}^k}$, and hence, just $\timeOf{\abbrStepOne}(Q,\dbbase,\circuit)$ will be too large.
However, one can have compact representations of $\poly(\vct{X})$ (e.g., resulting from optimizations like projection push-down~\cite{DBLP:books/daglib/0020812}, which produce factorized representations of $\poly(\vct{X})$.
For example, in~\Cref{fig:two-step}, $B(Y+Z)$ is a factorized representation of the SMB-form $BY+BZ$. To capture such factorizations, this work uses (arithmetic) circuits\footnote{An arithmetic circuit has variable and/or numeric inputs, with internal nodes representing either an addition or multiplication operator.}
However, systems can generate compact representations of $\poly(\vct{X})$ (e.g., through optimizations like projection push-down~\cite{DBLP:books/daglib/0020812}, which directly result in factorized representations of $\poly(\vct{X})$.
For example, in~\Cref{fig:two-step}, $B(Y+Z)$ is a factorized representation of the SMB-form $BY+BZ$.
Accordingly, this work uses (arithmetic) circuits\footnote{
An arithmetic circuit is a DAG with variable and/or numeric source nodes, with internal nodes representing either an addition or multiplication operator.
}
as the representation system of $\poly(\vct{X})$.
@ -291,10 +312,10 @@ Given a circuit $\circuit$ for $\apolyqdt$ (over all result tuples $\tup$) for \
(i) We show that for typical database usage patterns (e.g. when the circuit is a tree or is generated by recent worst-case optimal join algorithms or their Functional Aggregate Query (FAQ)/Aggregations and Joins over Annotated Relations (AJAR) followups~\cite{DBLP:conf/pods/KhamisNR16, ajar}), the answer to \Cref{prob:intro-stmt} for \abbrTIDB is {\em yes}, where there is a single result tuple\footnote{We can approximate the expected output tuple multiplicities (for all output tuples {\em simultanesouly} with only $O(\log{Z})=O_k(\log{n})$ overhead (where $Z$ is the number of output tuples) over the runtime of a broad class of query processing algorithms (see \Cref{app:sec-cicuits}).}
% the approximation algorithm has runtime linear in the size of the compressed lineage encoding (
In contrast, known approximation techniques in set-\abbrPDB\xplural are at most quadratic in the size of the compressed lineage encoding.\AR{cite?}
In contrast, known approximation techniques in set-\abbrPDB\xplural are at most quadratic in the size of the compressed lineage encoding~\cite{DBLP:conf/icde/OlteanuHK10,DBLP:journals/jal/KarpLM89}.
%Atri: The footnote below does not add much
%\footnote{Note that this doesn't rule out queries for which approximation is linear});
(ii) We generalize the \abbrPDB data model considered by the approximation algorithm to a class of bag-Block Independent Disjoint Databases (see \Cref{subsec:tidbs-and-bidbs}) (\abbrBIDB\xplural); (iv) We further prove that for \raPlus queries
(ii) We generalize the \abbrPDB data model considered by the approximation algorithm to a class of bag-Block Independent Disjoint Databases (see \Cref{subsec:tidbs-and-bidbs}) (\abbrBIDB\xplural);
%\AH{This point \emph{\Large seems} weird to me. I thought we just said that the approximation complexity is linear in step one, but now it's as if we're saying that it's $\log{\text{step one}} + $ the runtime of step one. Where am I missing it?}
%\OK{Atri's (and most theoretician's) statements about complexity always need to be suffixed with ``to within a log factor''}
(iii) We also observe that our results trivially extend to higher moments of the tuple multiplicity (instead of just the expectation).
@ -310,7 +331,7 @@ It can be verified that $\poly\inparen{A, B, C, E, X, Y, Z}$ for the sole output
The lineage polynomial for $Q^2$ is given by $\poly^2\inparen{A, B, C, E, X, Y, Z}$:\AR{Changed the variable $D$ to $E$ to avoid conflict with use of $D$ as a DB.}
\begin{multline*}
\inparen{AXB + BYE + BZC}^2\\
%\inparen{AXB + BYE + BZC}^2\\
=A^2X^2B^2 + B^2Y^2E^2 + B^2Z^2C^2 + 2AXB^2YE + 2AXB^2ZC + 2B^2YEZC.
\end{multline*}
By exploiting linearity of expectation of summand terms, and further pushing expectation through independent \abbrTIDB variables, the expectation
@ -341,9 +362,9 @@ With $\poly^2\inparen{A, B, C, E, X, Y, Z}$ as an example, we have:
&\widetilde{\poly^2}(A, B, C, E, X, Y, Z) = AXB + BYE + BZC + 2AXBYE + 2AXBZC + 2BYEZC.
%&\; = AXB + BYD + BZC + 2AXBYD + 2AXBZC + 2BYDZC
\end{align*}
Note that we have argued that for our specific example the expectation that we want to compute is $\widetilde{\poly^2}(\probOf\pbox{A=1},$ $\probOf\pbox{B=1}, \probOf\pbox{C=1}), \probOf\pbox{E=1}, \probOf\pbox{X=1}, \probOf\pbox{Y=1}, \probOf\pbox{Z=1})$.
Note that we have argued that for our specific example the expectation that we want is $\widetilde{\poly^2}(\probOf\pbox{A=1},$ $\probOf\pbox{B=1}, \probOf\pbox{C=1}), \probOf\pbox{E=1}, \probOf\pbox{X=1}, \probOf\pbox{Y=1}, \probOf\pbox{Z=1})$.
%It can be verified that the reduced polynomial parameterized with each variable's respective marginal probability is a closed form of the expected count (i.e., $\expct\limits_{\vct{\randWorld}\sim\pd}\pbox{\Phi^2\inparen{\vct{X}}} = \widetilde{\Phi^2}(\probOf\pbox{A=1},$ $\probOf\pbox{B=1}, \probOf\pbox{C=1}), \probOf\pbox{D=1}, \probOf\pbox{X=1}, \probOf\pbox{Y=1}, \probOf\pbox{Z=1})$).
In fact, the following lemma shows that this equivalence holds for {\em all} $\raPlus$ queries over \abbrTIDB (proof in \Cref{subsec:proof-exp-poly-rpoly}).
\Cref{lem:tidb-reduce-poly} generalizes the equivalence to {\em all} $\raPlus$ queries on \abbrTIDB\xplural (proof in \Cref{subsec:proof-exp-poly-rpoly}).
\begin{Lemma}\label{lem:tidb-reduce-poly}
Let $\pdb$ be a \abbrTIDB over $n$ input tuples
%\OK{Should this be $\vct{W}$?} $\vct{X} = \{X_1,\ldots,X_\numvar\}$
@ -356,19 +377,22 @@ For any \abbrTIDB-lineage polynomial $\poly\inparen{\vct{X}}=\apolyqdt{\vct{X}}$
$
\end{Lemma}
To prove our hardness result we show that for the same $Q$ considered in the example above, for an arbitrary `product width' $k$, the query $Q^k$ is able to encode various hard graph-counting problems\footnote{While $\query$ is the same, our results assume $\bigO{\numvar}$ tuples rather than the constant number of tuples appearing in \Cref{fig:two-step}}. We do so by analyzing how the coefficients in the (univariate) polynomial $\widetilde{\poly}\left(p,\dots,p\right)$ relate to counts of subgraphs in an arbitrary graph $G$ (which is used to define the $Route$ relation in $\query$) isomorphic to various graphs with $k$ edges. E.g., we exploit the fact that the leading coefficient in $\poly$ corresponding to $\query^k$ is proportional to the number of $k$-matchings in $G$, which is a known hard problem in parameterized/fine-grained complexity literature.
To prove our hardness result we show that for the same $Q$ from the example above, for an arbitrary `product width' $k$, the query $Q^k$ is able to encode various hard graph-counting problems\footnote{While $\query$ is the same, our results assume $\bigO{\numvar}$ tuples rather than the constant number of tuples appearing in \Cref{fig:two-step}}. We do so by analyzing how the coefficients in the (univariate) polynomial $\widetilde{\poly}\left(p,\dots,p\right)$ relate to counts of subgraphs in an arbitrary graph $G$ (which is used to define the $Route$ relation in $\query$) isomorphic to various graphs with $k$ edges. E.g., we exploit the fact that the leading coefficient in $\poly$ corresponding to $\query^k$ is proportional to the number of $k$-matchings in $G$, a known hard problem in parameterized/fine-grained complexity literature.
For an upper bound on approximating the expected count, it is easy to check that if all the probabilties are constant then ${\poly}\left(\prob_1,\dots, \prob_n\right)$ (i.e. evaluating the original lineage polynomial over the probability values) is a constant factor approximation. For example, using $\query^2$ from above, using $\prob_A$ to denote $\probOf\pbox{A = 1}$ (and similarly for the other six variables), we can see that
For an upper bound on approximating the expected count, it is easy to check that if all the probabilties are constant then ${\poly}\left(\prob_1,\dots, \prob_n\right)$ (i.e. evaluating the original lineage polynomial over the probability values) is a constant factor approximation. For example, using $\query^2$ from above, using $\prob_A$ to denote $\probOf\pbox{A = 1}$ (and similarly for the other variables), we can see that
\begin{footnotesize}
\begin{align*}
\hspace*{-3mm}
\poly^2\inparen{\probAllTup} &= \prob_A^2\prob_X^2\prob_B^2 + \prob_B^2\prob_Y^2\prob_E^2 + \prob_B^2\prob_Z^2\prob_C^2 + 2\prob_A\prob_X\prob_B^2\prob_Y\prob_E + 2\prob_A\prob_X\prob_B^2\prob_Z\prob_C + 2\prob_B^2\prob_Y\prob_E\prob_Z\prob_C\\
&\leq\prob_A\prob_X\prob_B + \prob_B\prob_Y\prob_E + \prob_B\prob_Z\prob_C +
2\prob_A\prob_X\prob_B\prob_Y\prob_E + 2\prob_A\prob_X\prob_B\prob_Z\prob_C + 2\prob_B\prob_Y\prob_E\prob_Z\prob_C \\
&= \rpoly\inparen{\vct{p}}.
2\prob_A\prob_X\prob_B\prob_Y\prob_E + 2\prob_A\prob_X\prob_B\prob_Z\prob_C + 2\prob_B\prob_Y\prob_E\prob_Z\prob_C
= \rpoly\inparen{\vct{p}}
%\inparen{0.9\cdot 1.0\cdot 1.0 + 0.5\cdot 1.0\cdot 1.0 + 0.5\cdot 1.0\cdot 0.5}^2 = 2.7225 < 3.45 = \rpoly^2\inparen{\probAllTup}
\end{align*}
If we assume that all of the seven probability values are at least $p_0>0$,
\end{footnotesize}
If we assume that all seven probability values are at least $p_0>0$,
%Choose the least factor that is reduced in $\rpoly^2\inparen{\vct{X}}$, in this case $\prob_A\prob_X\prob_B$, and
then we note that $\poly^2\inparen{\vct{\prob}}$ is in the range $[\inparen{p_0}^3\cdot\rpoly\inparen{\vct{\prob}}, \rpoly\inparen{\vct{\prob}}]$.
we get that $\poly^2\inparen{\vct{\prob}}$ is in the range $[\inparen{p_0}^3\cdot\rpoly\inparen{\vct{\prob}}, \rpoly\inparen{\vct{\prob}}]$.
%
To get an $(1\pm \epsilon)$-multiplicative approximation we uniformly sample monomials from the \abbrSMB representation of $\poly$ and `adjust' their contribution to $\widetilde{\poly}\left(\cdot\right)$.

View File

@ -606,3 +606,37 @@ Maximilian Schleich},
}
@article{DBLP:journals/jal/KarpLM89,
author = {Richard M. Karp and
Michael Luby and
Neal Madras},
title = {Monte-Carlo Approximation Algorithms for Enumeration Problems},
journal = {J. Algorithms},
volume = {10},
number = {3},
pages = {429--448},
year = {1989}
}
@inproceedings{ajar,
author = {Manas R. Joglekar and
Rohan Puttagunta and
Christopher R{\'{e}}},
editor = {Tova Milo and
Wang{-}Chiew Tan},
title = {{AJAR:} Aggregations and Joins over Annotated Relations},
booktitle = {Proceedings of the 35th {ACM} {SIGMOD-SIGACT-SIGAI} Symposium on Principles
of Database Systems, {PODS} 2016, San Francisco, CA, USA, June 26
- July 01, 2016},
pages = {91--106},
publisher = {{ACM}},
year = {2016},
url = {https://doi.org/10.1145/2902251.2902293},
doi = {10.1145/2902251.2902293},
timestamp = {Tue, 06 Nov 2018 16:58:02 +0100},
biburl = {https://dblp.org/rec/conf/pods/JoglekarPR16.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}

View File

@ -5,7 +5,7 @@
\subsection{Probabilistic Databases}
Following typical representation of bags in production databases, for query inputs, we will use \abbrBPDB\xplural with $\{0, 1\}$ input.
Following typical representation of bags in production databases, for query inputs, we will use \abbrBPDB\xplural with multiplicities $\{0, 1\}$ and a unique tuple-id field to allow duplicate tuples.
An \textit{incomplete database} $\idb$ is a set of deterministic databases $\db$ called possible worlds.
A \textit{probabilistic database} $\pdb$ is a pair $(\idb, \pd)$ where $\idb$ is an incomplete database and $\pd$ is a probability distribution over $\idb$. Queries over probabilistic databases are evaluated using the so-called possible world semantics. Under the possible world semantics, the result of a query $\query$ over an incomplete database $\idb$ is the set of query answers produced by evaluating $\query$ over each possible world: $\query(\idb) = \comprehension{\query(\db)}{\db \in \idb}$.
@ -35,7 +35,7 @@ For a probabilistic database $\pdb = (\idb, \pd)$, the result of a query is th
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%END: move to appendix.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Recall \cref{fig:nxDBSemantics} which depicts the semantics for constructing a lineage polynomial $\apolyqdt$ for any $\raPlus$ query. We now make a meaningful connection between possible world semantics and world assignments on the lineage polynomial.
Recall \Cref{fig:nxDBSemantics} which depicts the semantics for constructing a lineage polynomial $\apolyqdt$ for any $\raPlus$ query. We now make a meaningful connection between possible world semantics and world assignments on the lineage polynomial.
\begin{Proposition}[Expectation of polynomials]\label{prop:expection-of-polynom}
Given a \abbrBPDB $\pdb = (\idb,\pd)$ and lineage polynomial $\apolyqdt$ for aribitrary output tuple $\tup$, %$\semNX$-\abbrPDB $\pxdb = (\idb_{\semNX}',\pd')$ where $\rmod(\pxdb) = \pdb$,

View File

@ -9,7 +9,7 @@
\begin{figure}[h!]
\centering
\resizebox{\textwidth}{5.5cm}{%
\resizebox{\textwidth}{5.2cm}{%
\begin{tikzpicture}
%pdb cylinder
\node[cylinder, text width=0.28\textwidth, align=center, draw=black, text=black, cylinder uses custom fill, cylinder body fill=blue!10, aspect=0.12, minimum height=5cm, minimum width=2.5cm, cylinder end fill=blue!50, shape border rotate=90] (cylinder) at (0, 0) {