updates

2021-04-09 15:12:46 -05:00 · 2021-04-09 15:12:46 -05:00 · 3a68fe6822
parent 9dade793f7
commit 3a68fe6822
6 changed files with 39 additions and 34 deletions
--- a/abstract.tex
+++ b/abstract.tex
@ -1,15 +1,14 @@
 %root: main.tex
 %!TEX root=./main.tex
 \begin{abstract}
-  The problem of computing the marginal probability of a tuple in the result of a query over set-probabilistic databases (PDBs) can be reduced to calculating the probability of the \emph{lineage formula} of the result, a Boolean formula over random variables representing the existence of tuples in the database's possible worlds. 
-  The analog for bag semantics is a natural number-valued polynomial over random variables that evaluates to the multiplicity of the tuple in each world. 
-  In this work, we study the problem of calculating the expectation of such polynomials (a tuple's expected  multiplicity) exactly and approximately. 
-  For tuple-independent databases (TIDBs), the expected multiplicity of a query result tuple can trivially be computed in linear time in the size of the tuple's lineage, if this polynomial is encoded as a sum of products. 
-  However, using a reduction from the problem of counting $k$-matchings, we demonstrate that calculating the expectation is \sharpwonehard when the polynomial is compressed, for example through factorization. 
-  As we show, this result has a significant implication: a Bag-PDB doing exact computations will never be as fast as a classical (deterministic) database.
-  The problem stays hard even for polynomials generated by conjunctive queries (CQs) if all input tuples have a fixed probability $\prob$ (s.t. $\prob \in (0,1)$). 
-  We proceed to study polynomials of result tuples of union of conjunctive queries (UCQs) over TIDBs and for a non-trivial subclass of block-independent databases (BIDBs). We develop a sampling algorithm that computes a $1 \pm \epsilon$-approximation of the expectation of such polynomials in linear time in the size of the polynomial, paving the way for PDBs to be competitive with deterministic databases.
-
+  The problem of computing the marginal probability of a tuple in the result of a query over set-probabilistic databases (PDBs) can be reduced to calculating the probability of the \emph{lineage formula} of the result, a Boolean formula over random variables representing the existence of tuples in the database's possible worlds.
+  The analog for bag semantics is a natural number-valued polynomial over random variables that evaluates to the multiplicity of the tuple in each world.
+  In this work, we study the problem of calculating the expectation of such polynomials (a tuple's expected  multiplicity) exactly and approximately.
+  For tuple-independent databases (TIDBs), the expected multiplicity of a query result tuple can trivially be computed in linear time in the size of the tuple's lineage, if this polynomial is encoded as a sum of products.
+  However, using a reduction from the problem of counting $k$-matchings, we demonstrate that calculating the expectation is \sharpwonehard when the polynomial is compressed, for example through factorization.
+  % As we show, this result has a significant implication: a Bag-PDB doing exact computations will never be as fast as a classical (deterministic) database.
+  The problem stays hard even for polynomials generated by conjunctive queries (CQs) if all input tuples have a fixed probability $\prob$ (s.t. $\prob \in (0,1)$).
+  We proceed to study polynomials of result tuples of union of conjunctive queries (UCQs) over TIDBs and for a non-trivial subclass of block-independent databases (BIDBs). We develop a sampling algorithm that computes a $1 \pm \epsilon$-approximation of the expectation of such polynomials in linear time in the size of the polynomial. This paves the way for PDBs to be competitive with deterministic databases, because expected multiplicities can be computed with only a log factor overhead over many deterministic query processing algorithms.
 \end{abstract}

 %%% Local Variables:
--- a/intro-new.tex
+++ b/intro-new.tex
@ -3,9 +3,9 @@

 \section{Introduction}
 \label{sec:intro}
-A \emph{probabilistic database} $\pdb = (\idb, \pd)$ is set of deterministic databases $\idb = \{ \db_1, \ldots, \db_n\}$ called possible worlds, paired with a probability distribution $\pd$ over these worlds. 
-A well-studied problem in probabilistic databases is, given a query $\query$ and probabilistic database $\pdb$, computing the \emph{marginal probability} of a tuple $\tup$, (i.e., its probability of appearing in the result of query $\query$ over $\pdb$). 
-This problem is \sharpphard for set semantics, even for \emph{tuple-independent probabilistic databases}~\cite{DBLP:series/synthesis/2011Suciu} (TIDBs), which are a subclass of probabilistic databases where tuples are independent events. The dichotomy of Dalvi and Suciu~\cite{10.1145/1265530.1265571} separates the hard cases, from cases that are in \ptime for unions of conjunctive queries (UCQs). 
+A \emph{probabilistic database} $\pdb = (\idb, \pd)$ is set of deterministic databases $\idb = \{ \db_1, \ldots, \db_n\}$ called possible worlds, paired with a probability distribution $\pd$ over these worlds.
+A well-studied problem in probabilistic databases is, given a query $\query$ and probabilistic database $\pdb$, computing the \emph{marginal probability} of a tuple $\tup$, (i.e., its probability of appearing in the result of query $\query$ over $\pdb$).
+This problem is \sharpphard for set semantics, even for \emph{tuple-independent probabilistic databases}~\cite{DBLP:series/synthesis/2011Suciu} (TIDBs), which are a subclass of probabilistic databases where tuples are independent events. The dichotomy of Dalvi and Suciu~\cite{10.1145/1265530.1265571} separates the hard cases, from cases that are in \ptime for unions of conjunctive queries (UCQs).
 In this work we consider bag semantics, where each tuple is associated with a multiplicity $\db_i(\tup)$ in each possible world $\db_i$ and study the analogous problem of computing the expectation of the multiplicity of a query result tuple $\tup$ (denoted $\query(\db)(t)$):
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \begin{equation}\label{eq:intro-bag-expectation}
@ -117,38 +117,38 @@ In this work we consider bag semantics, where each tuple is associated with a mu
 \begin{Example}\label{ex:intro-tbls}
  Consider the bag-\ti relations shown in \Cref{fig:ex-shipping-simp}. We define a \ti under bag semantics analog to the set case: each tuple is associated with a probability of having a multiplicity of one (and otherwise multiplicity zero), and tuples are independent random events. Ignore column $\Phi$ for now. In this example, we have shipping routes that are certain (probability 1.0) and information about whether shipping at locations is on time (with a certain probability). Query $\query_1$ shown below returns starting points of shipping routes where processing of shipping is on time.

-$$Q_1(\text{City}) :- Loc(\text{City}), Route(\text{City}, \_)$$
+$$Q_1(\text{City}) \dlImp Loc(\text{City}), Route(\text{City}, \dlDontcare)$$

 \Cref{subfig:ex-shipping-simp-queries} shows the possible results of this query.
-For example, there is a 90\% probability there is a single route starting in Buffalo that is on time, and the expected multiplicity of this result tuple is $0.9$. 
-There are two shipping routes starting in Chicago. 
+For example, there is a 90\% probability there is a single route starting in Buffalo that is on time, and the expected multiplicity of this result tuple is $0.9$.
+There are two shipping routes starting in Chicago.
 Since the Chicago location has a 50\% probability of being on schedule (we assume that delays are linked), the expected multiplicity of this result tuple is $0.5 + 0.5 = 1.0$.
 \end{Example}
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

-A well-known result in probabilistic databases is that under set semantics the marginal probability of a query result $\tup$ can be computed based on the tuple's lineage. The lineage of a tuple is a Boolean formula (an element of the semiring $\text{PosBool}[\vct{X}]$ of positive Boolean expressions over variables $\vct{X}=(X_1,\dots,X_n)$~\cite{DBLP:conf/pods/GreenKT07}) over random variables that encode the existence of input tuples. Each possible world $\db$ corresponds to an assignment $\{0,1\}^\numvar$ of the variables in $\vct{X}$ to either true (the tuple exists in this world) or false (the tuple does not exist in this world). Importantly, the following holds: if the lineage formula for $t$ evaluates to true under the assignment for a world $\db$, then $\tup \in \query(\db)$. 
+A well-known result in probabilistic databases is that under set semantics the marginal probability of a query result $\tup$ can be computed based on the tuple's lineage. The lineage of a tuple is a Boolean formula (an element of the semiring $\text{PosBool}[\vct{X}]$ of positive Boolean expressions over variables $\vct{X}=(X_1,\dots,X_n)$~\cite{DBLP:conf/pods/GreenKT07}) over random variables that encode the existence of input tuples. Each possible world $\db$ corresponds to an assignment $\{0,1\}^\numvar$ of the variables in $\vct{X}$ to either true (the tuple exists in this world) or false (the tuple does not exist in this world). Importantly, the following holds: if the lineage formula for $t$ evaluates to true under the assignment for a world $\db$, then $\tup \in \query(\db)$.
 Thus, the marginal probability of tuple $\tup$ is equal to the probability that its lineage evaluates to true (with respect to the obvious analog of probability distribution $\probDist$ defined over $\vct{X}$).

 For bag semantics, the lineage of a tuple is a polynomial over variables $\vct{X}=(X_1,\dots,X_n)$ with % \in \mathbb{N}^\numvar$ with
-coefficients in the set of natural numbers $\mathbb{N}$ (an element of semiring $\mathbb{N}[\vct{X}]$). 
+coefficients in the set of natural numbers $\mathbb{N}$ (an element of semiring $\mathbb{N}[\vct{X}]$).
 Analogously to sets, evaluating the lineage for $t$  over an assignment corresponding to a possible world yields the multiplicity of the result tuple $\tup$ in this world. Thus, instead of using \cref{eq:intro-bag-expectation} to compute the expected result multiplicity of a tuple $\tup$, we can equivalently compute the expectation of the lineage polynomial of $\tup$ which we will denote as $\linsett{\query}{\pdb}{\tup}$ or $\Phi$ if the parameters are clear from the context. In this work, we study the complexity of computing the expectation of such polynomials encoded as arithmetic circuits.

 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \begin{Example}\label{ex:intro-lineage}
 Associating a lineage variable with every input tuple as shown in \cref{fig:ex-shipping-simp}, we can compute the lineage of every result tuple as shown in \cref{subfig:ex-shipping-simp-route}. For example, the tuple Chicago is in the result, because $L_b$ joins with both $R_b$ and $R_c$. Its lineage is $\Phi = L_b \cdot R_b + L_b \cdot R_c$. The expected multiplicity of this result tuple is calculated by summing the multiplicity of the result tuple, weighted by its probability, over all possible worlds.
 In this example, $\Phi$ is a sum of products (SOP), and so we can use linearity of expectation to  solve the problem in linear time (in the size of  $\linsett{\query}{\pdb}{\tup}$).
-The expectation of the sum is the sum of the expectations of monomials. 
+The expectation of the sum is the sum of the expectations of monomials.
 The expectation of each monomial is then computed by multiplying the probabilities of the variables (tuples) occurring in the monomial.
 The expected multiplicity of Chicago is $1.0$.
 \end{Example}
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

 The expected multiplicity of a query result can be computed in linear time (in the size of the result's lineage) if the lineage is in SOP form.
-However, this need not be true for compressed representations of polynomials, including factorized polynomials or arithmetic circuits. 
-For instance, \Cref{subfig:ex-proj-push-circ-q4} shows two circuits encoding the lineage of the result tuple $(Chicago)$ from \Cref{ex:intro-lineage}. 
-The left circuit encodes the lineage as a SOP while the right circuit uses distributivity to push the addition gate below the multiplication, resulting in a smaller circuit.  
-Given that there is a large body of work that can  output such compressed representations~\cite{DBLP:conf/pods/KhamisNR16,factorized-db}, %\BG{cite FDBs and FAQ}, 
-an interesting question is whether computing expectations is still in linear time for such compressed representations. 
+However, this need not be true for compressed representations of polynomials, including factorized polynomials or arithmetic circuits.
+For instance, \Cref{subfig:ex-proj-push-circ-q4} shows two circuits encoding the lineage of the result tuple $(Chicago)$ from \Cref{ex:intro-lineage}.
+The left circuit encodes the lineage as a SOP while the right circuit uses distributivity to push the addition gate below the multiplication, resulting in a smaller circuit.
+Given that there is a large body of work that can  output such compressed representations~\cite{DBLP:conf/pods/KhamisNR16,factorized-db}, %\BG{cite FDBs and FAQ},
+an interesting question is whether computing expectations is still in linear time for such compressed representations.
 If the answer is in the affirmative, and if lineage formulas can also be computed in linear time (in the lineage size), then bag-relational probabilistic databases can theoretically match the performance of deterministic databases.
 Unfortunately, we prove that this is not the case: computing the expected count of a query result tuple is super-linear under standard complexity assumptions (\sharpwonehard) in the size of a lineage circuit.

@ -158,12 +158,12 @@ Concretely, we make the following contributions:
 (iii) We generalize the approximation algorithm to bag-$\bi$s, a more general model of probabilistic data;
 (iv) We further prove that for \raPlus queries\AR{Some places we use \raPlus and UCQ in others: we should use one consistently (assuming they are both the same)},  we can approximate the expected output tuple multiplicities with only $O(\log{Z})$ overhead (where $Z$ is the number of output tuples) over the runtime of a broad class of query processing algorithms. We also observe that our results trivially extend to higher moments of the tuple multiplicity (instead of just the expectation).

-%\mypar{Implications of our Results} As mentioned above 
+%\mypar{Implications of our Results} As mentioned above

 \mypar{Overview of our Techniques} All of our results rely on working with a {\em reduced} form of the lineage polynomial $\Phi$. In fact, it turns out that for the TIDB (and BIDB) case, computing the expected multiplicity is {\em exactly} the same as evaluating this reduced polynomial over the probabilities that define the TIDB/BIDB. Next, we motivate this reduced polynomial by continuing~\Cref{ex:intro-tbls}.

-%Moving forward, we focus exclusively on bags.  
-Consider the query $Q():-$$OnTime(\text{City}), Route(\text{City}_1, \text{City}_2),$ $OnTime(\text{City}')$ over the bag relations of \cref{fig:ex-shipping-simp}. It can be verified that $\Phi$ for $Q$ is $L_aL_b + L_bL_d + L_bL_c$. Now consider the product query $\poly^2():- Q \times Q$.
+%Moving forward, we focus exclusively on bags.
+Consider the query $Q()\dlImp$$OnTime(\text{City}), Route(\text{City}_1, \text{City}_2),$ $OnTime(\text{City}')$ over the bag relations of \cref{fig:ex-shipping-simp}. It can be verified that $\Phi$ for $Q$ is $L_aL_b + L_bL_d + L_bL_c$. Now consider the product query $\poly^2()\dlImp Q(), Q()$.
 %The factorized representation of $\poly^2$ is (for simplicity we ignore the random variables of $Route$ since each variable has probability of $1$):
 %\begin{equation*}
 %\poly^2 = \left(L_aL_b + L_bL_d + L_bL_c\right) \cdot \left(L_aL_b + L_bL_d + L_bL_c\right)
--- a/intro.tex
+++ b/intro.tex
@ -197,7 +197,7 @@ In this paper we study computing the expected count of an output bag PDB tuple w
 %
 %\begin{Example}\label{ex:bag-vs-set}
 %Continuing the prior example, we are given the following Boolean (resp,. count) query
-%$$\poly() :- R(A), E(A, B), R(B)$$
+%$$\poly() \dlImp R(A), E(A, B), R(B)$$
 %The lineage of the result in a Set-PDB (Bag-PDB) is a Boolean formula (polynomial) over random variables annotating the input relations (i.e., $W_a$, $W_b$, $W_c$).
 %Because the query result is a nullary relation, in what follows we can write $Q(\cdot)$ to denote the function that evaluates the lineage over one specific assignment of values to the variables (i.e., the value of the lineage in the corresponding possible world):
 %
--- a/macros.tex
+++ b/macros.tex
@ -50,6 +50,12 @@
 \newcommand{\qClass}{\mathcal{Q}}
 \newcommand{\raPlus}{\ensuremath{\mathcal{RA}^{+}}\xspace}

+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+% Datalog
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\newcommand{\dlImp}[0]{\,\ensuremath{\mathtt{{:}-}}\,}
+\newcommand{\dlDontcare}{\_}
+
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%
 %Approx Alg
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%
--- a/mult_distinct_p.tex
+++ b/mult_distinct_p.tex
@ -15,7 +15,7 @@ In particular, we will consider the problems of computing the following counts (
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \begin{Theorem}[\cite{k-match}]
 \label{thm:k-match-hard}
-Given positive integer $k$ and undirected graph $G$ with no self-loops or parallel edges, computing $\numocc{G}{\kmatch}$ exactly is %counting the number of $k$-matchings in $G$ is 
+Given positive integer $k$ and undirected graph $G$ with no self-loops or parallel edges, computing $\numocc{G}{\kmatch}$ exactly is %counting the number of $k$-matchings in $G$ is
 \sharpwonehard (parameterization is in $k$).
 \end{Theorem}
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@ -30,13 +30,13 @@ There exists a constant $\eps_0>0$ such that given an undirected graph $G=(V,E)$
 \end{hypo}
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 %
-Based on the so called {\em Triangle detection hypothesis} (cf.~\cite{triang-hard}), which states that detection of whether $G$ has a triangle or not takes time $\Omega\inparen{|E|^{4/3}}$, implies that in Conjecture~\ref{conj:graph} we can take $\eps_0\ge \frac 13$. 
+Based on the so called {\em Triangle detection hypothesis} (cf.~\cite{triang-hard}), which states that detection of whether $G$ has a triangle or not takes time $\Omega\inparen{|E|^{4/3}}$, implies that in Conjecture~\ref{conj:graph} we can take $\eps_0\ge \frac 13$.
 %The current best known algorithm to count the number of $3$-matchings, to
 %\AR{Need to add something about 3-paths and 3-matchings as well.}

 Both of our hardness results rely on a simple query polynomial encoding of the edges of a graph.
 To prove our hardness result, consider a graph $G(V, E)$, where $|E| = m$, $|V| = \numvar$. Our query polynomial has a variable $X_i$ for every $i$ in $[\numvar]$.
-Consider the polynomial 
+Consider the polynomial
 $\poly_{G}(\vct{X}) = \sum\limits_{(i, j) \in E} X_i \cdot X_j.$
 The hard polynomial for our problem will be a suitable power $k\ge 3$ of the polynomial above:
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@ -50,7 +50,7 @@ Our hardness results only need a \ti instance; We also consider the special case


 \noindent Returning to in \cref{fig:ex-shipping-simp}, it is easy to see that $\poly_{G}^\kElem(\vct{X})$ generalizes our running example query:
-\[\poly^k_G:- Loc(C_1),Route(C_1, C_1'),Loc(C_1'),\dots,Loc(C_\kElem),Route(C_\kElem,C_\kElem'),Loc(C_\kElem')\]
+\[\poly^k_G\dlImp Loc(C_1),Route(C_1, C_1'),Loc(C_1'),\dots,Loc(C_\kElem),Route(C_\kElem,C_\kElem'),Loc(C_\kElem')\]
 where adapting the PDB instance in \cref{fig:ex-shipping-simp}, relation $Loc$ has $n$ tuples corresponding to each vertex in $V=[n]$ each with probability $\prob$ and $Route(\text{City}_1, \text{City}_2)$ has tuples corresponding to the edges $E$ (each with probability of $1$).\footnote{Technically, $\poly_{G}^\kElem(\vct{X})$ should have variables corresponding to tuples in $Route$ as well, but since they always are present with probability $1$, we drop those. Our argument also works when all the tuples in $Route$ also are present with probability $\prob$ but to simplify notation we assign probability $1$ to edges.}
 Note that this implies that our hard query polynomial can be represented even as an expression tree and is created from a project-join query with same probability value for each $\prob_i$. %; our hardness result transfers here as well.
 % OK: The following (commented-out) sentence feels a bit misplaced here.
@ -63,7 +63,7 @@ Note that this implies that our hard query polynomial can be represented even as
 %Unless otherwise noted, all proofs for this section are in~\Cref{app:single-mult-p}.
 We are now ready to present our main hardness result.
 %
-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \begin{Theorem}\label{thm:mult-p-hard-result}
 Computing $\rpoly_G^\kElem(\prob_i,\dots,\prob_i)$ for arbitrary $G$ and any $(2k+1)$ distinct values $\prob_i$ ($0\le i \le 2k$) is \sharpwonehard (parameterization is in $k$).
 \end{Theorem}
--- a/ra-to-poly.tex
+++ b/ra-to-poly.tex
@ -5,7 +5,7 @@

 \iffalse
 \subsection{Superlinearity of Bag PDBs}\label{sec:suplin-bags}
-Moving forward, we focus exclusively on bags.  For $Q():-$$OnTime(\text{City}), Route(\text{City}_1, \text{City}_2),$ $OnTime(\text{City}')$ over the bag relations of \cref{fig:ex-shipping-simp}, consider the product query $\poly^2():- Q \times Q$.
+Moving forward, we focus exclusively on bags.  For $Q()\dlImp$$OnTime(\text{City}), Route(\text{City}_1, \text{City}_2),$ $OnTime(\text{City}')$ over the bag relations of \cref{fig:ex-shipping-simp}, consider the product query $\poly^2()\dlImp Q \times Q$.
 The factorized representation of $\poly^2$ is (for simplicity we ignore the random variables of $Route$ since each variable has probability of $1$):
 \begin{equation*}
 \poly^2 = \left(L_aL_b + L_bL_d + L_bL_c\right) \cdot \left(L_aL_b + L_bL_d + L_bL_c\right)
@ -54,7 +54,7 @@ For a probabilistic  database $\pdb = (\idb, \pd)$,  the result of a query is th

 Let $\semNX$ denote the set of polynomials over variables $\vct{X}=(X_1,\dots,X_n)$ with natural number coefficients and exponents.
 We model incomplete relations using Green et. al.'s $\semNX$-databases~\cite{DBLP:conf/pods/GreenKT07}, discussed in detail in \Cref{subsec:supp-mat-krelations} and summarized here.
-In an $\semNX$-databases, relations are defined as functions from tuples to elements of $\semNX$, typically called annotations.  
+In an $\semNX$-databases, relations are defined as functions from tuples to elements of $\semNX$, typically called annotations.
 We write $R(t)$ to denote the polynomial annotating tuple $t$ in relation $R$.
 Each possible world is defined by an assignment of $N$ binary values $\vct{W} \in \{0, 1\}^{|X|}$.
 The multiplicity of $t \in R$ in this possible world is obtained by evaluating the polynomial annotating it on $\vct{W}$ (i.e., $R(t)(\vct{W})$).