Minor adjustments
parent
58b70f0fcf
commit
6d2a684189
|
@ -160,7 +160,7 @@ Then, in expectation we have
|
|||
&= \sum_{\vct{d} \in \eta}q_{\vct{d}}\cdot \prod_{\substack{i = 1\\s.t. d_i \geq 1}}^\numvar \prob_i\label{p1-s4}\\
|
||||
&= \rpoly(\prob_1,\ldots, \prob_\numvar)\label{p1-s5}
|
||||
\end{align}
|
||||
In steps \cref{p1-s1} and \cref{p1-s2}, by linearity of expectation (recall the variables are independent), the expecation can be pushed all the way inside of the product. In \cref{p1-s3}, note that $w_i \in \{0, 1\}$ which further implies that for any exponent $e \geq 1$, $w_i^e = w_i$. Next, in \cref{p1-s4} the expectation of a tuple is indeed its probability.
|
||||
In steps \cref{p1-s1} and \cref{p1-s2}, by linearity of expectation (recall the variables are independent, or the monomial expectation is 0), the expecation can be pushed all the way inside of the product. In \cref{p1-s3}, note that $w_i \in \{0, 1\}$ which further implies that for any exponent $e \geq 1$, $w_i^e = w_i$. Next, in \cref{p1-s4} the expectation of a tuple is indeed its probability.
|
||||
|
||||
Finally, observe \Cref{p1-s5} by construction in \Cref{lem:pre-poly-rpoly}, that $\rpoly(\prob_1,\ldots, \prob_\numvar)$ is exactly the product of probabilities of each variable in each monomial across the entire sum.
|
||||
\qed
|
||||
|
|
|
@ -3,7 +3,7 @@
|
|||
\section{Missing details from Section~\ref{sec:background}}\label{sec:proofs-background}
|
||||
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
\subsection{Supplementary Material for~\Cref{prop:expection-of-polynom}}\label{subsec:supp-mat-background}\label{subsec:supp-mat-krelations}
|
||||
\subsection{$\semK$-relations and $\semNX$-PDBs}\label{subsec:supp-mat-background}\label{subsec:supp-mat-krelations}
|
||||
\input{app_notation-background}
|
||||
|
||||
|
||||
|
|
|
@ -3,7 +3,7 @@
|
|||
|
||||
\section{$1 \pm \epsilon$ Approximation Algorithm}\label{sec:algo}
|
||||
|
||||
In~\Cref{sec:hard}, we showed that computing the expected multiplicity of a compressed lineage polynomial for \ti (even just based on project-join queries) is unlikely to be possible in linear time (\Cref{thm:mult-p-hard-result}), even if all tuples have the same probability (\Cref{th:single-p-hard}).
|
||||
In~\Cref{sec:hard}, we showed that computing the expected multiplicity of a compressed lineage polynomial for \ti (even just based on project-join queries), and by extension \bi (or any $\semNX$-PDB) is unlikely to be possible in linear time (\Cref{thm:mult-p-hard-result}), even if all tuples have the same probability (\Cref{th:single-p-hard}).
|
||||
Given this, we now design an approximation algorithm for our problem that runs in {\em linear time}.\footnote{For a very broad class of circuits: please see the discussion after~\Cref{lem:val-ub} for more.}
|
||||
The folowing approximation algorithm applies to \bi, though our bounds are more meaningful for a non-trivial subclass of \bis that contains both \tis, as well as the PDBench benchmark~\cite{pdbench}.
|
||||
%it is then desirable to have an algorithm to approximate the multiplicity in linear time, which is what we describe next.
|
||||
|
@ -106,7 +106,7 @@ Further, under either of the following conditions:
|
|||
we have $\abs{\circuit}(1,\ldots, 1)\le \size(\circuit)^{O(k)}.$
|
||||
\end{Lemma}
|
||||
|
||||
Note that the above implies that with the assumption $\prob_0>0$ and $\gamma<1$ are absolute constants from \Cref{cor:approx-algo-const-p}, then the runtime there simplies to $O_k\left(\frac 1{\inparen{\error'}^2}\cdot\size(\circuit)^2\cdot \log{\frac{1}{\conf}}\right)$ for general circuits $\circuit$ and to $O_k\left(\frac 1{\inparen{\error'}^2}\cdot\size(\circuit)\cdot \log{\frac{1}{\conf}}\right)$ for the case when $\circuit$ satisfies the specific conditions in~\Cref{lem:val-ub}. In~\Cref{app:proof-lem-val-ub} we argue that these conditions are very general and encompass many interesting scenarios.
|
||||
Note that the above implies that with the assumption $\prob_0>0$ and $\gamma<1$ are absolute constants from \Cref{cor:approx-algo-const-p}, then the runtime there simplies to $O_k\left(\frac 1{\inparen{\error'}^2}\cdot\size(\circuit)^2\cdot \log{\frac{1}{\conf}}\right)$ for general circuits $\circuit$ and to $O_k\left(\frac 1{\inparen{\error'}^2}\cdot\size(\circuit)\cdot \log{\frac{1}{\conf}}\right)$ for the case when $\circuit$ satisfies the specific conditions in~\Cref{lem:val-ub}. In~\Cref{app:proof-lem-val-ub} we argue that these conditions are very general and encompass many interesting scenarios, including query evaluation under \raPlus or FAQ.
|
||||
|
||||
\subsection{Approximating $\rpoly$}
|
||||
The algorithm (\approxq detailed in \Cref{alg:mon-sam}) to prove~\Cref{lem:approx-alg} follows from the following observation. Given a query polynomial $\poly(\vct{X})=\polyf(\circuit)$ for circuit \circuit over $\bi$, we can exactly represent $\rpoly(\vct{X})$ as follows:
|
||||
|
|
|
@ -1,7 +1,7 @@
|
|||
%!TEX root=./main.tex
|
||||
|
||||
\section{More on Circuits and Moments}\label{sec:gen}
|
||||
We formalize our claim from \Cref{sec:intro} that a linear approximation algorithm for our problem implies that PDB queries (under bag semantics) can be answered in the same runtime as deterministic queries under reasonable assumptions.
|
||||
We formalize our claim from \Cref{sec:intro} that a linear approximation algorithm for our problem implies that PDB queries (under bag semantics) can be answered (approximately) in the same runtime as deterministic queries under reasonable assumptions.
|
||||
Lastly, we generalize our result for expectation to other moments.
|
||||
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
|
@ -18,7 +18,7 @@ Lastly, we generalize our result for expectation to other moments.
|
|||
\mypar{The cost model}
|
||||
%\label{sec:cost-model}
|
||||
So far our analysis of $\approxq$ has been in terms of the size of the lineage circuits.
|
||||
We now show that this model corresponds to the behavior of a deterministic database by proving that for any UCQ query $\poly$, we can construct a compressed circuit for $\poly$ and \bi $\pxdb$ of size (and in runtime) linear in that of a general class of query processing algorithms for the same query $\poly$ on a deterministic database $\db$.
|
||||
We now show that this model corresponds to the behavior of a deterministic database by proving that for any \raPlus query $\poly$, we can construct a compressed circuit for $\poly$ and \bi $\pxdb$ of size (and in runtime) linear in that of a general class of query processing algorithms for the same query $\poly$ on a deterministic database $\db$.
|
||||
We assume a linear relationship between input sizes $|\pxdb|$ and $|\db|$ (i.e., $\exists c, \db \in \pxdb$ s.t. $\abs{\pxdb} \leq c \cdot \abs{\db})$).
|
||||
\footnote{This is a reasonable assumption because each block of a \bi represents entities with uncertain attributes.
|
||||
In practice there is often a limited number of alternatives for each block (e.g., which of five conflicting data sources to trust). Note that all \tis trivially fulfill this condition (i.e., $c = 1$).}
|
||||
|
@ -75,9 +75,11 @@ This follows from~\Cref{lem:circuits-model-runtime} (\cref{sec:circuit-runtime})
|
|||
%\label{sec:momemts}
|
||||
%
|
||||
We make a simple observation to conclude the presentation of our results.
|
||||
So far we have only focused on the expectation of $\poly$. In addition, we could e.g. prove bounds of probability of the multiplicity being at least $1$. Progress can be made on this as follows:
|
||||
So far we have only focused on the expectation of $\poly$.
|
||||
In addition, we could e.g. prove bounds of probability of the multiplicity being at least $1$.
|
||||
Progress can be made on this as follows:
|
||||
For any positive integer $m$ we can compute the $m$-th moment of the multiplicities, allowing us to e.g. use Chebyschev inequality or other high moment based probability bounds on the events we might be interested in.
|
||||
We leave a further investigation of this question for future work.
|
||||
We leave further investigations for future work.
|
||||
|
||||
%%% Local Variables:
|
||||
%%% mode: latex
|
||||
|
|
|
@ -4,7 +4,7 @@
|
|||
\section{Introduction}
|
||||
\label{sec:intro}
|
||||
A \emph{probabilistic database} $\pdb = (\idb, \pd)$ is set of deterministic databases $\idb = \{ \db_1, \ldots, \db_n\}$ called possible worlds, paired with a probability distribution $\pd$ over these worlds.
|
||||
A well-studied problem in probabilistic databases is, given a query $\query$ and probabilistic database $\pdb$, computing the \emph{marginal probability} of a tuple $\tup$, (i.e., its probability of appearing in the result of query $\query$ over $\pdb$).
|
||||
A well-studied problem in probabilistic databases is to, given a query $\query$ and a probabilistic database $\pdb$, compute the \emph{marginal probability} of a tuple $\tup$ (i.e., its probability of appearing in the result of query $\query$ over $\pdb$).
|
||||
This problem is \sharpphard for set semantics, even for \emph{tuple-independent probabilistic databases}~\cite{DBLP:series/synthesis/2011Suciu} (TIDBs), which are a subclass of probabilistic databases where tuples are independent events. The dichotomy of Dalvi and Suciu~\cite{10.1145/1265530.1265571} separates the hard cases, from cases that are in \ptime for unions of conjunctive queries (UCQs).
|
||||
In this work we consider bag semantics, where each tuple is associated with a multiplicity $\db_i(\tup)$ in each possible world $\db_i$ and study the analogous problem of computing the expectation of the multiplicity of a query result tuple $\tup$ (denoted $\query(\db)(t)$):
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
|
@ -115,7 +115,7 @@ In this work we consider bag semantics, where each tuple is associated with a mu
|
|||
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
\begin{Example}\label{ex:intro-tbls}
|
||||
Consider the bag-\ti relations shown in \Cref{fig:ex-shipping-simp}. We define a \ti under bag semantics analog to the set case: each tuple is associated with a probability of having a multiplicity of one (and otherwise multiplicity zero), and tuples are independent random events. Ignore column $\Phi$ for now. In this example, we have shipping routes that are certain (probability 1.0) and information about whether shipping at locations is on time (with a certain probability). Query $\query_1$ shown below returns starting points of shipping routes where processing of shipping is on time.
|
||||
Consider the bag-\ti relations shown in \Cref{fig:ex-shipping-simp}. We define a \ti under bag semantics analogously to the set case: each tuple is associated with a probability of having a multiplicity of one (and otherwise multiplicity zero), and tuples are independent random events. Ignore column $\Phi$ for now. In this example, we have shipping routes that are certain (probability 1.0) and information about whether shipping at locations is on time (with a certain probability). Query $\query_1$ shown below returns starting points of shipping routes where shipment processing is on time.
|
||||
|
||||
$$Q_1(\text{City}) \dlImp Loc(\text{City}), Route(\text{City}, \dlDontcare)$$
|
||||
|
||||
|
@ -126,7 +126,10 @@ Since the Chicago location has a 50\% probability of being on schedule (we assum
|
|||
\end{Example}
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
|
||||
A well-known result in probabilistic databases is that under set semantics the marginal probability of a query result $\tup$ can be computed based on the tuple's lineage. The lineage of a tuple is a Boolean formula (an element of the semiring $\text{PosBool}[\vct{X}]$ of positive Boolean expressions over variables $\vct{X}=(X_1,\dots,X_n)$~\cite{DBLP:conf/pods/GreenKT07}) over random variables that encode the existence of input tuples. Each possible world $\db$ corresponds to an assignment $\{0,1\}^\numvar$ of the variables in $\vct{X}$ to either true (the tuple exists in this world) or false (the tuple does not exist in this world). Importantly, the following holds: if the lineage formula for $t$ evaluates to true under the assignment for a world $\db$, then $\tup \in \query(\db)$.
|
||||
A well-known result in probabilistic databases is that under set semantics, the marginal probability of a query result $\tup$ can be computed based on the tuple's lineage. The lineage of a tuple is a Boolean formula (an element of the semiring $\text{PosBool}[\vct{X}]$ of positive Boolean expressions)
|
||||
over random variables
|
||||
($\vct{X}=(X_1,\dots,X_n)$~\cite{DBLP:conf/pods/GreenKT07})
|
||||
that encode the existence of input tuples. Each possible world $\db$ corresponds to an assignment $\{0,1\}^\numvar$ of the variables in $\vct{X}$ to either true (the tuple exists in this world) or false (the tuple does not exist in this world). Importantly, the following holds: if the lineage formula for $t$ evaluates to true under the assignment for a world $\db$, then $\tup \in \query(\db)$.
|
||||
Thus, the marginal probability of tuple $\tup$ is equal to the probability that its lineage evaluates to true (with respect to the obvious analog of probability distribution $\probDist$ defined over $\vct{X}$).
|
||||
|
||||
For bag semantics, the lineage of a tuple is a polynomial over variables $\vct{X}=(X_1,\dots,X_n)$ with % \in \mathbb{N}^\numvar$ with
|
||||
|
@ -135,11 +138,12 @@ Analogously to sets, evaluating the lineage for $t$ over an assignment correspo
|
|||
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
\begin{Example}\label{ex:intro-lineage}
|
||||
Associating a lineage variable with every input tuple as shown in \cref{fig:ex-shipping-simp}, we can compute the lineage of every result tuple as shown in \cref{subfig:ex-shipping-simp-route}. For example, the tuple Chicago is in the result, because $L_b$ joins with both $R_b$ and $R_c$. Its lineage is $\Phi = L_b \cdot R_b + L_b \cdot R_c$. The expected multiplicity of this result tuple is calculated by summing the multiplicity of the result tuple, weighted by its probability, over all possible worlds.
|
||||
In this example, $\Phi$ is a sum of products (SOP), and so we can use linearity of expectation to solve the problem in linear time (in the size of $\linsett{\query}{\pdb}{\tup}$).
|
||||
The expectation of the sum is the sum of the expectations of monomials.
|
||||
Associating a lineage variable with every input tuple as shown in \cref{fig:ex-shipping-simp}, we can compute the lineage of every result tuple as shown in \cref{subfig:ex-shipping-simp-route}. For example, the tuple Chicago is in the result, because $L_b$ joins with both $R_b$ and $R_c$. Its lineage is $\Phi = L_b \cdot R_b + L_b \cdot R_c$. The expected tuple multiplicity is calculated by summing the multiplicity of the result tuple, weighted by its probability, over all possible worlds.
|
||||
In this example, $\Phi$ is a sum of products (SOP), and so we can use linearity of expectation
|
||||
(i.e., the expectation of the sum is the sum of the expectations of monomials)
|
||||
to solve the problem in linear time (in the size of $\linsett{\query}{\pdb}{\tup}$).
|
||||
The expectation of each monomial is then computed by multiplying the probabilities of the variables (tuples) occurring in the monomial.
|
||||
The expected multiplicity of Chicago is $1.0$.
|
||||
The expected multiplicity for Chicago is $1.0$.
|
||||
\end{Example}
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
|
||||
|
@ -147,29 +151,29 @@ The expected multiplicity of a query result can be computed in linear time (in t
|
|||
However, this need not be true for compressed representations of polynomials, including factorized polynomials or arithmetic circuits.
|
||||
For instance, \Cref{subfig:ex-proj-push-circ-q4} shows two circuits encoding the lineage of the result tuple $(Chicago)$ from \Cref{ex:intro-lineage}.
|
||||
The left circuit encodes the lineage as a SOP while the right circuit uses distributivity to push the addition gate below the multiplication, resulting in a smaller circuit.
|
||||
Given that there is a large body of work that can output such compressed representations~\cite{DBLP:conf/pods/KhamisNR16,factorized-db}, %\BG{cite FDBs and FAQ},
|
||||
Given that there is a large body of work (on, e.g., deterministic bag-relational query processing) that can output such compressed representations~\cite{DBLP:conf/pods/KhamisNR16,factorized-db}, %\BG{cite FDBs and FAQ},
|
||||
an interesting question is whether computing expectations is still in linear time for such compressed representations.
|
||||
If the answer is in the affirmative, and if lineage formulas can also be computed in linear time (in the lineage size), then bag-relational probabilistic databases can theoretically match the performance of deterministic databases.
|
||||
Unfortunately, we prove that this is not the case: computing the expected count of a query result tuple is super-linear under standard complexity assumptions (\sharpwonehard) in the size of a lineage circuit.
|
||||
|
||||
Concretely, we make the following contributions:
|
||||
(i) We show that the expected result multiplicity problem for conjunctive queries for bag-$\ti$s is \sharpwonehard in the size of a lineage circuit by reduction from counting the number of $k$-matchings over an arbitrary graph;
|
||||
(ii) We present an $(1\pm\epsilon)$-\emph{multiplicative} approximation algorithm for bag-$\ti$s and show that for typical database usage patterns (e.g. when the circuit is a tree or is generated by recent worst-case optimal join algorithms or its FAQ followups~\cite{DBLP:conf/pods/KhamisNR16}) its complexity is linear in the size of the compressed lineage encoding; %;\BG{Fix not linear in all cases, restate after 4 is done}
|
||||
(ii) We present an $(1\pm\epsilon)$-\emph{multiplicative} approximation algorithm for bag-$\ti$s and show that for typical database usage patterns (e.g. when the circuit is a tree or is generated by recent worst-case optimal join algorithms or their FAQ followups~\cite{DBLP:conf/pods/KhamisNR16}) its complexity is linear in the size of the compressed lineage encoding; %;\BG{Fix not linear in all cases, restate after 4 is done}
|
||||
(iii) We generalize the approximation algorithm to bag-$\bi$s, a more general model of probabilistic data;
|
||||
(iv) We further prove that for \raPlus queries\AR{Some places we use \raPlus and UCQ in others: we should use one consistently (assuming they are both the same)}, we can approximate the expected output tuple multiplicities with only $O(\log{Z})$ overhead (where $Z$ is the number of output tuples) over the runtime of a broad class of query processing algorithms. We also observe that our results trivially extend to higher moments of the tuple multiplicity (instead of just the expectation).
|
||||
(iv) We further prove that for \raPlus queries (a equivalently expressive, but factorizable form of UCQs), we can approximate the expected output tuple multiplicities with only $O(\log{Z})$ overhead (where $Z$ is the number of output tuples) over the runtime of a broad class of query processing algorithms. We also observe that our results trivially extend to higher moments of the tuple multiplicity (instead of just the expectation).
|
||||
|
||||
%\mypar{Implications of our Results} As mentioned above
|
||||
|
||||
\mypar{Overview of our Techniques} All of our results rely on working with a {\em reduced} form of the lineage polynomial $\Phi$. In fact, it turns out that for the TIDB (and BIDB) case, computing the expected multiplicity is {\em exactly} the same as evaluating this reduced polynomial over the probabilities that define the TIDB/BIDB. Next, we motivate this reduced polynomial by continuing~\Cref{ex:intro-tbls}.
|
||||
|
||||
%Moving forward, we focus exclusively on bags.
|
||||
Consider the query $Q()\dlImp$$OnTime(\text{City}), Route(\text{City}_1, \text{City}_2),$ $OnTime(\text{City}')$ over the bag relations of \cref{fig:ex-shipping-simp}. It can be verified that $\Phi$ for $Q$ is $L_aL_b + L_bL_d + L_bL_c$. Now consider the product query $\poly^2()\dlImp Q(), Q()$.
|
||||
Consider the query $Q()\dlImp$$OnTime(\text{City}), Route(\text{City}, \text{City}'),$ $OnTime(\text{City}')$ over the bag relations of \cref{fig:ex-shipping-simp}. It can be verified that $\Phi$ for $Q$ is $L_aL_b + L_bL_d + L_bL_c$. Now consider the product query $\poly^2()\dlImp Q(), Q()$.
|
||||
%The factorized representation of $\poly^2$ is (for simplicity we ignore the random variables of $Route$ since each variable has probability of $1$):
|
||||
%\begin{equation*}
|
||||
%\poly^2 = \left(L_aL_b + L_bL_d + L_bL_c\right) \cdot \left(L_aL_b + L_bL_d + L_bL_c\right)
|
||||
%\end{equation*}
|
||||
%This equivalent SOP representation is
|
||||
Note that the lineage polynomial for $Q^2$ is given by $\Phi^2$:
|
||||
The lineage polynomial for $Q^2$ is given by $\Phi^2$:
|
||||
\begin{equation*}
|
||||
\left(L_aL_b + L_bL_d + L_bL_c\right)^2=L_a^2L_b^2 + L_b^2L_d^2 + L_b^2L_c^2 + 2L_aL_b^2L_d + 2L_aL_b^2L_c + 2L_b^2L_dL_c.
|
||||
\end{equation*}
|
||||
|
@ -178,7 +182,7 @@ The expectation $\expct\pbox{\Phi^2}$ then is:
|
|||
\expct\pbox{L_a}\expct\pbox{L_b^2} + \expct\pbox{L_b^2}\expct\pbox{L_d^2} + \expct\pbox{L_b^2}\expct\pbox{L_c^2} + 2\expct\pbox{L_a}\expct\pbox{L_b^2}\expct\pbox{L_d} \\
|
||||
+ 2\expct\pbox{L_a}\expct\pbox{L_b^2}\expct\pbox{L_c} + 2\expct\pbox{L_b^2}\expct\pbox{L_d}\expct\pbox{L_c}
|
||||
\end{multline*}
|
||||
\noindent Note that if the domain of a random variable $W$ is $\{0, 1\}$, then for any $k > 0$, $\expct\pbox{W^k} = \expct\pbox{W}$, which means that $\expct\pbox{\Phi^2}$ simplifies to:
|
||||
\noindent If the domain of a random variable $W$ is $\{0, 1\}$, then for any $k > 0$, $\expct\pbox{W^k} = \expct\pbox{W}$, which means that $\expct\pbox{\Phi^2}$ simplifies to:
|
||||
\begin{footnotesize}
|
||||
\begin{equation*}
|
||||
\expct\pbox{L_a^2}\expct\pbox{L_b} + \expct\pbox{L_b}\expct\pbox{L_d} + \expct\pbox{L_b}\expct\pbox{L_c} + 2\expct\pbox{L_a}\expct\pbox{L_b}\expct\pbox{L_d} + 2\expct\pbox{L_a}\expct\pbox{L_b}\expct\pbox{L_c} + 2\expct\pbox{L_b}\expct\pbox{L_d}\expct\pbox{L_c}
|
||||
|
@ -198,7 +202,7 @@ It can be verified that the reduced polynomial is a closed form of the expected
|
|||
|
||||
%The reduced form of a lineage polynomial can be obtained but requires a linear scan over the clauses of an SOP encoding of the polynomial. Note that for a compressed representation, this scheme would require an exponential number of computations in the size of the compressed representation. In \Cref{sec:hard}, we use $\rpoly$ to prove our hardness results .
|
||||
|
||||
To prove our hardness result we show that for the same $Q$ considered in the running example, the query $Q^k$ is able to encode variaous hard graph counting problems. We do so by analyzing how the coefficients in the (univariate) polynomial $\widetilde{\Phi}\left(p,\dots,p\right)$ relate to counts of various sub-graphs on $k$ edges in an arbitrary graph $G$ (which is used to define the relations in $Q$). For the upper bound is easy to check that if all the probabilties are constant then ${\Phi}\left(\probOf\pbox{X_1=1},\dots, \probOf\pbox{X_n=1}\right)$ (i.e. evaluating the original lineage polynomial over the probability values) is a constant factor approximation. To get an $(1\pm \epsilon)$-multiplicative approximation we sample monomials from $\Phi$ and `adjust' their contribution to $\widetilde{\Phi}\left(\cdot\right)$.
|
||||
To prove our hardness result we show that for the same $Q$ considered in the running example, the query $Q^k$ is able to encode variaous hard graph-counting problems. We do so by analyzing how the coefficients in the (univariate) polynomial $\widetilde{\Phi}\left(p,\dots,p\right)$ relate to counts of various sub-graphs on $k$ edges in an arbitrary graph $G$ (which is used to define the relations in $Q$). For the upper bound is easy to check that if all the probabilties are constant then ${\Phi}\left(\probOf\pbox{X_1=1},\dots, \probOf\pbox{X_n=1}\right)$ (i.e. evaluating the original lineage polynomial over the probability values) is a constant factor approximation. To get an $(1\pm \epsilon)$-multiplicative approximation we sample monomials from $\Phi$ and `adjust' their contribution to $\widetilde{\Phi}\left(\cdot\right)$.
|
||||
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
\mypar{Paper Organization} We present relevant background and notation in \Cref{sec:background}. We then prove our main hardness results in \Cref{sec:hard} and present our approximation algorithm in \Cref{sec:algo}. We present some (easy) generalizations of our results in \Cref{sec:gen} and also discuss extensions from computing expectations of polynomials to the expected result multiplicity problem. Finally, we discuss related work in \Cref{sec:related-work} and conclude in \Cref{sec:concl-future-work}.
|
||||
|
|
|
@ -1,7 +1,7 @@
|
|||
%!TEX root=./main.tex
|
||||
We use $\domK$-relations to model bags. A \emph{$\domK$-relation}~\cite{DBLP:conf/pods/GreenKT07} is a relation whose tuples are annotated with elements from a commutative semiring $\semK = (\domK, \addK, \multK, \zeroK, \oneK)$. A commutative semiring is a structure with a domain $\domK$ and associative and commutative binary operations $\addK$ and $\multK$ such that $\multK$ distributes over $\addK$, $\zeroK$ is the identity of $\addK$, $\oneK$ is the identity of $\multK$, and $\zeroK$ annihilates all elements of $\domK$ when combined by $\multK$.
|
||||
Let $\udom$ be a countable domain of values.
|
||||
Formally, an n-ary $\semK$-relation over a domain of attribute values $\udom$ is a function $\rel: \udom^n \to \domK$ with finite support $\support{\rel} = \{ \tup \mid \rel(\tup) \neq \zeroK \}$.
|
||||
Formally, an n-ary $\semK$-relation over $\udom$ is a function $\rel: \udom^n \to \domK$ with finite support $\support{\rel} = \{ \tup \mid \rel(\tup) \neq \zeroK \}$.
|
||||
A $\semK$-database is a set of $\semK$-relations. It will be convenient to also interpret a $\semK$-database as a function from tuples to annotations. Thus, $\rel(t)$ (resp., $\db(t)$) denotes the annotation associated by $\semK$-relation $\rel$ ($\semK$-database $\db$) to $t$.
|
||||
|
||||
For completeness, we briefly review the semantics for $\raPlus$ queries over $\semK$-relations~\cite{DBLP:conf/pods/GreenKT07} illustrated in \cref{fig:nxDBSemantics}.
|
||||
|
|
|
@ -95,8 +95,8 @@
|
|||
\newcommand{\rchild}{\vari{R}}
|
||||
%members of T
|
||||
\newcommand{\val}{\vari{val}}
|
||||
\newcommand{\wght}{\vari{weight}\xspace}
|
||||
\newcommand{\vpartial}{\vari{partial}\xspace}
|
||||
\newcommand{\wght}{\vari{weight}}
|
||||
\newcommand{\vpartial}{\vari{partial}}
|
||||
%types of T
|
||||
\newcommand{\var}{\textsc{var}\xspace}
|
||||
\newcommand{\tnum}{\textsc{num}\xspace}
|
||||
|
|
|
@ -3,7 +3,7 @@
|
|||
\section{Hardness of exact computation}
|
||||
\label{sec:hard}
|
||||
|
||||
In this section, we will prove that computing $\expct\limits_{\vct{W} \sim \pd}\pbox{\poly(\vct{W})}$ exactly for a \ti-lineage polynomial $\poly(\vct{X})$ generated from a project-join query (even in an expression tree representation) is \sharpwonehard. Note that this implies hardness for \bis and general $\semNX$-PDBs. Furthermore, we demonstrate in \Cref{sec:single-p} that the problem remains hard, even if $\probOf[X_i=1] = \prob$ for all $X_i$ and any fixed valued $\prob \in (0, 1)$ as long as certain popular hardness conjectures in fine-grained complexity hold.
|
||||
In this section, we will prove that computing $\expct\limits_{\vct{W} \sim \pd}\pbox{\poly(\vct{W})}$ exactly for a \ti-lineage polynomial $\poly(\vct{X})$ generated from a project-join query (even an expression tree representation) is \sharpwonehard. Note that this implies hardness for \bis and general $\semNX$-PDBs under bag semantics. Furthermore, we demonstrate in \Cref{sec:single-p} that the problem remains hard, even if $\probOf[X_i=1] = \prob$ for all $X_i$ and any fixed valued $\prob \in (0, 1)$ as long as certain popular hardness conjectures in fine-grained complexity hold.
|
||||
|
||||
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
|
|
|
@ -33,7 +33,7 @@ The degree of polynomial $\poly(\vct{X})$ is the largest $\sum_{j=1}^n i_j$ such
|
|||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
|
||||
The degree of the polynomial $X^2+2XY+Y^2$ is $2$.
|
||||
Product terms in lineage arise only as a consequence of join operations, so intuitively, the degree of a lineage polynomial is analogous to the largest number of joins in any clause of the UCQ query that created it.
|
||||
Product terms in lineage arise only from join operations (\cref{fig:nxDBSemantics}), so intuitively, the degree of a lineage polynomial is analogous to the largest number of joins in any clause of the UCQ query that created it.
|
||||
In this paper we consider only finite degree polynomials.
|
||||
%
|
||||
% Throughout this paper, we also make the following \textit{assumption}.
|
||||
|
@ -48,13 +48,7 @@ We call a polynomial $\query(\vct{X})$ a \emph{\bi-lineage polynomial} (resp., \
|
|||
%\AH{Why is it required for the tuple to be n-ary? I think this slightly confuses me since we have n tuples.}
|
||||
% OK: agreed w/ AH, this can be treated as implicit
|
||||
there exists a $\raPlus$ query $\query$, \bi $\pxdb$ (\ti $\pxdb$, or $\semNX$-PDB $\pxdb$), and tuple $\tup$ such that $\query(\vct{X}) = \query(\pxdb)(\tup)$. % Before proceeding, note that the following is assume that polynomials are \bis (which subsume \tis as a special case).
|
||||
As a special case of \bis, the following applies to \tis as well.
|
||||
In a \bi $\pxdb$, tuples are partitioned into $\ell$ blocks $\block_1, \ldots, \block_\ell$ where tuple $t_{i,j} \in \block_i$ is associated with a probability $\prob_{\tup_{i,j}} = \pd[X_{i,j} = 1]$, and is annotated with a unique variable $X_{i,j}$.\footnote{
|
||||
Although only a single independent, $[\abs{\block_i}+1]$-valued variable is customarily used per block, we decompose it into $\abs{\block_i}$ correlated $\{0,1\}$-valued variables per block that can be used directly in polynomials (without an indicator function). For $t_j \in b_i$, the event $(X_{i,j} = 1)$ corresponds to the event $(X_i = j)$ in the customary annotation scheme.
|
||||
}
|
||||
Because blocks are independent and tuples from the same block are disjoint, the probabilities $\prob_{\tup_{i,j}}$ and the blocks induce the probability distribution $\pd$ of $\pxdb$.
|
||||
We will write a \bi-lineage polynomial $\poly(\vct{X})$ for a \bi with $\ell$ blocks as
|
||||
$\poly(\vct{X})$ = $\poly(X_{1, 1},\ldots, X_{1, \abs{\block_1}},$ $\ldots, X_{\ell, \abs{\block_\ell}})$, where $\abs{\block_i}$ denotes the size of $\block_i$.\footnote{Later on in the paper, especially in~\Cref{sec:algo}, we will overload notation and rename the variables as $X_1,\dots,X_n$, where $n=\sum_{i=1}^\ell \abs{b_i}$.}
|
||||
|
||||
%\SF{Where is $\block_{i, j}$ used? Is it $X_{\block_{1, 1}}$ or $X_{\block_1, 1}$ ?}
|
||||
% and the probability distribution of $\pxdb$ is uniquely determined based on a probability vector $\vct{p}$ that associates each tuple a probability
|
||||
% variables are independent of each other (or disjoint if they are from the same block) and each variable $X$ is associated with a probability $\vct{p}(X) = \pd[X = 1]$. Thus, we are dealing with polynomials $\poly(\vct{X})$ that are annotations of a tuple in the result of a query $\query$ over a BIDB $\pxdb$ where $\vct{X}$ is the set of variables that occur in annotations of tuples of $\pxdb$.
|
||||
|
@ -72,7 +66,7 @@ Let $S$ be a {\em set} of polynomials over $\vct{X}$. Then $\poly(\vct{X})\mod{S
|
|||
\end{Definition}
|
||||
For example for a set of polynomials $S=\inset{X^2-X, Y^2-Y}$, taking the polynomial $2X^2 + 3XY - 2Y^2\mod S$ yields $2X+3XY-2Y$.
|
||||
%
|
||||
\begin{Definition}\label{def:mod-set-polys}
|
||||
\begin{Definition}[$\mathcal B$, $\mathcal T$]\label{def:mod-set-polys}
|
||||
Given the set of BIDB variables $\inset{X_{i,j}}$, define
|
||||
|
||||
\setlength\parindent{0pt}
|
||||
|
@ -109,7 +103,8 @@ Given the set of BIDB variables $\inset{X_{i,j}}$, define
|
|||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
%
|
||||
|
||||
All exponents $e > 1$ in $\smbOf{\poly(\vct{X})}$ are reduced to $e = 1$ via mod $\mathcal{T}$. Performing the modulus of $\rpoly(\vct{X})$ with $\mathcal{B}$ ensures the disjoint condition of \bi, removing monomials with lineage variables from the same block.%, (recall the constraint on tuples from the same block being disjoint in a \bi).% any monomial containing more than one tuple from a block has $0$ probability and can be ignored).
|
||||
All exponents $e > 1$ in $\smbOf{\poly(\vct{X})}$ are reduced to $e = 1$ via mod $\mathcal{T}$. Performing the modulus of $\rpoly(\vct{X})$ with $\mathcal{B}$ ensures the disjoint condition of \bi, removing monomials with lineage variables from the same block.
|
||||
%, (recall the constraint on tuples from the same block being disjoint in a \bi).% any monomial containing more than one tuple from a block has $0$ probability and can be ignored).
|
||||
%
|
||||
For the special case of \tis, the second step is not necessary since every block contains a single tuple.
|
||||
%Alternatively, one can think of $\rpoly$ as the \abbrSMB of $\poly(\vct{X})$ when the product operator is idempotent.
|
||||
|
|
|
@ -12,7 +12,7 @@ We represent query polynomials via {\em arithmetic circuits}~\cite{arith-complex
|
|||
\begin{Definition}[Circuit]\label{def:circuit}
|
||||
A circuit $\circuit$ is a Directed Acyclic Graph (DAG) whose source nodes (in degree of $0$) consist of elements in either $\reals$ or $\vct{X}$. The internal nodes and (the single) sink node of $\circuit$ (corresponding to the result tuple $t$) have binary input and are either sum ($\circplus$) or product ($\circmult$) gates.
|
||||
|
||||
$\circuit$ additionally has the following members: \type, \vari{val}, \vari{partial}, \vari{input}, \degval and \vari{Lweight}, \vari{Rweight}, where \type is the type of value stored in the node $\circuit$ (i.e. one of $\{\circplus, \circmult, \var, \tnum\}$, \val is the value stored (a constant or variable), and \vari{input} is the list of \circuit 's inputs where $\circuit_\linput$ is the left input and $\circuit_\rinput$ the right input.
|
||||
$\circuit$ additionally has the following members: \type, \val, \vpartial, \vari{input}, \degval and \vari{Lweight}, \vari{Rweight}, where \type is the type of value stored in the node $\circuit$ (i.e. one of $\{\circplus, \circmult, \var, \tnum\}$, \val is the value stored (a constant or variable), and \vari{input} is the list of \circuit 's inputs where $\circuit_\linput$ is the left input and $\circuit_\rinput$ the right input.
|
||||
%The member \degval holds the degree of \circuit.
|
||||
When the underlying DAG is a tree (with edges pointing towards the root), we will refer to the structure as an expression tree \etree. Note that in such a case, the root of \etree is analogous to the sink of \circuit.
|
||||
\end{Definition}
|
||||
|
|
|
@ -54,7 +54,7 @@ For a probabilistic database $\pdb = (\idb, \pd)$, the result of a query is th
|
|||
|
||||
Let $\semNX$ denote the set of polynomials over variables $\vct{X}=(X_1,\dots,X_n)$ with natural number coefficients and exponents.
|
||||
We model incomplete relations using Green et. al.'s $\semNX$-databases~\cite{DBLP:conf/pods/GreenKT07}, discussed in detail in \Cref{subsec:supp-mat-krelations} and summarized here.
|
||||
In an $\semNX$-databases, relations are defined as functions from tuples to elements of $\semNX$, typically called annotations.
|
||||
In an $\semNX$-database, relations are defined as functions from tuples to elements of $\semNX$, typically called annotations.
|
||||
We write $R(t)$ to denote the polynomial annotating tuple $t$ in relation $R$.
|
||||
Each possible world is defined by an assignment of $N$ binary values $\vct{W} \in \{0, 1\}^{|X|}$.
|
||||
The multiplicity of $t \in R$ in this possible world is obtained by evaluating the polynomial annotating it on $\vct{W}$ (i.e., $R(t)(\vct{W})$).
|
||||
|
@ -63,7 +63,7 @@ $\semNX$-relations are closed under $\raPlus$ (\cref{fig:nxDBSemantics}).
|
|||
|
||||
\begin{figure}
|
||||
\begin{align*}
|
||||
\evald{\project_A(\rel)}{\db}(\tup) =& \bigoplus_{\tup': \project_A(\tup') = \tup} \evald{\rel}{\db}(\tup') &
|
||||
\evald{\project_A(\rel)}{\db}(\tup) =& \sum_{\tup': \project_A(\tup') = \tup} \evald{\rel}{\db}(\tup') &
|
||||
\evald{(\rel_1 \union \rel_2)}{\db}(\tup) =& \evald{\rel_1}{\db}(\tup) + \evald{\rel_2}{\db}(\tup)\\
|
||||
\evald{\select_\theta(\rel)}{\db}(\tup) =& \begin{cases}
|
||||
\evald{\rel}{\db}(\tup) & \text{if }\theta(\tup) \\
|
||||
|
@ -104,9 +104,18 @@ We focus on this problem from now on, assume an implicit result tuple, and so dr
|
|||
In this paper, we focus on two popular forms of PDB: Block-Independent (\bi) and Tuple-Independent (\ti) PDBs.
|
||||
%
|
||||
A \bi $\pxdb = (\idb_{\semNX}, \pd)$ is an $\semNX$-PDB such that (i) every tuple is annotated with either $0$ (i.e., the tuple does not exist) or a unique variable $X_i$ and (ii) that the tuples $\tup$ of $\pxdb$ for which $\pxdb(\tup) \neq 0$ can be partitioned into a set of blocks such that variables from separate blocks are independent of each other and variables from the same blocks are disjoint events.
|
||||
In other words, each random variable corresponds to the event of a single tuple's presence.
|
||||
%
|
||||
A \emph{\ti} is a \bi where each block contains exactly one tuple.
|
||||
\Cref{subsec:supp-mat-ti-bi-def} explains \tis and \bis in greater detail.
|
||||
%
|
||||
In a \bi (and by extension a \ti) $\pxdb$, tuples are partitioned into $\ell$ blocks $\block_1, \ldots, \block_\ell$ where tuple $t_{i,j} \in \block_i$ is associated with a probability $\prob_{\tup_{i,j}} = \pd[X_{i,j} = 1]$, and is annotated with a unique variable $X_{i,j}$.\footnote{
|
||||
Although only a single independent, $[\abs{\block_i}+1]$-valued variable is customarily used per block, we decompose it into $\abs{\block_i}$ correlated $\{0,1\}$-valued variables per block that can be used directly in polynomials (without an indicator function). For $t_j \in b_i$, the event $(X_{i,j} = 1)$ corresponds to the event $(X_i = j)$ in the customary annotation scheme.
|
||||
}
|
||||
Because blocks are independent and tuples from the same block are disjoint, the probabilities $\prob_{\tup_{i,j}}$ and the blocks induce the probability distribution $\pd$ of $\pxdb$.
|
||||
We will write a \bi-lineage polynomial $\poly(\vct{X})$ for a \bi with $\ell$ blocks as
|
||||
$\poly(\vct{X})$ = $\poly(X_{1, 1},\ldots, X_{1, \abs{\block_1}},$ $\ldots, X_{\ell, \abs{\block_\ell}})$, where $\abs{\block_i}$ denotes the size of $\block_i$.\footnote{Later on in the paper, especially in~\Cref{sec:algo}, we will overload notation and rename the variables as $X_1,\dots,X_n$, where $n=\sum_{i=1}^\ell \abs{b_i}$.}
|
||||
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
|
||||
|
||||
|
|
Loading…
Reference in New Issue